CN105575406A

CN105575406A - Noise robustness detection method based on likelihood ratio test

Info

Publication number: CN105575406A
Application number: CN201610008285.2A
Authority: CN
Inventors: 李为; 朱杰; 包旭雷
Original assignee: Shenzhen Yinjiami Technology Co Ltd
Current assignee: Shenzhen Yinjiami Technology Co Ltd
Priority date: 2016-01-07
Filing date: 2016-01-07
Publication date: 2016-05-11

Abstract

The invention discloses a noise robustness detection method based on a likelihood ratio test. The method makes improvements on three aspects, estimation of a signal to noise ratio, robustness setting for a threshold, and elimination of trailing distortion, so as to enable the proposed algorithm to have better detection performance in the low signal to noise ratio environment and especially in the unstationary noise environment, compared with the prior art. The method has a similar voice boundary detection correct rate with a multi-observation likelihood ratio test algorithm based on harmonic wave characteristics, and has higher voice detection accuracy than the multi-observation likelihood ratio test algorithm based on harmonic wave characteristics, thus proving that the method is better in the performance than a traditional method. At the same time, the method has similar performance under the signal to noise ratio of 15dB and 25dB, thus proving that the method has better robustness for the noise. The noise robustness detection method based on a likelihood ratio test can be used as an important effective method for front end preprocessing for a voice identification or voiceprint identification system in the practical environment, and has a good application value.

Description

A kind of detection method of the noise robustness based on likelihood ratio test

Technical field

The present invention relates to speech processes and signal transacting field, refer in particular to a kind of detection method of the noise robustness based on likelihood ratio test.

Background technology

Speech terminals detection (VAD) is the very crucial part of in speech processes correlation technique, and the speech/non-speech that it not only can be used in speech enhan-cement detects, and can be applicable in the process such as feature extraction and voice signal dereverberation.Existing speech sound signal terminal point detection algorithm is mainly divided into three major types: the end-point detecting method based on time domain, the end-point detecting method based on frequency field and the end-point detecting method based on modeling statistics.

In practical application, high-precision speech terminals detection has extremely important effect to follow-up speech enhan-cement, end-point detection, speech recognition or Application on Voiceprint Recognition.But, existing speech terminals detection technology still also exists some problem and shortage, especially under actual channel environment, because the spectrum signature of voice signal voiceless sound and fricative composition and noise have very large similarity, and existing most of end-point detection algorithm is all the differentiation realizing voice and noise based on the syllable characteristic of voice own, therefore detect end points process in, may lose the initial sound of voice or ending sound cause truncation effect.Meanwhile, most of algorithm cannot all voice messagings of complete reservation, and when signal to noise ratio (S/N ratio) reduces, detection perform also will obviously decline.

Summary of the invention

Technical matters to be solved by this invention is the defect for prior art, there is provided and to arrange from the robustness of the estimation of signal to noise ratio (S/N ratio), threshold value respectively and hangover is eliminated three aspects and improved, make the algorithm that proposes under low signal-to-noise ratio environment, especially have the detection method of the noise robustness based on likelihood ratio test of better detection perform under nonstationary noise environment relative to existing algorithm.

The present invention is for solving the problems of the technologies described above by the following technical solutions: a kind of detection method of the noise robustness based on likelihood ratio test, comprises the following steps:

S1, by S filter, speech enhan-cement is carried out to Noisy Speech Signal, to weaken the impact of the noise signal in noisy speech on clean speech, and improve the smooth performance of the noise signal after filtering; Noisy Speech Signal after S filter speech enhan-cement clean speech and interference noise

Wherein, n is time-sampling index, and now can possess statistical iteration by the clean speech signal after S filter and interference noise and average is the characteristic of zero;

S2, carry out Fourier transform to noisy speech, the frequency spectrum factor of the frequency spectrum factor superposition interference noise of Noisy Speech Signal after filtering by clean speech signal on spectrum domain obtains;

(1)

Wherein, , with for the Short-time Fourier factor of every frame signal, m is frame index, and k is each frequency range value in frame, with represent non-speech frame and speech frame respectively;

S3, calculating likelihood ratio, when the probability density of clean speech signal and noise signal all meets Gaussian distribution, observation signal ? with under probability density function be

(2)

Wherein for the power spectrum of voice signal, for the power spectrum of noise signal;

The likelihood ratio of this frame kth frequency range is:

(3)

Wherein, , represent prior weight and posteriori SNR respectively, and prior weight and posteriori SNR following relation is there is in direct decision-making estimator:

(4)

Wherein the noise power spectrum of one frame;

S4, carry out noise estimation, and set threshold value , by threshold value determine that compared with the value of likelihood ratio present frame is voice segments or non-speech segment, when the value of likelihood ratio is greater than threshold value, judge that this frame is as speech frame for the first time, and when the value of likelihood ratio is less than threshold value, just assert that this frame is non-speech frame, specifically can be represented by following formula:

(5)

Wherein, K is frequency band sum; with represent non-speech frame and speech frame respectively;

S5, determine decision rule, the log-likelihood ratio of m frame is:

represent with centered by continuous 2M+1 frame, then the decision rule being object with this 2M+1 log-likelihood ratio is:

Wherein for the log-likelihood ratio under kth frequency range wherein , observation signal is existed with probability substitute into wherein obtain:

Prior weight posteriori SNR is obtained by maximum-likelihood estimation, that is:

Therefore, the value of log-likelihood ratio depends on that noise energy is composed ;

S6, hangover are eliminated, and when signal to noise ratio (S/N ratio) is low, noise energy is composed become large, by reducing the threshold value chosen reduce voiced segments probability of miscarriage of justice; Otherwise, by increasing threshold value come to mate with high s/n ratio signal;

Noisy speech power spectrum is composed by signals with noise power spectrum smoothly obtain, smoothing factor for time-frequency correlation function, then:

Its the noise based on minimum statistics is utilized to estimate just to obtain every frame signal minimal noise power spectrum .Relevant threshold value is composed to described noise energy for:

Wherein it is a constant coefficient of this threshold value.

The present invention adopts above technical scheme compared with prior art, there is following technique effect: vad algorithm proposed by the invention has similar SBR accuracy to the MOLRT algorithm based on harmonic characteristic, but than the MOLRT algorithm based on harmonic characteristic, there are more excellent many VAcc; The vad algorithm that the present invention proposes has similar performance under the signal to noise ratio (S/N ratio) of 15dB with 25dB, illustrates that this method has good robustness to noise.

Accompanying drawing explanation

Fig. 1 (a) is the schematic diagram of clean speech.

Fig. 1 (b) is the VAD result schematic diagram based on Sohn.

Fig. 1 (c) is the VAD result schematic diagram based on Tan.

Fig. 1 (d) is the VAD result schematic diagram based on the method for the invention.

Fig. 2 (a) is the section level Performance comparision under different signal to noise ratio (S/N ratio).

Fig. 2 (b) is the frame level Performance comparision under different signal to noise ratio (S/N ratio).

Fig. 2 (c) is the correct number of speech frame under different signal to noise ratio (S/N ratio).

Fig. 3 is the speech terminals detection block schematic illustration based on speech enhan-cement in the present invention.

Embodiment

Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail:

As shown in Figure 3, the technical scheme that the present invention takes is as follows: a kind of detection method of the noise robustness based on likelihood ratio test, comprises the following steps:

(1)

(2)

The likelihood ratio of this frame kth frequency range is:

(3)

(4)

Wherein the noise power spectrum of one frame;

(5)

S5, determine decision rule, the log-likelihood ratio of m frame is:

Its the noise based on minimum statistics is utilized to estimate just to obtain every frame signal minimal noise power spectrum , compose relevant threshold value to described noise energy for:

Wherein it is a constant coefficient of this threshold value.

Particularly, noisy speech of the present invention by clean speech and interference noise superposition obtains:

Wherein, n is time-sampling index.

Assuming that clean speech and interference noise have statistical iteration and average is the feature of zero, and the Fourier transform of noisy speech can be expressed as

(1)

Wherein, , with for the Short-time Fourier factor of every frame signal, m is frame index, and k is each frequency range value in frame, with represent non-speech frame and speech frame respectively.Suppose that the probability density of clean speech signal and noise signal all meets Gaussian distribution, so observation signal ? with under probability density function be:

(2)

Wherein with be respectively the power spectrum of voice signal and noise signal.So likelihood ratio (LR) value of this frame kth frequency range is just:

(3)

Wherein , represent prior weight and posteriori SNR respectively, and prior weight and posteriori SNR following relation is there is in direct decision-making (DD) estimator:

(4)

Suppose to set a threshold value compare with the value of LR and determine that present frame is voice segments or non-speech segment, meet:

(5)

Wherein, K is frequency band sum, can find out that the value of LR and prior weight, posteriori SNR have close relationship from formula (5), when posteriori SNR is very large, namely time, then therefore the value of LR also becomes very large, ; And work as posteriori SNR time, prior weight has just become the key parameter calculating LR.

The FB(flow block) of the whole speech terminals detection system based on speech enhan-cement as shown in Figure 3, from the log-likelihood ratio (LLR) of the known m frame of above-mentioned derivation is:

(6)

Suppose represent with centered by continuous 2M+1 frame, then the decision rule being object with this 2M+1 LLRs is:

(7)

For the log-likelihood ratio under kth frequency range wherein observation signal can exist by we with probability substitute into wherein obtain:

(8)

This is because prior weight can be obtained by maximum likelihood (ML) algorithm for estimating according to posteriori SNR:

Therefore, log-likelihood ratio LLR can be regarded as posteriori SNR simply function, namely the value of LLR depends on that noise energy is composed .

On the other hand, when signal to noise ratio (S/N ratio) is very low, i.e. noise energy spectrum when becoming large, need the threshold value that less reduce voiced segments probability of miscarriage of justice; Otherwise need large threshold value come to mate with strong signal-noise ratio signal.As can be seen from analysis above, log-likelihood ratio LLR depends primarily on the degree of accuracy of noise energy spectrum.Therefore the minimal noise energy spectrum of threshold value and present frame is set up certain to contact, vad algorithm not only can be made to have better robustness for various signal to noise ratio (S/N ratio) environment, simultaneously because estimate that the minimal noise energy spectrum obtained is less than and ensure that the redundance that voiced segments is correctly estimated.

Suppose energy spectrum it is signals with noise power spectrum smoothly obtain, smoothing factor a time-frequency correlation function, then:

(9)

Now, the noise based on minimum statistics that external author can be utilized to propose is estimated just to obtain every frame signal minimal noise power spectrum .

The threshold value that definition noise energy spectrum is relevant for:

(12)

Wherein it is a constant coefficient of this threshold value.

Verify the performance of proposed VAD method: in an experiment, adopt the clean language material of non-broadcasting recorded, totally 2906, sampling rate is fs=8kHz; This language material is mixed to get the noisy speech under different signal to noise ratio (S/N ratio) from steady, nonstationary noise; Wherein stationary noise comes from collection under actual environment and recording, and nonstationary noise (automobile noise and babble noise) comes from http://www.freesound.com and http://spib.rice.edu/spib/data/signals/noise/babble.html respectively.With length be the Hanning window of 200 as analysis window and analysis window, frequency range sum K=256; In noise is estimated, smoothing factor , the speech probability p of priori ( )=p ( ), and make in equation (10) , continuous print LLR number is 2M+1=17.

Although experimenter's operating characteristic (ROC) curve is a general method in the performance verification of vad algorithm, but the method can only judge VAD performance in frame level, namely it can only emphasize the speech/non-speech frame that correctly have estimated how many frames, is but at a complete loss as to what to do to the judgement of voice segments/non-speech segment.Such as, in the vad algorithm of Sohn, its ROC curve does relatively perfect, but in practical situations both, the VAD method based on Sohn but there will be a lot of fragment.With a noisy speech, this situation is described, as shown in Figure 1.

From Fig. 1 (a) to Fig. 1 (d), we can see the integrality adopting the method for Sohn can not ensure voice segments under the environment of low signal-to-noise ratio, there will be much tiny fragment; The performance of Tan in this is more better.But too much the existence of fractionlet makes these two kinds of methods can not ensure automatic speech recognition effective application in a noisy environment.Therefore, the present invention, in order to verify the validity of vad algorithm, not only considers the performance of frame level, simultaneously the performance of the section of considering level.

Fig. 2 (a) to Fig. 2 (c) shows the VAD result under the different signal to noise ratio (S/N ratio) of stationary noise; From Fig. 2 (c), we can see, algorithm in this paper is in the correct number of the detection of speech frame vAD method that is upper and Sohn is similar to, and is better than the VAD method of Tan proposition; Fig. 2 (a) then describes and will be better than other two kinds of methods far away in the detection of voice segments/non-speech segment.

It is as follows that speech frame under automobile noise environment correctly detects number schematic table:

Performance comparision schematic table under different nonstationary noise is as follows:

Above-mentioned two forms give the performance of different vad algorithm under different nonstationary noise; Can see that the method for Sohn has best effect in the detection of number of speech frames from first form, then as noted above, the correctness overemphasizing the speech frame of frame can not illustrate that this vad algorithm is optimum; Can draw to draw a conclusion from second form:

Vad algorithm proposed by the invention has similar SBR accuracy to the MOLRT algorithm based on harmonic characteristic, but have more excellent many VAcc than the MOLRT algorithm based on harmonic characteristic, this also just describes the end-point detecting method proposed in the present invention and has better performance than traditional method.

The vad algorithm that the present invention proposes has similar performance under the signal to noise ratio (S/N ratio) of 15dB with 25dB, and this has also just said that vad algorithm of the present invention has good robustness to noise.

By reference to the accompanying drawings embodiments of the present invention are explained in detail above, but the present invention is not limited to above-mentioned embodiment, in the ken that those of ordinary skill in the art possess, can also makes a variety of changes under the prerequisite not departing from present inventive concept.

Claims

1., based on a detection method for the noise robustness of likelihood ratio test, comprise the following steps:

(1)

(2)

The likelihood ratio of this frame kth frequency range is:

(3)

(4)

Wherein the noise power spectrum of one frame;

(5)

S5, determine decision rule, the log-likelihood ratio of m frame is:

Wherein it is a constant coefficient of this threshold value.