KR20110076315A

KR20110076315A - Non-stationary/mixed noise estimation method based on minium statistics and codebook driven short-term predictor parameter estimation

Info

Publication number: KR20110076315A
Application number: KR1020090132994A
Authority: KR
Inventors: 송재종; 이종설; 양창모; 박성주; 이석필
Original assignee: 전자부품연구원
Priority date: 2009-12-29
Filing date: 2009-12-29
Publication date: 2011-07-06
Also published as: KR101073665B1

Abstract

PURPOSE: A non-stationary/mixed noise estimation method based on minimum statistics and a codebook driven short-term predictor parameter estimation method is provided to improve noise cancelation performance. CONSTITUTION: A first estimation value of a background noise included in a noisy speech is obtained using a minimum statistics. The background noise is eliminated from the noisy speech. A second estimation value of the residual noise is obtained using a codebook driven short-term predictor parameter estimation method. The residual noise is removed once again.

Description

최소 통계법과 단기 예측 계수 코드북 기법을 이용한 배경 잡음 제거 방법{Non-stationary/Mixed Noise Estimation Method based on Minium Statistics and Codebook Driven Short-Term Predictor Parameter Estimation}Non-stationary / Mixed Noise Estimation Method based on Minium Statistics and Codebook Driven Short-Term Predictor Parameter Estimation

본 발명은 배경 잡음 제거 방법에 관한 것으로, 특히 잡음 추정 알고리즘인 최소 통계법(Minimum Statistics; MS)과 단기 예측계수 코드북 기법(Codebook Driven Short-Term Predictor parameter estimation; CDSTP)을 이용하여 배경 잡음을 제거함으로써, 오염된 원음성에 비해서 향상된 음성을 얻을 수 있도록 한 배경 잡음 제거 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for removing background noise, and in particular, to remove background noise by using a noise estimation algorithm, Minimum Statistics (MS) and Codebook Driven Short-Term Predictor parameter estimation (CDSTP). In addition, the present invention relates to a method for removing background noise to obtain an improved voice compared to contaminated original sound.

음성향상 기법은 휴대기기를 이용한 통신 및 음악정보처리 (music information retreival) 분야는 물론이고 로봇제어를 위한 음성인식 등 다양한 분야에서 필요한 기술이다. 음성향상 기법의 성능을 향상시키기 위해서는 잡음을 제거하는 기술 자체도 중요하지만, 근본적으로는 잡음을 추정하는 알고리즘의 성능이 더욱 중요하다.The speech enhancement technique is a necessary technique in various fields such as communication and music information retreival using a mobile device, as well as speech recognition for robot control. To improve the performance of the speech enhancement technique, the noise reduction technique itself is important, but fundamentally the performance of the noise estimation algorithm is more important.

잡음을 제거하는 기술에는 잡음으로 오염된 음성이 깨끗한 음성과 배경 잡음의 합이라는 가정에서 시작된 spectral subtraction (SS) 방법이 있다[참고 문헌 1: S.F.Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoustics, Speech, Signal Processing, vol. 27, no. 2, pp. 113-120, 1979.]. 또한, 잡음의 변화가 주파수 밴드별로 독립적이라는 가정을 이용하여 성능을 개선한 multi-band spectral subtraction (MBSS) 방법 등이 있다 [참고 문헌 2: L. Singh, and S. Sridharan, "Speech enhancement using critical band spectral subtraction," Proc. Intern. Conf. Spoken Lang. Processing, pp. 2827-2830. 1998. 참고 문헌 3: S. Kamath, and P. Loizou, "A multi-band spectral subtraction method for enhancing speech corrupted by colored noise," Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, 2002.].Noise reduction techniques include the spectral subtraction (SS) method, which is based on the assumption that noise-contaminated speech is the sum of clean speech and background noise [Reference 1: SFBoll, "Suppression of acoustic noise in speech using spectral subtraction, "IEEE Trans. Acoustics, Speech, Signal Processing, vol. 27, no. 2, pp. 113-120, 1979.]. In addition, there is a multi-band spectral subtraction (MBSS) method that improves performance using the assumption that the change in noise is independent of frequency bands. [Reference 2: L. Singh, and S. Sridharan, "Speech enhancement using critical band spectral subtraction, "Proc. Intern. Conf. Spoken Lang. Processing, pp. 2827-2830. Reference 3: S. Kamath, and P. Loizou, "A multi-band spectral subtraction method for enhancing speech corrupted by colored noise," Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, 2002.].

잡음 추정 알고리즘은 voice activity detection (VAD)을 이용한 방법이나 minimum statistics (MS)를 이용한 방법 등 다양한 알고리즘들이 연구되어 왔다 [참고 문헌 4: R.Martin. "Noise power spectral density estimation based on optimal smoothing and minimum statistics," IEEE Trans. Speech Audio processing. vol. 9, no. 5, pp. 504-512. 2001.]. MS 잡음 추정 알고리즘은 오염된 입력 신호의 최근 D-frame 구간에 대해서 최소 전력 크기를 가지는 프레임이 잡음만 포함하고 있는 프레임이라고 가정을 한다. MS는 stationary 배경잡음 환경에서는 비교적 잘 작동한다고 알려져 있으나, non-stationary 배경잡음 환경에서는 강인하지 못한 성능을 보인다.Various algorithms have been studied for noise estimation algorithms such as voice activity detection (VAD) and minimum statistics (MS) [Ref. 4: R.Martin. "Noise power spectral density estimation based on optimal smoothing and minimum statistics," IEEE Trans. Speech Audio processing. vol. 9, no. 5, pp. 504-512. 2001.]. The MS noise estimation algorithm assumes that the frame with the minimum power magnitude for the latest D-frame interval of the contaminated input signal is a frame containing only noise. MS is known to work relatively well in stationary background noise environments, but it is not robust in non-stationary background noise environments.

반면, codebook driven short-term predictor parameter estimation (CDSTP) 과 같은 방법은 stationary 배경잡음 환경 뿐만 아니라 non-stationary 배경잡음 환경에서도 대체적으로 강인한 성능을 보인다 [참고 문헌 5: S. Srinivasan, J. Samuelsson, and W. B. Kleijn, "Codebook driven short-term predictor parameter estimation for speech enhancement," IEEE Trans. Speech Audio processing, vol.14, issue 1. pp.163-176, 2006.]. CDSTP 알고리즘은 음성과 잡음의 스펙트럼을 linear predictive coding (LPC) 계수로 표현하고, 대표적인 스펙트럼 형태(shape)들을 LPC 형태로 코드북에 저장한다. 이후 maximum-likelihood estimates 방식을 사용하여 음성과 잡음의 코드북 파라미터와 각각의 gain 값을 추정하여 잡음을 추정하는 알고리즘이다.On the other hand, methods such as codebook driven short-term predictor parameter estimation (CDSTP) show generally robust performances in both stationary and non-stationary background noise environments [Ref. 5: S. Srinivasan, J. Samuelsson, and WB Kleijn, "Codebook driven short-term predictor parameter estimation for speech enhancement," IEEE Trans. Speech Audio processing, vol. 14, issue 1.pp. 163-176, 2006.]. The CDSTP algorithm expresses the spectrum of speech and noise as linear predictive coding (LPC) coefficients and stores representative spectral shapes in the codebook in LPC form. After that, the maximum-likelihood estimates are used to estimate noise by estimating the codebook parameters of speech and noise and their gain values.

하지만, CDSTP는 코드북에 저장된 형태(shape) 이외의 잡음 환경에는 취약하므로 본 발명에서는 CDSTP를 MS와 결합하여 사용하는 방법에 대해서 살펴 보았다. 또한, 현실에서의 잡음은 대부분 두 가지 혹은 그 이상의 잡음이 섞여 있는 형태로 존재한다. 예를 들어, 전쟁터의 경우 총소리나 비행기 소리 등이 함께 있기 때문에, 본 발명에서는 이러한 mixed 잡음 환경에 대한 음성 향상 기법도 살펴 보았다. 이러한 mixed 잡음을 제거 하는데 있어서는 기존의 방법들을 한 가지만 사용 하는 것보다는 서로 다른 특성의 잡음 추정 알고리즘을 섞어서 사용하는 것이 효과적이다.However, since the CDSTP is vulnerable to a noise environment other than the shape stored in the codebook, the present invention has been described about the method of using the CDSTP in combination with the MS. Also, most of the noise in the real world exists in the form of a mixture of two or more noises. For example, in the case of a battlefield, since there is a gunshot or an airplane sound, the present invention also looks at a speech enhancement technique for such a mixed noise environment. In order to remove such mixed noise, it is more effective to mix noise estimation algorithms with different characteristics than to use one conventional method.

이상으로, MS는 stationary 배경잡음에는 강인하지만, non-stationary 배경잡음에는 상대적으로 취약하다. CDSTP는 non-stationary 배경잡음에 강인한 특성을 보이지만, 코드북에 없는 배경잡음 환경에는 취약하다.Above, the MS is robust against stationary background noise but relatively weak against non-stationary background noise. CDSTP is robust against non-stationary background noise, but vulnerable to background noise environments not found in codebooks.

본 발명은 이러한 CDSTP와 MS를 결합함으로써 다양한 배경 잡음에 강인한 배경 잡음 제거 방법을 제공함을 목적으로 한다.An object of the present invention is to provide a background noise cancellation method that is robust to various background noises by combining the CDSTP and the MS.

전술한 목적을 달성하기 위해 본 발명의 일 실시예에 따른 배경 잡음 제거 방법은 배경 잡음으로 오염된 음성인 오염 음성에서 배경 잡음을 제거하는 배경 잡음 제거 방법에 있어서, 최소 통계법(Minimum Statistics; MS)을 이용하여, 상기 오염 음성에 포함된 배경 잡음의 제 1 추정치를 구하는 (a) 단계; 상기 (a) 단계에서 구한 제 1 추정치를 이용하여, 상기 오염 음성에서 배경 잡음을 1차 제거하는 (b) 단계; 상기 최소 통계법을 이용한 잡음 제거 후 남아 있는 잔여 잡음의 제 2 추정치를, 단기 예측계수 코드북 기법(Codebook Driven Short-Term Predictor parameter estimation; CDSTP)을 이용하여 구하는 (c) 단계; 및 상기 (c) 단계에서 구한 제 2 추정치를 이용하여, 상기 최소 통계법을 이용한 잡음 제거 후 남아 있는 잔여 잡음을 2차 제거하는 (d) 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, a background noise removal method according to an embodiment of the present invention is a background noise removal method for removing background noise from a contaminated voice that is voice contaminated with background noise. (A) obtaining a first estimate of background noise included in the contaminated speech using; (B) first removing background noise from the polluted speech using the first estimate obtained in step (a); (C) obtaining a second estimate of the residual noise remaining after the noise elimination using the least statistical method using a Codebook Driven Short-Term Predictor parameter estimation (CDSTP); And (d) removing second residual noise after noise removal using the least statistical method using the second estimate obtained in the step (c).

본 발명의 일 실시예에 따른 배경 잡음 제거 방법에 있어서, 1차 잡음 제거에는 멀티밴드 스펙트럼 감법(Multi-Band Spectral Subtraction; MBSS)이 이용되는 것임을 특징으로 한다.In the background noise removing method according to an embodiment of the present invention, multi-band spectral subtraction (MBSS) is used to remove the first noise.

또한, 본 발명의 일 실시예에 따른 배경 잡음 제거 방법에 있어서, 2차 잡음 제거에는 위너 필터(Wiener filter)가 이용되는 것임을 특징으로 한다.In addition, in the background noise removing method according to an embodiment of the present invention, a Wiener filter is used to remove the second noise.

또한, 상기 1차로 제거된 잡음은 상기 제 2 추정치 계산을 위한 코드북으로 활용되는 것임을 특징으로 한다. 이러한 방식을 MS+CDSTPv2로 명명한다(표 1 참조 바람).In addition, the first noise removed may be used as a codebook for calculating the second estimate. This approach is termed MS + CDSTPv2 (see Table 1).

본 발명의 다른 실시예에 따른 배경 잡음 제거 방법은 배경 잡음으로 오염된 음성인 오염 음성에서 배경 잡음을 제거하는 배경 잡음 제거 방법에 있어서, 단기 예측계수 코드북 기법(Codebook Driven Short-Term Predictor parameter estimation; CDSTP)을 이용하여, 상기 오염 음성에 포함된 배경 잡음의 제 1 추정치를 구하는 (a) 단계; 상기 (a) 단계에서 구한 제 1 추정치를 이용하여, 상기 오염 음성에서 배경 잡음을 1차 제거하는 (b) 단계; 상기 단기 예측계수 코드북 기법을 이용한 잡음 제거 후 남아 있는 잔여 잡음의 제 2 추정치를, 최소 통계법(Minimum Statistics; MS)을 이용하여 구하는 (c) 단계; 및 상기 (c) 단계에서 구한 제 2 추정치를 이용하여, 상기 단기 예측계수 코드북 기법을 이용한 잡음 제거 후 남아 있는 잔여 잡음을 2차 제거하는 (d) 단계를 포함하는 것을 특징으로 한다.In accordance with another aspect of the present invention, there is provided a background noise removal method for removing background noise from a contaminated speech that is speech contaminated with background noise, comprising: a Codebook Driven Short-Term Predictor parameter estimation; (A) obtaining a first estimate of background noise included in the contaminated speech using CDSTP); (B) first removing background noise from the polluted speech using the first estimate obtained in step (a); (C) obtaining a second estimate of the residual noise remaining after the noise removal using the short-term prediction coefficient codebook method using minimum statistics (MS); And (d) secondly removing residual noise remaining after noise removal using the short-term prediction coefficient codebook technique, using the second estimate obtained in step (c).

본 발명의 다른 실시예에 따른 배경 잡음 제거 방법에 있어서, 1차 잡음 제거에는 위너 필터(Wiener filter)가 이용되는 것임을 특징으로 한다.In the background noise removing method according to another embodiment of the present invention, a Wiener filter is used to remove the first noise.

또한, 본 발명의 다른 실시예에 따른 배경 잡음 제거 방법에 있어서, 2차 잡음 제거에는 멀티밴드 스펙트럼 감법(Multi-Band Spectral Subtraction; MBSS)이 이용되는 것임을 특징으로 한다.In addition, in the background noise removing method according to another embodiment of the present invention, multi-band spectral subtraction (MBSS) is used to remove the second noise.

본 발명의 배경 잡음 제거 방법에 따르면, 잡음 코드북 학습시 고려한 잡음 환경에 대해서는 MS를 CDSTP의 후처리로 사용하는 경우가, 학습 시 고려하지 못한 잡음 환경에 대해서는 MS를 CDSTP의 전처리로 사용하는 경우가 각각의 알고리즘을 독립적으로 사용하는 경우보다 우수한 잡음 제거 효과가 있다(표 1 참조 바람). 특히, mixed 잡음 환경에 대해서는 기존 알고리즘보다 훨씬 강인한 성능 향상을 얻을 수 있었다.According to the background noise removing method of the present invention, the MS is used as the post-processing of the CDSTP for the noise environment considered in the noise codebook learning, and the MS is used as the preprocessing of the CDSTP for the noise environment not considered in the learning. Better noise cancellation than using each algorithm independently (see Table 1). In particular, the performance improvement is much stronger than that of the conventional algorithm for mixed noise environments.

이하에는 첨부한 도면을 참조하여 본 발명의 바람직한 실시예에 따라 배경 잡음 제거 방법에 대해서 상세하게 설명한다.Hereinafter, a background noise removing method according to a preferred embodiment of the present invention with reference to the accompanying drawings will be described in detail.

1. MBSS and MSMBSS and MS

MBSS는 잡음의 변화가 주파수 밴드별로 독립적이라는 가정에 근거하며 각 주파수 밴드마다 독립적인 subtraction factor를 이용한다. 한 프레임에서 다음 프레임으로 넘어갈 때 각 주파수 밴드별 잡음이 급격하게 변화한다면, SS 방식은 한계를 가지게 되며 부분적인 음성 왜곡으로 이어지게 된다. MBSS 방식은 음성 스펙트럼을 N개의 오버랩되지 않은 밴드로 나누고, 각 밴드별로 독립적으로 SS를 적용해 서 잡음을 제거한다.MBSS is based on the assumption that the change in noise is independent of each frequency band and uses an independent subtraction factor for each frequency band. If the noise in each frequency band changes drastically when moving from one frame to the next, the SS scheme has limitations and leads to partial speech distortion. MBSS method uses the voice spectrum It divides into N nonoverlapping bands and removes noise by applying SS independently for each band.

MBSS가 잡음을 제거하는 과정을 살펴보면, 먼저 잡음으로 오염된 음성이 들어오면 FFT를 통해서 주파수 도메인으로 바꾸고, 다음 수학식 1과 같이 주파수 영역에서 윈도우를 씌우는 스무딩(smoothing) 과정을 거치게 된다.Looking at the process of removing noise by the MBSS, the voice contaminated with noise is first converted into the frequency domain through the FFT, and a smoothing process of covering a window in the frequency domain as shown in Equation 1 below.

여기서,

는 i 번째 프레임의 가중치 윈도우 값을, M은 스무딩을 위해 필요한 과거와 미래 프레임의 수를,

은 k 번째 주파수를 (k = 0, 1, ..., N-1),

와

는 i 번째 프레임의 처음과 끝 주파수를 나타낸다. 또한,

는 잡음으로 오염된 음성의 i 번째 프레임, k 번째 주파수의 스펙트럼 magnitude를 나타내고,

는 j 번째 프레임, k 번째 주파수의 스무딩된 스펙트럼 magnitude를 나타낸다. 원래 스펙트럼 대신 스무딩된 스펙트럼을 사용함으로써 확률적으로 원래 스펙트럼에 비해서 스펙트럼의 분포 범위가 감소되고 잔여 잡음의 변화가 감소되는 효과를 갖는다.here,

Is the weighted window value of the i-th frame, M is the number of past and future frames needed for smoothing,

Is the k th frequency (k = 0, 1, ..., N-1),

Wow

Denotes the start and end frequencies of the i-th frame. Also,

Denotes the spectral magnitude of the i th frame, the k th frequency of speech contaminated with noise,

Denotes the smoothed spectral magnitude of the j th frame, the k th frequency. Using the smoothed spectrum instead of the original spectrum has the effect of probabilistically decreasing the distribution range of the spectrum and reducing the change in residual noise compared to the original spectrum.

즉, 잡음이 제거된 i번째 프레임의 스펙트럼 magnitude

는 다 음 수학식 2와 같이 구할 수 있다.That is, the spectral magnitude of the i-th frame from which the noise is removed

Can be obtained as in Equation 2 below.

여기서,

는 추정된 잡음의 i 번째 프레임, k 번째 주파수의 스펙트럼 magnitude이고,

는 SNR에 따라 결정되는 subtraction factor 값이다.here,

Is the spectral magnitude of the i th frame, the k th frequency of the estimated noise,

Is a subtraction factor value determined according to SNR.

MS 잡음 추정 방식에서는 오염된 입력 신호의 최근 D-frame 구간에 대해서 최소 전력크기를 가지는 프레임이 잡음만 포함하고 있는 프레임이라고 가정하기 때문에, D-frame 구간은 음성 신호가 비활성화되는 묵음 구간을 포함할 만큼 충분히 큰 윈도우를 선택하여야 하며, 잡음의 통계적 특성이 변하지 않을 정도의 충분히 작은 윈도우를 선택하여야 한다. MS 잡음 추정 알고리즘의 세부적인 절차는 다음과 같다.Since the MS noise estimation method assumes that the frame having the minimum power size for the latest D-frame section of the contaminated input signal is a frame containing only noise, the D-frame section may include a silent section in which the voice signal is inactivated. You should choose a window large enough, and select a window small enough that the statistical properties of the noise do not change. The detailed procedure of the MS noise estimation algorithm is as follows.

먼저, 잡음으로 오염된 음성의 periodogram

을 구한다. 구해진 잡음으로 오염된 음성의 periodogram은 스무딩 과정을 통해서 급격히 변하는 문제점을 보완한다. 스무딩 factor

를 이용하여 스무딩된 파워 스펙트럼

를 구한다.First, the periodogram of the voice contaminated with noise

. The periodogram of the voice contaminated with the obtained noise compensates for the rapidly changing problem through the smoothing process. Smoothing factor

Power Spectrum Smoothed Using

.

그 후, D-frame 윈도우에서 최소 파워 스펙트럼

를 구하게 된다.After that, the minimum power spectrum in the D-frame window

Will be obtained.

위의 수학식 4를 이용해서 구한 최소 파워 스펙트럼은 bias 정정 요소로 보상 과정을 거쳐서 현재 프레임의 잡음을 추정하는데 쓰이게 된다.The minimum power spectrum obtained using Equation 4 above is used to estimate the noise of the current frame through a compensation process as a bias correction factor.

2. CDSTP2. CDSTP

최근 들어서 non-stationary 배경잡음 처리에 대해서도 많은 연구들이 진행되어 왔다. 그 중 한 가지로 CDSTP 알고리즘이 있다 [참고 문헌 5: S. Srinivasan, J. Samuelsson, and W. B. Kleijn, "Codebook driven short-term predictor parameter estimation for speech enhancement," IEEE Trans. Speech Audio processing, vol.14, issue 1. pp.163-176, 2006.]. CDSTP 알고리즘은 음성과 잡음의 스펙트럼 shape들을 LPC 계수로 표현하고, 대표적인 스펙트럼 shape들을 LPC 형태로 코드북에 저장한 후, 해당 코드북을 이용하여 잡음을 추정하는 방식이다. 잡음으로 오염된 음성신호는 음성과 잡음 신호로 분리될 수 있다고 가정하고, 음성과 잡음 신호 각각의 코드북 파라미터 추정에는 maximum-likelihood estimates 방식을 사용하였다 [참고 문헌 6: M. Kuropatwinski and W. B. Kleijn, "Estimation of the excitation variances of speech and noise AR-models for enhanced speech coding," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, col. 1, pp. 669-672, 2001]. 또한 잡음과 음성의 스펙트럼 shape 이외에 gain을 계산하는 과정을 추가적으로 포함하고 있다.Recently, many studies have been conducted on non-stationary background noise processing. One of them is the CDSTP algorithm [Ref. 5: S. Srinivasan, J. Samuelsson, and W. B. Kleijn, "Codebook driven short-term predictor parameter estimation for speech enhancement," IEEE Trans. Speech Audio processing, vol. 14, issue 1.pp. 163-176, 2006.]. The CDSTP algorithm expresses the spectral shapes of speech and noise as LPC coefficients, stores the representative spectral shapes in LPC form in the codebook, and estimates the noise using the corresponding codebook. Assuming that noise-contaminated speech signals can be separated into speech and noise signals, maximum-likelihood estimates were used to estimate the codebook parameters of speech and noise signals [Ref. 6: M. Kuropatwinski and WB Kleijn, " Estimation of the excitation variances of speech and noise AR-models for enhanced speech coding, "in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, col. 1, pp. 669-672, 2001. In addition to the spectral shape of noise and speech, it also includes a process for calculating gain.

CDSTP에서는 음성과 잡음 스펙트럼 shape 코드북에서 음성과 잡음 shape 코드벡터 후보를 선정하여 likelihood 값을 계산하게 되고, likelihood 값을 최소화 하는 코드북 엔트리와 gain을 추정하여 최종적으로 Wiener 필터의 계수로 사용하게 된다. CDSTP에서는 Itakura-Saito distortion을 이용하여 음성과 잡음에 대한 적절한 코드벡터를 찾는다 [참고 문헌 7: R.M. Gray, A. Buzo, A.H.G Jr., and Y. Matsuyama, "Distortion measures for speech processing," IEEE Trans. Acoustics, Speech, Signal Processing, vol 28, no. 4, pp. 367-376, 1980]. 프레임마다 이와 같은 과정의 반복을 통해서 파라미터들을 측정하게 된다. CDSTP calculates likelihood values by selecting speech and noise shape codevector candidates from speech and noise spectral shape codebooks, estimates codebook entries and gains that minimize likelihood values, and finally uses them as coefficients of Wiener filter. CDSTP uses Itakura-Saito distortion to find the appropriate codevector for speech and noise [Ref. 7: R.M. Gray, A. Buzo, A.H.G Jr., and Y. Matsuyama, "Distortion measures for speech processing," IEEE Trans. Acoustics, Speech, Signal Processing, vol 28, no. 4, pp. 367-376, 1980]. The parameters are measured by repeating this process every frame.

CDSTP에서 사용하는 파라미터는 LPC 계수와 gain이다. 잡음과 음성 shape 코드벡터에 따른 gain 추정은 non-stationary 배경잡음 추정에 중요한 역할을 한다.The parameters used in the CDSTP are LPC coefficients and gain. Gain estimation based on noise and speech shape code vector plays an important role in non-stationary background noise estimation.

도 1은 CDSTP의 기본 블록도로서, 음성과 잡음의 shape 코드벡터를 통해 gain값을 구하고, Itakura-Saito distortion을 이용하여 최적의 파라미터를 찾는 과정을 나타낸다. 잡음으로 오염된 입력신호의 스펙트럼과 코드북에 저장되어 있는 정보들을 이용해 잡음과 음성의 shape과 gain을 측정한다.FIG. 1 is a basic block diagram of a CDSTP, and shows a process of obtaining a gain value through shape code vectors of speech and noise, and finding an optimal parameter using Itakura-Saito distortion. The noise and speech shape and gain are measured using the spectrum of the input signal contaminated with noise and the information stored in the codebook.

음성과 잡음 코드북에서 각각 i 번째 음성 코드벡터

와 j 번째 잡음 코드벡터

를 선택하는 과정을 나타내면 다음 수학식 5와 같다.I-th speech code vector from speech and noise codebook, respectively

And j th noise code vector

A process of selecting is shown in Equation 5 below.

여기서

과

는 음성과 잡음의 excitation variance를 나타낸다. 수학식 5의 pdf

가 Gaussian이라 가정하고 likelihood 값을 로그 도메인에서 표현하면 다음 수학식 6으로 나타낼 수 있는데,

는 입력신호의 스펙트럼,

는 i번째 음성 코드벡터의 스펙트럼,

는 j번째 잡음 코드벡터의 스펙트럼이다.here

and

Denotes the excitation variance of speech and noise. Pdf of equation (5)

Is a Gaussian, and the likelihood value is expressed in the log domain, it can be expressed by the following equation.

Is the spectrum of the input signal,

Is the spectrum of the i th speech code vector,

Is the spectrum of the j th noise code vector.

수학식 5와 수학식 6을 결합하게 되면, 다음 수학식 7과 같이 나타낼 수 있다.When Equation 5 is combined with Equation 6, it may be expressed as Equation 7 below.

여기서,

는 음성과 잡음 코드벡터를 통해 표현한 스펙트럼으로

와 같이 표현할 수 있다.here,

Is the spectrum represented by the speech and noise codevectors.

It can be expressed as

는 원신호와 합성신호 스펙트럼의 Itakura-Saito distortion을 의미하며, 다음 수학식 8과 같이 표현할 수 있다.

Denotes Itakura-Saito distortion of the original signal and the synthesized signal spectrum, and can be expressed as Equation 8 below.

Gain은 수학식 8을 최소화하는 과정에서 구할 수 있으며, 다음 수학식 9를 풀면 얻어진다.Gain can be obtained by minimizing Equation 8, and is obtained by solving Equation 9.

여기서, C와 D는 다음 수학식 10으로 표현할 수 있 다(

). 잡음과 음성의 STP 측정에서 몇 가지 이론을 적용할 수 있는데 여기서는 파형을 강화하는데 초점을 맞춘다.Here, C and D can be expressed by the following equation (10).

). Several theories can be applied to STP measurements of noise and speech, focusing on enhancing waveforms.

{

}와 같이, 최적의 LPC와 gain이 선택이 되었다면, 다음 수학식 11과 같은 Wiener filter를 구현하여 잡음 제거에 적용할 수 있다[참고 문헌 8: T. Sreenicas and P. Kirnapure, "Codebook constrained Wiener filtering for speech enhancement," IEEE Trans. Speech Audio Processing, col. 4, no. 5, pp. 383-389, 1996.].{

}, If the optimal LPC and gain is selected, the Wiener filter can be applied to noise reduction by implementing the Wiener filter, as shown in Equation 11 below. for speech enhancement, "IEEE Trans. Speech Audio Processing, col. 4, no. 5, pp. 383-389, 1996.].

3. 제안하는 잡음제거 알고리즘3. proposed noise reduction algorithm

도 2는 본 발명에 따른 배경 잡음 제거 방법의 블록도이다.2 is a block diagram of a method for removing background noise according to the present invention.

MS는 stationary 배경잡음에는 강인하지만, non-stationary 배경잡음에는 상대적으로 취약하다. CDSTP는 non-stationary 배경잡음에도 강인한 특성을 보이지만 코드북에 저장된 shape 이외의 잡음환경에는 취약하므로, 본 발명에서는 CDSTP와 MS를 결합하여 사용하고자 한다.MS is robust against stationary background noise, but relatively weak against non-stationary background noise. Although CDSTP shows robust characteristics against non-stationary background noise, it is vulnerable to noise environments other than shapes stored in codebooks. Therefore, in the present invention, CDSTP and MS are used in combination.

실생활 환경에서는 단일 잡음만을 듣게 되는 경우는 거의 없다. 대부분의 경우에 우리는 두 가지 혹은 세 가지 이상의 잡음에 노출되고, 이러한 잡음 환경에서 음성에 대한 여러 가지 기술들 즉, 휴대기기를 이용한 통신 및 음악정보처리 (music information retreival) 등을 수행해야 한다. 따라서, 기존 알고리즘들이 고려했던 단일 잡음 제거에 대한 연구와 함께 mixed 잡음에 대한 처리도 고려하여 MS 와 CDSTP를 접목시킨 알고리즘을 구현하였다. 본 발명에서는 도 2에서 보듯이 잡음을 제거하는데 있어서 한가지의 잡음 추정 방법만을 사용하지 않고, 두 가지의 잡음 추정 방법을 연결하여 사용하여 기존의 알고리즘보다 더욱 강인한 성능을 보이는 알고리즘을 제안한다.In a real-life environment, you rarely hear a single noise. In most cases we are exposed to two or three or more noises, and in such a noisy environment, we have to perform various techniques for speech, such as mobile communication and music information retreival. Therefore, the algorithm that combines MS and CDSTP is implemented by considering the mixed noise as well as the study of single noise elimination that the existing algorithms consider. In the present invention, as shown in FIG. 2, instead of using only one noise estimation method in removing noise, an algorithm showing more robust performance than the conventional algorithm is proposed by using two noise estimation methods in combination.

3.1 MS+MS3.1 MS + MS

MS나 CDSTP 등의 잡음 추정 알고리즘을 사용하여 잡음을 제거할 경우 잡음으로 오염된 원음성에 비해서 향상된 음성을 얻을 수 있다. 하지만, 대부분의 알고리즘이 매 프레임 마다 잡음을 정확하게 추정 할 수는 없기 때문에 잔여 잡음이 남게 된다. 실제의 잡음보다 추정된 잡음이 클 경우에는 잡음이 부분적으로 지나치게 제거되어서 musical noise가 생기게 되고, 실제의 잡음보다 추정된 잡음이 작을 경우에는 너무 적게 제거가 되어 잔여잡음 (residual noise)이 남게 된다. 따라서, 유럽 전기통신 표준협회 (European Telecommunications Standards Institute: ETSI)에서는 음성인식을 위한 음성향상 기법으로 다음과 같은 방법을 권고한다 [참고 문헌 9: ETSI ES 202 050, “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms,” 2007년 1월.]. 입력 신호에 대해서 우선적으로 Wiener filter를 이용하여 1차적인 잡음을 제거하도록 하고, 다시 Wiener filter를 재사용하여 잔여 잡음에 대한 처리를 하도록 한다.When noise is removed using a noise estimation algorithm such as MS or CDSTP, an improved speech can be obtained compared to the original voice contaminated with noise. However, since most algorithms cannot estimate noise accurately every frame, residual noise remains. If the estimated noise is larger than the actual noise, the noise is partially removed excessively, resulting in musical noise. If the estimated noise is smaller than the actual noise, too little is removed to leave residual noise. Therefore, the European Telecommunications Standards Institute (ETSI) recommends the following methods for speech enhancement for speech recognition [Reference 9: ETSI ES 202 050, “Speech Processing, Transmission and Quality Aspects ( STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms, ”January 2007.]. For the input signal, the primary noise is first removed by using the Wiener filter, and the Wiener filter is reused to deal with the residual noise.

본 발명에서는 ETSI에서 권고한 Wiener filter 대신, MBSS를 multi-stage로 사용하였다. 이때, MBSS의 noise estimation 방법으로 MS 알고리즘을 이용하였다. 앞서 설명하였던 MS 알고리즘이 다양한 잡음 환경에 대해서 우수한 성능을 보임이 보고되었지만, 역시 잔여 잡음의 문제가 있으며, 따라서 도 2 (a)와 같이 MS + MS 형태로 잔여잡음을 제거하고자 한다.In the present invention, instead of the Wiener filter recommended by ETSI, MBSS was used as a multi-stage. At this time, the MS algorithm was used as the noise estimation method of MBSS. Although the MS algorithm described above shows excellent performance in various noise environments, there is also a problem of residual noise, and therefore, the residual noise is to be removed in the form of MS + MS as shown in FIG.

3.2 CDSTP+MS3.2 CDSTP + MS

MS + MS 알고리즘은 MS를 이용한 1차적인 잡음제거 후 남아 있는 잔여잡음에 대한 처리로 MS를 한 번 더 사용함으로써 기존의 MS를 한 번만 사용한 경우보다 강인한 효과를 기대하였다. 하지만, MS + MS 는 기대했던 효과와는 다르게 perceptual evaluation of speech quality (PESQ) [참고 문헌 10: A. Rix, J. Beerends, M.Hollier, and A. Hekstra, "Perceptual evaluation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs," in proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, col. 2, pp. 749-752, 2001] 관점의 개선 효과는 없었다. MS + MS의 경우에는 한번 제거한 잡음에 대하여 또 한 번 같은 방법으로 잡음을 제거하였기 때문에 성능 향상이 미미했다. The MS + MS algorithm is expected to have a stronger effect than the conventional MS only once by using the MS once more as a treatment for the residual noise remaining after the first noise cancellation using the MS. However, MS + MS differs from the expected effect by perceptual evaluation of speech quality (PESQ) [Ref. 10: A. Rix, J. Beerends, M. Holler, and A. Hekstra, "Perceptual evaluation of speech quality (PESQ). )-A new method for speech quality assessment of telephone networks and codecs, "in proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, col. 2, pp. 749-752, 2001]. There was no improvement from the point of view. In the case of MS + MS, the performance improvement was minimal because the noise was removed in the same way with respect to the noise once removed.

따라서, 잡음 제거 알고리즘들을 접목 시킬 경우에는 특성이 서로 다른 잡음 추정 알고리즘을 결합하는 것이 유리하다고 생각할 수 있다. 본 발명에서는 CDSTP 알고리즘을 기준으로 두 가지 방법을 제안 하도록 한다. 먼저, 첫 번째로 제안하는 알고리즘은 그림 2(b)와 같이 CDSTP와 MS를 결합한 알고리즘이다. 이때, CDSTP를 이용한 잡음 추정 후 잡음 제거에는 Wiener filter가 사용되었고, MS를 이용한 잡음 추정 후 잔여 잡음 제거에는 MBSS가 사용되었다.Therefore, it may be advantageous to combine noise estimation algorithms with different characteristics when combining noise cancellation algorithms. In the present invention, two methods are proposed based on the CDSTP algorithm. First, the proposed algorithm is a combination of CDSTP and MS as shown in Figure 2 (b). At this time, Wiener filter was used for noise removal after noise estimation using CDSTP, and MBSS was used for residual noise removal after noise estimation using MS.

CDSTP의 경우 코드북으로 해당 잡음의 특성을 training 하여 갖고 있기 때문에 MS 알고리즘에 비해 더 강인한 성능을 기대할 수 있다. 하지만, 코드북을 이용하여 잡음을 제거하였다 하더라도 잔여 잡음이 존재하기 때문에 이에 대해서는 다른 특성의 잡음 추정 알고리즘인 MS를 사용하도록 하였다. 즉, CDSTP + MS 방법을 사용하면 기존의 방법인 MS나 CDSTP의 단독 사용보다는 더 강인한 성능을 보일 것으로 기대할 수 있다.In the case of the CDSTP, the codebook trains the characteristics of the noise, so it can expect more robust performance than the MS algorithm. However, even if the noise is removed using the codebook, residual noise is present. Therefore, MS, which is a noise estimation algorithm with different characteristics, is used. In other words, using the CDSTP + MS method can be expected to show a more robust performance than the use of the existing methods such as MS or CDSTP alone.

3.3 MS+CDSTP3.3 MS + CDSTP

CDSTP + MS 방법은 코드북으로 1차적인 잡음추정을 하고 MS를 이용하여 잔여 잡음을 추정하도록 설계되었다. 본 CDSTP + MS 방식은 테스트 음성의 잡음 환경이 CDSTP 코드 북 학습 환경과 일치하는 경우에는 우수한 성능을 보일 것으로 기대할 수 있다. 하지만, 일반적인 경우에 테스트 환경은 학습 환경과 불일치가 발생하므로, MS + CDSTP 방식을 제안하였다. MS + CDSTP 방법은 우선적으로 MS를 이용하여 코드북으로 모델링할 수 없는 잡음을 제거하고, 남은 잡음을 CDSTP로 모델링하여 제거하고자 하였다. 이때, MS를 이용한 잡음 추정 후 잡음 제거에는 MBSS가 사용되었고, CDSTP를 이용한 잡음 추정 후 잔여 잡음 제거에는 Wiener filter가 사용되었다. 따라서, CDSTP의 단점이라고 할 수 있는 outside 잡음에 대해서도 처리 가능하다는 점이 MS + CDSTP 알고리즘의 강점이라 할 수 있다. 또한, CDSTP + MS나 MS + CDSTP 알고리즘은 서로 다른 두 가지 이상의 잡음이 섞여 있는 mixed 잡음 환경에서 다른 알고리즘보다 더욱 강인한 성능을 보일 것으로 기대된다.The CDSTP + MS method is designed to make a first order noise estimation with codebook and to estimate residual noise using MS. This CDSTP + MS method can be expected to show excellent performance when the noise environment of the test voice matches the CDSTP codebook learning environment. However, in general, the test environment is inconsistent with the learning environment, so MS + CDSTP method is proposed. In the MS + CDSTP method, first, the noise that cannot be modeled by the codebook is removed by using the MS, and the residual noise is modeled and eliminated by the CDSTP. At this time, MBSS was used to remove noise after noise estimation using MS, and Wiener filter was used to remove residual noise after noise estimation using CDSTP. Therefore, the strength of the MS + CDSTP algorithm is that it can handle outside noise, which is a disadvantage of CDSTP. In addition, CDSTP + MS or MS + CDSTP algorithms are expected to be more robust than other algorithms in mixed noise environments where two or more different noises are mixed.

4. 실험 및 결과4. Experiment and Results

본 발명에서는 제안된 알고리즘의 성능을 평가하기 위해서 기존의 알고리즘들을 포함하여 총 6가지의 실험을 진행 하였다. 실험에 이용된 알고리즘은 다음과 같다.In the present invention, a total of six experiments were performed to evaluate the performance of the proposed algorithm. The algorithm used in the experiment is as follows.

1) MS1) MS

2) CDSTP2) CDSTP

3) MS + MS3) MS + MS

4) CDSTP + MS4) CDSTP + MS

5) MS + CDSTP5) MS + CDSTP

6) MS + CDSTPv26) MS + CDSTPv2

기본적인 실험조건은 다음과 같다. 음성과 배경잡음의 sampling rate는 8kHz, window는 hamming window를 사용했으며, 프레임의 길이는 20ms로 설정하였다. CDSTP의 음성 코드북 트레이닝을 위해서는 TIMIT database를 이용하였다. TIMIT에서 168명이 녹음한 총 1680개의 음성을 가지고 (각 사람당 10문장씩 추출) 10차 LPC 계수를 추출한 후, line spectral frequencies (LSF) 계수로 변환하여 Generalized Lloyd Algorithm (GLA)을 이용하여 10 bit 음성 코드북을 생성하였다. CDSTP의 잡음 코드북 트레이닝을 위해서는 white, volvo, machinegun, babble noise를 inside 잡음으로 사용하였다. 각 잡음의 스펙트로그램은 도 3과 같으며, 각 잡음 코드북별 LPC 차수는 6, 6, 16, 10으로 하고 비트할당은 3, 3, 4, 2 bits 로 하였다. LPC 차수와 비트 할당을 높일 경우 spectral envelop 표현은 보다 정확해지지만 복잡도가 증가하므로, 성능과 복잡도 관점에서 실험에 의한 최적치를 선정하였다.Basic experimental conditions are as follows. The sampling rate of voice and background noise was 8kHz, the window used hamming window, and the frame length was set to 20ms. For voice codebook training of CDSTP, we used TIMIT database. With a total of 1680 voices recorded by 168 people (10 sentences per person) from TIMIT, 10th order LPC coefficients were extracted, and then converted into line spectral frequencies (LSF) coefficients and 10 bit voice using Generalized Lloyd Algorithm (GLA). Generated codebook. For the noise codebook training of CDSTP, we used white, volvo, machinegun and babble noise as the inside noise. The spectrogram of each noise is shown in FIG. 3, and the LPC order of each noise codebook is 6, 6, 16, and 10, and the bit allocation is 3, 3, 4, and 2 bits. Increasing the LPC order and bit allocation makes the spectral envelop representation more accurate, but increases the complexity. Therefore, the optimal value was selected from the viewpoint of performance and complexity.

알고리즘의 성능 평가를 위해서는 트레이닝 과정에서 사용되지 않은 음성과 잡음을 이용하였다. 테스트 음성 신호로는 TIMIT database에서 추출한 50명의 화자(남자 24명, 여자 26명)에 대해서 각각 10문장씩 추출하여 총 500문장의 신호를 사용하였다. 도 4는 학습 시 고려되지 않고 테스트에만 사용된 outside 잡음인 F16, polyphonic ringtone, machinegun+white 잡음의 스펙트로그램이다. 음성신호에 대하여 해당 잡음들을 0dB, 10dB, 20dB로 섞어서 사용하였다.To evaluate the performance of the algorithm, we used speech and noise that were not used during training. As a test voice signal, 10 sentences were extracted from 50 speakers (24 males and 26 females) extracted from the TIMIT database, and a total of 500 sentences were used. 4 is a spectrogram of F16, polyphonic ringtone, and machinegun + white noise, which are outside noise used only for testing, not considered in learning. For noise signals, the noises were mixed with 0dB, 10dB, 20dB.

　 Noise type (SNR)　Noise type (SNR) MSMS MS+MSMS + MS CDSTPCDSTP CDSTP+MSCDSTP + MS MS+CDSTPMS + CDSTP MS+CDSTPv2MS + CDSTPv2 Inside
training
noiseInside
training
noise machinegun (0dB)machinegun (0dB) 2.432.43 2.412.41 2.762.76 2.722.72 2.442.44 2.442.44 machinegun (10dB)machinegun (10dB) 3.063.06 3.043.04 3.323.32 3.303.30 3.063.06 3.083.08 machinegun (20dB)machinegun (20dB) 3.563.56 3.553.55 3.723.72 3.703.70 3.563.56 3.603.60 volvo (0dB)volvo (0dB) 3.213.21 3.243.24 3.583.58 3.613.61 3.263.26 3.443.44 volvo (10dB)volvo (10 dB) 3.823.82 3.843.84 4.034.03 4.054.05 3.823.82 4.004.00 volvo (20dB)volvo (20dB) 4.254.25 4.224.22 4.264.26 4.254.25 4.224.22 4.304.30 white (0dB)white (0dB) 1.701.70 1.701.70 1.761.76 1.771.77 1.851.85 1.861.86 white (10dB)white (10dB) 2.292.29 2.302.30 2.512.51 2.552.55 2.542.54 2.552.55 white (20dB)white (20dB) 2.932.93 2.942.94 3.153.15 3.23.2 3.193.19 3.193.19 babble (0dB)babble (0dB) 1.911.91 1.901.90 1.881.88 1.881.88 1.941.94 1.921.92 babble (10dB)babble (10dB) 2.572.57 2.572.57 2.612.61 2.632.63 2.672.67 2.672.67 babble (20dB)babble (20dB) 3.213.21 3.213.21 3.273.27 3.303.30 3.323.32 3.333.33 Outside
training
noiseOutside
training
noise F16 (0dB)F16 (0dB) 2.072.07 2.072.07 1.931.93 1.991.99 2.092.09 2.092.09 F16 (10dB)F16 (10 dB) 2.702.70 2.702.70 2.662.66 2.722.72 2.732.73 2.742.74 F16 (20dB)F16 (20dB) 3.343.34 3.343.34 3.313.31 3.393.39 3.373.37 3.373.37 ringtone (0dB)ringtone (0dB) 1.831.83 1.831.83 1.851.85 1.851.85 1.861.86 1.871.87 ringtone (10dB)ringtone (10dB) 2.492.49 2.492.49 2.522.52 2.522.52 2.512.51 2.512.51 ringtone (20dB)ringtone (20dB) 3.123.12 3.123.12 3.163.16 3.163.16 3.143.14 3.143.14 machinegun + white (0dB)machinegun + white (0dB) 1.491.49 1.471.47 1.361.36 1.401.40 1.581.58 1.551.55 machinegun + white (10dB)machinegun + white (10dB) 2.202.20 2.192.19 1.951.95 2.022.02 2.412.41 2.342.34 machinegun + white (20dB)machinegun + white (20dB) 2.872.87 2.862.86 2.742.74 2.802.80 3.073.07 3.043.04

표 1은 각 알고리즘의 음질을 perceptual evaluation of speech quality (PESQ) 점수로 계산한 결과이다 [[참고 문헌 10: A. Rix, J. Beerends, M.Hollier, and A. Hekstra, "Perceptual evaluation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs," in proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, col. 2, pp. 749-752, 2001]]. 결과에 대한 분석은 다음과 같다.Table 1 shows the results of calculating the perceptual evaluation of speech quality (PESQ) score of each algorithm [Ref. 10: A. Rix, J. Beerends, M. Holler, and A. Hekstra, “Perceptual evaluation of speech. quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs, "in proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, col. 2, pp. 749-752, 2001]. The analysis of the results is as follows.

4.1 MS vs MS+MS4.1 MS vs MS + MS

MS 잡음 추정 알고리즘을 사용하여 잡음을 제거하였을 경우 모든 잡음이 제거가 되지 않고 어느 정도의 잔여 잡음이 존재하는 것을 알 수 있다. 때문에 잔여 잡음을 처리하기 위해 MS를 두 번 사용하여 결과를 비교해 보았다. 하지만, MS만을 두 번 사용하였을 때는 표 1에서 보는 바와 같이 MS를 한번만 사용하는 것에 비해 큰 개선효과는 없다.When the noise is removed using the MS noise estimation algorithm, all the noise is not removed and there is some residual noise. Therefore, we used the MS twice to handle the residual noise and compared the results. However, when only MS is used twice, as shown in Table 1, there is no significant improvement compared to using MS only once.

4.2 MS vs CDSTP4.2 MS vs CDSTP

MS는 대체적으로 stationary 배경잡음에 강인한 방법으로 알려져 있고, 잡음 신호에 대하여 별도의 학습 과정이 필요하지 않다. CDSTP는 stationary 배경잡음 뿐만 아니라 non-stationary 배경잡음에서도 강인한 성능을 보인다. CDSTP의 경우 잡음 코드북에 존재하는 잡음인 inside 잡음에 대해서는 대체로 MS 보다 좋은 성능을 보인다. 왜냐하면 제거하고자 하는 잡음에 대한 정보를 LP(Linear Predictive) 스펙트럼 형태의 코드북으로 갖고 있기 때문에, 과거 D-frame만을 이용하여 잡음을 추정하는 MS에 비해 더 강인한 효과를 나타내게 된다. 특히, non-stationary 배경잡음인 machinegun 잡음에 대해서는 성능 향상 효과가 크게 나타난다.The MS is generally known as a robust method to stationary background noise and does not require a separate learning process for noise signals. CDSTP shows robust performance not only for stationary background noise but also for non-stationary background noise. In the case of the CDSTP, the internal noise, which is the noise present in the noise codebook, is generally better than the MS. Because the information about the noise to be removed is codebook in the form of LP (Linear Predictive) spectrum, it shows a more robust effect than the MS which estimates the noise using only the D-frame in the past. In particular, the performance improvement is shown for the machinegun noise which is non-stationary background noise.

반면에, 잡음 코드북에 존재하지 않는 ouside 잡음 환경에 대해서는 CDSTP보다는 MS에서 보다 좋은 성능이 관찰된다. 특이할 점은 코드북 학습 시 고려하지 않은 ringtone 잡음의 경우 일부 프레임에서 보여지는 LP 스펙트럼의 유사성 때문에 MS에 비해서 CDSTP의 성능이 더 좋게 관찰된다는 점이다. 즉, CDSTP를 실생활에서 적용할 경우 모든 종류의 잡음에 대해서 코드북을 학습할 필요는 없으며 대표적인 잡음군만을 선정하여 학습하더라도 MS 보다 우수한 성능을 기대할 수 있다.On the other hand, for ouside noise environments that do not exist in the noise codebook, better performance is observed in the MS than in the CDSTP. Of particular note is that ringtone noise, which is not taken into account in codebook learning, shows better CDSTP performance compared to MS due to the similarity of the LP spectrum seen in some frames. In other words, when the CDSTP is applied in real life, it is not necessary to learn the codebook for all kinds of noise, and even better performance can be expected than the MS even if only the representative noise group is selected and learned.

4.3 제안하는 알고리즘4.3 Proposed Algorithm

CDSTP만을 사용하여 음성 향상을 시키는 것보다는 CDSTP의 앞 또는 뒤에 MS 알고리즘을 접목하여 잡음을 추정하였을 경우가 대체적으로 성능이 향상됨을 볼 수 있다. MS를 접목시킴으로 CDSTP의 약점인 outside 잡음에 대한 보완이 가능하기 때문이다. 이 때, inside 잡음에 대해서는 CDSTP를 먼저 수행하고 남은 잔여 잡음을 MS로 처리하는 CDSTP+MS 방식의 성능이 우수하였고, outside 잡음에 대해서는 MS+CDSTP 방식의 성능이 우수하였다.Rather than using only CDSTP to improve speech, we estimate that the noise is estimated by combining the MS algorithm before or after CDSTP. By integrating the MS, it is possible to compensate for the outside noise, which is a weak point of the CDSTP. At this time, the performance of the CDSTP + MS method, which performs the CDSTP first for the inside noise and processes the residual noise with MS, was excellent, and the performance of the MS + CDSTP method was excellent for the outside noise.

예외적으로 machinegun 잡음의 경우는 discrete 한 특성이 있기 때문에 MS와 같이 과거 D-frame이내의 window를 살펴서 잡음을 측정하는 방식으로는 제거가 불가능하고, MS의 접목 없이 CDSTP만을 사용 하는 경우가 가장 우수한 성능을 나타낸다.In the exceptional case of machinegun noise, it is impossible to remove it by measuring the noise by looking at the window in the past D-frame like MS, and using only CDSTP without MS connection is the best performance. Indicates.

또한, 제안 알고리즘은 mixed 잡음 환경인 machinegun+white 잡음 환경에서 다른 알고리즘보다 더욱 강인한 성능을 보이는 것을 알 수 있다. Mixed 잡음 환경에 대해서는 먼저 MS로 제거 가능한 잡음을 제거한 후 CDSTP를 이용하여 MS로는 제거하기 힘든 machinegun 잡음을 제거하도록 하기 때문에 기존의 알고리즘보다 더욱 강인한 성능을 볼 수 있다.In addition, the proposed algorithm shows more robust performance than other algorithms in machinegun + white noise environment. For the mixed noise environment, first, the noise that can be removed by the MS is removed, and then CDSTP is used to remove the machinegun noise that is difficult to remove by the MS.

따라서, 예측 가능한 잡음 환경에서는 CDSTP+MS 방식이, 그리고 mixed 잡음 환경을 포함한 예측이 힘든 다양한 잡음 환경을 고려하면 MS+CDSTP 방식이 가장 우수한 성능을 보인다고 할 수 있다.Therefore, in the predictable noise environment, the CDSTP + MS method and the MS + CDSTP method show the best performance in consideration of various unpredictable noise environments including mixed noise environment.

MS+CDSTP 방식에서는 MS를 먼저 실행하므로 CDSTP에 사용될 잡음 스펙트럼의 shape이 변하기 때문에 CDSTP 잡음 코드북 학습 시 MS로 처리한 잡음 신호를 이용하는 것이 바람직하다. 이 방식을 MS+CDSTPv2로 명명하였으며, 표 1에서 보는 바와 같이 MS+CDSTPv2 방법이 MS+CDSTP 방법에 비해서 inside 잡음에 대해서는 대체로 좋은 성능을 나타내며, outside 잡음에 대해서도 비슷한 성능을 나타냈다. Mixed 잡음에 대해서는 좋지 못한 성능을 보이는데 이것은 MS+CDSTPv2 방법이 machinegun과 white가 섞인 잡음을 고려해서 코드북 설계를 하지 않았기 때문이다.In the MS + CDSTP method, since the MS is executed first, the shape of the noise spectrum to be used for the CDSTP changes, so it is preferable to use the noise signal processed by the MS when learning the CDSTP noise codebook. This method is named MS + CDSTPv2, and as shown in Table 1, the MS + CDSTPv2 method shows a better performance on the inside noise and similar performance on the outside noise than the MS + CDSTP method. The performance of mixed noise is poor because the MS + CDSTPv2 method does not design codebooks considering the noise of machinegun and white.

5. 결론5. Conclusion

본 발명에서는 mixed 잡음을 포함한 non-stationary 잡음 환경에 강인한 배경 잡음 추정 알고리즘으로 MS와 CDSTP를 결합하는 방식을 제안하였다. 잡음 코드북 학습시 고려한 잡음 환경에 대해서는 MS를 CDSTP의 후처리로 사용하는 경우가, 학습 시 고려하지 못한 잡음 환경에 대해서는 MS를 CDSTP의 전처리로 사용하는 경우가 각각의 알고리즘을 독립적으로 사용하는 경우보다 우수한 방식을 보임을 알 수 있었다. 동일한 잡음 추정 알고리즘인 MS를 두 번 연속적으로 사용하는 경우의 성능개선이 미미함에 비해, non-stationary 환경에 강인한 CDSTP와 별도의 학습 과정이 필요없는 MS를 결합하는 제안 방식은 PESQ 관점에서 상당한 성능개선 효과를 보였다. 특히, 본 발명에서 고찰한 mixed 잡음 환경에 대해서는 다른 기존 알고리즘보다 훨씬 강인한 성능 향상을 얻을 수 있었다.In the present invention, a method of combining MS and CDSTP as a background noise estimation algorithm robust to a non-stationary noise environment including mixed noise is proposed. For the noise environment considered in the noise codebook learning, the MS is used as a post-processing of the CDSTP. For the noise environment not considered in the learning, the MS is used as the preprocessing of the CDSTP. It was found to be an excellent way. While the performance improvement is minimal when using the same noise estimation algorithm MS in succession twice, the proposed scheme combining the robust CDSTP in the non-stationary environment and the MS that does not require a separate learning process is considerably improved in terms of PESQ. It showed an effect. In particular, the mixed noise environment considered in the present invention can achieve a much stronger performance improvement than other conventional algorithms.

본 발명의 배경 잡음 제거 방법은 전술한 실시 예에 국한되지 않고 본 발명의 기술 사상이 허용하는 범위에서 다양하게 변형하여 실시할 수가 있다.The background noise removing method of the present invention is not limited to the above-described embodiments, and various modifications can be made within the range allowed by the technical idea of the present invention.

도 1은 CDSTP의 기본 블록도이다.1 is a basic block diagram of a CDSTP.

도 3은 Inside 잡음의 스펙트로그램으로서, 위에서부터 차례대로 babble, volvo, white, machinegun noise를 나타낸다.Figure 3 is a spectrogram of the inside noise, showing the babble, volvo, white, machinegun noise in order from the top.

도 4는 Outside 잡음의 스펙트로그램으로서, 위에서부터 차례대로 ringtone, F16, machinegun+white noise를 나타낸다.Figure 4 is a spectrogram of the outside noise, showing the ringtone, F16, machinegun + white noise in order from the top.

Claims

배경 잡음으로 오염된 음성인 오염 음성에서 배경 잡음을 제거하는 배경 잡음 제거 방법에 있어서,A background noise removal method for removing background noise from a contaminated voice that is voice contaminated with background noise,

최소 통계법(Minimum Statistics; MS)을 이용하여, 상기 오염 음성에 포함된 배경 잡음의 제 1 추정치를 구하는 (a) 단계;(A) obtaining a first estimate of background noise included in the contaminated speech using Minimum Statistics (MS);

상기 (a) 단계에서 구한 제 1 추정치를 이용하여, 상기 오염 음성에서 배경 잡음을 1차 제거하는 (b) 단계;(B) first removing background noise from the polluted speech using the first estimate obtained in step (a);

상기 최소 통계법을 이용한 잡음 제거 후 남아 있는 잔여 잡음의 제 2 추정치를, 단기 예측계수 코드북 기법(Codebook Driven Short-Term Predictor parameter estimation; CDSTP)을 이용하여 구하는 (c) 단계; 및(C) obtaining a second estimate of the residual noise remaining after the noise elimination using the least statistical method using a Codebook Driven Short-Term Predictor parameter estimation (CDSTP); And

상기 (c) 단계에서 구한 제 2 추정치를 이용하여, 상기 최소 통계법을 이용한 잡음 제거 후 남아 있는 잔여 잡음을 2차 제거하는 (d) 단계;(D) removing second residual noise after noise removal using the least statistical method using the second estimate obtained in the step (c);

를 포함하는 것을 특징으로 하는 배경 잡음 제거 방법.Background noise removal method comprising a.

제 1 항에 있어서, 상기 1차 잡음 제거에는, The method of claim 1, wherein the first noise cancellation,

멀티밴드 스펙트럼 감법(Multi-Band Spectral Subtraction; MBSS)이 이용되는 것임을 특징으로 하는 배경 잡음 제거 방법.Multi-Band Spectral Subtraction (MBSS) is used.

제 2 항에 있어서, 상기 2차 잡음 제거에는,The method of claim 2, wherein the secondary noise cancellation,

위너 필터(Wiener filter)가 이용되는 것임을 특징으로 하는 배경 잡음 제거 방법.A background noise cancellation method, characterized in that a Wiener filter is used.

제 3 항에 있어서, 상기 1차로 제거된 잡음은,The method of claim 3, wherein the first noise removed,

상기 제 2 추정치 계산을 위한 코드북으로 활용되는 것임을 특징으로 하는 배경 잡음 제거 방법. Background noise removal method characterized in that it is used as a codebook for the second estimate calculation.

단기 예측계수 코드북 기법(Codebook Driven Short-Term Predictor parameter estimation; CDSTP)을 이용하여, 상기 오염 음성에 포함된 배경 잡음의 제 1 추정치를 구하는 (a) 단계;(A) obtaining a first estimate of background noise included in the contaminated speech using a Codebook Driven Short-Term Predictor parameter estimation (CDSTP);

상기 단기 예측계수 코드북 기법을 이용한 잡음 제거 후 남아 있는 잔여 잡음의 제 2 추정치를, 최소 통계법(Minimum Statistics; MS)을 이용하여 구하는 (c) 단계; 및(C) obtaining a second estimate of the residual noise remaining after the noise removal using the short-term prediction coefficient codebook method using minimum statistics (MS); And

상기 (c) 단계에서 구한 제 2 추정치를 이용하여, 상기 단기 예측계수 코드북 기법을 이용한 잡음 제거 후 남아 있는 잔여 잡음을 2차 제거하는 (d) 단계;(D) removing second residual noise after noise removal using the short-term prediction coefficient codebook technique by using the second estimate obtained in the step (c);

제 5 항에 있어서, 상기 1차 잡음 제거에는,The method of claim 5, wherein the first noise cancellation,

제 6 항에 있어서, 상기 2차 잡음 제거에는,The method of claim 6, wherein the secondary noise cancellation,

제 1 항 내지 제 7 항 중 어느 한 항의 배경 잡음 제거 방법을 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 매체.A computer readable medium having recorded thereon a program for executing the background noise canceling method according to any one of claims 1 to 7.