KR20110061781A

KR20110061781A - Apparatus and method for subtracting noise based on real-time noise estimation

Info

Publication number: KR20110061781A
Application number: KR1020090118290A
Authority: KR
Inventors: 강병옥; 정호영; 이성주; 박기영
Original assignee: 한국전자통신연구원
Priority date: 2009-12-02
Filing date: 2009-12-02
Publication date: 2011-06-10

Abstract

PURPOSE: A voice processing device and method for removing noise based on a real time noise estimate are provided to improve the performance of voice recognition by removing dynamic noise based on real time noise estimate about input voice including the noise. CONSTITUTION: An input spectrum estimating unit(201) estimates an input spectrum which is a frequency spectrum about an input signal. A noise estimating unit(202) estimates the power spectrum density of noise included in the input signal based on the input spectrum. An average value estimating unit(203) estimates the average value of the power spectrum density of the input signal from the input spectrum. A Wiener filter calculation unit(204) calculates the Wiener filter based on the power spectrum density of the input signal and the power spectrum density of the noise.

Description

실시간 잡음 추정에 기반하여 잡음을 제거하는 음성 처리 장치 및 방법{Apparatus and Method for Subtracting Noise based on Real-time Noise Estimation}Speech processing apparatus and method for removing noise based on real-time noise estimation {Apparatus and Method for Subtracting Noise based on Real-time Noise Estimation}

본 발명은 잡음 환경에서 입력 음성에 포함된 동적 잡음을 제거하는 방법 및 그 장치에 관한 것으로, 더욱 상세하게는 잡음이 포함된 입력 음성에 대하여 실시간 잡음 추정에 기반하여 동적 잡음을 제거함으로써 음성 인식의 성능을 높이는 방법 및 그 장치에 관한 것이다.The present invention relates to a method and apparatus for removing dynamic noise included in an input speech in a noisy environment. The present invention relates to a method and a device for improving performance.

본 발명은 지식 경제부의 IT성장동력기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호:2006-S-036-04, 과제명: 신성장동력산업용 대용량 대화형 분산 내장 처리 음성인터페이스 기술개발].The present invention is derived from a study conducted as part of the IT growth engine technology development project of the Ministry of Knowledge Economy. [Task management number: 2006-S-036-04, Project name: Large-capacity interactive distributed embedded processing voice interface technology for new growth engine industry Development].

최근 정보통신 기술과 컴퓨터 기술이 발달하면서 각종 통신 장치나 데이터 처리 장치에 휴먼 인터페이스(human interface)를 제공하는 음성 인식 기술 분야에 대한 관심이 높아지고 있다. 이러한 추세 속에서 잡음으로 인하여 음성 인식 성능이 저하되는 문제를 해결하기 위한 연구가 활발히 진행되고 있다. 깨끗한 잡음 환경에서는 거의 완벽한 성능을 갖는 음성 인식 시스템도 입력 음성에 잡음이 혼합된 실제 환경에서는 인식률이 급격히 저하되는 경우를 흔히 볼 수 있는바, 음성 인식 성능을 향상시키기 위하여 음성이 입력될 때 부가되는 잡음의 영향을 최대한 줄이려는 것이다.Recently, with the development of information communication technology and computer technology, there is a growing interest in the field of speech recognition technology that provides a human interface to various communication devices and data processing devices. Under these trends, studies are being actively conducted to solve the problem of deterioration of speech recognition performance due to noise. Even in a clean noise environment, a speech recognition system that has almost perfect performance can often be found in a case where the recognition rate is drastically reduced in an actual environment in which noise is mixed with the input voice, which is added when the voice is input to improve the speech recognition performance. To minimize the effects of noise.

음성 인식 성능을 향상시키기 위하여 입력 음성에 혼합된 잡음을 처리하는 기법은 현재까지 여러 가지 접근 방식이 제시되어 있는데, 대표적으로 신호처리 기반의 스펙트럼 향상(spectral enhancement) 기법을 통하여 음성 인식 시스템의 전처리 과정에서 음성의 특성을 향상시키는 방식과 통계모델 기반의 모델 적응(model adaptation) 기법을 통하여 음성 인식 시스템의 인식 모델을 잡음 환경에 적응시키는 방식이 있다.In order to improve the speech recognition performance, various approaches to processing the noise mixed with the input speech have been proposed so far. Typically, the preprocessing process of the speech recognition system through the spectral enhancement technique based on signal processing is proposed. There is a method to improve the characteristics of speech and to adapt the recognition model of the speech recognition system to the noise environment through a model adaptation technique based on statistical model.

스펙트럼 향상 기법은 잡음이 부가된 입력 음성으로부터 잡음을 제거하여 원래의 깨끗한 음성을 추정하는 기법으로서, 예컨대 입력 신호에 음성이 존재하지 않는 비음성 구간에서 추정된 잡음 스펙트럼을 입력 신호에 음성이 존재하는 음성 구간에서 입력 신호의 스펙트럼으로부터 차감한다. 이와 같은 스펙트럼 향상 기법 중에는 decision-directed 기반의 위너 필터(Wiener filter)를 사용하는 방식이 있는데, 2 스테이지(stage)로 확장된 형태가 ETSI(European Telecommunications Standards Institute) AFE(Advanced Front-End) 표준으로 채택된 바 있다. The spectral enhancement technique removes the noise from the noisy input speech and estimates the original clean speech. For example, the noise spectrum is estimated in the non-speech interval where no speech is present in the input signal. Subtract from the spectrum of the input signal in the speech section. One such technique is to use a decision-directed based Wiener filter, which is extended to two stages as the European Telecommunications Standards Institute (ETSI) Advanced Front-End (AFE) standard. Has been adopted.

모델 적응 기법은 잡음이 부가된 입력 음성을 수정하는 대신에 음성 인식 모델을 잡음 환경에 맞도록 적응시키는 기법으로서, 예컨대 음성 인식 모델로 은닉 마코프 모델(Hidden Markov Model, HMM)을 채택한 경우 많은 양의 깨끗한 음성으로 학습한 은닉 마코프 모델을 잡음이 부가된 음성으로 학습한 형태로 변형시킨다. 모 델 적응 기법의 대표적인 예로 PMC(Parallel Model Combination) 기법을 들 수 있는데, PMC는 깨끗한 음성과 잡음을 각각 다른 모델로 표현하고 이 두 모델을 결합하여 잡음이 섞인 음성의 모델을 생성하는 방식으로 잡음 모델에 포함된 잡음 환경에 대하여 다른 기법에 비해 우수한 성능을 보이나, 계산량이 많다는 단점이 있다.The model adaptation technique adapts the speech recognition model to the noise environment instead of modifying the noisy input speech. For example, when the Hidden Markov Model (HMM) is adopted as the speech recognition model, The hidden Markov model trained with clear speech is transformed into a trained form with noisy speech. A representative example of the model adaptation technique is the PMC (Parallel Model Combination) technique. PMC expresses clean speech and noise as different models, and combines the two models to generate a noise-mixed model of noise. The noise environment included in the model is superior to other techniques, but has a disadvantage in that it has a large amount of calculation.

도 1은 종래 위너 필터를 이용하여 입력 음성으로부터 잡음을 제거하는 음성 처리 장치의 예시적인 구조를 나타내는 도면이다.1 is a diagram illustrating an exemplary structure of a speech processing device for removing noise from an input speech using a conventional Wiener filter.

도 1을 참조하면, 음성 처리 장치(100)는 입력 스펙트럼 추정부(101), 비음성 구간 잡음 추정부(102), 확률 스펙트럼 밀도 평균값 추정부(103), 위너 필터 계산부(104) 및 위너 필터 수행부(105)를 포함한다.Referring to FIG. 1, the speech processing apparatus 100 may include an input spectrum estimator 101, a non-voice interval noise estimator 102, a probability spectral density average value estimator 103, a winner filter calculator 104, and a winner. It includes a filter performing unit 105.

입력 스펙트럼 추정부(101)는 입력 신호(S_in)에 대하여 시간 및 주파수 단위의 스펙트럼을 추정한다.The input spectrum estimator 101 estimates a spectrum in units of time and frequency with respect to the input signal S _in .

비음성 구간 잡음 추정부(102)는 입력 스펙트럼 추정부(101)가 추정한 스펙트럼에 기반하여 입력 신호에 음성이 존재하는지 여부를 추정한 후 가장 최근의 비음성 구간에서의 잡음의 주파수 특성을 추정한다.The non-voice interval noise estimator 102 estimates whether voice is present in the input signal based on the spectrum estimated by the input spectrum estimator 101, and then estimates frequency characteristics of noise in the most recent non-voice interval. do.

평균값 추정부(103)는 입력 스펙트럼 추정부(101)가 추정한 스펙트럼으로부터 전력 스펙트럼 밀도(Power Spectral Density, PSD)의 평균값을 추정한다.The average value estimator 103 estimates an average value of the power spectral density (PSD) from the spectrum estimated by the input spectrum estimator 101.

위너 필터 계산부(104)는 입력 스펙트럼 추정부(101)가 추정한 스펙트럼, 비음성 구간 잡음 추정부(102)가 추정한 가장 최근의 비음성 구간에서의 잡음의 주파수 특성 및 평균값 추정부(103)가 추정한 전력 스펙트럼 밀도의 평균값을 이용하여 현재 최적의 위너 필터를 설계한다.The Wiener filter calculator 104 estimates the frequency characteristics and the average value of the noise in the spectrum estimated by the input spectrum estimator 101 and the most recent non-voice interval estimated by the non-voice interval noise estimator 102. We design the optimal Wiener filter at present using the average value of power spectral density estimated by

위너 필터 수행부(105)는 위너 필터 계산부(104)가 계산한 위너 필터를 수행하여 음성 처리 장치(100)에 입력된 입력 신호(S_in)에서 잡음이 제거된 출력 신호(S_out)를 출력한다.The winner filter performing unit 105 performs the winner filter calculated by the winner filter calculation unit 104 to output an output signal S _out from which the noise is removed from the input signal S _in input to the voice processing apparatus 100. Output

종래의 위너 필터를 이용한 잡음 제거 방법 및 그 장치는 정적 잡음(stationary noise)를 효과적으로 제거할 수 있으나, 잡음 변화가 심한 환경과 음성구간에서 특히 동적 잡음이 혼재할 경우 성능 향상이 제한되는 단점이 있다.Conventional noise canceling method and apparatus using Wiener filter can effectively remove stationary noise, but there is a disadvantage that the performance improvement is limited especially when dynamic noise is mixed in environment with high noise variation and voice section. .

본 발명은 상기와 같은 문제점을 인식한 것으로서, 비음성 구간 및 음성 구간에서 실시간으로 잡음의 전력 스펙트럼 밀도를 추정하여 위너 필터를 계산하는 데에 활용함으로써 음성 구간에서 동적 잡음을 효과적으로 제거할 수 있는 음성 처리 방법 및 그 장치를 제공하는 데에 그 목적이 있다.The present invention recognizes the above problems, and can be used to calculate the Wiener filter by estimating the power spectral density of the noise in real time in the non-voice section and the speech section, thereby effectively removing the dynamic noise in the speech section. Its purpose is to provide a treatment method and an apparatus thereof.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다. The object of the present invention is not limited to the above-mentioned object, and other objects that are not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명의 일 실시예에 따르면, 입력 신호에 대한 주파수 스펙트럼인 입력 스펙트럼을 추정하는 입력 스펙트럼 추정부와, 상기 입력 스펙트럼에 기반하여 상기 입력 신호에 포함된 잡음의 전력 스펙트럼 밀도를 추정하는 잡음 추정부와, 상기 입력 스펙트럼으로부터 상기 입력 신호의 전력 스펙트럼 밀도의 평균값을 추정하는 평균값 추정부와, 상기 잡음의 전력 스펙트럼 밀도 및 상기 입력 신호의 전력 스펙트럼 밀도에 기반하여 위너 필터를 계산하는 위너 필터 계산부와, 상기 위너 필터를 이용하여 상기 입력 신호에 포함된 잡음을 제거하여 상기 입력 신호에 포함된 음성 신호의 추정값을 산출하는 위너 필터 수행부를 포함하는 음성 처리 장치를 제공한다.According to an embodiment of the present invention for achieving the above object, an input spectrum estimator for estimating an input spectrum which is a frequency spectrum of an input signal, and the power of noise included in the input signal based on the input spectrum A noise estimator for estimating a spectral density, an average value estimator for estimating an average value of the power spectral density of the input signal from the input spectrum, and a Wiener filter based on the power spectral density of the noise and the power spectral density of the input signal And a Wiener filter calculator configured to calculate a Wiener filter calculation unit and a Wiener filter performer configured to remove noise included in the input signal using the Wiener filter to calculate an estimated value of the voice signal included in the input signal. .

또한, 상기 잡음 추정부는 상기 입력 스펙트럼에 기반하여 상기 입력 신호에 음성 신호가 포함된 구간인지 여부를 추정하는 음성/비음성 구간 추정부를 더 포함하는 음성 처리 장치를 제공한다.The noise estimator further includes a speech / non-speech section estimator that estimates whether the noise signal is included in the input signal based on the input spectrum.

또한, 상기 잡음 추정부는 상기 입력 신호에 음성 신호가 포함되지 않은 구간에서 상기 입력 신호에 포함된 잡음의 전력 스펙트럼 밀도인 제1 전력 스펙트럼 밀도를 추정하는 비음성 구간 잡음 추정부와, 상기 입력 신호에 음성 신호가 포함된 구간에서 상기 입력 신호에 포함된 잡음의 전력 스펙트럼 밀도인 제2 전력 스펙트럼 밀도를 추정하는 음성 구간 잡음 추정부를 더 포함하는 음성 처리 장치를 제공한다.The noise estimator may further include a non-voice interval noise estimator for estimating a first power spectral density, which is a power spectral density of noise included in the input signal, in a section in which the voice signal is not included in the input signal. The speech processing apparatus further includes a speech section noise estimator for estimating a second power spectral density, which is a power spectral density of noise included in the input signal in a section including a speech signal.

또한, 상기 음성 구간 잡음 추정부는 상기 비음성 구간 잡음 추정부에 의하여 추정된 상기 제1 전력 스펙트럼 밀도 및 상기 입력 신호에 포함된 잡음에 대하여 잡음 추적 기법을 적용하여 추정한 전력 스펙트럼 밀도에 기반하여 제2 전력 스펙트럼 밀도를 추정하는 음성 처리 장치를 제공한다.The voice interval noise estimator may be further configured based on the first power spectral density estimated by the non-voice interval noise estimator and a power spectral density estimated by applying a noise tracking technique to the noise included in the input signal. 2 provides a speech processing device for estimating power spectral density.

또한, 상기 위너 필터 계산부는 상기 입력 신호에 음성 신호가 포함되지 않은 구간의 경우 상기 제1 전력 스펙트럼 밀도 및 상기 입력 신호의 전력 스펙트럼 밀도에 기반하여 위너 필터를 계산하고, 상기 입력 신호에 음성 신호가 포함된 구간의 경우 상기 제2 전력 스펙트럼 밀도 및 상기 입력 신호의 전력 스펙트럼 밀도에 기반하여 위너 필터를 계산하는 음성 처리 장치를 제공한다.In addition, the Wiener filter calculator may calculate a Wiener filter based on the first power spectral density and the power spectral density of the input signal when the voice signal is not included in the input signal, and the voice signal is included in the input signal. In the included section, a voice processing apparatus for calculating a Wiener filter based on the second power spectral density and the power spectral density of the input signal is provided.

상기와 같은 목적을 달성하기 위한 본 발명의 다른 실시예에 따르면, (a) 입력 신호에 대한 주파수 스펙트럼인 입력 스펙트럼을 추정하는 단계와, (b) 상기 입력 스펙트럼에 기반하여 상기 입력 신호에 포함된 잡음의 전력 스펙트럼 밀도를 추정하는 단계와, (c) 상기 입력 스펙트럼으로부터 상기 입력 신호의 전력 스펙트럼 밀도의 평균값을 추정하는 단계와, (d) 상기 잡음의 전력 스펙트럼 밀도 및 상기 입력 신호의 전력 스펙트럼 밀도에 기반하여 위너 필터를 계산하는 단계와, (e) 상기 위너 필터를 이용하여 상기 입력 신호에 포함된 잡음을 제거하여 상기 입력 신호에 포함된 음성 신호의 추정값을 산출하는 단계를 포함하는 음성 처리 방법을 제공한다.According to another embodiment of the present invention for achieving the above object, (a) estimating the input spectrum which is a frequency spectrum of the input signal, and (b) included in the input signal based on the input spectrum Estimating a power spectral density of noise; (c) estimating an average value of power spectral density of the input signal from the input spectrum; and (d) power spectral density of the noise and power spectral density of the input signal. Calculating a Wiener filter based on the; and (e) calculating an estimated value of the voice signal included in the input signal by removing noise included in the input signal using the Wiener filter. To provide.

또한, 상기 (b) 단계는 상기 입력 스펙트럼에 기반하여 상기 입력 신호에 음성 신호가 포함된 구간인지 여부를 추정하는 단계를 더 포함하는 음성 처리 방법을 제공한다.In addition, the step (b) provides a speech processing method further comprising the step of estimating whether the input signal is a section including the speech signal based on the input spectrum.

또한, 상기 (b) 단계는 상기 입력 신호에 음성 신호가 포함되지 않은 구간에서 상기 입력 신호에 포함된 잡음의 전력 스펙트럼 밀도인 제1 전력 스펙트럼 밀도를 추정하는 단계와, 상기 입력 신호에 음성 신호가 포함된 구간에서 상기 입력 신호에 포함된 잡음의 전력 스펙트럼 밀도인 제2 전력 스펙트럼 밀도를 추정하는 단계를 더 포함하는 음성 처리 방법을 제공한다.In addition, the step (b) is a step of estimating the first power spectral density, which is the power spectral density of the noise included in the input signal in a period in which the voice signal is not included in the input signal, and the voice signal in the input signal The method further includes estimating a second power spectral density, which is a power spectral density of noise included in the input signal in an included section.

또한, 상기 제2 전력 스펙트럼 밀도는 상기 제1 전력 스펙트럼 밀도 및 상기 입력 신호에 포함된 잡음에 대하여 잡음 추적 기법을 적용하여 추정한 전력 스펙트럼 밀도에 기반하여 추정하는 음성 처리 방법을 제공한다.In addition, the second power spectral density provides a speech processing method for estimating based on the power spectral density estimated by applying a noise tracking technique to the first power spectral density and the noise contained in the input signal.

또한, 상기 (d) 단계는 상기 입력 신호에 음성 신호가 포함되지 않은 구간의 경우 상기 제1 전력 스펙트럼 밀도 및 상기 입력 신호의 전력 스펙트럼 밀도에 기반하여 위너 필터를 계산하고, 상기 입력 신호에 음성 신호가 포함된 구간의 경우 상기 제2 전력 스펙트럼 밀도 및 상기 입력 신호의 전력 스펙트럼 밀도에 기반하여 위너 필터를 계산하는 음성 처리 방법을 제공한다.Also, in the step (d), the Wiener filter is calculated based on the first power spectral density and the power spectral density of the input signal in the case where the voice signal is not included in the input signal, and the voice signal is applied to the input signal. In the case of a section including a speech processing method for calculating a Wiener filter based on the second power spectral density and the power spectral density of the input signal.

본 발명의 또 다른 실시예들에 따른 구체적인 사항들은 이하의 상세한 설명 및 도면들에 포함되어 있다.Specific details according to still another embodiment of the present invention are included in the following description and drawings.

본 발명의 실시예에 따르면, 입력 음성에 부가된 잡음을 효율적으로 제거하여 음성 왜곡과 잔여 잡음을 감소시킴으로써, 음성 인식 시스템이 실제로 사용되는 잡음 환경에서 원래의 깨끗한 음성을 더욱 정확하게 추정하여 음성 인식 성능을 향상시키는 효과가 있다.According to an embodiment of the present invention, by effectively removing the noise added to the input speech to reduce the speech distortion and residual noise, the speech recognition performance by more accurately estimating the original clean speech in the noise environment where the speech recognition system is actually used Has the effect of improving.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발 명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 실시예들은 단지 본 발명의 개시가 완전하도록 하고 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 또한, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Advantages and features of the present invention and methods for achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various forms, and the embodiments are only provided to make the disclosure of the present invention complete and have ordinary skill in the art to which the present invention pertains. It is provided to fully inform the scope of the invention, and the invention is defined only by the scope of the claims. In addition, when it is determined that the detailed description of the related well-known configuration or function may obscure the subject matter of the present invention, the detailed description thereof will be omitted.

이하에서, 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명하기로 한다.Hereinafter, with reference to the accompanying drawings will be described embodiments of the present invention;

도 2는 본 발명의 실시예에 따른 음성 처리 장치의 예시적인 구조를 나타내는 도면이다.2 is a diagram illustrating an exemplary structure of a speech processing device according to an embodiment of the present invention.

도 2를 참조하면, 음성 처리 장치(200)는 입력 스펙트럼 추정부(201), 잡음 추정부(202), 평균값 추정부(203), 위너 필터 계산부(204) 및 위너 필터 수행부(205)를 포함한다. 이와 같은 음성 처리 장치(200)는 예컨대 음성 인식 시스템에서 입력 음성에 포함된 잡음을 제거하여 음성의 품질을 향상시키기 위한 전처리 과정에 사용될 수 있다.Referring to FIG. 2, the speech processing apparatus 200 may include an input spectrum estimator 201, a noise estimator 202, an average value estimator 203, a winner filter calculator 204, and a winner filter performer 205. It includes. Such a speech processing apparatus 200 may be used in a preprocessing process for improving the quality of speech by removing noise included in an input speech, for example, in a speech recognition system.

입력 스펙트럼 추정부(201)는 입력 신호(S_in)에 대하여 주파수 영역의 스펙트럼을 추정한다. 입력 스펙트럼 추정부(201)는 입력 신호(S_in)에 대하여 각 프레임 별로 이산 주파수 스펙트럼을 추정할 수 있는데, 예컨대 입력 스펙트럼 추정부(201)는 각 프레임마다 입력 신호(S_in)에 대하여 이산 푸리에 변환(Discrete Fourier Transform)을 수행하여 입력 신호의 각 주파수 빈(bin)에 해당하는 성분을 얻을 수 있다. 이때, 입력 신호(S_in)는 음성 신호에 주변 잡음이 가산된 형태로서, 현재 프레임 t에서의 스펙트럼은 다음의 수학식과 같이 표현할 수 있다.The input spectrum estimator 201 estimates the spectrum of the frequency domain with respect to the input signal S _in . The input spectrum estimator 201 may estimate a discrete frequency spectrum for each frame with respect to the input signal S _in . For example, the input spectrum estimator 201 may use a discrete Fourier for the input signal S _in for each frame. By performing a Discrete Fourier Transform, a component corresponding to each frequency bin of the input signal may be obtained. In this case, the input signal S _in is a form in which ambient noise is added to the voice signal, and the spectrum in the current frame t may be expressed as the following equation.

여기서, X(t)는 입력 신호(S_in)의 스펙트럼이고, S(t)는 원래의 깨끗한 음성 신호의 스펙트럼이며, N(t)는 잡음의 스펙트럼이다.Where X (t) is the spectrum of the input signal S _in , S (t) is the spectrum of the original clear speech signal, and N (t) is the spectrum of noise.

잡음 추정부(202)는 입력 스펙트럼 추정부(201)가 추정한 입력 스펙트럼에 기반하여 각 프레임 별로 입력 신호(S_in)에 부가된 잡음의 전력 스펙트럼 밀도를 추정한다. 여기서, 잡음 추정부(202)는 음성/비음성 구간 추정부(211), 비음성 구간 잡음 추정부(212), 음성 구간 잡음 추정부(213)를 포함하여 현재 프레임이 음성 구간인지 비음성 구간인지에 따라 잡음을 추정할 수 있다.The noise estimator 202 estimates the power spectral density of the noise added to the input signal S _in for each frame based on the input spectrum estimated by the input spectrum estimator 201. Here, the noise estimator 202 includes a speech / non-speech section estimator 211, a non-speech section noise estimator 212, and a speech section noise estimator 213 to determine whether the current frame is a speech section. Depending on the perception, noise can be estimated.

음성/비음성 구간 추정부(211)는 입력 스펙트럼 추정부(201)가 추정한 입력 스펙트럼에 기반하여 각 프레임 별로 입력 신호(S_in)에 음성 신호가 존재하는지 유무를 추정하여 현재 프레임이 음성 구간에 해당하는지 아니면 비음성 구간에 해당하는지를 추정한다. 예컨대, 원래의 깨끗한 음성 신호의 스펙트럼과 잡음의 스펙트럼은 복소 가우시안(complex Gaussian) 모델과 같은 확률 모델로 표현될 수 있는바, 경판정(hard decision)에 기반한 음성 활동 검출(Voice Activity Detection, VAD) 방식이나 연판정(soft decision) 방식을 사용하여 음성 구간과 비음성 구간을 판별할 수 있다.The speech / non-speech section estimator 211 estimates whether or not a speech signal exists _{in the} input signal S _in for each frame based on the input spectrum estimated by the input spectrum estimator 201 so that the current frame has a speech section. It is estimated whether it corresponds to the non-speech interval. For example, the spectrum of the original clear speech signal and the spectrum of noise can be represented by a probability model such as a complex Gaussian model, which is based on hard decision-based Voice Activity Detection (VAD). The speech section and the non-speech section may be distinguished by using a method or a soft decision method.

비음성 구간 잡음 추정부(212)는 음성/비음성 구간 추정부(211)에 의하여 현재 프레임이 비음성 구간이라고 판별된 경우 현재 프레임에서의 잡음의 전력 스펙트럼 밀도를 추정한다.The non-voice section noise estimator 212 estimates the power spectral density of the noise in the current frame when the voice / non-voice section estimator 211 determines that the current frame is the non-voice section.

음성 구간 잡음 추정부(213)는 음성/비음성 구간 추정부(211)에 의하여 현재 프레임이 음성 구간이라고 판별된 경우 현재 프레임에서의 잡음의 전력 스펙트럼 밀도를 추정한다. 예컨대, 음성 구간 잡음 추정부(213)는 가장 최근의 비음성 구간에서의 잡음의 전력 스펙트럼 밀도 및 MS(Minimum Statistics) 또는 MCRA(Minima Controlled Recursive Average)과 같은 잡음 추적(noise tracking) 기법에 의하여 추정된 현재 프레임에서의 잡음의 전력 스펙트럼 밀도를 조합하여, 최종적인 잡음의 전력 스펙트럼 밀도를 추정할 수 있다. 여기서, 잡음 추적 기법은 가장 최근의 비음성 구간에서 추정된 잡음의 전력 스펙트럼 밀도를 초기값으로 하여 수행될 수 있다. 이와 같은 경우, 음성 구간인 현재 프레임 t에서 k 번째 주파수 빈에 대하여 잡음의 전력 스펙트럼 밀도는 다음의 수학식과 같이 추정될 수 있다.The speech section noise estimator 213 estimates the power spectral density of the noise in the current frame when the speech / non-voice section estimator 211 determines that the current frame is the speech section. For example, the voice interval noise estimator 213 estimates the power spectral density of noise in the most recent non-voice interval and a noise tracking technique such as Minimum Statistics (MS) or Minima Controlled Recursive Average (MCRA). By combining the power spectral densities of the noise in the current frame, the power spectral density of the final noise can be estimated. Here, the noise tracking technique may be performed by setting the power spectral density of the noise estimated in the most recent non-voice interval as an initial value. In this case, the power spectral density of the noise for the k th frequency bin in the current frame t, which is a voice interval, may be estimated as in the following equation.

여기서,

는 음성 구간인 현재 프레임 t에서의 잡음의 전력 스펙트 럼 밀도이고,

는 가장 최근의 비음성 구간인 프레임 t_n에서 추정된 잡음의 전력 스펙트럼 밀도이며,

는 음성 구간인 현재 프레임 t에서 잡음 추적 기법을 통해 획득한 잡음의 전력 스펙트럼 밀도이고, a는 스무딩 팩터(smoothing factor)이다.here,

Is the power spectral density of the noise in the current frame t, the speech interval,

Is the power spectral density of the noise estimated in frame t _n , the most recent non-voice interval,

Is the power spectral density of the noise obtained through the noise tracking technique in the current frame t, which is a voice interval, and a is a smoothing factor.

평균값 추정부(203)는 입력 스펙트럼 추정부(201)가 추정한 입력 스펙트럼으로부터 현재 프레임에서 입력 신호(S_in)의 전력 스펙트럼 밀도의 평균값을 추정한다.The average value estimator 203 estimates an average value of the power spectral density of the input signal S _{in in the} current frame from the input spectrum estimated by the input spectrum estimator 201.

위너 필터 계산부(204)는 잡음의 전력 스펙트럼 밀도 및 음성 신호의 전력 스펙트럼 밀도를 이용하여 현재 프레임 t에서 최적인 위너 필터의 k 번째 주파수 빈에 대한 주파수 성분을 다음의 수학식과 같이 계산한다.The Wiener filter calculator 204 calculates a frequency component of the k-th frequency bin of the Wiener filter, which is optimal in the current frame t, using the power spectral density of the noise and the power spectral density of the voice signal as shown in the following equation.

여기서, 잡음의 전력 스펙트럼 밀도

는 음성/비음성 구간 추정 모듈(211)가 현재 프레임 t가 음성 구간인지 비음성 구간인지 판별하는 결과에 따라 음성 구간에서는 음성 구간 잡음 추정부(213)에 의하여 추정된 잡음의 전력 스펙트 럼 밀도

를 사용하고 비음성 구간에서는 비음성 구간 잡음 추정부(212)에 의하여 추정된 잡음의 전력 스펙트럼 밀도

을 사용하며,

는 입력 스펙트럼 추정부(201)에 의하여 추정된 입력 스펙트럼 및 평균값 추정부(203)에 의하여 추정된 평균값에 기반하여 획득되는 입력 신호(S_in)의 전력 스펙트럼 밀도로부터 이전 프레임을 통하여 추정된 잡음이 제거된 음성 신호의 전력 스펙트럼 밀도를 나타낸다.Where power spectral density of noise

Is the power spectrum density of the noise estimated by the speech section noise estimator 213 in the speech section according to a result of the speech / non-speech section estimating module 211 determining whether the current frame t is the speech section or the non-voice section.

In the non-voice interval, the power spectral density of the noise estimated by the non-voice interval noise estimator 212

Using

The noise estimated through the previous frame is obtained from the power spectral density of the input signal S _in obtained based on the input spectrum estimated by the input spectrum estimator 201 and the average value estimated by the average value estimator 203. Indicates the power spectral density of the removed speech signal.

위너 필터 수행부(205)는 위너 필터 계산부(204)가 계산한 위너 필터를 통하여 음성 처리 장치(200)에 입력된 입력 신호(S_in)을 필터링하여 입력 신호(S_in)에 포함된 음성 신호의 추정값을 산출함으로써, 입력 신호(S_in)에서 잡음이 제거된 출력 신호(S_out)를 출력한다.The winner filter performing unit 205 filters the input signal S _in input to the voice processing apparatus 200 through the winner filter calculated by the winner filter calculation unit 204, and includes the voice included _in the input signal S _in . By calculating the estimated value of the signal, the output signal S _out from which the noise is removed from the input signal S _in is output.

본 발명의 실시예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 판독할 수 있는 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매 체(optical media), 플롭티컬 디스크와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 상기 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Embodiments of the present invention may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer-readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic-optical media such as floppy disks. (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. The medium may be a transmission medium such as an optical or metal line, a wave guide, or the like, including a carrier wave for transmitting a signal designating a program command, a data structure, or the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 한정적인 것으로 이해해서는 안 된다.Although the embodiments of the present invention have been described above with reference to the accompanying drawings, those skilled in the art to which the present invention pertains may be embodied in other specific forms without changing the technical spirit or essential features of the present invention. I can understand that. Therefore, the embodiments described above are illustrative in all respects and should not be understood as limiting.

Claims

입력 신호에 대한 주파수 스펙트럼인 입력 스펙트럼을 추정하는 입력 스펙트럼 추정부와,An input spectrum estimator for estimating an input spectrum that is a frequency spectrum of the input signal;

상기 입력 스펙트럼에 기반하여 상기 입력 신호에 포함된 잡음의 전력 스펙트럼 밀도를 추정하는 잡음 추정부와,A noise estimator for estimating a power spectral density of noise included in the input signal based on the input spectrum;

상기 입력 스펙트럼으로부터 상기 입력 신호의 전력 스펙트럼 밀도의 평균값을 추정하는 평균값 추정부와,An average value estimator for estimating an average value of power spectral density of the input signal from the input spectrum;

상기 잡음의 전력 스펙트럼 밀도 및 상기 입력 신호의 전력 스펙트럼 밀도에 기반하여 위너 필터를 계산하는 위너 필터 계산부와,A Wiener filter calculator for calculating a Wiener filter based on the power spectral density of the noise and the power spectral density of the input signal;

상기 위너 필터를 이용하여 상기 입력 신호에 포함된 잡음을 제거하여 상기 입력 신호에 포함된 음성 신호의 추정값을 산출하는 위너 필터 수행부A Wiener filter performing unit which removes noise included in the input signal by using the Wiener filter to calculate an estimated value of the voice signal included in the input signal

를 포함하는 음성 처리 장치.Speech processing device comprising a.