KR100931181B1

KR100931181B1 - Method of processing noise signal and computer readable recording medium therefor

Info

Publication number: KR100931181B1
Application number: KR1020080008616A
Authority: KR
Inventors: 정성일; 신옥근; 하동경; 유강주; 김형순; 양성일
Original assignee: 한양대학교 산학협력단
Priority date: 2008-01-28
Filing date: 2008-01-28
Publication date: 2009-12-10
Also published as: KR20090082699A

Abstract

웨이블릿 패킷 변환 도메인 또는 퓨리에 변환 도메인 등과 같은 주파수 도메인에서의 새로운 순환 평균 기반 노이즈 추정을 포함하는 노이지 음성 신호의 처리 절차에 관하여 개시한다. 본 발명의 일 실시예에 의하면, 노이지 음성 신호의 웨이블릿 패킷 변환 계수(WPTC)의 크기에 대한 추정된 노이즈의 WPTC의 크기의 비로써, 노이즈 추정을 위한 스케일링 인자를 정의한다. 그리고 종래의 순환 평균 기반 노이즈 추정에서 사용하는 것과는 달리, 본 발명의 실시예에 의하면, 과거의 추정된 노이즈 신호보다는 현재의 입력 노이지 음성 신호에 더 높은 가중치를 부여하여 노이즈 추정을 수행한다. 따라서 본 발명의 실시예에 의하면, 비정적인 노이즈의 변동을 신속하고 또한 정확하게 추적할 수가 있다. 그리고 본 발명의 실시예에 의하면, 노드 레벨에서 인접한 노드 사이의 WPTC의 크기비를 이용하여 묵음 프레임인지를 판정하고 또한 그 결과를 스케일링 인자를 추정하는데 곧바로 반영함으로써, 스케일링 인자를 정확하게 추정할 수가 있다. 이러한 본 발명의 실시예에 의하면, 노이즈 도미넌트 프레임에서 시간 지연이 생기는 것을 방지할 수 있을 뿐만 아니라 뮤지컬 노이즈가 잔류하는 현상이나 음성의 왜곡을 방지할 수가 있다.Disclosed is a processing procedure for a noisy speech signal including a new cyclic average based noise estimation in a frequency domain such as a wavelet packet transform domain or a Fourier transform domain. According to an embodiment of the present invention, a scaling factor for noise estimation is defined as a ratio of the magnitude of the WPTC of the estimated noise to the magnitude of the wavelet packet transform coefficient (WPTC) of the noisy speech signal. Unlike conventional cyclic average-based noise estimation, according to the embodiment of the present invention, noise estimation is performed by giving a higher weight to the current input noisy speech signal than the past estimated noise signal. Therefore, according to the embodiment of the present invention, it is possible to quickly and accurately track the fluctuation of non-static noise. According to the embodiment of the present invention, the scaling factor can be accurately estimated by determining whether the frame is a silent frame using the WPTC size ratio between adjacent nodes at the node level, and reflecting the result directly in estimating the scaling factor. . According to this embodiment of the present invention, it is possible not only to prevent the time delay from occurring in the noise dominant frame, but also to prevent the phenomenon of musical noise and distortion of voice.

노이즈 추정, 순환 평균, 균일 웨이블릿 패킷 변환, 퓨리에 변환, 묵음 프레 임 Noise Estimation, Cyclic Mean, Uniform Wavelet Packet Transform, Fourier Transform, Silent Frame

Description

노이지 음성 신호의 처리 방법 및 이를 위한 컴퓨터 판독 가능한 기록매체{Procedure for processing noisy speech signals and program therefor}Method for processing noisy speech signals and computer readable recording medium therefor {Procedure for processing noisy speech signals and program therefor}

본 발명은 노이지 음성 신호(Noisy Speech Signal)의 처리에 관한 것으로, 보다 구체적으로 비정적인 노이즈 환경에서도 효율적으로 배경 노이즈를 제거할 수 있는 노이지 음성 신호의 처리 절차와 이를 위한 컴퓨터 판독 가능한 기록매체에 관한 것이다.The present invention relates to the processing of a noisy speech signal, and more particularly, to a process of processing a noisy speech signal capable of efficiently removing background noise even in a non-static noise environment, and a computer-readable recording medium therefor. will be.

스피커폰(Speakerphone)은, 다수의 당사자들 사이의 통신을 용이하게 하며 또한 개별적으로는 핸즈-프리 세팅을 제공할 수 있기 때문에, 많은 통신 기기의 필수 장치로 구비되어 있다. 그리고 최근 무선 통신 기술의 발달로 영상 통화용 통신 기기의 보급이 널리 이루어지고 있다. 또한, 청력이 약하여 잘 들을 수 없는 사람들에게 도움을 줄 수 있도록 보청기(Hearing Aid)가 개발되어 보급되고 있다. 이러한 스피커폰이나 보청기, 그리고 영상 통화용 통신 기기 등에서는 노이즈가 섞여 있는 음성 신호인 노이지 음성(Noisy Speech) 신호로부터 배경 노이즈(Background Noise)를 제거하여 음성 신호만을 처리하기 위한 노이지 음성 신호의 처리 장치가 구비되어 있다. Speakerphones are an essential device of many communication devices because they facilitate communication between multiple parties and can individually provide hands-free settings. Recently, with the development of wireless communication technology, the spread of communication devices for video telephony has been widely made. In addition, hearing aids (Hearing Aid) has been developed and distributed to help those who are hearing impaired. In such a speakerphone, a hearing aid, and a communication device for a video call, a noisy voice signal processing apparatus for processing only a voice signal by removing background noise from a noisy speech signal, which is a mixed voice signal, is provided. It is provided.

이러한 노이즈 음성 처리 장치의 성능은 이를 구비하는 음성 기반 어플리케이션 장치의 성능에 큰 영향을 미친다. 왜냐하면, 배경 노이즈는 거의 언제나 음성 신호를 오염시켜서, 음성 코덱, 개인 휴대 통신(Cellular Telephony), 및 음성 인식(Speech Recognition) 등과 같은 음성 기반 어플리케이션 장치의 성능을 급격하게 떨어뜨릴 수가 있기 때문이다. 따라서 배경 노이즈에 의한 영향을 최소화함으로써 노이지 음성 신호의 처리 장치의 성능을 개선하기 위한 연구가 활발하게 진행되고 있다. The performance of such a noise speech processing device greatly affects the performance of a speech-based application device including the same. This is because background noise almost always pollutes the speech signal, which can drastically degrade the performance of speech-based application devices such as speech codecs, personal telephony, speech recognition, and the like. Therefore, researches for improving the performance of the processing device of the noisy voice signal by minimizing the influence of background noise have been actively conducted.

노이즈와 음성이 공존하는 단일 채널에서 음질 개선을 위해서는 노이지 음성 신호 중에서 음성 성분에는 손상을 가하지 않으면서 노이즈 성분만을 효율적으로 제거하여야 한다. 따라서 대부분의 노이지 음성 처리 절차에서는 노이지 음성 신호의 노이즈 성분을 구하기 위한 노이즈 추정(Noise Estimation) 절차를 기본적으로 포함한다. 그리고 추정된 노이즈(Estimated Noise) 신호는 노이지 음성 신호에서 노이즈 성분을 제거하는데 이용되는데, 이와 같이 노이지 음성 신호의 스펙트럼에서 추정된 노이즈 신호의 스펙트럼을 차감함으로써 노이즈를 제거하는 방법을 일반적으로 스펙트럼 차감(Spectral Subtraction)법이라고 한다.In order to improve sound quality in a single channel where noise and voice coexist, only noise components must be efficiently removed without damaging the speech components of the noisy speech signal. Therefore, most of the noisy speech processing procedures basically include a noise estimating procedure for obtaining a noise component of the noisy speech signal. The estimated noise signal is used to remove noise components from the noisy speech signal. In this way, a method of removing noise by subtracting the spectrum of the estimated noise signal from the spectrum of the noisy speech signal is generally used. It is called the Spectral Subtraction method.

스펙트럼 차감법을 이용하는 노이지 음성 신호의 처리 장치에서는 무엇보다도 노이즈 추정이 정확하게 이루어져야 한다. 하지만, 노이지 음성 신호에서 실시간으로 노이즈를 정확하게 추정하는 것은 결코 쉽지가 않다. 뿐만 아니라, 다양한 비정적인 환경에 오염된 노이지 음성 신호로부터 노이즈 추정을 정확하게 수행하여 깨끗한 음성 신호를 얻는다는 것은 매우 어려운 일이다. 부정확한 노이즈 추정은 두 가지 종류의 부작용을 유발할 수가 있는데, 만일 추정된 노이즈가 실제 노이즈보다 낮으면, 개선된 음성 신호에서 성가신 잔류 노이즈(Annoying Residual Noise) 또는 잔류 뮤지컬 노이즈(Residual Musical Noise)가 감지될 수 있다. 반면, 추정된 노이즈가 실제 노이즈보다 높으면, 개선된 음성 신호에서 음성 왜곡이 발생하게 된다.In the processing of a noisy speech signal using the spectral subtraction method, noise estimation must be made above all. However, accurately estimating noise in real time from a noisy speech signal is not easy. In addition, it is very difficult to obtain a clean voice signal by accurately performing noise estimation from a noisy voice signal contaminated with various non-static environments. Inaccurate noise estimation can cause two kinds of side effects. If the estimated noise is lower than the actual noise, annoying residual noise or residual musical noise is detected in the improved speech signal. Can be. On the other hand, if the estimated noise is higher than the actual noise, speech distortion occurs in the improved speech signal.

정확한 노이즈 추정을 위하여 많은 방법들이 제안되었는데, 그 중에서 다소 직관적이고 직접적인 노이즈 추정 절차는 음성 활동 감지(Voice Activity Detection, VAD) 기반 방법을 사용하는 것이다. VAD 기반 방법에 의하면, 이전의 노이즈 프레임(Noise Frame)으로부터 획득한 통계 정보를 이용하여 노이즈를 추정한다. 노이즈 프레임이란 음성이 포함되지 않은 묵음 프레임(Silent Frame or Speech-absent Frame)을 일컫는다. 그러나 배경 노이즈가 비정적(Non-stationary)이거나 또는 레벨 가변적(Level-varying)인 경우에, 종래의 VAD 기반 방법은 과거의 통계 정보를 이용하기 때문에 현재의 노이즈 레벨에 대한 신뢰할 수 있는 정보를 획득하기가 어려운 단점이 있다.Many methods have been proposed for accurate noise estimation. Among them, a somewhat intuitive and direct noise estimation procedure is using a voice activity detection (VAD) based method. According to the VAD-based method, noise is estimated using statistical information obtained from a previous noise frame. The noise frame refers to a silent frame or a speech-absent frame. However, if the background noise is non-stationary or level-varying, the conventional VAD based method uses past statistical information to obtain reliable information about the current noise level. There are disadvantages that are difficult to do.

VAD 기반 방법의 단점을 극복하기 위하여, 여러 가지 새로운 방법이 제안되었다. 이들 중에서 널리 알려진 접근법은 최소 통계(Minimum Statistics, MS) 알고리즘을 이용하는 것이다. 이에 의하면, 탐색 윈도우(Search Window)에 걸쳐서 노이지 음성 신호의 스무드된 파워 스펙트럼(Smoothed Power Spectrum)의 최소값을 추적한다. 여기서, 탐색 윈도우는 약 1.5초의 최근 프레임들을 커버한다. 이러한 MS 알고리즘은 일반적으로 훌륭한 성능을 보여 주지만, 특히 노이즈가 대부분을 차지 하는 노이즈 도미넌트 신호(Noise Dominant Signal)에서 노이즈 레벨의 변화를 빨리 추적할 수 없는 단점이 있다. 그 결과, 노이즈 추정에 있어서의 시간 지연(Time Lag)이라는 문제가 발생한다.In order to overcome the shortcomings of the VAD based method, several new methods have been proposed. A well-known approach among these is the use of Minimum Statistics (MS) algorithms. This tracks the minimum value of the Smooth Power Spectrum of the noisy speech signal over the Search Window. Here, the search window covers recent frames of about 1.5 seconds. These MS algorithms generally show good performance, but have the disadvantage of not being able to track changes in the noise level quickly, especially in the noise dominant signal, which is mostly occupied by noise. As a result, a problem such as time lag in noise estimation occurs.

이러한 단점을 해결하기 위하여, 여러 가지 종류의 수정된 MS 알고리즘이 제안되었다. 이들 수정된 MS 알고리즘의 대부분이 갖는 공통적인 두 가지 특징은 다음과 같다. 첫째는 고려 대상이 되는 프레임 또는 주파수 빈(Frequency Bin)에 음성이 존재하는지 또는 묵음 구간인지를 구별하기 위한 지시자(Indicator)가 지속적으로 사용된다는 것이다. 그리고 두 번째 특징은 순환 평균(Recursive Averaging, RA) 기반 노이즈 추정기(Noise Estimator)가 사용된다는 것이다. In order to solve this disadvantage, various kinds of modified MS algorithms have been proposed. Two common features of most of these modified MS algorithms are as follows. First, an indicator is continuously used to distinguish whether voice is present in a frame or frequency bin under consideration or whether it is a silent section. The second feature is that a recursive averaging (RA) based noise estimator is used.

그런데, 수정된 MS 알고리즘은 최초의 MS 알고리즘이 보여준 노이즈 추정에 있어서의 시간 지연을 어느 정도는 해결할 수가 있었지만, 이를 완전하게 해결할 수는 없었다. 왜냐하면, 수정된 MS 알고리즘도 본질적으로는 MS 알고리즘과 동일한 방법, 즉 현재 프레임의 노이즈를 추정함에 있어서, 이전 프레임의 추정된 노이즈 신호가 큰 가중치로 반영되어 사용되고 있기 때문이다. 따라서 종래의 MS 알고리즘 또는 수정된 MS 알고리즘은 특히, 노이즈 도미넌트 프레임에서 레벨의 변화가 급격한 배경 노이즈를 신속하고 정확하게 추정하기가 어렵다.By the way, the modified MS algorithm could solve to some extent the time delay in the noise estimation shown by the original MS algorithm, but could not completely solve it. This is because the modified MS algorithm is essentially the same method as the MS algorithm, i.e., in estimating the noise of the current frame, the estimated noise signal of the previous frame is reflected with a large weight. Thus, conventional MS algorithms or modified MS algorithms are particularly difficult to quickly and accurately estimate background noise with sharp changes in levels in noise dominant frames.

본 발명이 해결하고자 하는 과제는 전술한 종래의 노이지 음성 신호의 처리 절차에서 파생되는 문제점을 해결하는 것으로서, 비정적이며 또한 다양한 유형의 노이즈 조건에서도 효과적으로 음질을 개선할 수 있고, 특히 잔류 뮤지컬 잡음을 효과적으로 억제할 수 있는 음질 개선 방법과 이를 위한 컴퓨터 판독 가능한 기록매체를 제공하는 것이다.The problem to be solved by the present invention is to solve the problems derived from the above-described conventional process of processing a noisy speech signal, it is possible to effectively improve the sound quality even in a non-static and various types of noise conditions, and in particular to remove residual musical noise To provide a method for improving sound quality that can be effectively suppressed and a computer readable recording medium therefor.

본 발명이 해결하고자 하는 다른 과제는 노이지 도미넌트 신호인 경우에도 노이즈의 변화를 신속하고 또한 정확하게 추적할 수 있으며, 또한 시간 지연이 발생하는 것을 효과적으로 방지할 수 있는 노이지 음성 신호의 처리 방법과 이를 위한 컴퓨터 판독 가능한 기록매체를 제공하는 것이다.Another problem to be solved by the present invention is a method for processing a noisy voice signal that can quickly and accurately track changes in noise even in the case of a noisy dominant signal, and can effectively prevent a time delay from occurring and a computer therefor. To provide a readable recording medium.

본 발명이 해결하고자 하는 또 다른 과제는 음성이 대부분을 차지하는 신호에서는 노이즈 레벨이 과대평가 되어서 음성 왜곡이 발생하는 것을 방지할 수 있는 노이즈 음성 신호의 처리 방법과 이를 위한 컴퓨터 판독 가능한 기록매체를 제공하는 것이다.Another object of the present invention is to provide a method of processing a noisy speech signal that can prevent speech distortion due to an overestimation of the noise level in the speech signal, and a computer-readable recording medium therefor. will be.

전술한 과제를 해결하기 위하여, 본 발명에서는 균일 웨이블릿 패킷 변환 도메인 또는 퓨리에 변환 도메인 등과 같은 주파수 도메인에서의 적응적 노이즈 추정 절차와 이를 이용한 노이지 음성 신호의 처리 방법 및 장치를 제안한다. 본 발명에 따라 제안된 알고리즘에서의 주요한 특징 두 가지는 노이지 도미넌트 프레임에서도 현재 프레임의 노이지 음성 신호에 큰 가중치를 부여하는 순환적 평균 기반 노이즈 추정(RA-based Noise Estimation) 및 정확한 묵음 프레임의 판정을 통한 지속적인 스케일링 인자의 리파인먼트(Scaling Factor Refinement)이다.In order to solve the above problems, the present invention proposes an adaptive noise estimation procedure in a frequency domain such as a uniform wavelet packet transform domain or a Fourier transform domain, and a method and apparatus for processing a noisy speech signal using the same. Two main features of the proposed algorithm in accordance with the present invention are the cyclic average-based noise estimating (RA-based Noise Estimation), which gives a large weight to the noisy speech signal of the current frame even in the noisy dominant frame. Continuous scaling factor refinement.

본 발명의 실시예에서도 노이즈를 추정하기 위하여 RA 기반 노이즈 추정을 사용한다는 점에서, 종래의 MS 알고리즘이나 가중 평균(Weighted Average, WA) 알고리즘과 공통점이 있다. 그러나 본 발명의 실시예에 따른 RA 기반 노이즈 추정은, 스케일링 인자의 관점에서, 음질 개선에서 일반적으로 사용되고 있는 종래의 RA 기반 노이즈 추정과 차이가 있다. 즉, 종래의 MS 알고리즘이나 WA 알고리즘에서의 RA 기반 노이즈 추정에서는, 이전 프레임의 노이즈 추정치에 더 큰 가중치를 부여하고 입력 신호(Input Signal)인 현재 프레임의 노이지 스피치 신호에는 더 작은 가중치를 부여하였다. 반면에, 본 발명의 실시예에서는, 이러한 기존의 알고리즘과는 반대로, 노이지 음성 신호에 대한 노이즈의 비율이 소정의 임계치 이상일 경우에는 현재의 입력 노이지 음성 신호에 보다 큰 가중치를 할당되도록 함으로써, 신속하고 정확한 노이즈 추정이 가능하도록 하고 시간 지연이 발생되는 것을 방지한다. 그리고 노이지 스피치 신호에 대한 노이즈의 비율이 소정의 임계치 이하일 경우에는, 현재의 입력 노이지 신호만을 이용하여 노이즈를 추정함으로써, 노이즈가 과대 평가되는 것을 방지한다.In the embodiment of the present invention, since RA-based noise estimation is used to estimate noise, it has a common point with a conventional MS algorithm or a weighted average (WA) algorithm. However, RA-based noise estimation according to an embodiment of the present invention is different from conventional RA-based noise estimation generally used for sound quality improvement in terms of scaling factors. That is, in the RA-based noise estimation in the conventional MS algorithm or the WA algorithm, a greater weight is given to the noise estimate of the previous frame and a smaller weight is given to the noisy speech signal of the current frame which is an input signal. On the other hand, in the embodiment of the present invention, in contrast to the conventional algorithm, when the ratio of the noise to the noisy voice signal is more than a predetermined threshold, a larger weight is assigned to the current input noisy voice signal, thereby quickly and Enable accurate noise estimation and avoid time delays. When the ratio of the noise to the noisy speech signal is equal to or less than a predetermined threshold, the noise is estimated using only the current input noisy signal to prevent the noise from being overestimated.

그리고 스케일링 인자 리파인먼트와 관련하여 본 발명의 실시예에 의하면, 고려 대상인 주파수 도메인의 노드(예컨대, 웨이블릿 패킷 노드)가 스피치가 없는 묵음(Speech-absent)인 것으로 판정될 경우에, 스케일링 인자를 지속적으로 업데이트한다. 그리고 묵음 구간인지 여부를 판정함에 있어서, 파워나 엔트로피가 아닌 인접한 노드 사이의 변환 계수, 예컨대 웨이블릿 패킷 변환 계수(Wavelet Packet Transform Coefficient, WPTC)의 크기비(Magnitude Ratio)를 이용하기 때문에, 묵음 구간에 대한 판정의 정확도를 향상시킬 수가 있다. And in accordance with an embodiment of the present invention with respect to scaling factor refinement, when the node of the frequency domain under consideration (eg, wavelet packet node) is determined to be speech-absent, the scaling factor is continued. Update with. In determining whether the silence period is a silent period, since a transform ratio between adjacent nodes, for example, a Wavelet Packet Transform Coefficient (WPTC), is used instead of power or entropy, The accuracy of the judgment can be improved.

일반적으로, 음성이 없다는 것을 감지하는 것은 노이즈를 추정하는 과정에서 큰 도움이 될 수 있다. 그러나 묵음 프레임인를 부정확하게 판정하는 경우에는 노이즈 추정에 큰 오류를 발생시킬 수가 있기 때문에, 정확한 판정이 요구된다. 본 발명의 실시예에 의하면, 노이즈 신호는 주파수 도메인, 예컨대 웨이블릿 패킷 변환 도메인에서의 이웃 노드들 사이에 변환 계수의 크기가 유사하다는 성질을 이용한다. 보다 구체적으로, 본 발명에서는 해당 프레임에서 음성이 존재하는지 여부를 조사하기 위하여, 주파수 도메인, 예컨대 웨이블릿 패킷 변환 도메인에서 현재 노드와 함께 이웃 노드에서의 변환 계수(예컨대, WPTC)의 크기 사이의 비를 고려한다. 이러한 본 발명의 실시예에 의하면, 고도로 비정적인 노이즈 환경에서도 현재 프레임이 묵음 프레임에 해당되는지에 대한 판정의 정확성을 향상시킬 수가 있다.In general, detecting that there is no voice can be a great help in estimating noise. However, inaccurately determining the silence frame-in can cause a large error in noise estimation, so accurate determination is required. According to an embodiment of the present invention, the noise signal utilizes the property that the magnitudes of the transform coefficients are similar between neighboring nodes in the frequency domain, for example, the wavelet packet transform domain. More specifically, in the present invention, the ratio between the magnitudes of the transform coefficients (eg, WPTC) at the neighboring node together with the current node in the frequency domain, for example, the wavelet packet transform domain, is used to investigate whether voice is present in the frame. Consider. According to this embodiment of the present invention, it is possible to improve the accuracy of determining whether the current frame corresponds to the silent frame even in a highly non-static noise environment.

상기한 과제를 해결하기 위한 본 발명의 일 실시예는 노이지 음성 처리 신호의 처리 절차에 관한 것으로서, 현재 프레임의 입력 노이지 음성 신호를 주파수 도메인으로 변환하여 변환 계수들로 이루어진 변환 신호를 생성하고, 상기 변환 신호를 이용하여 노이즈를 추정하는 절차로써, 만일 스케일링 인자가 소정의 임계치보다 작은 경우에는 상기 입력 노이지 음성 신호만을 이용하여 제1 노이즈를 추정하고, 만일 상기 스케일링 인자가 상기 임계치 이상인 경우에는 상기 입력 노이지 음성 신호와 이전 프레임의 노이즈 추정치를 모두 이용하되 상기 노이즈 추정치보다 상기 입력 노이지 음성 신호에 더 큰 가중치를 할당하여 제2 노이즈를 추정하며, 상기 추정된 노이즈를 이용한 스펙트럼 차감을 수행하여 개선된 음성 신호를 구하는 과정을 포함한다. 그리고 상기 개선된 음성 신호를 다시 시간 도메인으로 역변 환을 수행함으로써, 잡음이 제거된 음성 신호를 얻을 수 있다.An embodiment of the present invention for solving the above problems relates to a processing procedure of a noisy speech processing signal, and converts an input noisy speech signal of a current frame into a frequency domain to generate a converted signal composed of transform coefficients. A procedure of estimating noise by using a converted signal, wherein if a scaling factor is smaller than a predetermined threshold, first noise is estimated using only the input noisy voice signal, and if the scaling factor is greater than or equal to the threshold, the input is performed. Improved speech by using both a noisy speech signal and a noise estimate of a previous frame, estimating a second noise by assigning a greater weight to the input noisy speech signal than the noise estimate, and performing a spectral subtraction using the estimated noise Obtaining a signal. In addition, by performing the inverse transformation on the improved speech signal back to the time domain, a speech signal without noise may be obtained.

상기 실시예의 일 측면에 의하면, 상기 변환 신호를 생성하는 과정에서는 상기 입력 노이지 음성 신호에 대하여 균일 웨이블릿 패킷 변환을 수행할 수 있다. 이 경우에, 상기 노이즈 추정 절차 이전에 이전 프레임의 스무드된 변환 신호와 현재 프레임의 임의의 고정된 웨이블릿 패킷 트리 레벨에서의 웨이블릿 패킷 변환 신호의 크기를 이용하여 상기 변환 신호를 스무딩하는 것을 더 포함하고, 상기 노이즈 추정 단계에서는 스무드된 변환 신호를 이용할 수 있다.According to an aspect of the embodiment, the uniform wavelet packet conversion may be performed on the input noisy speech signal in the process of generating the converted signal. In this case, further comprising smoothing the transform signal using the magnitude of the smoothed transform signal of the previous frame and the wavelet packet transform signal at any fixed wavelet packet tree level of the current frame prior to the noise estimation procedure. In the noise estimation step, a smoothed converted signal may be used.

상기 실시예의 다른 측면에 의하면, 상기 스케일링 인자는 현재 프레임에서 상기 노이지 음성 신호의 변환 계수의 크기의 합에 대한 이전 프레임에서 상기 추정된 노이즈 신호의 변환 계수의 크기의 합의 비를 이용하여 추정하되, 상기 합의 비가 1보다 큰 경우에는 상기 스케일링 인자를 1로 설정할 수 있다. 그리고 상기 추정된 제1 노이즈는 상기 스케일링 인자와 상기 입력 노이지 음성 신호의 곱으로 구할 수도 있다. 또한, 상기 추정된 제2 노이즈는 상기 스케일링 인자와 상기 입력 노이지 음성 신호의 곱의 합과 (1-상기 스케일링 인자)와 상기 이전 프레임의 노이즈 추정치의 곱의 합으로 구할 수도 있다.According to another aspect of the embodiment, the scaling factor is estimated using the ratio of the sum of the magnitudes of the transform coefficients of the estimated noise signal in the previous frame to the sum of the magnitudes of the transform coefficients of the noisy speech signal in the current frame, When the ratio is greater than 1, the scaling factor may be set to 1. The estimated first noise may be obtained as a product of the scaling factor and the input noisy speech signal. In addition, the estimated second noise may be obtained as a sum of a product of the scaling factor and the input noisy speech signal and a sum of a product of (1-the scaling factor) and a noise estimate of the previous frame.

상기 실시예의 또 다른 측면에 의하면, 상기 노이즈의 추정 과정은 현재 프레임이 묵음 프레임인지를 판정하는 것을 더 수행하며, 상기 묵음 프레임의 판정 절차에서는 인접한 노드 사이의 상기 변환 계수의 크기비를 이용할 수도 있다. 이 경우에, 상기 묵음 프레임의 판정 절차에서는 현재 프레임에서의 인접한 노드 사이의 변환 계수의 제1 크기비를 묵음 프레임인 기준 프레임에서의 인접한 노드 사이 의 변환 계수의 제2 크기비를 비교할 수도 있다. 그리고 상기 제1 크기비와 상기 제2 크기비 사이의 비가 1에 가까우면 묵음 프레임으로 판정할 수도 있다. 또한, 상기 묵음 프레임의 판정 절차에서는 연속된 세 개 또는 그 이상의 노드에서 상기 제1 및 제2 크기비 사이의 비가 음성이 존재하는 프레임인 것으로 판정할 경우에만 상기 현재 프레임을 묵음 프레임이 아닌 것으로 판정할 수도 있다.According to another aspect of the embodiment, the noise estimating step further determines whether the current frame is a silent frame, and the size ratio of the transform coefficients between adjacent nodes may be used in the silent frame determination procedure. . In this case, in the determination procedure of the silent frame, the first size ratio of transform coefficients between adjacent nodes in the current frame may be compared with the second size ratio of transform coefficients between adjacent nodes in the reference frame which is the silent frame. If the ratio between the first size ratio and the second size ratio is close to 1, it may be determined as a silent frame. Further, in the determination procedure of the silent frame, it is determined that the current frame is not the silent frame only when it is determined that the ratio between the first and second size ratios is a frame in which voice exists in three or more consecutive nodes. You may.

상기 실시예의 또 다른 측면에 의하면, 상기 현재 프레임이 묵음 프레임으로 판정되는 경우에는 상기 현재 프레임의 정보를 이용하여 상기 스케일링 인자를 업데이트한 후에 업데이트된 스케일링 인자를 이용하여 노이즈를 추정하고, 상기 현재 프레임이 묵음 프레임이 아닌 것으로 판정되는 경우에는 곧바로 노이즈를 추정할 수도 있다.According to another aspect of the embodiment, if it is determined that the current frame is a silent frame, after updating the scaling factor using the information of the current frame, the noise is estimated using the updated scaling factor, and the current frame If it is determined that this is not the silent frame, noise can be estimated immediately.

상기한 기술적 과제를 달성하기 위한 본 발명의 다른 실시예는 노이즈 음성 신호에서 묵음 프레임을 판정하는 방법에 관한 것으로서, 현재 프레임의 입력 노이지 음성 신호를 주파수 도메인으로 변환하여 변환 계수들로 이루어진 변환 신호를 생성하고, 소정 차수의 변환 레벨에서 상기 변환 신호의 인접한 노드 사이의 상기 변환 계수의 크기비를 이용하여 현재 프레임이 묵음 프레임인지를 판정한다.Another embodiment of the present invention for achieving the above technical problem relates to a method for determining a silent frame from a noise speech signal, and converts the input noisy speech signal of the current frame into the frequency domain to convert a transform signal consisting of transform coefficients And use the ratio of the magnitudes of the transform coefficients between adjacent nodes of the transform signal at a transform level of a predetermined order to determine whether the current frame is a silent frame.

상기 실시예의 일 측면에 의하면, 상기 변환 신호를 생성하는 과정에서는 상기 입력 노이지 음성 신호에 대하여 균일 웨이블릿 패킷 변환을 수행할 수 있다. 이 경우에, 상기 묵음 프레임의 판정 절차에서는 현재 프레임에서의 인접한 노드 사이의 웨이블릿 패킷 변환 계수의 제1 크기비를 묵음 프레임인 기준 프레임에서의 인접한 노드 사이의 웨이블릿 패킷 변환 계수의 제2 크기비를 비교할 수도 있다.According to an aspect of the embodiment, the uniform wavelet packet conversion may be performed on the input noisy speech signal in the process of generating the converted signal. In this case, in the silence frame determination procedure, the first size ratio of wavelet packet transform coefficients between adjacent nodes in the current frame is determined by the second size ratio of wavelet packet transform coefficients between adjacent nodes in the reference frame which is the silent frame. You can also compare.

상기한 기술적 과제를 달성하기 위한 본 발명의 또 다른 실시예는 노이지 음성 신호의 처리 절차로써, 현재 프레임의 입력 노이지 음성 신호에 대하여 균일 웨이블릿 패킷 변환을 수행하여 변환 신호를 생성하는 단계, 이전 프레임의 스무드된 변환 신호와 현재 프레임의 임의의 고정된 웨이블릿 패킷 트리 레벨에서의 상기 변환 신호의 크기를 이용하여 상기 변환 신호를 스무딩하여 스무드된 변환 신호를 구하는 단계, 상기 고정된 웨이블릿 패킷 변환 레벨에서 상기 스무드된 변환 신호의 인접한 노드 사이의 웨이블릿 패킷 변환 계수의 크기비를 이용하여 현재 프레임이 묵음 프레임인지를 판정하는 단계, 만일 상기 현재 프레임이 묵음 프레임인 경우에는 스케일링 인자를 업데이트한 후에 상기 스무드된 변환 신호를 이용하여 노이즈를 추정하고, 만일 상기 현재 프레임이 묵음 프레임이 아닌 경우에는 곧바로 상기 스무드된 변환 신호를 이용하여 노이즈를 추정하는 단계, 상기 추정된 노이즈를 이용한 스펙트럼 차감을 수행하여 개선된 음성 신호를 구하는 단계, 및 상기 개선된 음성 신호에 대하여 균일 웨이블릿 패킷 역변환을 수행하는 단계를 포함한다.Another embodiment of the present invention for achieving the above technical problem is a process of processing a noisy voice signal, performing a uniform wavelet packet conversion on the input noisy voice signal of the current frame to generate a converted signal, Smoothing the transform signal using a smoothed transform signal and the magnitude of the transform signal at any fixed wavelet packet tree level of the current frame to obtain a smoothed transform signal, the smooth at the fixed wavelet packet transform level Determining whether the current frame is a silent frame by using a magnitude ratio of wavelet packet transform coefficients between adjacent nodes of the converted signal, and if the current frame is a silent frame, updating the scaling factor after updating the scaling factor. To estimate the noise, If the current frame is not a silent frame, immediately estimating noise using the smoothed converted signal, performing a spectral subtraction using the estimated noise to obtain an improved speech signal, and the improved speech Performing uniform wavelet packet inverse transform on the signal.

상기한 기술적 과제를 달성하기 위한 본 발명의 또 다른 실시예는 컴퓨터를 제어하여 현재 프레임의 입력 노이지 음성 신호로부터 개선된 음성 신호를 생성하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체로서, 상기 프로그램은 상기 입력 노이지 음성 신호를 주파수 도메인으로 변환하여 변환 계수들로 이루어진 변환 신호를 생성하는 처리와, 상기 변환 신호를 이용하여 노이즈를 추정하는 처리로써, 만일 스케일링 인자가 소정의 임계치보다 작은 경우에는 상기 입력 노이지 음성 신호만을 이용하여 제1 노이즈를 추정하고, 만일 상기 스케일링 인자가 상기 임계치 이상인 경우에는 상기 입력 노이지 음성 신호와 이전 프레임의 노이즈 추정치를 모두 이용하되 상기 노이즈 추정치보다 상기 입력 노이지 음성 신호에 더 큰 가중치를 할당하여 제2 노이즈를 추정하는 처리와, 상기 추정된 노이즈를 이용한 스펙트럼 차감을 수행하여 개선된 음성 신호를 구하는 처리와, 상기 개선된 음성 신호를 시간 도메인으로 역변환하는 처리를 상기 컴퓨터에 실행시킨다.Another embodiment of the present invention for achieving the above technical problem is a computer-readable recording medium recording a program for controlling the computer to generate an improved speech signal from the input noisy speech signal of the current frame, the program is the input A process of converting a noisy speech signal into a frequency domain to generate a transform signal composed of transform coefficients, and a process of estimating noise using the transform signal, wherein if the scaling factor is less than a predetermined threshold, the input noisy speech The first noise is estimated using only the signal, and if the scaling factor is greater than or equal to the threshold, both the input noisy speech signal and the noise estimate of the previous frame are used, and a greater weight is applied to the input noisy speech signal than the noise estimate. By assigning a second The computer performs a process of estimating noise, a process of obtaining an improved speech signal by performing a spectral subtraction using the estimated noise, and a process of inversely converting the improved speech signal into the time domain.

상기한 기술적 과제를 달성하기 위한 본 발명의 또 다른 실시예는 컴퓨터를 제어하여 현재 프레임의 입력 노이지 음성 신호로부터 상기 현재 프레임이 묵음 프레임인지를 판정하도록 고안된 프로그램을 기록한 컴퓨터 판독 가능한 기록매체로서, 상기 프로그램은 현재 프레임의 입력 노이지 음성 신호를 주파수 도메인으로 변환하여 변환 계수들로 이루어진 변환 신호를 생성하는 처리와, 소정 차수의 변환 레벨에서 상기 변환 신호의 인접한 노드 사이의 상기 변환 계수의 크기비를 이용하여 현재 프레임이 묵음 프레임인지를 판정하는 처리를 상기 컴퓨터에 실행킨다.Another embodiment of the present invention for achieving the above technical problem is a computer-readable recording medium recording a program designed to control a computer to determine whether the current frame is a silent frame from the input noisy voice signal of the current frame, The program utilizes a process of converting the input noisy speech signal of the current frame into the frequency domain to generate a transform signal consisting of transform coefficients, and using the magnitude ratio of the transform coefficients between adjacent nodes of the transform signal at a predetermined order of transform level. Processing to determine if the current frame is a silent frame is executed on the computer.

상기한 기술적 과제를 달성하기 위한 본 발명의 또 다른 실시예는 컴퓨터를 제어하여 현재 프레임의 입력 노이지 음성 신호로부터 개선된 음성 신호를 생성하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체로서, 상기 프로그램은 상기 입력 노이지 음성 신호에 대하여 균일 웨이블릿 패킷 변환을 수행하여 변환 신호를 생성하는 처리와, 이전 프레임의 스무드된 변환 신호와 현재 프레임의 임의의 고정된 웨이블릿 패킷 트리 레벨에서의 상기 변환 신호의 크기를 이용하여 상기 변환 신호를 스무딩하여 스무드된 변환 신호를 구하는 처리와, 상기 고정된 웨이블릿 패킷 변환 레벨에서 상기 스무드된 변환 신호의 인접한 노드 사이의 웨이블릿 패킷 변환 계수의 크기비를 이용하여 현재 프레임이 묵음 프레임인지를 판정하는 처리와, 만일 상기 현재 프레임이 묵음 프레임인 경우에는 스케일링 인자를 업데이트한 후에 상기 스무드된 변환 신호를 이용하여 노이즈를 추정하고, 만일 상기 현재 프레임이 묵음 프레임이 아닌 경우에는 곧바로 상기 스무드된 변환 신호를 이용하여 노이즈를 추정하는 처리와, 상기 추정된 노이즈를 이용한 스펙트럼 차감을 수행하여 개선된 음성 신호를 구하는 처리와, 상기 개선된 음성 신호에 대하여 균일 웨이블릿 패킷 역변환을 수행하는 처리를 상기 컴퓨터에 실행시킨다.Another embodiment of the present invention for achieving the above technical problem is a computer-readable recording medium recording a program for controlling the computer to generate an improved speech signal from the input noisy speech signal of the current frame, the program is the input A process of performing a uniform wavelet packet transform on a noisy speech signal to generate a transform signal, and using the smoothed transform signal of the previous frame and the magnitude of the transform signal at any fixed wavelet packet tree level of the current frame. Smoothing the transform signal to obtain a smoothed transform signal, and determining whether the current frame is a silent frame using the ratio of wavelet packet transform coefficients between adjacent nodes of the smoothed transform signal at the fixed wavelet packet transform level. Treatment done, in case the current If the frame is a silent frame, after the scaling factor is updated, the noise is estimated using the smoothed converted signal. If the current frame is not the silent frame, the noise is immediately estimated using the smoothed converted signal. The computer performs processing, a process of obtaining an improved speech signal by performing spectral subtraction using the estimated noise, and a process of performing uniform wavelet packet inverse transform on the improved speech signal.

순환 평균(RA) 기반 노이즈 추정에서 이전의 노이즈 추정치는 과거 프레임으로부터 획득한 정보이고, 입력 노이지 스피치 신호는 현재 프레임으로부터 획득한 정보이다. 본 발명의 실시예에 의하면, 입력 노이지 스피치 신호에 대한 이전 프레임에서의 노이즈 추정치, 즉 스케일링 인자가 소정의 임계치보다 클 경우에는 이전 프레임으로부터 추정된 노이즈 보다는 현재 프레임에 존재하는 신호에 보다 높은 가중치를 부여하여 현재 프레임의 노이즈를 추정한다. 따라서 본 발명에 의하면, 노이즈 추정기가 현재 프레임의 노이지 스피치 신호에 포함된 노이즈의 변화에 대한 추적을 보다 신속하게 그리고 적응적으로 수행할 수 있도록 한다. 이러한 본 발명의 실시예에 의하면, 노이즈가 일정 비율 이상을 차지하고 또한 고도로 비정적이거나 또는 레벨-가변적 환경에서도 급격히 변화하는 노이즈를 정확하게 적응적으로 추정할 수 있기 때문에, 종래의 MS 알고리즘이나 WA 알고리즘이 보여 준 근본적인 한계, 즉 시간 지연으로 인하여 뮤지컬 노이즈가 잔류하는 문제를 해결할 수가 있 다. In cyclic average (RA) based noise estimation, the previous noise estimate is information obtained from a past frame, and the input noisy speech signal is information obtained from a current frame. According to an embodiment of the present invention, if the noise estimate in the previous frame for the input noisy speech signal, i.e., the scaling factor is greater than the predetermined threshold, a higher weight is placed on the signal present in the current frame than the noise estimated from the previous frame. To estimate the noise of the current frame. Therefore, according to the present invention, the noise estimator enables a faster and adaptive tracking of the change in the noise included in the noisy speech signal of the current frame. According to this embodiment of the present invention, since the noise occupies a certain ratio and can accurately and adaptively estimate noise rapidly changing even in a highly non-static or level-variable environment, the conventional MS algorithm or WA algorithm It is possible to solve the fundamental limitations shown, that is, the musical noise remains due to the time delay.

그리고 본 발명의 실시예에 의하면, 입력 노이지 스피치 신호에 대한 이전 프레임의 추정 노이즈의 비율(스케일링 인자)이 상기 임계치 이하일 경우에는 현재 프레임의 입력 노이지 스피치 신호만을 이용하여 노이즈를 추정한다. 그 결과, 본 발명의 실시예에 따른 노이즈 추정 절차에서는 음성의 비율이 높은 노이지 음성 신호에서 노이즈 레벨이 지나치게 높게 추정되어 음성 왜곡이 발생하는 것을 방지할 수가 있다.According to an embodiment of the present invention, when the ratio (scaling factor) of the estimated noise of the previous frame to the input noisy speech signal is less than or equal to the threshold, the noise is estimated using only the input noisy speech signal of the current frame. As a result, in the noise estimation procedure according to the embodiment of the present invention, the noise level is estimated too high in the noisy speech signal having a high ratio of speech, thereby preventing speech distortion from occurring.

또한, 본 발명의 실시예에 의하면, 현재 프레임이 묵음 프레임인지를 판정함에 있어서, 주파수 도메인, 예컨대 웨이블릿 패킷 변환 도메인에서 이웃하는 노드 사이의 변환 계수(예컨대, WPTC)의 크기비를 이용한다. 이러한 본 발명의 실시예에 따른 묵음 프레임의 판정 방법은, 다양한 비정적인 노이즈 환경에서도 변화가 없는 노이즈 신호의 근본적인 특성을 이용하는 것이기 때문에, 묵음 프레임에 대한 판정의 정확도를 한층 향상시킬 수가 있다. 그리고 이렇게 향상된 정확도로 판정된 묵음 프레임에서 획득한 최신 정보는, 노이즈 추정에서 사용하는 스케일링 인자를 새롭게 정의하는데 이용되고, 또한 필요한 경우에는 묵음 프레임의 판정에 이용되는 기준 프레임의 정보를 업데이트하는데 이용될 수 있다. 따라서 이러한 본 발명의 실시예에 따른 묵음 프레임의 판정 방법은, 노이지 음성 신호의 처리에 있어서 보다 현재 프레임에 가까운 이전 프레임의 정보를 이용할 수 있도록 하기 때문에, 노이즈 추정의 정확도를 한층 향상시킬 수가 있다.Further, according to the embodiment of the present invention, in determining whether the current frame is a silent frame, the size ratio of transform coefficients (eg, WPTC) between neighboring nodes in a frequency domain, for example, the wavelet packet transform domain, is used. Since the silent frame determination method according to the embodiment of the present invention utilizes the fundamental characteristics of the noise signal that does not change even in various non-static noise environments, the accuracy of the silent frame determination can be further improved. The latest information obtained in the silent frame determined with this improved accuracy is used to newly define the scaling factor used in the noise estimation and, if necessary, to update the information of the reference frame used for the determination of the silent frame. Can be. Therefore, the silent frame determination method according to the embodiment of the present invention makes it possible to use the information of the previous frame closer to the current frame in the processing of the noisy speech signal, thereby further improving the accuracy of noise estimation.

이하에서는, 첨부 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세하게 설명한다. 후술하는 실시예는 본 발명의 기술적 사상을 설명하기 위한 목적이므로, 본 발명의 기술적 사상은 이 실시예에 의하여 한정되는 것으로 해석되어서는 안된다. 본 실시예에 대한 설명 및 도면에서 각각의 구성요소에 부가된 참조 부호는 단지 설명의 편의를 위하여 기재된 것일 뿐이다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention. Since the embodiments described below are for the purpose of illustrating the technical idea of the present invention, the technical idea of the present invention should not be construed as being limited by the embodiments. Reference numerals added to the respective components in the description of the embodiment and the drawings are merely described for convenience of description.

그리고 후술하는 본 발명의 실시예는 주파수 변환으로서 웨이블릿 패킷 변환을 적용하는 경우에 대해서만 예를 들어서 설명한다. 하지만, 본 발명의 실시예가 웨이블릿 패킷 변환이 아닌 퓨리에 변환을 적용하는 경우에도 동등하게 적용할 수 있다는 것은 당업자에게 자명하므로, 이하에서는 퓨리에 변환을 적용하는 실시예에 대한 구체적인 설명은 생략한다.In addition, the embodiment of the present invention described later will be described by way of example only for the case where the wavelet packet transform is applied as the frequency transform. However, it will be apparent to those skilled in the art that the embodiment of the present invention may be equally applicable to the case of applying the Fourier transform instead of the wavelet packet transform, and thus, a detailed description of the embodiment to which the Fourier transform is applied will be omitted.

도 1은 본 발명의 일 실시예에 따른 노이지 음성 신호의 처리 절차를 보여 주는 흐름도이다. 도 1을 참조하면, 본 발명의 일 실시예에 따른 노이지 음성 신호의 처리 절차는 입력 노이지 음성 신호에 대한 균일 웨이블릿 패킷 변환 단계(Uniform Wavelet Packet Transform, S10), 스무딩 단계(Smoothing, S20), 노이즈 추정 단계(Noise Estimation, S30), 변형된 스펙트럼 차감 단계(Modified Spectrum Substraction, S40), 및 균일 웨이블릿 패킷 역변환 단계(Inverse Uniform Wavelet Packet Transform, S50)를 포함한다. 이하, 입력 노이지 음성 신호를 처리하여 개선된 음성을 출력하는 본 발명의 실시예를 구성하는 각 단계에 대하여 보다 구체적으로 설명한다.1 is a flowchart illustrating a processing procedure of a noisy voice signal according to an embodiment of the present invention. Referring to FIG. 1, a process of processing a noisy voice signal according to an embodiment of the present invention may include a uniform wavelet packet transform (S10), a smoothing step (Smoothing, S20), and noise for an input noisy voice signal. Noise Estimation (S30), Modified Spectrum Substraction (S40), and Inverse Uniform Wavelet Packet Transform (S50). Hereinafter, each step of configuring an embodiment of the present invention for processing an input noisy voice signal and outputting an improved voice will be described in more detail.

입력 노이지 음성 신호 y(l)는 다음의 수학식 1과 같이 깨끗한 음성과 가산 노이즈의 합으로 표현할 수 있다. 수학식 1에서, l, s(l), 및 n(l)은 각각 이산 시간 인덱스(Discrete Time Index), 깨끗한 음성(Clean Speech), 부가적인 비관련 노이즈(Additive Uncorrelated Noise)를 나타낸다.The input noisy voice signal y (l) can be expressed as a sum of clean voice and additive noise as shown in Equation 1 below. In Equation 1, l , s (l) , and n (l) represent Discrete Time Index, Clean Speech, and Additive Uncorrelated Noise, respectively.

그리고 상기 입력 노이지 음성 신호 y(l)에 대하여 균일 웨이블릿 패킷 변환을 수행하여 웨이블릿 패킷 변환 도메인에서의 변환 신호를 생성한다(S10). 상기 변환 신호는 균일 웨이블릿 패킷 변환 영역에서의 변환 계수이며, 그 구조는 도 2에 도시되어 있다. 도 2를 참조하면, 전체 트리 레벨(Tree Level)을 J라고 할 경우에, 웨이블릿 패킷 변환이 이루어지지 않은 레벨을 0으로 표시하고, 레벨0에서의 노드(Node)의 개수를 1로 가정한다. 웨이블릿 패킷 변환 단계에 따라 트리 레벨은 1씩 증가하고, 그에 따라서 노드의 개수는 2배씩 증가한다. 따라서 j(0≤j≤J-1)번째 트리 레벨에서 노드의 개수는 2^j이 된다. 각 노드는 하나 이상의 변환 계수를 가지고 있으며, 노드에 포함되는 변환 계수의 개수는 각 노드마다 동일하다. 이와 같이, 입력 노이지 음성 신호에 대한 균일한 밴드 구조를 갖는 웨이블릿 패킷 계수는 다음의 수학식 2로 표현할 수 있다. In addition, uniform wavelet packet transform is performed on the input noisy voice signal y (l) to generate a transform signal in the wavelet packet transform domain (S10). The transform signal is a transform coefficient in the uniform wavelet packet transform region, the structure of which is shown in FIG. Referring to FIG. 2, when the entire tree level is referred to as J, the level at which the wavelet packet conversion is not performed is represented as 0, and the number of nodes at level 0 is assumed to be 1. According to the wavelet packet conversion step, the tree level is increased by 1, and accordingly, the number of nodes is increased by 2 times. Therefore, the number of nodes in the j (0 ≦ j ≦ J-1) th tree level is 2 ^j . Each node has one or more transform coefficients, and the number of transform coefficients included in the node is the same for each node. As such, the wavelet packet coefficient having a uniform band structure for the input noisy speech signal may be expressed by Equation 2 below.

여기서, i, j, k (0 ≤ k ≤ K-1), 및 m (m = 0, 1, …, M-1)은 각각 프레임 인덱스, 웨이블릿 패킷 트리 레벨 인덱스, 노드 인덱스 및 각 노드에서의 계수 빈 인덱스(Coefficient Bin Index)를 나타낸다. 그리고 S _i,j,k (m)은 깨끗한 음성의 웨이블릿 패킷 변환 계수를 나타내고, N _i,j,k (m)은 노이즈의 웨이블릿 패킷 변환 계수를 나타낸다.Where i, j, k (0 ≤ k ≤ K-1), and m ( m = 0, 1, ..., M-1) are the frame index, wavelet packet tree level index, node index and Represents a coefficient bin index. And S _{i, j, k} (m) represents the wavelet packet transform coefficients of clean speech, and N _{i, j, k} (m) represents the wavelet packet transform coefficients of noise.

그런데, 본 발명의 실시예에서는 단일 고정 트리 레벨(예컨대, j=3)에서의 웨이블릿 패킷 변환 신호를 취급하므로, 상기 변한 신호에서 웨이블릿 패킷 트리 레벨 인덱스 j는 제외할 수가 있다. 따라서 아래 첨자 j를 삭제하면, 상기 수학식 2는 동등하게 수학식 3으로 표현할 수 있다.However, in the embodiment of the present invention, since the wavelet packet transform signal is handled at a single fixed tree level (for example, j = 3), the wavelet packet tree level index j can be excluded from the changed signal. Therefore, if the subscript j is deleted, Equation 2 may be equally represented by Equation 3.

다음으로, 균일 웨이블릿 패킷 변환된 변환 신호에 대하여 스무딩을 수행한다(S20). 일반적으로 웨이블릿 패킷 계수 Y _i,k (m)는 시간 축(Time Axis), 즉 각 노드에서의 계수 빈 인덱스 m에 대하여 날카로운 피크와 벨리(Peaks and Valleys)를 가지기 때문에, 이를 부주의하게 취급할 경우에는 종종 개선된 음성 신호에서 잔류 노이즈를 초래하게 된다. 따라서 본 단계의 스무딩 절차는 노이즈를 추정하는 과정에서 이러한 위험을 줄이기 위한 것이다.Next, smoothing is performed on the uniform wavelet packet-converted transformed signal (S20). In general, the wavelet packet coefficient Y _{i, k} (m) has sharp peaks and valleys for the time axis, i.e., the coefficient bin index m at each node. Often results in residual noise in the improved speech signal. Therefore, the smoothing procedure in this step is to reduce this risk in estimating the noise.

다음의 수학식 4는 본 발명의 일 실시예에 따른 스무딩 단계를 표현하는 식으로서, 날카로운 피크와 벨리를 제거하기 위하여 순환적 평균(Recursive Averaging) 기법이 사용되는 경우이다. 수학식 4에서, α_X (0 < α_X <1)와 X _i,k (m)는 각각 스무딩 인자(Smoothing Factor)와 스무드된 웨이블릿 패킷 변환 계수를 나타낸다. 수학식 4를 참조하면, 본 발명의 실시예에서는 웨이블릿 패킷 변환된 변환 신호의 파워(Power)가 아닌 크기(Magnitude)를 이용한다는 것을 알 수 있는데, 이것은 본 발명의 실시예에서는 묵음 구간을 찾기 위하여 인접한 노드 간의 변환 계수의 크기비(Magnitude Ratio)를 이용하는 것과 관련이 있다. 이에 대해서는 후술하기로 한다.Equation 4 below represents a smoothing step according to an embodiment of the present invention, in which a recursive averaging technique is used to remove sharp peaks and valleys. In Equation 4, α _X (0 <α _X <1) and X _{i, k} (m) represent a smoothing factor and a smoothed wavelet packet transform coefficient, respectively. Referring to Equation 4, it can be seen that the embodiment of the present invention uses Magnitude, not Power, of the wavelet packet-converted transformed signal. Related to using Magnitude Ratio of transform coefficients between adjacent nodes. This will be described later.

다음으로, 스무드된 웨이블릿 패킷 변환 계수를 이용하여 노이즈 추정 절차를 수행한다(S30). 본 단계의 노이즈 추정 절차는, 스펙트럼 차감법에서 현재 입력 노이지 음성 신호에서 배경 노이즈를 제거하는데 이용할 노이즈 신호를 추정하는 절차이다.Next, a noise estimation procedure is performed using the smoothed wavelet packet transform coefficients (S30). The noise estimation procedure of this step is a procedure of estimating a noise signal to be used to remove background noise from a current input noisy speech signal in a spectral subtraction method.

음성 개선을 위하여 노이즈 추정 절차에서 순환 평균(RA) 알고리즘이나 또는 가중 평균(WA) 알고리즘은 종래부터 널리 사용되어 왔다. 이의 대표적인 예는 여러 가지가 있는데, 고도의 비정적인 환경에서 노이즈 파워 스펙트럼을 추정하는데 RA 기법을 사용하는 Rangachari, Loizou, 및 Hu의 논문("A noise estimation algorithm with rapid adaptation for highly non-stationary environments", IEEE ICASSP, pp.305-308, May 2004), 노이즈 변이(Noise Variance)를 추정하는데 RA 기 법을 사용하는 최소 제어 순환 평균(Minima Controlled Recursive Averaging, MCRA)(I. Cohen, B. Berdugo, "Noise estimation by minima controlled recursive averaging for robust speech enhancement", IEEE Signal Processing Letter, vol. 9, no.1, pp. 12-15, Jan. 2002), 및 개선된 MCRA(I. Cohen, "Noise spectrum estimation in sdverse environments: improved minima controlled recursive averaging", IEEE Trans. Speech and Audio Processing, vol. 11, no. 5, pp. 466-475, Sept. 2003) 등이 그 일례이다.For improving speech, a cyclic average (RA) algorithm or a weighted average (WA) algorithm has been widely used in the noise estimation procedure. There are several representative examples of this, in Rangachari, Loizou, and Hu's paper, "A noise estimation algorithm with rapid adaptation for highly non-stationary environments," which uses the RA technique to estimate the noise power spectrum in a highly non-static environment. , IEEE ICASSP, pp. 305-308, May 2004), Minimum Controlled Recursive Averaging (MCRA) using the RA technique to estimate noise variance (I. Cohen, B. Berdugo, "Noise estimation by minima controlled recursive averaging for robust speech enhancement", IEEE Signal Processing Letter, vol. 9, no. 1, pp. 12-15, Jan. 2002), and improved MCRA (I. Cohen, "Noise spectrum Estimation in sdverse environments: improved minima controlled recursive averaging ", IEEE Trans.Speech and Audio Processing, vol. 11, no. 5, pp. 466-475, Sept. 2003).

이러한 RA 기반 방법에서의 기본적인 개념은 다음의 수학식 5와 같이 나타낼 수 있다.The basic concept in this RA-based method can be expressed as Equation 5 below.

여기서,

과 X_i는 각각 i번째 프레임의 추정된 노이즈를 나타내는 파라미터(크기, 파워, 변이 등)와 입력 노이지 음성을 나타내는 파라미터이고, α는 스케일링 인자이다. 상기 스케일링 인자는 일반적으로

의 형태를 갖고 있다. 여기서

및 P(X_i)는 각각 이전 프레임에서의 노이즈 추정치를 나타내는 파라미터와 입력 노이지 음성을 나타내는 파라미터이다. 따라서 노이즈 도미넌트 프레임인 경우에는 상기 스케일링 인자 α는 상대적으로 큰 값을 가지며, 그 결과 현재 프레임의 노이즈 추정

은 대부분 이전 프레임의 노이즈 추정치

에 의존한다. 이러한 종래의 노이즈 추정 알고리즘은 정적인 노이즈 환경에서 효과가 우수한 것으로 알려져 있지만, 전술한 바와 같이 비정적인 노이즈 환경과 노이즈 도미넌트 프레임에서는 심각한 결함을 나타낼 수 있다. here,

And X _i are parameters representing the estimated noise of the i-th frame (size, power, transition, etc.) and input noise noise, respectively, and α is a scaling factor. The scaling factor is generally

Has the form of. here

And P (X _i ) are parameters representing noise estimates in previous frames and parameters representing input noisy speech, respectively. Therefore, in the case of a noise dominant frame, the scaling factor α has a relatively large value, and as a result, the noise of the current frame is estimated.

Is mostly a noise estimate of the previous frame.

Depends on Although the conventional noise estimation algorithm is known to be effective in a static noise environment, as described above, it may exhibit serious defects in the non-static noise environment and the noise dominant frame.

그리고 상기 수학식 5에 기초하여 노이즈 추정을 하는 경우에는, 노이즈 레벨이 증가하는 경우에는 약 0.5 내지 2초의 시간 지연이 발생하는 것으로 관측되었다. 이것의 주된 원인은 스케일링 인자 α의 값이 증가할수록 현재 프레임의 노이즈 추정은 이전 프레임의 노이즈 추정치에 더 많은 영향을 받기 때문이다. 그 결과, 상기 수학식 5에 기초하는 기존의 노이즈 추정 방법은 비정적인 환경에서의 노이즈의 변동(Fluctuation)을 효과적으로 반영할 수가 없다. 따라서 노이지 음성 신호에서 노이즈가 대부분을 차지하는 영역, 즉 α가 거의 1이 되는 영역에서 이러한 노이즈의 변동에 대처하기 위해서는, 이전 프레임의 노이즈 추정 보다는 현재 프레임의 노이지 음성 신호 전체에 더 많이 의존하도록 하여 노이즈 추정을 수행하도록 할 필요가 있다.In the case of noise estimation based on Equation 5, it was observed that a time delay of about 0.5 to 2 seconds occurs when the noise level increases. The main reason for this is that as the value of the scaling factor α increases, the noise estimate of the current frame is more affected by the noise estimate of the previous frame. As a result, the existing noise estimation method based on Equation 5 cannot effectively reflect the fluctuation of the noise in the non-static environment. Therefore, in order to cope with such noise fluctuations in the area where the noise occupies most of the noisy speech signal, that is, the α is almost 1, the noise is made more dependent on the whole noisy speech signal of the current frame than the noise estimation of the previous frame. It is necessary to make the estimation.

따라서 본 발명의 실시예에 따른 노이즈 추정 절차(S30)는 우선, 적응적 순환 평균 노이즈 추정으로써, 스케일링 인자의 크기가 소정의 임계치보다 큰 경우와 작은 경우를 구분하여 각기 다른 방법으로 노이즈를 추정하며, 특히 스케일링 인자의 크기가 상기 임계치 이상이어서 노이즈 도미넌트 프레임인 경우에는 추정된 과거 프레임의 노이즈 신호보다는 현재 프레임의 노이지 음성 신호를 더 많이 반영하여 노이즈 추정을 한다는 점에서 기존의 노이즈 추정 방법과는 차이가 있다.Therefore, the noise estimation procedure (S30) according to an embodiment of the present invention is an adaptive cyclic average noise estimation. First, noise is estimated by different methods by dividing the case where the magnitude of the scaling factor is larger than the predetermined threshold value and the case where it is smaller. In particular, in the case of the noise dominant frame because the scaling factor is greater than or equal to the threshold, the noise estimation method reflects the noise signal of the current frame more than the estimated noise signal of the previous frame, which is different from the conventional noise estimation method. There is.

그리고 본 발명의 실시예의 일 측면에 의하면, 노이즈 추정을 하기 이전에 묵음 구간을 먼저 조사하여 스케일링 인자를 지속적으로 업데이트하는데, 묵음 구간을 검출하는데 있어서 인접한 노드 간의 WPTC의 크기비(Magnitude Ratio)를 이용한다. 이러한 본 발명의 실시예에 따른 묵음 구간의 판정 절차는 묵음 구간의 검출에 대한 정확도를 향상시킬 수가 있다.According to an aspect of an embodiment of the present invention, before the noise estimation is performed, the silence interval is first examined to continuously update the scaling factor, and the detection ratio of the silence interval uses the Magnitude Ratio of WPTCs between adjacent nodes. . The determination procedure of the silent section according to the embodiment of the present invention can improve the accuracy of detection of the silent section.

도 3은 이러한 본 발명의 일 실시예에 따른 노이즈 추정 절차를 보여 주는 흐름도이다. 도 3을 참조하면, 본 발명의 실시예에 따른 노이즈 추정 절차(S30)는 스케일링 인자(Scaling Factor) 추정 단계(S31), 크기비 기반 묵음 판정 단계(S32), 해당 프레임이 묵음 프레임인지를 판정하고(S33), 만일 묵음 프레임인 것으로 판정되는 경우에는 스케일링 인자를 업데이트(S34)한 후에 노이즈를 추정하고(S35), 묵음 프레임이 아닌 것으로 판정되는 경우에는 곧바로 노이즈를 추정하는 단계(S35)를 포함한다.3 is a flowchart illustrating a noise estimation procedure according to an embodiment of the present invention. Referring to FIG. 3, the noise estimation procedure S30 according to an embodiment of the present invention may include determining a scaling factor S31, a size ratio based silence determination step S32, and determining whether the corresponding frame is a silent frame. (S33), if it is determined that the frame is silent, update the scaling factor (S34) and then estimate the noise (S35); if it is determined that the frame is not the silent frame, immediately estimate the noise (S35). Include.

단계 S31에서의 스케일링 인자의 추정을 위하여, 우선 본 발명의 실시예에 적용될 수 있는 스케일링 인자에 대하여 정의한다. 본 발명의 실시예에 따른 스케일링 인자 φ_i(k)는 임의의 노드 k에서의 크기 기반 스케일링 인자(Magnitude Based Scaling Factor)로써, i 번째 프레임의 노이지 음성 신호의 노드 k에서의 WPTC 크기의 합(수학식 6)에 대한 (i-1)번째 프레임에서의 추정된 노이즈 신호의 노드 k에서의 WPTC 크기의 합(수학식 7)의 비로 표현된다. 그리고 이러한 본 발명의 일 실시예에 따른 크기 기반 스케일링 인자는 1보다 클 수는 없다.In order to estimate the scaling factor in step S31, a scaling factor that can be applied to an embodiment of the present invention is first defined. Scaling factor φ _i (k) according to an embodiment of the present invention is a Magnitude Based Scaling Factor at an arbitrary node k, and is the sum of the WPTC magnitudes at node k of the noisy speech signal of the i th frame. It is expressed as the ratio of the sum of the WPTC magnitudes (Equation 7) at the node k of the estimated noise signal in the (i-1) th frame with respect to Equation 6). And the size-based scaling factor according to an embodiment of the present invention cannot be greater than one.

이와 같이, 단계 S31에서는 g_i(k)에 대한 h_i-1(k)의 비로 스케일링 인자 φ_i(k)를 계산한다. 상기 스케일링 인자는 노이지 음성 신호에서 노이즈가 차지하는 비율에 대한 추정치를 계산하기 위한 것이다. 그리고 상기 스케일링 인자는 임의의 노드 k에 대하여 계산되므로, 노드 레벨 스케일링 인자이다. Thus, in step S31, the scaling factor phi _i (k) is calculated by the ratio of h _i-1 (k) to g _i (k). The scaling factor is for calculating an estimate of the proportion of noise in the noisy speech signal. And since the scaling factor is calculated for any node k, it is a node level scaling factor.

이와 같이, 스케일링 인자에 대한 추정이 이루어지고 나면, 다음으로 현재 프레임이 묵음 프레임인지에 대한 판정을 수행한다(S32). 이러한 본 단계는 임의적인 것으로서, 본 발명의 실시예에 의하면 매 프레임마다 묵음 프레임인지를 판정하기 위한 절차를 수행할 필요는 없다. 예를 들어, 매 프레임마다 수행하지 않고 소정의 프레임 간격으로 묵음 프레임인지를 판정하거나 또는 필요한 경우에만 임의적으로 묵음 프레임인지를 판정할 수도 있다. In this way, after the estimation of the scaling factor is made, it is next determined whether the current frame is a silent frame (S32). This present step is optional, and according to an embodiment of the present invention, it is not necessary to perform a procedure for determining whether the frame is a silent frame. For example, it may be determined whether the frame is a silent frame at a predetermined frame interval without performing every frame, or it may be arbitrarily determined only if necessary.

묵음 여부를 판단하는 일차적인 목적은 단계 S31에서 추정된 스케일링 인자를 단계 S34에서 업데이트하기 위한 것이다. 본 발명의 실시예에서는 현재 처리 대상이 되는 노드 k에서의 WPTC 크기와 이에 인접한 노드, 예컨대 노드 (k-1) 및/또는 노드 (k+1)에서의 WPTC 크기 사이의 크기비를 묵음 프레임인 기준 프레임에서의 크기비와 비교함으로써, 현재 프레임이 묵음 프레임인지 또는 아닌지를 판단한다. The primary purpose of determining silence is to update the scaling factor estimated in step S31 in step S34. In the embodiment of the present invention, the size ratio between the WPTC size at the node k currently being processed and the WPTC size at the adjacent node, for example, node (k-1) and / or node (k + 1) is a silent frame. By comparing with the size ratio in the reference frame, it is determined whether or not the current frame is a silent frame.

노이지 음성의 처리와 이를 위한 노이즈 추정 절차에서 묵음 프레임은 노이지 음성의 처리에 필요한 많은 파라미터들을 새롭게 정의하거나 또는 업데이트하는데 이용할 수가 있다. 따라서 묵음 프레임인지를 정확하게 그리고 가능한 자주 판단하는 것이 중요하다. 일반적으로 노이즈가 비정적인 경우에, 단일 채널 노이지 음성의 입력에서 음성의 존부를 정확하게 판정하는 것은 상당히 어렵다. 본 발명의 실시예에서는 비정적인 노이즈 환경에서도 묵음 여부에 대한 판정의 정확도를 향상시키기 위하여, 웨이블릿 패킷 변환 도메인의 각 노드에서 음성이 존재하는지 또는 묵음인지를 판단한다.In the processing of the noisy speech and the noise estimation procedure therefor, the silent frame can be used to newly define or update many parameters necessary for the processing of the noisy speech. It is therefore important to determine exactly and as often as possible whether they are silent frames. In general, when noise is indeterminate, it is quite difficult to accurately determine the presence of speech at the input of a single channel noisy speech. In an embodiment of the present invention, in order to improve the accuracy of the determination of silence even in a non-static noise environment, it is determined whether voice is present or silent at each node of the wavelet packet conversion domain.

그리고 본 발명의 실시예에서는 크기비 기반 판정법(Magnitude Ratio-based Decision Approach)을 제안한다. 음성이 존재하는지 또는 묵음인지를 판정하기 위하여 종래에도 많은 파라미터(스펙트럼 에너지(Spectral Energy), 제로 크로싱비(Zero Crossing Rate), 엔트로피(Entrpoy) 등)가 사용되었다. 그러나 현재까지 웨이블릿 패킷 변환 도메인에서 인접한 노드 사이의 WPTC의 크기비에 기초하여 묵음인지를 판정하는 방법은 알려져 있지 않다. In addition, an embodiment of the present invention proposes a Magnitude Ratio-based Decision Approach. Many parameters (Spectral Energy, Zero Crossing Rate, Entrpoy, etc.) have conventionally been used to determine whether speech is present or silent. However, until now, it is not known how to determine whether to be silent based on the size ratio of WPTCs between adjacent nodes in the wavelet packet conversion domain.

본 발명의 실시예는 다음과 같은 현상, 즉 자연적인 노이즈 스펙트럼은, 비록 그것인 비정적인 노이즈라고 하더라도, 웨이블릿 패킷 변환 도메인에서 이웃 밴드와는 독립적으로 급격한 변동을 보여주지 않는다는 현상에 기초한다. 이러한 현상을 이용할 경우에, 만일 이웃한 밴드 사이에 WPTC의 크기비를 파라미터로 사용하고, 그리고 그것을 기준 노이즈 프레임에서의 WPTC의 크기비와 비교함으로써, 묵음 여부에 대한 판정을 용이하게 할 수가 있다. An embodiment of the present invention is based on the following phenomenon, that is, the natural noise spectrum does not show a sudden fluctuation independently of neighboring bands in the wavelet packet conversion domain, even if it is non-static noise. In the case of using this phenomenon, it is possible to facilitate determination of silence by using the size ratio of WPTC between neighboring bands as a parameter and comparing it with the size ratio of WPTC in the reference noise frame.

이와 같이, 본 발명의 실시예에서는, 후술하는 바와 같이, WPTC의 크기비를 통하여 노이지 음성 신호에서 묵음 여부를 판정한다. 예를 들어, 각 노드 k에 대하여 두 개의 인접 노드 (k+1) 및 (k-1)와의 WPTC의 크기비를 수학식 8과 같이 정의할 수 있다. 수학식 8에서 노드 k와 노드 (k+1)과의 WPTC의 크기비를 상방향 비(Upward Ratio, UPR), γ_i ^UP(k)라고 하고, 노드 k와 노드 (k-1)과의 WPTC의 크기비를 하방향 비(Downward Ratio, DNR), γ_i ^DN(k)라고 한다. 그리고 g_i(k)는 수학식 6에 정의되어 있으며, K는 웨이블릿 패킷 트리 레벨이 k인 경우의 노드의 개수를 나타낸다.As described above, in the embodiment of the present invention, it is determined whether or not to silence the noisy voice signal through the size ratio of the WPTC. For example, the size ratio of the WPTC with two adjacent nodes (k + 1) and (k-1) for each node k may be defined as shown in Equation (8). In Equation 8, the ratio of the size of WPTC between node k and node (k + 1) is referred to as the Upward Ratio (UPR), γ _i ^UP (k), and The size ratio of WPTC is called Downward Ratio (DNR), γ _i ^DN (k). And g _i (k) is defined in Equation 6, where K represents the number of nodes when the wavelet packet tree level is k.

그리고 두 개의 기준비, 즉 상방향 기준비(Upward Reference Ratio, R-UPR), λ^UP(k) 및 하방향 기준비(Downward Reference Ratio, R-DNR), λ^DN(k)를 수학식 8에서 정의한 것과 동일한 방식으로 정의한다. 이들 두 개의 기준비는 수학식 8에서 정의된 비와 비교하는데 있어서 기준이 되는 것으로서, 이미 알고 있는 묵음 프레임으로부터 구할 수가 있다. 여기서, 이미 알고 있는 묵음 프레임은 예컨대, 통상적으로 음성이 없는 것으로 간주되는 발화(Utterance)의 시작 프레임 또는 본 발명의 실시예에 따른 묵음 판정 절차에 따라서 가장 최근에 식별된 묵음 프레임 등이 될 수 있다.The two reference ratios, ie, ^Up Reference Ratio (R-UPR), λ ^UP (k), Downward Reference Ratio (R-DNR), and λ ^DN (k) Define in the same way as defined in These two reference ratios are used as a reference for comparison with the ratios defined in Equation 8, and can be obtained from a known silent frame. Here, the known silence frame may be, for example, a start frame of utterance (Utterance) that is generally considered to be speechless, or a silence frame most recently identified according to a silence determination procedure according to an embodiment of the present invention. .

만일 R-UPR에 대한 UPR의 비

가 1에 가까우면, 현재 프레임에서 k번째 노드의 노이즈 비율과 (k+1)번째 노드의 노이즈 비율 사이의 비가 기준 프레임에서의 노이즈 비율 사이의 비와 서로 비슷하다는 것을 나타내며, 이것은 R-DNR에 대한 DNR의 비

의 경우에도 동일하게 적용될 수 있다. 즉, R-UPR(또는 R-DNR)에 대한 UPR(또는 DNR)의 비가 1에 가까우면, 현재 프레임에서의 두 개의 노드 사이의 노이즈 크기 경향(Noise Magnitude Trend)은 그 크기에 상관없이 기준 프레임에서의 두 개의 노드 사이의 노이즈 크기 경향과 유사하다는 것을 알 수 있다. 즉, 이와 같은 경우에는 현재 프레임이 기준 프레임과 유사한 WPTC의 크기비를 보이므로, 현재 프레임도 묵음 프레임에 해당될 가능성이 상당히 높다.If the ratio of UPR to R-UPR

Is close to 1, indicating that the ratio between the noise ratio of the kth node in the current frame and the noise ratio of the (k + 1) th node is similar to the ratio between the noise ratio in the reference frame. Ratio of DNR to

The same may also apply to the case. That is, if the ratio of UPR (or DNR) to R-UPR (or R-DNR) is close to 1, the noise magnitude trend between two nodes in the current frame is the reference frame regardless of the magnitude. It can be seen that it is similar to the noise magnitude tendency between two nodes in. That is, in this case, since the current frame has a WPTC size ratio similar to that of the reference frame, there is a high possibility that the current frame also corresponds to a silent frame.

도 4는 본 발명의 실시예에서와 같이 UPR과 DNR을 이용하여 묵음 여부를 판단하는 것이 효과적임을 설명하기 위한 도면이다. 도 4의 (a)는 SNR(Signal to Noise Ration) 5db에서의 노이지 음성의 파형을 보여 주는 도면이고, 도 4의 (d)는 상기 (a)의 배경 노이즈의 파형을 보여 주는 도면이다. 도 4의 (b)와 (e)는 각각 주파수 0 내지 3kHz 범위에서 (a)의 노이지 음성와 (d)의 노이즈에 대한 스펙트로그램(Spectrogram)을 보여 주는 도면이고, 도 4의 (c)와 (f)는 각각 노드 인덱스 k=2에서 (a)와 (d)의 UPR γ^UP(점선) 및 DNR γ^DN(실선)을 보여 주는 도면이다. 4 is a view for explaining that it is effective to determine whether to silence using UPR and DNR as in the embodiment of the present invention. FIG. 4A is a diagram showing waveforms of noisy voices in SNR (Signal to Noise Ration) 5db, and FIG. 4D is a diagram showing waveforms of background noise of (a). 4 (b) and (e) show spectrograms for the noise of the noise of (a) and the noise of (d) in the frequency range of 0 to 3 kHz, respectively, and FIGS. 4 (c) and ( f) is a diagram showing UPR γ ^UP (dashed line) and DNR γ ^DN (solid line) of (a) and (d) at node index k = 2, respectively.

여기서, 스펙트로그램의 주파수 범위가 0 내지 3kHz 범위인 것은 노드 인덱 스가 0 내지 2인 것에 대응한다. 그리고 상기 노이즈 파형은 백색 가우시안 노이즈를 진폭 조정함으로써 획득할 수 있으며, '변동 백색 가우시안 노이즈(Fluctuating White Gaussian Noise)'라고도 한다. 도 4의 (f)를 참조하면, 노이즈 신호의 UPR과 DNR은 상대적으로 평평한 반면에, 도 4의 (c)를 참조하면, 노이지 음성 신호의 UPR과 DNR은 음성 신호의 진폭과 유사하게 비례하면서 변동되는 것을 알 수 있다. 결국, UPR와 DNR은 크기의 비이기 때문에, 에너지 레벨의 변동에 의해서는 거의 영향을 받지 않으며, 음성이 함께 존재할 경우에는 음성 신호의 영향으로 UPR과 DNR이 변동이 생기지만, 노이즈 신호만으로 구성된 묵음 프레임의 경우에는 UPR과 DNR은 거의 일정하다는 것을 알 수 있다.Here, the frequency range of the spectrogram is in the range of 0 to 3 kHz, which corresponds to the node index of 0 to 2. The noise waveform may be obtained by amplitude-adjusting white Gaussian noise, also referred to as 'Fluctuating White Gaussian Noise'. Referring to FIG. 4F, the UPR and DNR of the noise signal are relatively flat, while referring to FIG. 4C, the UPR and DNR of the noisy speech signal are proportionally similar to the amplitude of the speech signal. It can be seen that the fluctuation. After all, since UPR and DNR are ratios of magnitude, they are hardly affected by fluctuations in energy level, and in the presence of voice, UPR and DNR fluctuate due to voice signal, but muted with noise signal only. In the case of a frame, it can be seen that UPR and DNR are almost constant.

UPR와 DNR을 이용하는 이상의 내용을 정리하면, 다음과 같이 공식화가 가능하다. 우선, 상방향과 하방향의 유사도 파라미터(Resemblance Parameters)를 수학식 9와 같이 정의한다. Summarizing the above using UPR and DNR, it can be formulated as follows. First, the similarity parameters (Resemblance Parameters) of the upward direction and the downward direction are defined as in Equation (9).

그리고 k번째 노드에서의 음성의 존재를 나타내는 파라미터는 수학식 10에서와 같이 Λ_i(k)로 나타낸다. 여기서, 임계치 η은 음성이 존재하는지 여부를 판정하기 위한 소정의 기준값이다. 수학식 10을 참조하면 알 수 있는 바와 같이, ξ_i ^UP와 ξ_i ^DN중에서 어느 하나가 상기 임계치 η보다 큰 경우에, 상기 k번째 노드는 음성이 존재하는 것으로 임의적으로 가정된다. 그리고 수학식 10에서는 경계, 즉 K가 0이거나 또는 K-1인 경우에는 Λ_i(k)가 정의되어 있지 않은데, Λ_i(0)=Λ_i(1)이고 Λ_i(K-1)=Λ_i(K-2)라고 가정할 수 있다.The parameter representing the presence of voice at the k-th node is represented by Λ _i (k) as in Equation 10. Here, the threshold η is a predetermined reference value for determining whether or not voice is present. As can be seen with reference to Equation 10, when any one of ξ _i ^UP and ξ _i ^DN is greater than the threshold η, the k-th node is arbitrarily assumed that voice is present. In Equation 10, Λ _i (k) is not defined when the boundary, that is, K is 0 or K-1, Λ _i (0) = Λ _i (1) and Λ _i (K-1) = It can be assumed that Λ _i (K-2).

이와 같이, 본 발명의 실시예에 의하면, 상방향과 하방향의 유사도 파라미터를 이용하여 음성이 존재하는지 여부를 판정한다. 특히, 본 실시예의 일 측면에 의하면, 하나 이상의 유사도 파라미터 Λ_i(k)를 이용하여 묵음 프레임인지를 판정하는 것이 바람직하다. 왜냐하면, 대부분의 자연적 음성은 그 성분(Ingredients)들이 주파수 도메인에서 좁은 밴드에 한정되어 있기 보다는 넓은 영역에 퍼져 있기 때문에, 다수의 유사도 파라미터를 이용할 경우에는 판정의 정확도를 향상시킬 수 있기 때문이다. 실험에 의할 경우에, 음성 성분들이 적어도 2~3kHz 이상에 걸쳐서 퍼져 있다. 따라서 본 발명의 바람직한 실시예에서는 적어도 연속되는 3개 이상의 유사도 파라미터 Λ_i(k)가 음성이 존재하는 것을 지시할 경우에만, 현재 프레임을 음성이 존재하는 프레임으로 판정할 수 있다.As described above, according to the embodiment of the present invention, it is determined whether or not the voice exists by using the similarity parameter between the upward direction and the downward direction. In particular, according to one aspect of the present embodiment, it is preferable to determine whether the frame is a silence using one or more similarity parameters Λ _i (k). Because most natural voices are spread over a wide area rather than being confined to narrow bands in the frequency domain, the accuracy of the decision can be improved when using multiple similarity parameters. In experiments, negative components are spread over at least 2-3 kHz. Therefore, in the preferred embodiment of the present invention, the current frame may be determined to be the frame in which the voice exists only if at least three consecutive similarity parameters Λ _i (k) indicate that the voice exists.

수학식 11은 수학식 10을 이용하여 구한 유사도 파라미터 Λ_i(k)를 이용하여 단계 S34에서 스케일링 인자를 새롭게 정의하고 또한 R-UPR과 R-DNR을 업데이트하는 단계를 표현한 식의 일례이다. 수학식 11은 단순화된 휴리스틱(Heuristic) 가코드(Pseudo Code)이다. Equation 11 is an example of an expression representing a step of newly defining a scaling factor in step S34 and updating the R-UPR and R-DNR using the similarity parameter Λ _i (k) obtained using Equation 10. Equation 11 is a simplified Heuristic Pseudo Code.

수학식 11을 참조하면, 음성이 없는 노드, 즉 유사도 파라미터 Λ_i(k)가 0인 경우에 스케일링 인자 φ_i(k)를 새롭게 정의한다. 그리고 유사도 파라미터 Λ_i(k)가 0인 경우에는 기준비 R-UPR과 R-DNR도 업데이트한다. 그리고 음성이 있는 노드, 즉 Λ_i(k)가 1인 경우에는 스케일링 인자나 기준비의 어떤 것도 변경하거나 업데이트하지 않는다. Referring to Equation 11, the scaling factor φ _i (k) is newly defined when a node having no voice, that is, when the similarity parameter Λ _i (k) is 0. If the similarity parameter Λ _i (k) is 0, the reference ratios R-UPR and R-DNR are also updated. If the negative node, Λ _i (k) is 1, neither the scaling factor nor the reference ratio is changed or updated.

본 발명의 실시예에 따른 묵음 여부 감지 방법의 가장 큰 이점은, 인접한 노드 사이에서의 WPTC의 크기비를 비교하기 때문에, 진폭이 변하는 노이즈 신호의 경우에도 상당히 정확하게 묵음 프레임인지를 판정할 수 있다는 것이다. 따라서 본 발명의 실시예에 의하면, 종래의 VAD 알고리즘이나 또는 장기간의 노이즈 통계에 의존할 필요가 없으며, 보다 정확하게 묵음 프레임인지에 대한 판정이 가능하다. 그리고 본 발명의 실시예에 따라서 임의적으로 새롭게 정의된 스케일링 인자 φ_i(k)는 단계 S35의 노이즈 추정 절차에서 사용될 수 있으며, 업데이트된 R-UPR 및 R-DNR은 후속 절차에서 크기비를 비교하는데 이용될 수 있다.The biggest advantage of the silence detection method according to the embodiment of the present invention is that it is possible to determine whether the silence frame is quite accurate even in the case of a noise signal of varying amplitude, since the size ratio of the WPTC between adjacent nodes is compared. . Therefore, according to the embodiment of the present invention, it is not necessary to rely on the conventional VAD algorithm or the long-term noise statistics, and it is possible to determine whether the frame is silent more accurately. And optionally a newly defined scaling factor φ _i (k) according to the embodiment of the present invention can be used in the noise estimation procedure of step S35, and the updated R-UPR and R-DNR are used to compare the size ratio in the subsequent procedure. Can be used.

계속해서 도 3을 참조하면, 단계 S33에서 묵음인 것으로 판정된 경우에는 스 케일링 인자를 새롭게 정의하고 또한 필요한 경우에는 R-UPR과 R-DNR을 업데이트(S34)한 이후에, 그리고 단계 S33에서 음성이 있는 것으로 판정된 경우에는 곧바로 노이즈를 추정하는 단계(S35)를 수행한다. 본 발명의 실시예에 의하면, 노이즈 추정 단계는 추정되거나 또는 새롭게 정의된 스케일링 인자의 크기가 소정의 임계치(Threshold, θ)보다 큰지 여부에 따라서 서로 다른 방법을 적용한다. 여기서, 상기 임계치의 구체적인 값은 실험에 의하여 적절한 값으로 결정될 수 있다.With continued reference to FIG. 3, if it is determined in step S33 to be silent, the scaling factor is newly defined and, if necessary, after updating the R-UPR and R-DNR (S34), and in step S33. If it is determined that there is voice, a step (S35) of immediately estimating noise is performed. According to an embodiment of the present invention, the noise estimation step applies different methods depending on whether the magnitude of the estimated or newly defined scaling factor is greater than a predetermined threshold (θ). Here, the specific value of the threshold may be determined to an appropriate value by experiment.

우선, 스케일링 인자 φ_i(k)가 소정의 임계치(θ)보다 작은 경우에 대하여 설명한다. 스케일링 인자 φ_i(k)가 소정의 임계치(θ)보다 작다는 것은 노이즈 도미넌트 프레임이 아니라는 것을 의미한다. 이와 같은 경우에, 일반적으로 현재 입력 신호는 노이즈와 음성이 모두 포함되어 있는 것으로 간주한다. 본 발명의 실시예에 의하면, 이와 같은 경우에는 노이즈를 추정함에 있어서 현재의 노이지 음성 신호의 크기만을 이용한다. 예를 들어, 추정된 노이즈

는 현재 노이지 음성 신호와 스케일링 인자 φ_i(k)의 곱으로 표현할 수 있다. 이 경우에 추정된 노이즈

는 현재의 노이지 음성 신호에 비례하므로, 입력된 노이지 음성 신호의 크기에 전적으로 의존한다.First, the case where scaling factor phi _i (k) is smaller than predetermined threshold value (theta) is demonstrated. The scaling factor φ _i (k) is less than the predetermined threshold θ means that it is not a noise dominant frame. In such cases, it is generally assumed that the current input signal contains both noise and voice. According to the embodiment of the present invention, in this case, only the magnitude of the current noise signal is used to estimate the noise. For example, estimated noise

Can be expressed as the product of the current noise signal and the scaling factor φ _i (k). Estimated noise in this case

Since is proportional to the current noisy speech signal, it depends entirely on the magnitude of the input noisy speech signal.

따라서 이러한 본 발명의 실시예에 의하면, 음성이 섞여 있는 신호에서 노이즈가 과대 평가되는 것을 방지할 수가 있다. 반면, 수학식 5로 표현되는 종래의 방법에 의하면, 스케일링 인자 α가 작을 경우에 추정된 노이즈에서 입력 노이지 신 호(X_i)로 인하여 노이즈 추정치

는 큰 값을 갖는다. 그리고 만일 이전 프레임에서의 추정된 노이즈가 무시할 수 없는 값을 갖는 경우라면, 노이즈가 지나치게 높게 추정될 염려가 있다. Therefore, according to this embodiment of the present invention, it is possible to prevent the noise from being overestimated in the mixed signal. On the other hand, according to the conventional method represented by Equation 5, the noise estimated value due to the input noise signal X _i at the estimated noise when the scaling factor α is small.

Has a large value. And if the estimated noise in the previous frame has a value that cannot be ignored, there is a fear that the noise is estimated too high.

그리고 본 발명의 실시예에 의하면, 추정된 스케일링 인자가 상기 임계치(θ) 이상인 경우에는, 현재의 입력 노이지 신호와 이전 프레임의 추정된 노이즈 모두를 이용하여 노이즈를 추정한다. 스케일링 인자가 임계치(θ) 이상인 경우는 노이즈 도미넌트 프레임도 포함한다. 이러한 본 발명의 실시예는 그 형식상 현재의 입력 노이지 신호와 이전 프레임 모두를 이용한다는 점에서 종래의 노이즈 추정 절차와 유사한 점이 있다. According to an embodiment of the present invention, when the estimated scaling factor is equal to or greater than the threshold θ, noise is estimated using both the current input noise signal and the estimated noise of the previous frame. If the scaling factor is greater than or equal to the threshold θ, the noise dominant frame is also included. This embodiment of the present invention is similar to the conventional noise estimation procedure in that it uses both the current input noise signal and the previous frame.

그러나 본 발명의 실시예에서는 노이즈 도미넌트 프레임인 경우, 즉 스케일링 인자 φ_i(k)가 큰 값을 갖는 경우에는 추정된 노이즈의 크기는 그 대부분이 현재 프레임의 노이지 음성 신호에 의존하며, 이전 프레임의 추정된 노이즈 신호가 반영되는 비율은 아주 작게 되도록 한다. 예를 들어, 노이즈 추정치는 현재 프레임의 노이지 음성 신호와 스케일링 인자의 곱과 이전 프레임의 노이즈 추정치와 (1-스케일링 인자)의 곱의 합으로 구할 수 있다. 따라서 본 발명의 실시예에 의하면, 노이즈 도미넌트 프레임인 경우에도 추정된 노이즈는 현재 프레임의 입력 노이지 신호의 변동을 즉각적으로 추종하도록 할 수가 있다.However, in the exemplary embodiment of the present invention, in the case of a noise dominant frame, that is, when the scaling factor φ _i (k) has a large value, the estimated noise is largely dependent on the noisy speech signal of the current frame. The rate at which the estimated noise signal is reflected is made very small. For example, the noise estimate may be obtained as the sum of the product of the noise signal of the current frame and the scaling factor and the product of the noise estimate of the previous frame and the product of (1-scaling factor). Therefore, according to the embodiment of the present invention, even in the case of a noise dominant frame, the estimated noise can immediately follow the variation of the input noise signal of the current frame.

이와 같이, 본 발명의 실시예에 따른 노이즈 추정 절차에서는 스케일링 인자를 WPTC의 크기를 이용하여 정의하며, 또한 웨이블릿 패킷 변환 계수 도메인에서 새로운 RA 기반 방법을 이용하여 노이즈를 추정한다. 즉, 스케일링 인자가 소정의 임계치보다 작아서 음성과 노이즈가 섞여 있는 경우에는 현재의 입력 노이지 음성 신호의 일정 비율을 노이즈로 추정한다. 그리고 스케일링 인자가 소정의 임계치 이상이어서 노이즈 도미넌트 프레임인 경우에는 현재의 입력 노이지 신호와 이전 프레임에서 추정된 노이즈 신호 모두를 이용하여 노이즈를 추정하되, 전자에 보다 많은 가중치를 부여한다.As described above, in the noise estimation procedure according to the embodiment of the present invention, the scaling factor is defined using the size of the WPTC, and the noise is estimated using the new RA-based method in the wavelet packet transform coefficient domain. That is, when the scaling factor is smaller than the predetermined threshold and the voice and the noise are mixed, the predetermined ratio of the current input noisy voice signal is estimated as the noise. When the scaling factor is greater than or equal to a predetermined threshold and is a noise dominant frame, the noise is estimated using both the current input noise signal and the noise signal estimated in the previous frame, but the weight is given more to the former.

이러한 본 발명의 실시예에 따른 노이즈 추정 절차는 적어도 두 가지 이점이 있다. 우선, 노이즈가 많은 비율을 차지하는 경우에 이전 프레임의 노이즈 추정치가 아닌 현재의 입력 노이지 신호에 보다 많은 가중치를 부여하여 노이즈를 추정하기 때문에, 종래의 RA 기반 방법에서 나타났던 것과 같은 시간 지연 현상과 잔류 뮤지컬 노이즈의 발생을 방지할 수가 있다. 그리고 노이즈가 조금 섞여 있는 경우에는 현재 입력 노이지 신호만을 이용하여 노이즈를 추정하기 때문에, 노이즈에 대한 과대 평가를 방지할 수 있으며, 그 결과 음성 왜곡이 생기는 것을 방지할 수가 있다. The noise estimation procedure according to the embodiment of the present invention has at least two advantages. First, when the noise occupies a large proportion, the noise is estimated by giving more weight to the current input noise signal, not the noise estimate of the previous frame, so that the time delay phenomenon and the residual as shown in the conventional RA based method are estimated. The occurrence of musical noise can be prevented. When the noise is a little mixed, the noise is estimated using only the current input noisy signal, so that an overestimation of the noise can be prevented, and as a result, speech distortion can be prevented.

계속해서 도 1을 참조하면, 단계 S30의 결과로 출력되는 각 노드에 대한 노이즈 추정치

를 이용하여 스펙트럼 차감법을 수행한다(S40). 본 실시예의 일 측면에 의하면, 스펙트럼 차감법은, 다음의 수학식 12와 같이, 웨이블릿 패킷 변환 도메인에서의 변형된 스펙트럼 차감법이 이용될 수 있다. 수학식 12를 참조하면, 웨이블릿 패킷 변환 도메인에서 i번째 프레임의 추정된 음성 신호

는 입력 노이 지 음성 신호 Y_i,k(m)와 추정된 노이즈 신호

의 상대적인 크기에 따라서 달라진다. 즉, 입력 노이지 음성 신호 Y_i,k(m)가 추정된 노이즈 신호

의 큰 경우에는 추정된 음성 신호

는 두 값의 차이가 되지만, 반대의 경우에는 추정된 음성 신호

는 0이 된다.1, the noise estimate for each node output as a result of step S30.

Using the spectral subtraction method (S40). According to an aspect of the present embodiment, the spectral subtraction method may use a modified spectral subtraction method in the wavelet packet transform domain, as shown in Equation 12 below. Referring to Equation 12, the estimated speech signal of the i th frame in the wavelet packet conversion domain

The input noise speech signal Y _{i, k} (m) and the estimated noise signal

Depends on the relative size of That is, the noise signal from which the input noisy speech signal Y _{i, k} (m) is estimated

If the large of the estimated speech signal

Is the difference between the two values, but vice versa

Becomes zero.

그리고 스펙트럼 차감법을 이용하여 구한 추정된 음성 신호

에 대하여 웨이블릿 패킷 역변환을 함으로써, 주파수 도메인에서 구한 개선된 음성 신호를 시간 도메인에서의 개선된 음성 신호로 변환을 한다(S50).And estimated speech signal obtained by using spectral subtraction

By performing the inverse transform on the wavelet packet, the enhanced speech signal obtained in the frequency domain is converted into the improved speech signal in the time domain (S50).

다음으로 본 발명의 다른 실시예에 대하여 설명한다. 본 발명의 다른 실시예는 전술한 본 발명의 실시예에 따른 노이지 음성 신호의 처리 절차에 따라서 입력 노이지 음성 신호를 처리하여 개선된 음성 신호를 출력하는 노이지 음성 신호의 처리 장치이다. 노이지 음성 신호의 처리 장치는 휴대폰 등과 같은 음성 기반 어플리케이션 장치에 내장되는 소프트웨어(Software), 예컨대 음성 기반 어플리케이션의 프로세서(컴퓨터)를 실행시키도록 컴퓨터 판독 가능한 기록매체의 형태나 또는 보청기나 영상통화 시스템 등과 같은 음성 기반 어플리케이션 장치에 장착되는 칩의 형태 등과 같이 다양한 방법으로 구현될 수 있다. Next, another Example of this invention is described. Another embodiment of the present invention is a noisy voice signal processing apparatus for outputting an improved voice signal by processing an input noisy voice signal according to the above-described process of processing a noisy voice signal. The apparatus for processing a noisy voice signal is in the form of a computer-readable recording medium for executing software embedded in a voice-based application device such as a mobile phone, for example, a processor (computer) of a voice-based application, or a hearing aid or a video call system. The chip may be implemented in various ways, such as in the form of a chip mounted in the voice-based application device.

도 5는 본 발명의 일 실시예에 따른 노이지 음성 신호의 처리 장치를 보여 주는 블록도이다. 도 5를 참조하면, 노이지 음성 신호의 처리 장치(100)는 균일 웨이블릿 패킷 변환 유닛(110), 스무딩 유닛(120), 노이즈 추정 유닛(130), 스펙트럼 차감 유닛(140), 및 균일 웨이블릿 패킷 역변환 유닛(150)을 포함한다. 본 발명의 실시예에 따른 노이지 음성 신호의 처리 장치(100)에 포함되는 각 구성 요소(110, 120, 130, 140, 및 150)의 기능은 전술한 노이지 음성 신호의 처리 절차를 구성하는 단계(S10, S20, S30, S40, 및 S50)에서의 기능과 각각 동일하므로, 이하 이에 대한 구체적인 설명은 생략한다. 이러한 본 발명의 실시예에 따른 노이지 음성 신호의 처리 장치(100)는 스피커폰이나 영상 통화용 통신 기기, 보청기 등과 같은 음성 기반 어플리케이션 장치에 구비되어, 노이지 음성 신호로부터 노이즈 신호를 제거하는데 이용될 수 있다. 5 is a block diagram illustrating an apparatus for processing a noisy voice signal according to an embodiment of the present invention. Referring to FIG. 5, the apparatus 100 for processing a noisy speech signal includes a uniform wavelet packet transform unit 110, a smoothing unit 120, a noise estimation unit 130, a spectrum subtraction unit 140, and a uniform wavelet packet inverse transform. Unit 150. The functions of the components 110, 120, 130, 140, and 150 included in the apparatus for processing a noisy voice signal according to an embodiment of the present invention may include configuring the aforementioned procedure for processing the noisy voice signal ( Since each of the functions in S10, S20, S30, S40, and S50) is the same, a detailed description thereof will be omitted below. The apparatus 100 for processing a noisy voice signal according to an embodiment of the present invention may be provided in a voice-based application device such as a speakerphone, a video communication device, a hearing aid, or the like, and may be used to remove a noise signal from the noisy voice signal. .

도 6은 본 발명의 일 실시예에 따른 음성 기반 어플리케이션 장치의 구성을 보여 주는 블록도로서, 도 5에 도시된 노이지 음성 신호의 처리 장치(100)를 포함한다. 도 6을 참조하면, 음성 기반 어플리케이션 장치(200)는 노이지 음성 신호를 획득하기 위한 입력 수단, 예컨대 마이크(210), 상기 입력 수단을 통해 획득한 노이지 음성 신호를 처리하기 위한 노이지 음성 신호의 처리 장치(220), 및 상기 노이지 음성 신호의 처리 장치를 통해 생성된 개선된 음성 신호를 외부로 출력하기 위한 출력 장치, 예컨대 스피커(230)를 포함한다. 마이크(210)는 노이즈에 오염된 음성을 장치(200)로 입력하기 위한 수단이다. 노이지 음성 신호의 처리 장치(220) 는 도 5에 도시된 장치(100)와 동일한 구성을 갖는 장치로서, 본 발명의 실시예에 따른 노이지 음성 신호의 처리 절차에 따라서 노이지 음성 신호를 처리하여 개선된 음성 신호를 출력한다. 그리고 스피커(230)는 개선된 음성 신호를 사람이 식별할 수 있는 개선된 음성으로 출력하기 위한 수단이다. 이와 같은 도 6에 도시된 음성 기반 어플리케이션 장치는 노이지 음성 신호의 처리 장치(100)에서 출력되는 개선된 음성 신호를 바로 출력하는 어플리케이션 장치, 예컨대 보청기 등일 수 있는데, 여기에만 한정되는 것은 아니다.FIG. 6 is a block diagram showing a configuration of a voice-based application device according to an embodiment of the present invention, and includes a device 100 for processing a noisy voice signal shown in FIG. 5. Referring to FIG. 6, the voice-based application apparatus 200 may include an input means for obtaining a noisy voice signal, for example, a microphone 210 and an apparatus for processing a noisy voice signal for processing a noisy voice signal obtained through the input means. 220, and an output device such as a speaker 230 for outputting an improved voice signal generated through the noisy voice signal processing device to the outside. The microphone 210 is a means for inputting the voice contaminated with noise into the device 200. The apparatus for processing a noisy voice signal 220 is a device having the same configuration as the apparatus 100 shown in FIG. 5, and is improved by processing the noisy voice signal according to a processing procedure of the noisy voice signal according to an embodiment of the present invention. Output voice signal. And the speaker 230 is a means for outputting the improved voice signal to the improved voice that can be identified by a person. The voice-based application device illustrated in FIG. 6 may be an application device that directly outputs an improved voice signal output from the processing device 100 of the noisy voice signal, such as a hearing aid, but is not limited thereto.

도 7은 본 발명의 다른 실시예에 따른 음성 기반 어플리케이션 장치의 구성을 보여 주는 블록도로서, 역시 도 5에 도시된 노이지 음성 신호의 처리 장치(100)를 포함한다. 도 7을 참조하면, 음성 기반 어플리케이션 장치(300)는 마이크(310), 노이지 음성 신호의 처리 장치(320), 및 전송 장치(330)를 포함한다. 마이크(310)는 노이즈에 오염된 음성을 장치(300)로 입력하기 위한 수단이다. 노이지 음성 신호의 처리 장치(320)는 도 5에 도시된 장치(100)와 동일한 구성을 갖는 장치로서, 본 발명의 실시예에 따른 노이지 음성 신호의 처리 절차에 따라서 노이지 음성 신호를 처리하여 개선된 음성 신호를 출력한다. 그리고 전송 장치(330)는 개선된 음성 신호를 유무선 통신 네트워크를 통해 전송하기 위한 수단이다. 전송 장치(330)는 기존의 아날로그 방식에 따라 개선된 음성 신호를 반송파에 그대로 실어서 전송하기 위한 장치이거나 또는 개선된 음성 신호를 소정의 압축 부호화 방식에 따라서 부호화를 수행한 후에 부호화된 음성 데이터를 반송파에 실어서 전송하기 위한 장치일 수 있다. 후자의 경우에, 상기 전송 장치(330)는 도면에 도시하지는 않았지만 음성 부호화 수단을 더 구비할 수도 있다. 이와 같은 도 7에 도시된 음성 기반 어플리케이션 장치는 노이지 음성 신호의 처리 장치(100)에서 출력되는 개선된 음성 신호를 통신 네트워크를 통해 다른 사람에게 전송하기 위한 어플리케이션 장치, 예컨대 스피커폰이나 영상 통화용 통화 기기 등일 수 있는데, 여기에만 한정되는 것은 아니다.FIG. 7 is a block diagram illustrating a configuration of a voice-based application device according to another embodiment of the present invention, and also includes the apparatus 100 for processing a noisy voice signal shown in FIG. 5. Referring to FIG. 7, the voice-based application device 300 includes a microphone 310, a noisy voice signal processing device 320, and a transmission device 330. The microphone 310 is a means for inputting the voice contaminated with noise into the device 300. The apparatus 320 for processing the noisy voice signal has a configuration identical to that of the apparatus 100 shown in FIG. 5, and is improved by processing the noisy voice signal according to the processing of the noisy voice signal according to an embodiment of the present invention. Output voice signal. The transmitting device 330 is a means for transmitting the improved voice signal through the wired or wireless communication network. The transmitting device 330 is a device for transmitting an improved speech signal according to a conventional analog signal on a carrier or transmitting the encoded speech data after encoding the improved speech signal according to a predetermined compression encoding scheme. It may be a device for transmitting on a carrier. In the latter case, the transmission device 330 may further include speech encoding means although not shown in the figure. Such a voice-based application device shown in FIG. 7 is an application device for transmitting an improved voice signal output from the processing unit 100 of a noisy voice signal to another person through a communication network, such as a speaker phone or a video call device. Etc., but is not limited thereto.

테스트 결과Test results

본 발명에 따른 노이즈 추정 방법과 이를 이용하는 노이지 음성 신호의 처리 절차의 성능을 평가하기 위하여, 양적인 테스트는 물론 질적인 테스트도 함께 수행하였다. 여기서, 질적인 테스트란 비형식적이고 주관적인 듣기 테스트 및 스펙트럼 조사를 의미하고, 양적인 테스트는 부분 노이즈 추정 에러(Segmental Noise Estimation Error)를 측정하는 것을 의미한다. 주관적인 테스트 결과에 의하면, 본 발명의 실시예에 의할 경우에 잔류 뮤지컬 노이즈는 거의 관측이 되지 않았을 뿐만 아니라 개선된 음성 신호에서의 왜곡도 종래의 다른 방법에 비해서 상당히 줄어들었다는 것을 확인할 수 있었다. 여기서, 종래의 다른 방법이란 본 발명의 실시예에 따른 테스트 결과에 대한 성능 비교를 위하여 참조 기술로써 MS 방법(스펙트롤 플로링 인자(Spectral Flooring Factor, subf) = 0.01)과 WA 방법(스케일링 인자 α = 0.95, 임계치 β = 2)을 이용하였다. 이하에서 설명하는 바와 같이, 양적인 테스트에서 실험 결과는 상기한 질적인 테스트에서의 실험 결과를 뒷받침하는 것이었다.In order to evaluate the performance of the noise estimation method according to the present invention and the processing procedure of a noisy speech signal using the same, a quantitative test as well as a qualitative test were also performed. In this case, the qualitative test means an informal and subjective listening test and a spectral investigation, and the quantitative test means measuring a segmental noise estimation error. According to the subjective test results, the residual musical noise was hardly observed in the embodiment of the present invention, and it was confirmed that the distortion in the improved speech signal was considerably reduced compared with other conventional methods. Here, another conventional method is referred to as an MS technique (Spectral Flooring Factor (subf) = 0.01) and WA method (scaling factor α =) as a reference technique for performance comparison of test results according to an embodiment of the present invention. 0.95, threshold β = 2). As explained below, the experimental results in the quantitative test were to support the experimental results in the above qualitative test.

양적인 테스트에서는, TIMIT 데이터베이스로부터 30초 길이의 음성 신호(이 중에서 15초는 남성의 음성 신호이고 나머지 15초는 여성의 음성 신호)가 선택되었는데, 그것의 지속 시간은 6초 이상이다. 그리고 네 가지 유형의 노이즈 신호가 사용되었는데, 그 중에서 세 가지 노이즈 신호는 NoiseX-92 데이터베이스에서 선택된 것으로서 각각 바블 노이즈(Babble Noise), F16 노이즈, 및 정적인 백색 가우시안 노이즈(SWGN)이다. 그리고 나머지 하나의 노이즈 신호는, 여러 가지 에너지 레벨과 약 0.7초의 간격으로 SWGN을 진폭 조정함으로써 획득한 변동하는 백색 가우시안 신호(FWGN)이다. 각 음성은 각 유형의 노이즈를 이용하여 SNR 0dB, 5dB, 및 10dB로 결합되었다. 모든 신호의 샘플링 주파수는 16kHz이고, 각 프레임은 50%의 오버랩핑을 갖는 512 샘플(32ms)로 구성되어 있다. 그리고 Daubechies Basis를 갖는 균일 웨이블릿 패킷 변환이 수행되었는데, 여기서 웨이블릿 패킷 트리 레벨 인덱스 j는 3으로 선택되었다. 그리고 본 발명의 실시예를 위해서는 다음과 같은 파라미터들이 선택되었다. 즉, 수학식 4에서의 스무딩 파라미터 α_x는 0.5이고, 수학식 10에서 임계치 η는 0.85, 그리고 단계 S35에서의 임계치는 0.075이다.In the quantitative test, a 30-second speech signal was selected from the TIMIT database, of which 15 seconds were male voice and the remaining 15 seconds female voice, with a duration of 6 seconds or more. Four types of noise signals were used, three of which were selected from the NoiseX-92 database: Bubble Noise, F16 Noise, and Static White Gaussian Noise (SWGN), respectively. The other noise signal is a fluctuating white Gaussian signal (FWGN) obtained by amplitude adjustment of the SWGN at various energy levels and at intervals of about 0.7 seconds. Each voice was combined with SNR 0dB, 5dB, and 10dB using each type of noise. The sampling frequency of all signals is 16 kHz, and each frame consists of 512 samples (32 ms) with 50% overlap. And uniform wavelet packet transform with Daubechies Basis was performed, where wavelet packet tree level index j was selected as 3. And the following parameters were selected for the embodiment of the present invention. That is, the smoothing parameter [alpha] _x in the equation (4) is 0.5, the threshold? In the equation (10) is 0.85, and the threshold in step S35 is 0.075.

원래의 노이즈 신호 n(l)와 추정된 노이즈 신호

사이의 추정 에러를 계산하기 위하여, 시간 도메인에서 다음의 수학식 13과 같은 부분 노이즈 추정 에러(Segmental Noise Estimation Error, Seg.NEE)를 계산하였다.Original noise signal n (l) and estimated noise signal

In order to calculate the estimation error between, a partial noise estimation error (Segmental Noise Estimation Error, Seg. NEE) is calculated in Equation 13 below.

여기서, F 및 L은 각각 프레임의 총 개수와 프레임당 샘플의 개수를 나타낸다. 그리고 테스트에서는 개선된 부분 SNR(Improved Segmental SNR, Seg.SNR_Imp)도 평가하였는데, 수학식 14로 표시되는 것과 같은 SNR 개선 정도를 계산하기 위하여 널리 사용되는 방법을 이용하였다. 여기서, Seg.SNR_Output과 Seg.SNR_Input은 각각 개선된 음성 신호의 부분 SNR과 노이지 음성 신호의 부분 SNR이다.Here, F and L represent the total number of frames and the number of samples per frame, respectively. In the test, the improved partial SNR (Improved Segmental SNR, Seg. SNR _Imp ) was also evaluated, and a widely used method was used to calculate the degree of SNR improvement as represented by Equation (14). Here, Seg.SNR _Output and Seg.SNR _Input are the partial SNR of the improved speech signal and the partial SNR of the noisy speech signal, respectively.

Seg.SNR_Imp = Seg.SNR_Output - Seg.SNR_Input Seg.SNR _Imp = Seg.SNR _Output -Seg.SNR _Input

도 8에는 도 4의 (a)에 도시된 노이지 음성 신호(가는 실선), 도 4의 (d)에 도시된 노이즈 신호(점선), 및 본 발명의 실시예에 따라 추정된 노이즈 신호(굵은 실선)의 스무드된 WPTC의 크기의 일례가 도시되어 있다. 도 8을 참조하면, 비록 밸리 및 피크에서 노이즈에 대한 부분적인 과대 평가가 존재하고 또한 시간 지연도 완전히 해소하지는 못하였지만, 본 발명의 실시예에 의할 경우에는 과대 평가와 시간 지연의 정도가 상당히 완화되었으며, 노이즈에 대한 추정의 정확도를 상당히 향상시킬 수 있다는 것을 알 수 있다.8 shows a noisy speech signal (thin solid line) shown in FIG. 4A, a noise signal (dashed line) shown in FIG. 4D, and a noise signal (bold solid line) estimated according to an embodiment of the present invention. An example of the size of a smoothed WPTC is shown. Referring to FIG. 8, although there is a partial overestimation of noise in valleys and peaks, and the time delay is not completely eliminated, according to an embodiment of the present invention, the degree of overestimation and time delay is quite significant. It has been mitigated and it can be seen that the accuracy of the estimation for noise can be significantly improved.

도 9는 본 발명의 실시예에 의할 경우에 노이즈의 변동을 상당히 정확하게 추적할 수 있다는 것을 보여 주기 위한 것으로서, 노이지 음성 신호의 평균 WPTC 곡선(Contour) 및 추정된 노이즈 신호의 평균 WPTC 곡선이 도시되어 있다. 여기서, 추정된 노이즈 신호는 본 발명의 실시예를 포함하여, 종래의 여러 가지 접근법(MA법 및 WA법)을 이용하여 구한 것이다. 그리고 노이지 음성 신호는 음성 신호와 약 0.4초에 SNR이 15dB에서 0dB로 변화하는 FWGN 신호를 혼합하여 생성한 것이다. 도 9를 참조하면, 본 발명의 실시예에 의할 경우에는 원래의 노이즈 신호를 상당히 잘 추적할 수 있지만, 종래의 WA법에 의할 경우에는 변동이 있는 경우에 추적을 잘하지 못하며, 종래의 MA법에 의할 경우에는 약 1.5초의 시간 지연이 발생한다는 것을 알 수 있다.FIG. 9 shows that the noise change can be tracked with great accuracy when the embodiment of the present invention is shown. The average WPTC curve of the noisy speech signal and the average WPTC curve of the estimated noise signal are shown. It is. Here, the estimated noise signal is obtained using various conventional approaches (MA method and WA method), including the embodiment of the present invention. The noisy voice signal is a mixture of the voice signal and the FWGN signal whose SNR changes from 15 dB to 0 dB in about 0.4 seconds. Referring to FIG. 9, the original noise signal can be tracked considerably well in the embodiment of the present invention. However, in the case of the conventional WA method, tracking is difficult when there is a change. When the MA method is used, it can be seen that a time delay of about 1.5 seconds occurs.

도 10과 도 11은 노이지 음성 신호와 본 발명의 실시예에 따라서 개선된 음성 신호의 파형 및 스펙트로그램을 보여 주는 것으로서, 도 10은 노이즈 신호로써 SNR 5dB의 바블 노이즈(Babble Noise)가 사용된 경우이고 도 11은 노이즈 신호로써 SNR 0dB의 정적인 백색 가우시안 노이즈가 사용된 경우이다. 도 10과 도 11에서 (a)와 (c)는 노이즈 음성 신호에 대한 것이고, (b)와 (d)는 본 발명의 실시예에 따라 개선된 음성 신호에 대한 것이다. 이러한 도 10과 도 11을 참조하면, 본 발명의 실시예에 따라 개선된 음성 신호에 잔류 뮤지컬 노이즈가 거의 관측되지 않는다는 것을 알 수 있다.10 and 11 illustrate waveforms and spectrograms of a noisy voice signal and an improved voice signal according to an embodiment of the present invention, and FIG. 10 illustrates a case in which SNR 5 dB of BBB noise is used as a noise signal. 11 shows a case where a static white Gaussian noise of SNR 0dB is used as the noise signal. 10 and 11, (a) and (c) are for a noisy voice signal, and (b) and (d) are for an improved voice signal according to an embodiment of the present invention. 10 and 11, it can be seen that residual musical noise is hardly observed in the improved speech signal according to the embodiment of the present invention.

도 12 및 도 13은 각각 Seg.NEE와 Seg.SNR_Imp를 이용하여 본 발명의 실시예에 의한 경우와 종래의 MS 및 WA 방법에 의한 경우의 평균 성능을 비교하여 보여 주는 도면이다. 도 12 및 도 13에서 (a)는 바블 노이즈, (b)는 F16 노이즈, (c)는 변동하는 백색 가우시안 노이즈, 및 (d)는 정적 백색 가우시안 노이즈가 각각 노이즈 신호로써 사용된 경우이다. 도 12 및 도 13을 참조하면, 본 발명의 실시예에 의할 경우에는 종래의 WA 방법에 비하여 Seg.NEE (또는 Seg.SNR_Imp)의 전체 평균이 0.327(또는 3.68dB) 개선되었고, 종래의 MS 방법에 비하여 0.145 (또는 3.78dB) 개선되었다는 것을 알 수 있다. 그리고 도 12와 도 13의 그래프로부터, 종래의 MS 방법이나 WA 방법에 의할 경우에 비교하여 본 발명의 실시예에 의할 경우에 나타나는 성능 개선의 정도는, 입력 SNR이 작을수록 더 증가한다는 것을 알 수 있다. 또한, 변동하는 노이즈 신호(FWGN)의 경우에는 다른 노이즈 신호보다 성능 개선이 훨씬 더 크다는 것을 알 수 있는데, 이것의 주된 원인은 본 발명의 실시예에서는 묵음 프레임을 판정할 때 크기비를 이용하였기 때문이다. 12 and 13 are diagrams showing the average performance of the case of the embodiment of the present invention and the case of the conventional MS and WA method using Seg.NEE and Seg.SNR _Imp , respectively. 12 and 13, (a) is a bobble noise, (b) is a F16 noise, (c) is a fluctuating white Gaussian noise, and (d) is a case where a static white Gaussian noise is used as a noise signal, respectively. 12 and 13, the overall average of Seg.NEE (or Seg.SNR _Imp ) is improved by 0.327 (or 3.68 dB) compared to the conventional WA method according to the embodiment of the present invention. It can be seen that the improvement is 0.145 (or 3.78 dB) over the MS method. And from the graphs of Figs. 12 and 13, it is understood that the degree of performance improvement shown by the embodiment of the present invention is increased as the input SNR is smaller as compared with the conventional MS method or WA method. Able to know. In addition, it can be seen that in the case of the variable noise signal FWGN, the performance improvement is much greater than that of other noise signals. The main reason for this is that in the embodiment of the present invention, the size ratio is used when determining the silent frame. to be.

도 1은 본 발명의 일 실시예에 따른 노이지 음성 신호의 처리 절차를 보여 주는 흐름도이다.1 is a flowchart illustrating a processing procedure of a noisy voice signal according to an embodiment of the present invention.

도 2는 균일 웨이블릿 패킷 변환의 구조를 보여 주는 다이어그램이다.2 is a diagram showing the structure of uniform wavelet packet transformation.

도 3은 도 1의 노이즈 추정 절차의 일례를 보여 주는 흐름도이다.3 is a flowchart illustrating an example of the noise estimation procedure of FIG. 1.

도 4는 본 발명의 실시예에 따라서 상방향 비(UPR)와 하방향 비(DNR)를 이용하여 묵음 여부를 판단하는 것이 효과적임을 설명하기 위한 도면으로서, (a)는 (d)의 노이즈 신호에 오염된 SNR 5dB에서의 노이지 음성 신호의 파형, (b)는 0-3kHz 주파수 범위에서의 노이지 음성 신호(a)의 스펙트로그램, (c)는 노드 인덱스 k=2에서의 노이지 음성 신호(a)의 UPR(점선)과 DNR(실선), (d)는 백색 가우시안 노이즈 신호의 파형, (e)는 0-3kHz 주파수 범위에서의 노이즈 신호(d)의 스펙트로그램, 및 (f)는 노드 인덱스 k=2에서의 노이즈 신호(d)의 UPR(점선)과 DNR(실선)이다. 4 is a view for explaining that it is effective to determine whether to silence using the upper ratio (UPR) and the lower ratio (DNR) according to an embodiment of the present invention, (a) is a noise signal of (d) The waveform of the noisy speech signal at SNR 5 dB contaminated with sigma, (b) is the spectrogram of the noisy speech signal (a) in the 0-3 kHz frequency range, and (c) is the noisy speech signal at node index k = 2. UPR (dotted line) and DNR (solid line), (d) are the waveforms of the white Gaussian noise signal, (e) the spectrogram of the noise signal (d) in the 0-3 kHz frequency range, and (f) the node index. UPR (dotted line) and DNR (solid line) of the noise signal d at k = 2.

도 5는 본 발명의 일 실시예에 따른 노이지 음성 신호의 처리 장치의 구성을 보여 주는 블록도이다.5 is a block diagram showing the configuration of an apparatus for processing a noisy voice signal according to an embodiment of the present invention.

도 6은 도 5의 노이지 음성 신호의 처리 장치를 포함하는 본 발명의 일 실시예에 따른 음성 기반 어플리케이션 장치의 구성을 보여 주는 블록도이다.FIG. 6 is a block diagram illustrating a configuration of a voice-based application device according to an exemplary embodiment of the present invention including the apparatus for processing the noisy voice signal of FIG. 5.

도 7은 도 5의 노이지 음성 신호의 처리 장치를 포함하는 본 발명의 다른 실시에에 따른 음성 기반 어플리케이션 장치의 구성을 보여 주는 블록도이다.FIG. 7 is a block diagram illustrating a configuration of a voice-based application device according to another embodiment of the present invention including the apparatus for processing the noisy voice signal of FIG. 5.

도 8은 도 4의 (a)에 도시된 노이지 음성 신호(가는 실선), 도 4의 (d)에 도시된 노이즈 신호(점선), 및 본 발명의 실시예에 따라 추정된 노이즈 신호(굵은 실 선)의 스무드된 WPTC의 크기의 일례가 도시되어 있다.8 is a noisy speech signal (thin solid line) shown in FIG. 4A, a noise signal (dashed line) shown in FIG. 4D, and a noise signal (coarse thread) estimated according to an embodiment of the present invention. An example of the size of a smoothed WPTC) is shown.

도 9는 노이지 음성 신호의 평균 WPTC 곡선(Contour) 및 본 발명의 실시예에 따라 추정된 노이즈 신호의 평균 WPTC 곡선이 도시되어 있다.9 shows an average WPTC curve of a noisy speech signal and an average WPTC curve of an estimated noise signal according to an embodiment of the present invention.

도 10과 도 11은 각각 노이지 음성 신호와 본 발명의 실시예에 따라서 개선된 음성 신호의 파형 및 스펙트로그램을 보여 주는 도면이다.10 and 11 show waveforms and spectrograms of a noisy voice signal and an improved voice signal according to an embodiment of the present invention, respectively.

도 12와 도 13은 각각 Seg.NEE와 Seg.SNR_Imp를 이용하여 본 발명의 실시예에 의한 경우와 종래의 MS 및 WA 방법에 의한 경우의 평균 성능을 비교하여 보여 주는 도면으로서, (a)는 바블 노이즈, (b)는 F16 노이즈, (c)는 변동하는 백색 가우시안 노이즈, 및 (d)는 정적 백색 가우시안 노이즈가 각각 노이즈 신호로써 사용된 경우이다.12 and 13 are diagrams showing the average performance of the case of the embodiment of the present invention and the case of the conventional MS and WA method using Seg.NEE and Seg.SNR _Imp , respectively, (a) Is bobble noise, (b) F16 noise, (c) fluctuating white Gaussian noise, and (d) static white Gaussian noise are used as noise signals, respectively.

Claims

현재 프레임의 입력 노이지 음성 신호를 주파수 도메인으로 변환하여 변환 계수들로 이루어진 변환 신호를 생성하는 단계;Converting the input noisy speech signal of the current frame into a frequency domain to generate a transform signal composed of transform coefficients;

상기 변환 신호를 이용하여 노이즈를 추정하는 단계로써, 만일 스케일링 인자가 소정의 임계치보다 작은 경우에는 상기 입력 노이지 음성 신호만을 이용하여 제1 노이즈를 추정하고, 만일 상기 스케일링 인자가 상기 임계치 이상인 경우에는 상기 입력 노이지 음성 신호와 이전 프레임의 노이즈 추정치를 모두 이용하되 상기 노이즈 추정치보다 상기 입력 노이지 음성 신호에 더 큰 가중치를 할당하여 제2 노이즈를 추정하는 단계; 및Estimating noise using the converted signal, and if the scaling factor is less than a predetermined threshold, estimating the first noise using only the input noisy speech signal, and if the scaling factor is greater than or equal to the threshold, Estimating a second noise using both an input noisy speech signal and a noise estimate of a previous frame but assigning a greater weight to the input noisy speech signal than the noise estimate; And

상기 추정된 노이즈를 이용한 스펙트럼 차감을 수행하여 개선된 음성 신호를 구하는 단계를 포함하는 노이지 음성 신호의 처리 방법.And obtaining an improved speech signal by performing a spectral subtraction using the estimated noise.

제1항에 있어서, 상기 변환 신호를 생성하는 단계에서는 상기 입력 노이지 음성 신호에 대하여 균일 웨이블릿 패킷 변환을 수행하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.The method of claim 1, wherein the generating of the converted signal comprises performing uniform wavelet packet conversion on the input noisy speech signal.

제2항에 있어서, 상기 노이즈 추정 단계 이전에 이전 프레임의 스무드된 변환 신호와 현재 프레임의 임의의 고정된 웨이블릿 패킷 트리 레벨에서의 웨이블릿 패킷 변환 신호의 크기를 이용하여 상기 변환 신호를 스무딩하는 단계를 더 포함하 고, 상기 노이즈 추정 단계에서는 스무드된 변환 신호를 이용하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.3. The method of claim 2, further comprising: prior to the noise estimation step, smoothing the transform signal using the magnitude of the smoothed transform signal of the previous frame and the wavelet packet transform signal at any fixed wavelet packet tree level of the current frame. The method of claim 1, further comprising using a smoothed transformed signal in the noise estimation step.

제3항에 있어서, 상기 스무딩 단계는 하기 식(E1)에 따라서 스무드된 변환 신호를 구하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.4. The method of claim 3, wherein the smoothing step obtains a smoothed converted signal according to the following equation (E1).

(E1)

여기서, i는 프레임 인덱스, k는 노드 인덱스, m은 각 노드에서의 계수 빈 인덱스, α_X는 스무딩 인자(0 <α_X <1), Y_i,k(m)은 임의의 고정된 웨이블릿 패킷 트리 레벨에서의 웨이블릿 패킷 변환 신호, 및 X_i,k(m)은 Y_i,k(m)의 스무드된 웨이블릿 패킷 변환 신호.Where i is the frame index, k is the node index, m is the coefficient bin index at each node, α _X is the smoothing factor (0 <α _X <1), and Y _{i, k} (m) is any fixed wavelet packet The wavelet packet transform signal at the tree level, and X _{i, k} (m) is a smoothed wavelet packet transform signal of Y _{i, k} (m).

제1항에 있어서, 상기 노이즈 추정 단계는 현재 프레임에서 상기 노이지 음성 신호의 변환 계수의 크기의 합에 대한 이전 프레임에서 상기 추정된 노이즈 신호의 변환 계수의 크기의 합의 비를 이용하여 상기 스케일링 인자를 추정하는 단계를 더 포함하고, 2. The method of claim 1, wherein the noise estimating step uses the scaling factor using a ratio of a sum of magnitudes of transform coefficients of the estimated noise signal in a previous frame to a sum of magnitudes of transform coefficients of the noisy speech signal in a current frame. Further comprising estimating,

상기 스케일링 인자의 추정 단계에서는 상기 합의 비가 1보다 큰 경우에는 상기 스케일링 인자를 1로 설정하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.And in the estimating step of the scaling factor, set the scaling factor to 1 when the sum ratio is greater than one.

제5항에 있어서, 상기 합의 비는 하기 식(E2)에 대한 하기 식(E3)의 비로 정의하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.The method for processing a noisy speech signal according to claim 5, wherein the ratio of sum is defined as the ratio of the following formula (E3) to the following formula (E2).

(E2)

(E3)

여기서, i는 프레임 인덱스, k는 노드 인덱스, m은 각 노드에서의 계수 빈 인덱스, X_i,k(m)은 주파수 도메인에서의 변환 신호, 및

은 추정된 노이즈 신호.Where i is a frame index, k is a node index, m is a coefficient bin index at each node, X _{i, k} (m) is a transform signal in the frequency domain, and

Is an estimated noise signal.

제1항에 있어서, 상기 추정된 제1 노이즈는 상기 스케일링 인자와 상기 입력 노이지 음성 신호의 곱으로 구하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.The method of claim 1, wherein the estimated first noise is obtained by multiplying the scaling factor by the input noisy speech signal.

제1항에 있어서, 상기 추정된 제2 노이즈는 상기 스케일링 인자와 상기 입력 노이지 음성 신호의 곱의 합과 (1-상기 스케일링 인자)와 상기 이전 프레임의 노이즈 추정치의 곱의 합으로 구하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.The method of claim 1, wherein the estimated second noise is calculated as a sum of a product of the scaling factor and the input noisy speech signal, and a sum of a product of (1-the scaling factor) and a noise estimate of the previous frame. A method of processing a noisy voice signal.

제1항에 있어서, 상기 노이즈 추정 단계는 현재 프레임이 묵음 프레임인지를 판정하는 단계를 더 포함하고, The method of claim 1, wherein the noise estimation step further comprises determining whether the current frame is a silent frame,

상기 묵음 프레임의 판정 단계에서는 인접한 노드 사이의 상기 변환 계수의 크기비를 이용하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.And in the determining of the silent frame, a ratio of magnitudes of the transform coefficients between adjacent nodes is used.

제9항에 있어서, 상기 묵음 프레임의 판정 단계에서는 현재 프레임에서의 인접한 노드 사이의 변환 계수의 제1 크기비를 묵음 프레임인 기준 프레임에서의 인접한 노드 사이의 변환 계수의 제2 크기비를 비교하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.10. The method of claim 9, wherein in the step of determining the silent frame, the first size ratio of transform coefficients between adjacent nodes in the current frame is compared with the second size ratio of transform coefficients between adjacent nodes in the reference frame which is the silent frame. A method for processing a noisy voice signal, characterized in that.

제10항에 있어서, 상기 제1 크기비와 상기 제2 크기비 사이의 비가 1에 가까우면 묵음 프레임으로 판정하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.The method for processing a noisy speech signal according to claim 10, wherein if the ratio between the first size ratio and the second size ratio is close to 1, the silence frame is determined as a silent frame.

제11항에 있어서, 상기 묵음 프레임의 판정 단계에서는 연속된 세 개 또는 그 이상의 노드에서 상기 제1 및 제2 크기비 사이의 비가 음성이 존재하는 프레임인 것으로 판정할 경우에만 상기 현재 프레임을 묵음 프레임이 아닌 것으로 판정하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.12. The silent frame according to claim 11, wherein in the determining of the silent frame, the current frame is silenced only when it is determined that the ratio between the first and second size ratios is a frame in which voice exists in three or more consecutive nodes. And determining that it is not.

제9항에 있어서, 상기 현재 프레임이 묵음 프레임으로 판정되는 경우에는 상기 현재 프레임의 정보를 이용하여 상기 스케일링 인자를 업데이트한 후에 업데이트된 스케일링 인자를 이용하여 노이즈를 추정하고, 상기 현재 프레임이 묵음 프레 임이 아닌 것으로 판정되는 경우에는 곧바로 노이즈를 추정하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.10. The method of claim 9, wherein when the current frame is determined to be a silent frame, noise is estimated using an updated scaling factor after updating the scaling factor using the information of the current frame, and the current frame is a silent frame. And if it is determined that it is not, the noise is immediately estimated.

현재 프레임의 입력 노이지 음성 신호를 주파수 도메인으로 변환하여 변환 계수들로 이루어진 변환 신호를 생성하는 단계; 및Converting the input noisy speech signal of the current frame into a frequency domain to generate a transform signal composed of transform coefficients; And

소정 차수의 변환 레벨에서 상기 변환 신호의 인접한 노드 사이의 상기 변환 계수의 크기비를 이용하여 현재 프레임이 묵음 프레임인지를 판정하는 단계를 포함하되,Determining whether a current frame is a silent frame using a ratio of magnitudes of the transform coefficients between adjacent nodes of the transform signal at a transform level of a predetermined order;

상기 변환 신호를 생성하는 단계에서는 상기 입력 노이지 음성 신호에 대하여 균일 웨이블릿 패킷 변환을 수행하고,In the generating of the converted signal, uniform wavelet packet conversion is performed on the input noisy speech signal.

상기 묵음 프레임의 판정 단계에서는 현재 프레임에서의 인접한 노드 사이의 웨이블릿 패킷 변환 계수의 제1 크기비를 묵음 프레임인 기준 프레임에서의 인접한 노드 사이의 웨이블릿 패킷 변환 계수의 제2 크기비와 비교하여 유사도 파라미터를 산출하며, 적어도 2이상의 상기 유사도 파라미터를 이용하여 묵음 프레임을 판정하는 묵음 프레임의 판정 방법.In the step of determining the silent frame, the similarity parameter is compared by comparing a first size ratio of wavelet packet transform coefficients between adjacent nodes in the current frame with a second size ratio of wavelet packet transform coefficients between adjacent nodes in the reference frame that is the silent frame. And determining a silent frame using at least two similarity parameters.

삭제delete

제14항에 있어서, 상기 제1 크기비와 상기 제2 크기비를 이용하여 하기 식(E4)와 같은 유사도 파라미터를 정의하고, 하기 식(E5)에 정의된 알고리즘에 따라서 묵음 프레임인지를 판정하는 것을 특징으로 하는 노이지 음성 신호에서의 묵음 프레임의 판정 방법.The method according to claim 14, wherein the first size ratio and the second size ratio are used to define a similarity parameter such as Equation (E4), and determine whether the frame is a silent frame according to an algorithm defined in Equation (E5). A method of determining a silent frame in a noisy speech signal, characterized in that.

(E4)

(E5)

여기서, i는 프레임 인덱스, k는 노드 인덱스, K는 임의의 웨이블릿 패킷 변환 트리 레벨에서의 노드의 개수, γ_i ^UP(k)는 상방향 크기비로써 노드 k와 노드 (k+1)과의 WPTC의 크기비, γ_i ^DN(k)는 하방향 크기비로써 노드 k와 노드 (k-1)과의 WPTC의 크기비, λ^UP(k)는 기준 프레임에서의 상방향 크기비인 상방향 기준비, λ^DN(k)는 기준 프레임에서의 하방향 크기비인 하방향 기준비, 임계치 η은 음성이 존재하는지 여부를 판정하기 위한 소정의 기준값, 및 Λ_i(k)가 0인 경우는 묵음 프레임을 가리키고, 1인 경우는 음성이 존재하는 프레임을 가리킨다.Where i is the frame index, k is the node index, K is the number of nodes at any wavelet packet transform tree level, and γ _i ^UP (k) is the upward size ratio between node k and node (k + 1). of WPTC size ratio, γ _i ^DN (k) is the downward direction size WPTC size ratio between the node k to the node (k-1) as the ratio, λ ^UP (k) is the direction size ratio upward group in the reference frame Ready, λ ^DN (k) is the down reference ratio which is the down-size ratio in the reference frame, the threshold η is a predetermined reference value for determining whether or not voice is present, and the silence frame when Λ _i (k) is 0 Indicates a frame in which the voice exists.

제17항에 있어서, 상기 묵음 프레임의 판정 단계에서는 연속된 세 개 또는 그 이상의 노드에서 상기 Λ_i(k)가 1인 경우에만 음성이 존재하는 것으로 판정하는 것을 특징으로 하는 노이지 음성 신호에서의 묵음 프레임의 판정 방법.18. The method of claim 17, wherein in the step of determining the silence frame, the silence in the noisy voice signal is determined to exist only when the Λ _i (k) is 1 in three consecutive nodes or more. How to determine the frame.

현재 프레임의 입력 노이지 음성 신호에 대하여 균일 웨이블릿 패킷 변환을 수행하여 변환 신호를 생성하는 단계;Generating a transformed signal by performing uniform wavelet packet transform on the input noisy speech signal of the current frame;

이전 프레임의 스무드된 변환 신호와 현재 프레임의 임의의 고정된 웨이블릿 패킷 트리 레벨에서의 상기 변환 신호의 크기를 이용하여 상기 변환 신호를 스무딩하여 스무드된 변환 신호를 구하는 단계; Smoothing the transform signal using the smoothed transform signal of the previous frame and the magnitude of the transform signal at any fixed wavelet packet tree level of the current frame to obtain a smoothed transform signal;

상기 고정된 웨이블릿 패킷 변환 레벨에서 상기 스무드된 변환 신호의 인접한 노드 사이의 웨이블릿 패킷 변환 계수의 크기비를 이용하여 현재 프레임이 묵음 프레임인지를 판정하는 단계;Determining whether the current frame is a silent frame using a ratio of wavelet packet transform coefficients between adjacent nodes of the smoothed transform signal at the fixed wavelet packet transform level;

만일 상기 현재 프레임이 묵음 프레임인 경우에는 스케일링 인자를 업데이트한 후에 상기 스무드된 변환 신호를 이용하여 노이즈를 추정하고, 만일 상기 현재 프레임이 묵음 프레임이 아닌 경우에는 곧바로 상기 스무드된 변환 신호를 이용하여 노이즈를 추정하는 단계; 및If the current frame is a silent frame, after the scaling factor is updated, the noise is estimated using the smoothed transformed signal. If the current frame is not the silent frame, the noise is immediately used by the smoothed converted signal. Estimating; And

제19항에 있어서, 상기 노이즈 추정 단계에서는,20. The method of claim 19, wherein in the noise estimation step,

만일 상기 스케일링 인자가 소정의 임계치보다 작은 경우에는 상기 입력 노이지 음성 신호만을 이용하여 제1 노이즈를 추정하고, 만일 상기 스케일링 인자가 상기 임계치 이상인 경우에는 상기 입력 노이지 음성 신호와 이전 프레임의 노이즈 추정치를 모두 이용하되 상기 노이즈 추정치보다 상기 입력 노이지 음성 신호에 더 큰 가중치를 할당하여 제2 노이즈를 추정하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.If the scaling factor is smaller than a predetermined threshold, the first noise is estimated using only the input noisy speech signal. If the scaling factor is greater than or equal to the threshold, both the input noisy speech signal and the noise estimate of the previous frame are both estimated. And estimating a second noise by assigning a greater weight to the input noisy speech signal than the noise estimate.

제20항에 있어서, 상기 스케일링 인자는 현재 프레임에서 상기 노이지 음성 신호의 웨이블릿 패킷 변환 계수의 크기의 합에 대한 이전 프레임에서 상기 추정된 노이즈 신호의 웨이블릿 패킷 변환 계수의 크기의 합의 비를 이용하여 추정하되, 상기 합의 비가 1보다 큰 경우에는 상기 스케일링 인자를 1로 설정하는 것을 특징으로 하는 노이지 음성 신호의 처리 방법.21. The method of claim 20, wherein the scaling factor is estimated using a ratio of the sum of the magnitudes of the wavelet packet transform coefficients of the estimated noise signal in the previous frame to the sum of the magnitudes of the wavelet packet transform coefficients of the noisy speech signal in the current frame. If the ratio is greater than 1, the scaling factor is set to 1.

컴퓨터를 제어하여 현재 프레임의 입력 노이지 음성 신호로부터 개선된 음성 신호를 생성하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체로서, 상기 프로그램은A computer-readable recording medium having recorded thereon a program that controls a computer to generate an improved speech signal from an input noisy speech signal of a current frame.

상기 입력 노이지 음성 신호를 주파수 도메인으로 변환하여 변환 계수들로 이루어진 변환 신호를 생성하는 처리와, Converting the input noisy speech signal into a frequency domain to produce a transform signal consisting of transform coefficients;

상기 변환 신호를 이용하여 노이즈를 추정하는 처리로써, 만일 스케일링 인자가 소정의 임계치보다 작은 경우에는 상기 입력 노이지 음성 신호만을 이용하여 제1 노이즈를 추정하고, 만일 상기 스케일링 인자가 상기 임계치 이상인 경우에는 상기 입력 노이지 음성 신호와 이전 프레임의 노이즈 추정치를 모두 이용하되 상기 노이즈 추정치보다 상기 입력 노이지 음성 신호에 더 큰 가중치를 할당하여 제2 노이즈를 추정하는 처리와, A process of estimating noise using the converted signal, wherein if the scaling factor is less than a predetermined threshold, the first noise is estimated using only the input noisy speech signal; and if the scaling factor is greater than or equal to the threshold, A process of estimating a second noise by using both an input noisy speech signal and a noise estimate of a previous frame but assigning a greater weight to the input noisy speech signal than the noise estimate;

상기 추정된 노이즈를 이용한 스펙트럼 차감을 수행하여 개선된 음성 신호를 구하는 처리를 상기 컴퓨터에 실행시키는 것을 특징으로 하는 컴퓨터 판독 가능한 기록매체.And performing a process for obtaining an improved speech signal by performing spectral subtraction using the estimated noise.

컴퓨터를 제어하여 현재 프레임의 입력 노이지 음성 신호로부터 상기 현재 프레임이 묵음 프레임인지를 판정하도록 고안된 프로그램을 기록한 컴퓨터 판독 가능한 기록매체로서, 상기 프로그램은A computer readable recording medium having recorded thereon a program designed to control a computer to determine whether the current frame is a silent frame from an input noisy speech signal of the current frame.

현재 프레임의 입력 노이지 음성 신호를 주파수 도메인으로 변환하여 변환 계수들로 이루어진 변환 신호를 생성하는 처리와,Converting an input noisy speech signal of the current frame into a frequency domain to generate a transform signal composed of transform coefficients;

소정 차수의 변환 레벨에서 상기 변환 신호의 인접한 노드 사이의 상기 변환 계수의 크기비를 이용하여 현재 프레임이 묵음 프레임인지를 판정하는 처리를 상기 컴퓨터에 실행시키는 것을 특징으로 하는 컴퓨터 판독 가능한 기록매체.And causing the computer to perform a process of determining whether a current frame is a silent frame by using a magnitude ratio of the transform coefficients between adjacent nodes of the transform signal at a transform level of a predetermined order.

상기 입력 노이지 음성 신호에 대하여 균일 웨이블릿 패킷 변환을 수행하여 변환 신호를 생성하는 처리와,Performing a uniform wavelet packet transformation on the input noisy speech signal to generate a transformed signal;

이전 프레임의 스무드된 변환 신호와 현재 프레임의 임의의 고정된 웨이블릿 패킷 트리 레벨에서의 상기 변환 신호의 크기를 이용하여 상기 변환 신호를 스무딩 하여 스무드된 변환 신호를 구하는 처리와, A process of smoothing the transform signal using the smoothed transform signal of the previous frame and the magnitude of the transform signal at any fixed wavelet packet tree level of the current frame to obtain a smoothed transform signal;

상기 고정된 웨이블릿 패킷 변환 레벨에서 상기 스무드된 변환 신호의 인접한 노드 사이의 웨이블릿 패킷 변환 계수의 크기비를 이용하여 현재 프레임이 묵음 프레임인지를 판정하는 처리와,A process of determining whether a current frame is a silent frame using a ratio of wavelet packet transform coefficients between adjacent nodes of the smoothed transform signal at the fixed wavelet packet transform level;

만일 상기 현재 프레임이 묵음 프레임인 경우에는 스케일링 인자를 업데이트한 후에 상기 스무드된 변환 신호를 이용하여 노이즈를 추정하고, 만일 상기 현재 프레임이 묵음 프레임이 아닌 경우에는 곧바로 상기 스무드된 변환 신호를 이용하여 노이즈를 추정하는 처리와, If the current frame is a silent frame, after the scaling factor is updated, the noise is estimated using the smoothed transformed signal. If the current frame is not the silent frame, the noise is immediately used by the smoothed converted signal. A process of estimating