KR100639930B1

KR100639930B1 - Voice 2 stage end-point detection apparatus for automatic voice recognition system and method therefor

Info

Publication number: KR100639930B1
Application number: KR1020040097113A
Authority: KR
Inventors: 이성주
Original assignee: 한국전자통신연구원
Priority date: 2004-11-24
Filing date: 2004-11-24
Publication date: 2006-11-01
Also published as: KR20060057919A

Abstract

본 발명의 목적은 로그 에너지를 이용한 음성의 끝점검출 방법과 통계적 모델을 이용한 음성의 끝점검출 방법의 장점만을 실현시켜 동적인 잡음환경 혹은 정적인 잡음환경 하에서도 보다 정확한 음성의 시작점 혹은 끝점을 검출하도록 하는 자동음성인식시스템의 음성 2단 끝점검출 장치 및 그 방법을 제공하는 데 있다. 상기와 같은 목적을 달성하기 위한 본 발명의 자동음성인식시스템의 음성 2단 끝점검출 장치는, 입력신호로부터 부가잡음 성분을 제거하여 입력신호의 음질을 향상시키는 입력 신호 음질 향상부와, 상기 입력 신호 음질 향상 수단에서 출력되는 부가잡음이 제거된 입력신호의 로그 에너지를 이용하여 음성의 시작점 혹은 끝점을 검출하는 로그 에너지 기반 음성 검출부와, 상기 로그 에너지 기반 음성 검출 수단에서 출력되는 상기 음성의 시작점 혹은 음성의 끝점 정보를 이용하고, 통계적 모델을 이용한 음성의 끝점검출 방법을 통해 음성의 시작 혹은 끝점을 검출하는 통계적 모델 기반 음성 검출부로 구성된다.An object of the present invention is to realize only the advantages of the method of detecting the end point of speech using log energy and the method of detecting the end point of speech using statistical model, so that the accurate starting or end point of speech can be detected even under dynamic or static noise environment. The present invention provides an apparatus and method for detecting a two-stage end point of an automatic speech recognition system. In order to achieve the above object, an audio two-stage endpoint detection apparatus of an automatic speech recognition system of the present invention includes an input signal sound quality improving unit for removing additional noise components from an input signal to improve sound quality of the input signal, and the input signal. A log energy-based speech detector for detecting a start point or an end point of speech using log energy of an input signal from which additional noise output from the sound quality improving means is removed, and a start point or voice of the speech output from the log energy-based speech detection means It consists of a statistical model-based speech detector that detects the beginning or the end point of speech by using the endpoint information of and through the speech endpoint detection method using a statistical model.

음성의 끝점검출, 음성 추출, 비음성 제거, 로그 에너지, 통계적 모델Speech endpoint detection, speech extraction, nonvoice rejection, log energy, statistical model

Description

자동음성인식시스템의 음성 2단 끝점검출 장치 및 그 방법{VOICE 2 STAGE END-POINT DETECTION APPARATUS FOR AUTOMATIC VOICE RECOGNITION SYSTEM AND METHOD THEREFOR} VOICE 2 STAGE END-POINT DETECTION APPARATUS FOR AUTOMATIC VOICE RECOGNITION SYSTEM AND METHOD THEREFOR}

도 1은 본 발명의 일 실시예에 따른 자동음성인식시스템의 음성 2단 끝점검출장치의 구성을 나타낸 구성 블록도,1 is a block diagram showing the configuration of a two-stage end point detection apparatus of an automatic voice recognition system according to an embodiment of the present invention;

도 2는 도 1에서의 입력신호 음질 향상부의 상세 기능블록도,2 is a detailed functional block diagram of an input signal sound quality improving unit of FIG. 1;

도 3은 도 1에서의 로그 에너지 기반 음성 검출부의 상세 기능 블록도,3 is a detailed functional block diagram of a log energy-based speech detector of FIG. 1;

도 4는 도 3에서의 음성 시작점/끝점 검출부의 상태도,4 is a state diagram of a voice start / end point detector of FIG. 3;

도 5 는 도 1에서의 통계적 모델 기반 음성 검출부의 상세 기능 블록도,FIG. 5 is a detailed functional block diagram of a statistical model-based speech detector in FIG. 1;

도 6은 본 발명의 일 실시예에 따른 자동음성인식시스템의 음성 2단 끝점검출방법을 나타낸 동작 플로우챠트,6 is an operation flowchart illustrating a method for detecting a two-stage end point of an automatic voice recognition system according to an embodiment of the present invention;

도 7은 도 6에서의 입력신호 음질 향상 단계의 상세 동작 플로우챠트,7 is a detailed operation flowchart of the input signal sound quality improving step of FIG.

도 8은 도 6에서의 로그 에너지 기반 음성 검출 단계의 상세 동작 플로우챠트,FIG. 8 is a detailed operation flowchart of the log energy based speech detection step of FIG. 6;

도 9는 도 6에서의 통계적 모델 기반 음성 검출 단계의 상세 동작 플로우챠트.9 is a detailed operational flowchart of a statistical model based speech detection step in FIG. 6;

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 입력신호 음질 향상부100: sound quality enhancement unit

200 : 로그 에너지 기반 음성 검출부200: log energy-based speech detector

300 : 통계적 모델 기반 음성 검출부300: statistical model based speech detector

본 발명은 자동음성인식시스템의 음성 2단 끝점검출 장치 및 그 방법에 관한 것으로, 특히 동적인 잡음환경 혹은 정적인 잡음환경 하에서도 보다 정확한 음성의 시작점 혹은 끝점을 검출하도록 하는 자동음성인식시스템의 음성 2단 끝점검출 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus and method for detecting a two-stage end point of an automatic speech recognition system. In particular, a speech of an automatic speech recognition system for detecting a more accurate start or end point of a speech even in a dynamic noise environment or a static noise environment The present invention relates to a two-stage endpoint detection device and a method thereof.

일반적으로, 자동 음성 인식 기술이란 인간의 음성에 포함되어 있는 언어정보를 추출하는 방법으로, 마이크, 헤드셋, 유무전화기, 및 휴대폰 등을 통하여 입력된 음성의 특징을 분석하여 이를 인식하고 그에 상응하는 동작을 수행하는 기술을 말한다. 이와 같은 자동 음성 인식 기술은 실생활과 밀접한 관련이 있는 분야 즉, 홈오토메이션, 음성인식 장난감, 음성인식 어학 학습기, 음성인식 웹브라우저, 음성인식 게임, 음성인식 휴대통신단말기, 음성인식가전제품, 증권거래시스템, 자 동안내시스템, 음성인식 다이얼링시스템 등 여러 분야에 걸쳐서 폭 넓게 활용되고 있다. In general, automatic speech recognition technology is a method of extracting language information contained in a human voice, and analyzes the characteristics of the voice input through a microphone, a headset, a telephone, a mobile phone, and the like and recognizes the corresponding operation. Say the skill to do it. Such automatic speech recognition technology is closely related to real life such as home automation, speech recognition toys, speech recognition language learners, speech recognition web browsers, speech recognition games, speech recognition mobile communication terminals, speech recognition appliances, securities trading It is widely used in various fields such as system, e-mail system and voice recognition dialing system.

또한, 일반적으로 자동 음성 인식 방법은 마이크, 헤드셋 등의 음성신호입력장치를 통하여 음성신호을 입력 받는 음성신호입력 과정, 입력 받은 음성신호로부터 주변잡음을 제외한 순수 음성신호부분만을 추출하는 음성의 끝점검출과정, 순수음성신호로부터 음성의 주파수특성을 분석하는 음성특징추출과정 및 인식 알고리즘을 이용하여 음성을 인식하는 음성인식과정, 음성인식결과로부터 인식된 결과가 인식 혹은 오인식 된 것인지를 판별해내거나 인식된 결과를 수정하는 후처리 과정을 거친다.Also, in general, the automatic voice recognition method includes a voice signal input process for receiving a voice signal through a voice signal input device such as a microphone or a headset, and an endpoint detection process for extracting only a pure voice signal portion except for ambient noise from the input voice signal. Speech recognition process that analyzes frequency characteristics of speech from pure speech signal and speech recognition process using speech recognition algorithm and recognition algorithm, and whether the recognized result is recognized or misrecognized from speech recognition result After the post-treatment process to correct.

상기 음성의 끝점검출 과정이란 입력신호로부터 순수음성신호만을 추출해내는 자동음성인식 전과정 중 전처리에 해당하는 과정으로, 전체 자동음성인식 시스템의 성능을 크게 좌우하는 역할을 한다. 음성의 끝점검출 성능을 크게 좌우하는 원인으로 부가잡음을 예를 들 수 있다. 부가잡음은 크게 정적인 잡음(stationary noise)과 동적인 잡음(dynamic noise)이 있다. 정적인 잡음이란 시간에 따라 그 주파수 특성이 거의 변하지 않는 부가잡음을 말하고 동적인 잡음이란 시간에 따라 그 주파수 특성이 동적으로 변화하는 부가잡음을 말한다. The end point detection process of the speech corresponds to preprocessing of the entire automatic speech recognition process of extracting only the pure speech signal from the input signal, and greatly affects the performance of the entire automatic speech recognition system. An additional noise may be cited as a cause that greatly influences the endpoint detection performance of speech. Additive noise is largely stationary noise and dynamic noise. Static noise refers to added noise whose frequency characteristic is hardly changed with time, and dynamic noise refers to additional noise whose frequency characteristic changes dynamically with time.

일반적으로 정적인 잡음환경 하에서 보다 동적인 잡음환경에서 음성의 끝점검출이 까다로운 것으로 알려져 있다. 일반적으로 주변잡음에 비해 입력음성의 신호가 상대적으로 크다고 가정한다면 입력신호의 로그에너지를 이용한 음성의 끝점검출 방법으로도 동적인 혹은 정적인 잡음환경 하에서도 대략적인 음성의 끝점검출 이 가능하지만 정확한 음성의 시작점과 끝점을 검출하기는 어렵다. 통계적 모델을 이용한 음성의 끝점검출 방법은 정적인 잡음 환경에서 비교적 정확한 음성의 시작점 및 끝점 검출이 가능하지만 동적인 잡음 환경에서는 그 정확도가 저하되는 단점을 가지고 있다.In general, it is known that voice end point detection is more difficult in a static noise environment. In general, assuming that the input voice signal is relatively large compared to the surrounding noise, the end point detection method using the log energy of the input signal can be used to detect the end point of the voice even under dynamic or static noise. It is difficult to detect the starting point and the ending point of. The end point detection method of speech using a statistical model is relatively accurate in detecting the starting point and end point of a voice in a static noise environment, but has a disadvantage in that its accuracy is degraded in a dynamic noise environment.

따라서, 본 발명은 상기와 같은 종래의 문제점을 해결하기 위해 이루지는 것으로서, 본 발명의 목적은 로그 에너지를 이용한 음성의 끝점검출 방법과 통계적 모델을 이용한 음성의 끝점검출 방법의 장점만을 실현시켜 동적인 잡음환경 혹은 정적인 잡음환경 하에서도 보다 정확한 음성의 시작점 혹은 끝점을 검출할 수 있으므로써, 자동음성인식시스템의 성능 향상시킬 수 있도록 하는 자동음성인식시스템의 음성 2단 끝점검출 장치 및 그 방법을 제공하는 데 있다.
Accordingly, the present invention has been made to solve the above-mentioned conventional problems, and an object of the present invention is to realize only the advantages of the method of detecting the endpoint of speech using log energy and the method of detecting the endpoint of speech using a statistical model. The present invention provides an apparatus and method for detecting two-stage endpoints of an automatic speech recognition system, which can improve the performance of an automatic speech recognition system by detecting a precise start or end point of a speech even in a noisy environment or a static noise environment. There is.

상기와 같은 목적을 달성하기 위한 본 발명의 자동음성인식시스템의 음성 2단 끝점검출 장치는, 입력신호로부터 부가잡음 성분을 제거하여 입력신호의 음질을 향상시키는 입력 신호 음질 향상 수단과, 상기 입력 신호 음질 향상 수단에서 출력되는 부가잡음이 제거된 입력신호의 로그 에너지를 이용하여 음성의 시작점 혹은 끝점을 검출하는 로그 에너지 기반 음성 검출 수단과, 상기 로그 에너지 기반 음성 검출 수단에서 출력되는 상기 음성의 시작점 혹은 음성의 끝점 정보를 이용하고, 통계적 모델을 이용한 음성의 끝점검출 방법을 통해 음성의 시작 혹은 끝점을 검출하는 통계적 모델 기반 음성 검출 수단을 포함하여 구성되는 것을 특징으로 한다.In order to achieve the above object, the two-stage end point detection apparatus of the automatic speech recognition system of the present invention includes: an input signal sound quality improving means for removing an additional noise component from an input signal to improve sound quality of the input signal, and the input signal. Log energy-based speech detection means for detecting the starting point or end point of speech using the log energy of the input signal from which the additional noise output from the sound quality enhancement means is removed, and the starting point of the speech output from the log energy-based speech detection means, or It is characterized in that it comprises a statistical model-based speech detection means for detecting the beginning or the end point of the voice by using the end point information of the voice, and the end point detection method of the voice using the statistical model.

상기와 같은 목적을 달성하기 위한 본 발명의 자동음성인식시스템의 음성 2단 끝점검출 방법은, 입력신호가 입력되면, 그 입력신호로부터 부가잡음 성분을 제거하여 입력신호의 음질을 향상시키는 제1 단계와, 상기 부가잡음이 제거된 입력신호의 로그 에너지를 이용하여 음성의 시작점 혹은 끝점을 검출하는 제2 단계와, 상기 검출된 음성의 시작점 혹은 음성의 끝점 정보를 이용하고, 통계적 모델을 이용한 음성의 끝점검출 방법을 통한 음성의 시작 혹은 끝점을 검출하는 제3 단계를 포함하여 이루어 지는 것을 특징으로 한다.In order to achieve the above object, a two-stage end point detection method of an automatic speech recognition system of the present invention includes, when an input signal is input, removes an additional noise component from the input signal to improve sound quality of the input signal. And a second step of detecting a start point or an end point of the voice using the log energy of the input signal from which the additional noise is removed, and using a statistical model using the start point or end point information of the detected voice. And a third step of detecting the beginning or the end point of the voice through the endpoint detection method.

이하, 본 발명의 일 실시예에 의한 자동음성인식시스템의 음성 2단 끝점검출 장치 및 그 방법에 대하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, an apparatus for detecting a two-stage end point of an automatic voice recognition system and a method thereof according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 자동음성인식시스템의 음성 2단 끝점검출 장치의 기능 블록도를 도시한 것이다. Figure 1 shows a functional block diagram of a two-stage endpoint detection apparatus of the automatic speech recognition system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 자동음성인식시스템의 음성 2단 끝점검출 장치는 입력신호로부터 부가잡음 성분을 제거하여 입력신호의 음질을 향상시키는 입력 신호 음질 향상부(100)와, 상기 부가잡음이 제거된 입력신호의 로그 에너지를 이용하여 음성의 시작점 혹은 끝점을 검출하는 로그 에너지 기반 음성 검출부(200)와, 상기 음성의 시작점 혹은 음성의 끝점 정보를 이용하 고, 통계적 모델을 이용한 음성의 끝점검출 방법을 통한 음성의 시작 혹은 끝점을 검출하는 통계적 모델 기반 음성 검출부(300)로 구성된다.As shown in FIG. 1, an apparatus for detecting a two-stage end point of an automatic speech recognition system according to an embodiment of the present invention includes an input signal sound quality improving unit for improving sound quality of an input signal by removing additional noise components from the input signal. 100), a log energy-based voice detector 200 for detecting a start point or an end point of the voice by using the log energy of the input signal from which the additional noise is removed, and using the start point or the end point information of the voice, and statistically It is composed of a statistical model-based speech detector 300 for detecting the beginning or the end point of the speech through the method of detecting the endpoint of the speech using the model.

도 2는 도 1에서의 입력신호 음질 향상부의 상세 기능 블록도를 도시한 것이다.FIG. 2 is a detailed functional block diagram of an input signal sound quality improving unit of FIG. 1.

도 2에 도시된 바와 같이, 상기 입력신호 음질 향상부(100)는 입력신호를 버퍼링하고, 그 버퍼링된 입력신호들로부터 음질을 향상시키고자 하는 비교적 짧은 구간(10msec 혹은 20msec)의 음성신호를 프레이밍하는 입력신호 버퍼링 및 프레이밍부(101)와, 상기 프레이밍된 신호의 주파수 스펙트럼을 분석하여 그 주파수 스펙트럼을 추정하는 입력신호 스펙트럼 추정부(102)와, 상기 프레이밍된 신호가 음성신호인지를 판단하는 음성 검출부(103)와, 상기 음성 검출부(103)의 음성 검출 결과를 이용하여 잡음 스펙트럼을 추정하는 잡음 스펙트럼 추정부(104)와, 상기 입력신호 스펙트럼 추정부(102)에서 추정된 입력신호의 스펙트럼과 상기 잡음 스펙트럼 추정부(104)에서 추정된 잡음 스펙트럼을 이용하여 입력신호의 신호 대 잡음비(SNR : Signal-to-Noise Ratio)를 추정하고 이를 기반 잡음 제거 계수를 추정하는 잡음제거 필터 계수 추정부(105)와, 상기 입력신호 버퍼링 및 프레이밍부(101)에서 프레이밍된 신호에 잡음 제거 필터 계수를 적용하여 음질이 향상된 음성신호를 출력하는 신호 재생부(106)로 구성되어 있다. As illustrated in FIG. 2, the input signal sound quality improving unit 100 buffers an input signal and frames a voice signal of a relatively short interval (10 msec or 20 msec) to improve sound quality from the buffered input signals. An input signal buffering and framing unit 101, an input signal spectrum estimator 102 for analyzing the frequency spectrum of the framed signal and estimating the frequency spectrum, and a voice for determining whether the framed signal is a voice signal A noise spectrum estimator 104 for estimating a noise spectrum using the detector 103, a voice detection result of the speech detector 103, and a spectrum of an input signal estimated by the input signal spectrum estimator 102; The signal spectrum-to-noise ratio (SNR) of the input signal is estimated by using the noise spectrum estimated by the noise spectrum estimator 104 and A noise reduction filter coefficient estimator 105 for estimating the base noise reduction coefficient and a signal reproduction for outputting a voice signal having an improved sound quality by applying noise reduction filter coefficients to a signal signaled by the input signal buffering and framing unit 101. It consists of a part 106.

도 3은 도 1에서의 로그 에너지 기반 음성 검출부의 상세 기능 블록도를 도 시한 것이다.FIG. 3 illustrates a detailed functional block diagram of a log energy-based voice detector in FIG. 1.

도 3에 도시된 바와 같이, 상기 로그 에너지 기반 음성 검출부(200)는 상기 입력신호 품질 향상부(100)를 통하여 음질이 향상된 입력신호의 프레임으로부터 로그 에너지를 추정하는 로그 에너지 추정부(201)와, 로그 에너지 추정부(201)를 통하여 추정된 입력 프레임의 로그 에너지와 음성 검출부(203)의 음성 검출 결과를 기반으로 잡음의 로그 에너지 평균를 추정하는 잡음 로그 에너지 평균 추정부(202)와, 상기 잡음 로그 에너지 평균 추정부(202)에서 추정된 잡음의 로그 에너지 평균과 상기 로그 에너지 추정부(201)에서 추정된 로그 에너지를 비교하여 입력 프레임이 음성인지 아니면 배경 잡음인지를 판단하는 음성 검출부(203)와, 상기 음성 검출부(203)의 음성 검출 결과를 기반으로 음성의 시작점과 끝점을 검출하는 음성 시작점/끝점 검출부(204)로 구성된다.As shown in FIG. 3, the log energy-based voice detector 200 includes a log energy estimator 201 estimating log energy from a frame of an input signal having an improved sound quality through the input signal quality improving unit 100. A noise log energy average estimator 202 estimating a log energy average of noise based on a log energy of an input frame estimated by the log energy estimator 201 and a voice detection result of the voice detector 203, and the noise The voice detector 203 which determines whether the input frame is voice or background noise by comparing the log energy average of the noise estimated by the log energy average estimator 202 and the log energy estimated by the log energy estimator 201. And a voice start point / end point detector 204 for detecting a start point and an end point of the voice based on the voice detection result of the voice detector 203.

도 5는 도 1에서의 통계적 모델 기반 음성 검출부의 상세 기능 블록도를 도시한 것이다.FIG. 5 is a detailed functional block diagram of the statistical model-based speech detector of FIG. 1.

도 5에 도시된 바와 같이, 상기 통계적 모델 기반 음성 검출부(300)는 상기 로그 에너지 기반 음성 검출부(200)에서 검출된 입력신호의 확률 분포 함수를 추정하는 입력 신호 확률 분포 함수 추정부(301)와, 상기 입력신호 확률 분포 함수 추정부(301)를 통한 신호의 잡음의 확률 분포 함수를 추정하는 잡음 확률 분포 함수를 추정하는 잡음 확률 분포 함수 추정부(302)와, 상기 입력 신호 확률 분포 함수 추정부(301)에서 추정된 입력신호의 확률 분포 함수와 상기 잡음 확률 분포 함수 추정부(302)에서 추정된 잡음의 확률 분포 함수를 기반으로 음성이 존재하지 않을 확률 혹은 음성이 존재할 확률을 입력 프레임으로 계산하고, 이 확률값과 임계치를 비교하여 음성 검출 여부를 판단하는 확률 계산 및 음성 검출부(303)와, 상기 음성 검출부(303)의 음성 검출 결과를 기반으로 음성의 시작점과 끝점을 검출하는 음성 시작점/끝점 검출부(304)로 구성된다.As shown in FIG. 5, the statistical model-based speech detector 300 includes an input signal probability distribution function estimator 301 for estimating a probability distribution function of the input signal detected by the log energy-based speech detector 200. A noise probability distribution function estimator 302 for estimating a noise probability distribution function for estimating a probability distribution function of noise of a signal through the input signal probability distribution function estimator 301, and the input signal probability distribution function estimator Based on the probability distribution function of the input signal estimated at 301 and the probability distribution function of the noise estimated by the noise probability distribution function estimating unit 302, a probability that a voice does not exist or a probability that a voice exists is calculated as an input frame. And comparing the probability value with a threshold value to determine the probability of speech detection and the speech detection result of the speech detection unit 303 and the speech detection unit 303. Half to consist of speech start point / end point detector 304 for detecting the start and end points of the speech.

그러면, 상기와 같은 구성을 가지는 본 발명의 일 실시예에 따른 자동음성인식시스템의 음성 2단 끝점검출 장치의 동작을 도 4, 도 6 내지 도 9를 참조하여 상세히 설명하기로 한다.Then, the operation of the two-stage end point detection device of the automatic speech recognition system according to an embodiment of the present invention having the above configuration will be described in detail with reference to FIGS. 4, 6 to 9.

먼저, 입력신호가 입력되면(S100), 입력신호 음질 향상부(100)는 그 입력신호로부터 부가잡음 성분을 제거하여 입력신호의 음질을 향상시킨다(S200). First, when an input signal is input (S100), the input signal sound quality improving unit 100 removes an additional noise component from the input signal to improve the sound quality of the input signal (S200).

상기 입력신호 음질 향상 단계(S200)를 도 7를 참조하여 상세히 설명하면 다음과 같다.The input signal sound quality improvement step S200 will be described in detail with reference to FIG. 7 as follows.

입력신호 버퍼링 및 프레이밍부(101)는 디지털 입력신호를 실시간(real-time)으로 처리하기 위하여 입력신호를 버퍼링하고, 버퍼링된 입력 신호들로부터 음질을 향상시키고자 하는 비교적 짧은 구간(10msec 혹은 20msec)의 음성신호를 프레이밍하는 역할을 수행한다(S201). The input signal buffering and framing unit 101 buffers the input signal in order to process the digital input signal in real-time and improves the sound quality from the buffered input signals (10 msec or 20 msec). Framing the audio signal serves (S201).

입력신호의 스펙트럼 추정부(102)는 입력신호의 주파수 스펙트럼을 분석하여 입력신호의 스펙트럼을 추정하는 역할을 수행한다(S202). 이렇게 추정된 입력신 호의 주파수 스펙트럼은 잡음 스펙트럼을 추정 시 혹은 잡음제거 필터 계수를 추정 시에 이용된다.The spectrum estimator 102 of the input signal analyzes the frequency spectrum of the input signal and estimates the spectrum of the input signal (S202). The frequency spectrum of the input signal thus estimated is used when estimating the noise spectrum or estimating the noise canceling filter coefficient.

음성 검출부(103)는 잡음 스펙트럼을 추정하기 위해 음성의 검출 여부를 판단하는 역할을 수행하게 된다(S203). 즉, 상기 입력신호 버퍼링 및 프레이밍부(101)에서 프레이밍된 신호가 음성신호인지를 판단하게 된다.The voice detector 103 performs a role of determining whether a voice is detected in order to estimate the noise spectrum (S203). That is, it is determined whether the signal framed by the input signal buffering and framing unit 101 is a voice signal.

잡음 스펙트럼 추정부(104)는 음성 검출부(103)의 음성검출 결과를 이용하여 잡음 스펙트럼을 추정하는 역할을 수행한다(S204). 즉, 상기 음성 검출부(103)에서 음성이 검출되지 않는 프레임에 대해서는 잡음 스펙트럼을 업데이트하고, 음성이 검출된 프레임에 대해서는 잡음 스펙트럼을 업데이트 하지 않는다.The noise spectrum estimator 104 estimates the noise spectrum using the speech detection result of the speech detector 103 (S204). That is, the voice detector 103 updates the noise spectrum for the frame in which no voice is detected, and does not update the noise spectrum in the frame in which the voice is detected.

잡음제거 필터 계수 추정부(105)는 입력신호 스펙트럼 추정부(102)에서 추정된 입력신호의 스펙트럼과 잡음 스펙트럼 추정부(104)에서 추정된 잡음 스펙트럼을 이용하여 입력신호의 신호대잡음비(SNR, Signal-to-Noise Ratio)를 추정하고, 이를 기반으로 잡음제거 필터 계수를 추정하는 역할을 수행한다(S205). 이렇게 추정된 잡음제거 필터 계수들은 신호 재생부(106)에서 이용된다.The noise canceling filter coefficient estimator 105 uses a signal-to-noise ratio (SNR, Signal) of the input signal using the spectrum of the input signal estimated by the input signal spectrum estimator 102 and the noise spectrum estimated by the noise spectrum estimator 104. -to-Noise Ratio) and estimates the noise canceling filter coefficients based on this (S205). The noise canceling filter coefficients thus estimated are used in the signal regeneration unit 106.

상기 신호 재생부(106)에서는 입력신호에 잡음제거 필터 계수를 적용하여 음질이 향상된 음성신호를 얻는다(S206).The signal reproducing unit 106 applies a noise removing filter coefficient to the input signal to obtain a voice signal having improved sound quality (S206).

이어서, 로그 에너지 기반 음성 검출부(200)는 상기 입력신호 음질 향상부(100)에서 부가잡음이 제거된 입력신호의 로그 에너지를 이용하여 음성의 시작점 혹은 끝점을 검출하게 된다(S300).Subsequently, the log energy-based voice detector 200 detects the start point or the end point of the voice using the log energy of the input signal from which the additional noise is removed by the input signal sound quality improving unit 100 (S300).

이 로그 에너지 기반 음성 검출 단계(S300)를 도 8를 참조하여 상세히 설명하기로 한다.This log energy-based voice detection step S300 will be described in detail with reference to FIG. 8.

일반적으로, 로그 에너지를 이용한 음성의 끝점검출 알고리즘은 신호 대 잡음비가 작은 경우 그 성능이 저하되는 단점이 있다. 즉, 부가잡음의 에너지 레벨이 높은 경우, 순수 음성신호와 혼동될 가능성이 높아진다. 따라서 부가 잡음을 제거하는 입력신호 음질향상부(100)는 로그 에너지를 이용한 음성의 끝점검출 알고리즘의 성능을 향상시키는데 많은 기여를 하게 된다.In general, the endpoint detection algorithm of voice using log energy has a disadvantage in that its performance is degraded when the signal-to-noise ratio is small. In other words, when the energy level of the additional noise is high, there is a high possibility of confusion with the pure audio signal. Therefore, the input signal sound quality improving unit 100 that removes additional noise contributes a lot to improving the performance of the end point detection algorithm of the voice using log energy.

로그 에너지 추정부(201)에서는 상기 입력신호 음질 향상부(100)를 통하여 음질이 향상된 입력신호의 프레임으로부터 로그 에너지를 추정하는 역할을 수행한다(S301).The log energy estimator 201 estimates log energy from a frame of the input signal having the improved sound quality through the input signal sound quality improving unit 100 (S301).

잡음 로그 에너지 평균 추정부(202)에서는 로그 에너지 추정부(201)를 통하여 추정된 입력 프레임의 로그 에너지와 음성 검출부(203)의 음성검출 결과를 기반으로 잡음의 로그 에너지를 추정하게 된다(S302).The noise log energy average estimator 202 estimates the log energy of the noise based on the log energy of the input frame estimated by the log energy estimator 201 and the voice detection result of the voice detector 203 (S302). .

음성 검출부(203)에서는 추정된 잡음의 로그에너지 평균과 입력 로그 에너지를 비교하여 입력 프레임이 음성인지 아니면 배경 잡음인지를 판단하게 된다(S303). 즉, 입력 로그 에너지가 잡음의 로그 에너지 평균 보다 어느 정도이상 크면 음성으로 판단하고, 그렇지 않은 경우 배경잡음으로 판단하게 된다. 이러한 음성검출방법에는 입력 음성신호가 배경잡음에 비해 그 에너지가 크다는 기본 가정이 포함되어 있는 것이다.The voice detector 203 compares the estimated log energy average of the noise with the input log energy to determine whether the input frame is voice or background noise (S303). That is, if the input log energy is more than a certain degree above the log energy average of noise, it is judged as voice, otherwise it is judged as background noise. The voice detection method includes a basic assumption that the input voice signal has a larger energy than background noise.

음성 시작점/끝점 검출부(204)에서는 음성검출 결과를 기반으로 음성의 시 작점과 끝점을 검출하게 된다. 일반적으로 음성검출 결과 음성프레임으로 판별된 프레임이 연속적으로 일정 수 이상 나타날 경우, 연속된 음성 프레임들 중 첫 프레임을 음성의 시작점으로 판별하고(S304), 음성의 시작점이 검출된 후 배경잡음 프레임으로 판별된 입력 프레임이 일정 수 이상 연속하여 나타날 경우 연속된 배경잡음 프레임들 중 첫 프레임을 음성의 끝점으로 판별하게 된다(S305). The voice start point / end point detector 204 detects the start point and the end point of the voice based on the voice detection result. In general, when a frame detected as a voice frame is a predetermined number or more consecutively, as a result of voice detection, the first frame of the consecutive voice frames is determined as the start point of the voice (S304), and after the start point of the voice is detected, the background noise frame is detected. If the determined input frames appear continuously for a predetermined number or more, the first frame of the consecutive background noise frames is determined as an end point of the voice (S305).

도 4는 음성의 시작점 혹은 끝점 검출부의 상태도를 나타낸 것이다. 4 is a state diagram of a start point or an end point detector of a voice.

최초 상태는 묵음상태에서 시작한다. 그런 다음, 음성 검출 결과가 배경잡음 즉, 묵음인 경우에는 1번 상태로 천이가 되는데 이는 묵음상태에 계속 머무르는 것을 의미한다. 묵음상태에서 음성검출결과가 음성인 경우 전반부 상태로 천이된다(2). 전반부 상태에서는 일정 수의 연속된 음성프레임이 계속 검출된 경우 음성상태로 천이하게 되고(5), 묵음이 검출되는 경우 묵음상태로 천이하게 된다(3). 일정 수 이하의 음성프레임이 연속적으로 검출되는 경우에는 전반부상태에 머무르게 된다(4). 일단 음성상태로 천이가 되면 일정 수의 음성 프레임들이 이미 검출된 것을 의미한다. 따라서 음성의 시작점은 전반부상태에서 음성상태로 천이된 시점의 일정 수 이전의 음성프레임이 된다. 음성상태에서 음성프레임이 검출된 경우에는 계속 음성상태에 머무르게 되고(6) 묵음 프레임이 검출되는 경우에는 후반부 상태로 천이하게 된다(7). 후반부 상태에서는 음성프레임이 검출되면 다시 음성상태로 천이하게 되고(8) 그렇지 않은 경우 후반부상태에 계속 머무르게 된다(9). 이때 일정 수의 묵음 프레임이 연속적으로 검출되는 경우 묵음상태로 천이하게 된다(10). 이때 일정 수 이전의 음성 프레임에서 음성의 끝점이 검출된 것으로 판단하게 된다.The initial state starts with silence. Then, when the voice detection result is background noise, that is, mute, the state transitions to state 1, which means that the voice is kept in the mute state. In the silent state, when the voice detection result is negative, the state transitions to the first half state (2). In the first half state, when a certain number of consecutive voice frames are continuously detected, the state transitions to the speech state (5), and when the silence is detected, the state transitions to the silent state (3). If a certain number of voice frames are continuously detected (4). Once the transition to the voice state means that a certain number of voice frames have already been detected. Therefore, the starting point of the voice is the voice frame before a certain number of times when the voice transition from the first half state to the voice state. If a voice frame is detected in the voice state, the voice state is kept in the voice state (6). If a silent frame is detected, the state transitions to the latter half state (7). In the latter half state, if a voice frame is detected, the state transitions back to the negative state (8). Otherwise, the second half state remains in the second half state (9). At this time, if a certain number of silent frames are continuously detected, the transition to the silent state (10). At this time, it is determined that the end point of the voice is detected in the voice frame before the predetermined number.

이어서, 통계적 모델 기반 음성 검출부(300)는 상기 로그 에너지 기반 음성 검출부(200)로부터 출력되는 상기 음성의 시작점 혹은 음성의 끝점 정보를 이용하고, 통계적 모델을 이용한 음성의 끝점검출 방법을 통한 음성의 시작 혹은 끝점을 검출하게 된다(S400).Subsequently, the statistical model-based speech detector 300 uses the starting point or end point information of the speech output from the log energy-based speech detector 200, and starts the speech through the method of detecting the endpoint of the speech using a statistical model. Or the end point is detected (S400).

이 통계적 모델 기반 음성 검출 단계(S400)를 도 9를 참조하여 상세히 설명하기로 한다.This statistical model-based speech detection step S400 will be described in detail with reference to FIG. 9.

통계적 모델을 이용한 음성의 끝점검출 방법은 인간의 음성신호를 가우시안, 라플라시안 혹은 감마 분포를 따른다고 가정하고, 배경 잡음의 경우 가우시안 분포를 따른다는 가정 하에서 음성이 존재할 확률 혹은 음성인 존재하지 않을 확률을 계산하여 음성의 끝점검출에 이용하는 방법을 말한다. 이러한 방법은 로그 에너지를 이용한 음성의 끝점검출 방법에 비하여 보다 정확한 음성의 시작점 혹은 끝점의 검출이 가능한 장점이 있으나 동적인 잡음 환경 하에서는 잡음의 주파수 특성 변화에 민감하게 반응하게 되어 그 성능이 저하되는 단점을 가지고 있다. 본 발명에서는 로그 에너지 기반 음성 검출부(200)의 다음 단의 통계적 모델 기반 음성 검출부(300)를 이용하여 잡음의 주파수 특성 변화 정도를 줄여 통계적 모델을 이용한 음성의 끝점검출 방법의 장점을 최대한 부각시킬 수 있도록 끝점검출 시스템을 디자인 하였다. The endpoint detection method using the statistical model assumes that the human voice signal follows the Gaussian, Laplacian, or Gamma distribution. Refers to the method used for calculating the end point of speech. This method has the advantage of more accurate detection of the starting point or end point of voice compared to the method of detecting the end point of voice using log energy. However, under the dynamic noise environment, it is sensitive to changes in the frequency characteristics of the voice and degrades its performance. Have In the present invention, by using the statistical model-based speech detector 300 of the next stage of the log energy-based speech detector 200, the degree of change in the frequency characteristic of noise can be reduced to highlight the advantages of the method of detecting the endpoint of the speech using the statistical model. We designed an end point detection system.

상기 로그 에너지 기반 음성 검출부(200)에서는 음성의 시작점이 검출되면 음성 시작 전 배경잡음이 충분히 포함되어 있게 충분한 마진을 주고 버퍼링하고 있 던 입력신호를 통계적 모델 기반 음성 검출부(300)의 입력신호 확률 분포 함수 추정부(301)로 전송하게 된다. In the log energy-based speech detector 200, when the starting point of the speech is detected, the probability distribution of the input signal of the statistical model-based speech detector 300 is sufficient to provide sufficient margin and buffer the background noise before the speech starts. The function estimation unit 301 transmits.

입력신호 확률 분포 함수 추정부(301)는 이러한 입력 프레임으로부터 입력신호의 확률 분포 함수를 추정하게 된다(S401).The input signal probability distribution function estimator 301 estimates a probability distribution function of the input signal from the input frame (S401).

잡음 확률 분포 함수 추정부(302)는 상기 입력신호 확률 분포 함수 추정부(301)에서 추정된 입력신호의 확률 분포 함수와 음성 검출부(303)에서의 음성 검출 결과를 기반으로 잡음의 확률 분포 함수를 추정하게 된다(S402). 이때 음성신호의 확률분포 추정을 위해서 가우시안, 라플라시안 혹은 감마 함수들이 주로 이용되며 잡음신호의 확률분포 추정을 위해서는 주로 가우시안 분포를 이용한다. The noise probability distribution function estimator 302 calculates a noise probability function of the noise based on a probability distribution function of the input signal estimated by the input signal probability distribution function estimator 301 and a voice detection result of the speech detector 303. It is estimated (S402). In this case, Gaussian, Laplacian, or Gamma functions are mainly used to estimate the probability distribution of the speech signal, and Gaussian distribution is mainly used to estimate the probability distribution of the noise signal.

상기 확률 계산 및 음성 검출부(303)는 이렇게 추정된 입력신호의 확률분포함수와 잡음신호의 확률분포함수를 기반으로 음성이 존재하지 않을 확률 혹은 음성이 존재할 확률을 입력프레임으로부터 계산하고(S403), 이 확률값과 임계치를 비교함으로써 음성검출 여부를 판단하게 된다(S404). 즉, 음성이 존재하지 않을 확률이 특정 임계치 보다 높은 경우, 해당 입력프레임을 비음성 프레임으로 판단하고, 그렇지 않은 경우 음성프레임으로 판단한다. The probability calculation and voice detector 303 calculates a probability that the voice does not exist or the probability that the voice exists from the input frame based on the probability distribution function of the input signal and the probability distribution function of the noise signal (S403). By comparing the probability value and the threshold value, it is determined whether or not the voice is detected (S404). That is, if the probability that there is no voice is higher than a specific threshold, the corresponding input frame is determined as a non-voice frame, and if not, the voice frame is determined.

음성 시작점/끝점 검출부(304)는 음성검출 결과를 기반으로 음성의 시작점과 끝점을 검출하게 된다. 일반적으로 음성검출 결과 음성프레임으로 판별된 프레임이 연속적으로 일정 수 이상 나타날 경우, 연속된 음성 프레임들 중 첫 프레임을 음성의 시작점으로 판별하고(S405), 음성의 시작점이 검출된 후 배경잡음 프레임으로 판별된 입력 프레임이 일정 수 이상 연속하여 나타날 경우 연속된 배경잡음 프 레임들 중 첫 프레임을 음성의 끝점으로 판별하게 된다(S406). The voice start point / end point detector 304 detects the start point and the end point of the voice based on the voice detection result. In general, when a frame detected as a voice frame is a predetermined number or more, as a result of voice detection, the first frame of the consecutive voice frames is determined as the start point of the voice (S405), and after the start point of the voice is detected, the background noise frame is detected. If the determined input frames are continuously displayed for a predetermined number or more, the first frame of the continuous background noise frames is determined as an end point of the voice (S406).

이와같이, 통계적 모델을 기반하는 음성의 시작점과 끝점의 검출 방법은 로그 에너지를 이용한 음성의 시작점과 끝점의 검출 방법과 거의 유사하나, 로그 에너지 기반 음성 검출부(200)에서 음성의 끝점이 검출된 경우 통계적 모델 기반 음성 검출부(300)에서 음성의 끝점이 검출되지 않았더라도 음성의 끝점이 검출된 것으로 판단케 하는 제어 기능이 추가되어 있다.As such, the method of detecting the starting point and the end point of the speech based on the statistical model is almost similar to the method of detecting the starting point and the ending point of the speech using log energy, but statistically when the end point of the speech is detected by the log energy based speech detector 200. Even if the end point of the voice is not detected by the model-based voice detector 300, a control function for determining that the end point of the voice is detected is added.

이상에서 몇가지 실시예를 들어 본 발명을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한 되는 것이 아니고 본 발명의 기술 사상을 벗어 나지 않는 범위 내에서 다양하게 변형 실시될 수 있다.Although the present invention has been described in more detail with reference to some embodiments, the present invention is not necessarily limited to these embodiments, and various modifications can be made without departing from the spirit of the present invention.

상술한 바와 같이, 본 발명에 의한 자동음성인식시스템의 음성 2단 끝점검출 장치 및 그 방법에 의하면, 정적인 잡음 환경 뿐만 아니라 동적인 잡음 환경에서도 보다 정확한 음성의 시작점 혹은 끝점검출을 가능케 함으로써 자동음성인식 시스템의 성능을 향상시킬 뿐만 아니라 부정확한 음성의 끝점검출로 발생할 수 있는 자동음성인식 시스템의 부하를 막아 자동음성인식 시스템의 효율을 향상시킨다.
As described above, according to the two-stage end point detection apparatus and method of the automatic voice recognition system according to the present invention, it is possible to detect the start point or the end point of the voice more accurately not only in the static noise environment but also in the dynamic noise environment. It not only improves the performance of the recognition system, but also improves the efficiency of the automatic speech recognition system by preventing the load of the automatic speech recognition system that may be caused by inaccurate endpoint detection.

Claims

입력신호로부터 부가잡음 성분을 제거하여 입력신호의 음질을 향상시키는 입력 신호 음질 향상 수단과;Input signal sound quality improving means for removing additional noise components from the input signal to improve the sound quality of the input signal;

상기 입력 신호 음질 향상 수단에서 출력되는 부가잡음이 제거된 입력신호의 로그 에너지를 이용하여 음성의 시작점 혹은 끝점을 검출하는 로그 에너지 기반 음성 검출 수단과;Log energy-based voice detection means for detecting a start point or an end point of the voice by using the log energy of the input signal from which the additional noise output from the input signal sound quality improving means is removed;

상기 로그 에너지 기반 음성 검출 수단에서 출력되는 상기 음성의 시작점 혹은 음성의 끝점 정보를 이용하고, 통계적 모델을 이용한 음성의 끝점검출 방법을 통해 음성의 시작 혹은 끝점을 검출하는 통계적 모델 기반 음성 검출 수단을 포함하여 구성되어, 동적인 잡음환경과 정적인 잡음환경 하에서도 보다 정확한 음성의 시작점과 끝점을 검출하는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치.And a statistical model-based speech detection means for detecting the beginning or the end point of the speech using a speech end point detection method using a statistical model, using information on the beginning or end point of the speech output from the log energy-based speech detection means. And a two-stage end point detection device of an automatic speech recognition system, which detects a more accurate start and end point of a voice even in a dynamic noise environment and a static noise environment.

제 1 항에 있어서,The method of claim 1,

상기 입력신호 음질 향상 수단은 입력신호를 버퍼링하고, 그 버퍼링된 입력신호들로부터 음질을 향상시키고자 하는 소정의 짧은 구간의 음성신호를 프레이밍하는 입력신호 버퍼링 및 프레이밍부와;The input signal sound quality improving means includes: an input signal buffering and framing unit for buffering an input signal and framing a voice signal of a predetermined short interval to improve sound quality from the buffered input signals;

상기 입력신호 버퍼링 및 프레이밍부에서 프레이밍된 신호의 주파수 스펙트럼을 분석하여 그 주파수 스펙트럼을 추정하는 입력신호 스펙트럼 추정부와;An input signal spectrum estimator for analyzing a frequency spectrum of a signal framed by the input signal buffering and framing unit and estimating the frequency spectrum;

상기 입력신호 버퍼링 및 프레이밍부에서 프레이밍된 신호가 음성신호인지 를 판단하는 음성 검출부와;A voice detector for determining whether the signal framed by the input signal buffering and framing unit is a voice signal;

상기 음성 검출부의 음성 검출 결과를 이용하여 잡음 스펙트럼을 추정하는 잡음 스펙트럼 추정부와;A noise spectrum estimator for estimating a noise spectrum using a voice detection result of the speech detector;

상기 입력신호 스펙트럼 추정부에서 추정된 입력신호의 스펙트럼과 상기 잡음 스펙트럼 추정부에서 추정된 잡음 스펙트럼을 이용하여 입력신호의 신호 대 잡음비를 추정하고 이를 기반 잡음 제거 계수를 추정하는 잡음제거 필터 계수 추정부와;A noise canceling filter coefficient estimator estimating a signal-to-noise ratio of an input signal using the spectrum of the input signal estimated by the input signal spectrum estimator and the noise spectrum estimated by the noise spectrum estimator and estimating a noise cancellation coefficient based on the signal Wow;

상기 입력신호 버퍼링 및 프레이밍부에서 프레이밍된 신호에 잡음 제거 필터 계수를 적용하여 음질이 향상된 음성신호를 출력하는 신호 재생부로 구성되는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치.And a signal reproducing unit configured to output a voice signal having improved sound quality by applying noise removing filter coefficients to the signal framed by the input signal buffering and framing unit.

제 2 항에 있어서,The method of claim 2,

상기 잡음 스펙트럼 추정부는 상기 음성 검출부에서 음성이 검출되지 않는 프레임에 대해서는 잡음 스펙트럼을 업데이트하고, 음성이 검출된 프레임에 대해서는 잡음 스펙트럼을 업데이트 하지 않는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치.The noise spectrum estimator detects the speech two-stage endpoint of the automatic speech recognition system, characterized in that the noise spectrum is updated for a frame in which the voice is not detected by the speech detector, and the noise spectrum is not updated for the frame in which the voice is detected. Device.

제 1 항에 있어서,The method of claim 1,

상기 로그 에너지 기반 음성 검출 수단은 상기 입력신호 품질 향상 수단을 통하여 음질이 향상된 입력신호의 프레임으로부터 로그 에너지를 추정하는 로그 에너지 추정부와;The log energy-based speech detection means includes: a log energy estimator for estimating log energy from a frame of an input signal having improved sound quality through the input signal quality improving means;

상기 로그 에너지 추정부를 통하여 추정된 입력 프레임의 로그 에너지와 음성 검출부의 음성 검출 결과를 기반으로 잡음의 로그 에너지 평균를 추정하는 잡음 로그 에너지 평균 추정부와;A noise log energy average estimator for estimating a log energy average of noise based on a log energy of an input frame estimated by the log energy estimator and a voice detection result of the voice detector;

상기 잡음 로그 에너지 평균 추정부에서 추정된 잡음의 로그 에너지 평균과 상기 로그 에너지 추정부에서 추정된 로그 에너지를 비교하여 입력 프레임이 음성인지 아니면 배경 잡음인지를 판단하는 음성 검출부와;A voice detector for comparing the log energy average of the noise estimated by the noise log energy average estimator and the log energy estimated by the log energy estimator to determine whether an input frame is voice or background noise;

상기 음성 검출부의 음성 검출 결과를 기반으로 음성의 시작점과 끝점을 검출하는 음성 시작점/끝점 검출부로 구성되는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치.And an audio start point / end point detection unit for detecting a start point and an end point of the voice based on a voice detection result of the voice detection unit.

제 4 항에 있어서,The method of claim 4, wherein

상기 잡음 로그 에너지 평균 추정부는 입력 로그 에너지가 잡음의 로그 에너지 평균 보다 소정의 정도 이상 크면 음성으로 판단하고, 그렇지 않은 경우 배경잡음으로 판단하는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치.The noise log energy average estimating unit determines the voice when the input log energy is greater than or equal to the log energy average of the noise by a predetermined level, and otherwise determines the noise by the background noise. .

제 4 항에 있어서,The method of claim 4, wherein

상기 음성 검출부는 음성검출 결과 음성프레임으로 판별된 프레임이 연속적으로 일정 수 이상 나타날 경우, 연속된 음성 프레임들 중 첫 프레임을 음성의 시작점으로 판별하고, 음성의 시작점이 검출된 후 배경잡음 프레임으로 판별된 입력 프레임이 일정 수 이상 연속하여 나타날 경우 연속된 배경잡음 프레임들 중 첫 프레임을 음성의 끝점으로 판별하는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치.If the voice detector detects a certain number of frames consecutively determined as a voice frame as a result of voice detection, the voice detector determines the first frame of the consecutive voice frames as the start point of the voice, and determines the background noise frame after the start point of the voice is detected. 2. The apparatus of detecting a two-stage voice of an automatic speech recognition system, characterized in that the first frame among consecutive background noise frames is determined as an end point of a voice when a plurality of consecutive input frames appear consecutively.

제 1 항에 있어서,The method of claim 1,

상기 통계적 모델 기반 음성 검출 수단은 상기 로그 에너지 기반 음성 검출수단에서 검출된 입력신호의 확률 분포 함수를 추정하는 입력 신호 확률 분포 함수 추정부와;The statistical model-based speech detection means includes: an input signal probability distribution function estimator for estimating a probability distribution function of the input signal detected by the log energy-based speech detection means;

상기 입력신호 확률 분포 함수 추정부를 통한 신호의 잡음의 확률 분포 함수를 추정하는 잡음 확률 분포 함수를 추정하는 잡음 확률 분포 함수 추정부와;A noise probability distribution function estimator for estimating a noise probability distribution function for estimating a probability distribution function of noise of a signal through the input signal probability distribution function estimator;

상기 입력 신호 확률 분포 함수 추정부에서 추정된 입력신호의 확률 분포 함수와 상기 잡음 확률 분포 함수 추정부에서 추정된 잡음의 확률 분포 함수를 기 반으로 음성이 존재하지 않을 확률 혹은 음성이 존재할 확률을 입력 프레임으로 계산하고, 이 확률값과 임계치를 비교하여 음성 검출 여부를 판단하는 확률 계산 및 음성 검출부와;A probability that a voice does not exist or a probability that a voice exists based on a probability distribution function of the input signal estimated by the input signal probability distribution function estimator and the noise probability distribution function estimated by the noise probability distribution function estimator A probability calculation and voice detection unit for calculating a frame and comparing the probability value with a threshold to determine whether or not the voice is detected;

상기 확률 계산 및 음성 검출부의 음성 검출 결과를 기반으로 음성의 시작점과 끝점을 검출하는 음성 시작점/끝점 검출부로 구성되는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치.And an audio start point / end point detection unit for detecting a starting point and an end point of the voice based on the probability calculation and the voice detection result of the voice detection unit.

제 7 항에 있어서,The method of claim 7, wherein

상기 음성 검출부는 음성이 존재하지 않을 확률이 특정 임계치 보다 높은 경우, 해당 입력프레임을 비음성 프레임으로 판단하고, 그렇지 않은 경우 음성프레임으로 판단하는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치.When the probability that the voice does not exist is higher than a certain threshold, the voice detector determines the corresponding input frame as a non-voice frame, and if not, determines that the voice frame is a voice frame. Device.

제 7 항에 있어서,The method of claim 7, wherein

상기 음성 시작점/끝점 검출부는 음성검출 결과 음성프레임으로 판별된 프레임이 연속적으로 일정 수 이상 나타날 경우, 연속된 음성 프레임들 중 첫 프레임을 음성의 시작점으로 판별하고, 음성의 시작점이 검출된 후 배경잡음 프레임으로 판별된 입력 프레임이 일정 수 이상 연속하여 나타날 경우 연속된 배경잡음 프레임 들 중 첫 프레임을 음성의 끝점으로 판별하는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 장치. The voice start point / end point detector detects the first frame of the consecutive voice frames as the start point of the voice when the frame determined as the voice frame is continuously displayed as a result of the voice detection, and detects the background noise after the start point of the voice is detected. 2. An apparatus for detecting two-stage voices of an automatic speech recognition system, characterized in that the first frame among consecutive background noise frames is determined as an end point of a voice when an input frame determined as a frame is continuously displayed for a predetermined number or more.

입력신호가 입력되면, 그 입력신호로부터 부가잡음 성분을 제거하여 입력신호의 음질을 향상시키는 제1 단계와;When the input signal is input, removing the additional noise component from the input signal to improve sound quality of the input signal;

상기 부가잡음이 제거된 입력신호의 로그 에너지를 이용하여 음성의 시작점 혹은 끝점을 검출하는 제2 단계와;Detecting a start point or an end point of the voice by using the log energy of the input signal from which the additional noise is removed;

상기 검출된 음성의 시작점 혹은 음성의 끝점 정보를 이용하고, 통계적 모델을 이용한 음성의 끝점검출 방법을 통한 음성의 시작 혹은 끝점을 검출하는 제3 단계를 포함하여 이루어져 동적인 잡음환경과 정적인 잡음환경 하에서도 보다 정확한 음성의 시작점과 끝점을 검출하는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법. A dynamic noise environment and a static noise environment are included using a third step of detecting the beginning or the end point of the voice by using the detected end point information of the voice or the end point information of the voice using a statistical model. A two-stage end point detection method of an automatic speech recognition system, characterized by detecting a more precise start point and end point of a voice.

제 10 항에 있어서,The method of claim 10,

상기 제1 단계는 디지털 입력신호를 실시간으로 처리하기 위하여 입력신호를 버퍼링하고, 버퍼링된 입력 신호들로부터 음질을 향상시키고자 하는 소정의 짧은 구간의 음성신호를 프레이밍하는 제1 과정과; The first step includes a first process of buffering an input signal to process a digital input signal in real time, and framing a voice signal of a predetermined short interval to improve sound quality from the buffered input signals;

싱기 프레이밍된 신호의 주파수 스펙트럼을 분석하여 입력신호의 스펙트럼을 추정하는 제2 과정과;A second process of estimating a spectrum of the input signal by analyzing a frequency spectrum of the singulated frame;

상기 프레이밍된 신호가 음성신호인지를 판단하는 제3 과정과;Determining whether the framed signal is a voice signal;

상기 음성검출 결과를 이용하여 잡음 스펙트럼을 추정하는 제4 과정과;A fourth step of estimating a noise spectrum using the voice detection result;

상기 추정된 입력신호의 스펙트럼과 상기 추정된 잡음 스펙트럼을 이용하여 입력신호의 신호대잡음비를 추정하고, 이를 기반으로 잡음제거 필터 계수를 추정하는 제5 과정과;A fifth step of estimating a signal-to-noise ratio of an input signal using the estimated spectrum of the input signal and the estimated noise spectrum and estimating a noise removing filter coefficient based on the estimated signal-to-noise ratio;

상기 프레이밍된 신호에 잡음제거 필터 계수를 적용하여 음질이 향상된 음성신호를 얻는 제6 과정으로 이루어 지는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법. And a sixth process of applying a noise removing filter coefficient to the framed signal to obtain a speech signal having improved sound quality.

제 11 항에 있어서,The method of claim 11,

제4 과정은 음성이 검출되지 않는 프레임에 대해서는 잡음 스펙트럼을 업데이트하고, 음성이 검출된 프레임에 대해서는 잡음 스펙트럼을 업데이트 하지 않는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법.The fourth process is to update the noise spectrum for a frame in which no voice is detected, and to update the noise spectrum in a frame in which the voice is not detected.

제 10 항에 있어서, 상기 제2 단계는,The method of claim 10, wherein the second step,

상기 제1 단계에서 음질이 향상된 입력신호의 프레임으로부터 로그 에너지를 추정하는 제1 과정과;A first step of estimating log energy from a frame of an input signal having improved sound quality in the first step;

상기 추정된 입력 프레임의 로그 에너지와 음성검출 결과를 기반으로 잡음의 로그 에너지를 추정하는 제2 과정과;A second step of estimating log energy of noise based on the log energy of the estimated input frame and the voice detection result;

상기 추정된 잡음의 로그에너지 평균과 입력 로그 에너지를 비교하여 입력 프레임이 음성인지 아니면 배경 잡음인지를 판단하는 제3 과정과;A third step of determining whether an input frame is voice or background noise by comparing an estimated log energy average of the estimated noise with an input log energy;

상기 음성검출 결과를 기반으로 음성의 시작점과 끝점을 검출하는 제4 과정으로 이루어지는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법. And a fourth process of detecting a start point and an end point of the voice based on the result of the voice detection.

제 13 항에 있어서,The method of claim 13,

상기 제3 과정은 입력 로그 에너지가 잡음의 로그 에너지 평균 보다 소정의 정도 이상 크면 음성으로 판단하고, 그렇지 않은 경우 배경잡음으로 판단하는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법.The third step is a voice two-stage endpoint detection method of the automatic speech recognition system characterized in that if the input log energy is greater than the log energy average of the noise by a predetermined degree or more, and if it is not the background noise.

제 13 항에 있어서,The method of claim 13,

제4 과정은 음성검출 결과 음성프레임으로 판별된 프레임이 연속적으로 일정 수 이상 나타날 경우, 연속된 음성 프레임들 중 첫 프레임을 음성의 시작점으로 판별하는 제4-1 과정과;Step 4-1 includes the step 4-1 of determining a first frame among the consecutive voice frames as a starting point of the voice when the number of frames determined as the voice frame is continuously displayed as a result of the voice detection;

음성의 시작점이 검출된 후 배경잡음 프레임으로 판별된 입력 프레임이 일정 수 이상 연속하여 나타날 경우 연속된 배경잡음 프레임들 중 첫 프레임을 음성 의 끝점으로 판별하는 제4-2 과정으로 이루어 지는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법. After the start point of the voice is detected, if the input frame determined as the background noise frame appears continuously for a predetermined number or more, step 4-2 is performed to determine the first frame among the consecutive background noise frames as the end point of the voice. A two-stage endpoint detection method of an automatic speech recognition system.

제 10 항에 있어서, 상기 제3 단계는,The method of claim 10, wherein the third step,

상기 제2 단계에서의 입력 프레임으로부터 입력신호의 확률 분포 함수를 추정하는 제1 과정과;Estimating a probability distribution function of the input signal from the input frame in the second step;

상기 추정된 입력신호의 확률 분포 함수와 음성 검출 결과를 기반으로 잡음의 확률 분포 함수를 추정하는 제2 과정과;Estimating a probability distribution function of noise based on the estimated probability distribution function of the input signal and a voice detection result;

상기 추정된 입력신호의 확률분포함수와 잡음신호의 확률분포함수를 기반으로 음성이 존재하지 않을 확률 혹은 음성이 존재할 확률을 입력프레임으로부터 계산하는 제3 과정과;A third step of calculating, from an input frame, a probability that there is no voice or a probability that voice exists based on the estimated probability distribution function of the input signal and the noise distribution function of the noise signal;

이 확률값과 임계치를 비교함으로써 음성검출 여부를 판단하는 제4 과정과;A fourth step of determining whether voice is detected by comparing the probability value with a threshold value;

상기 음성검출 결과를 기반으로 음성의 시작점과 끝점을 검출하는 제5 과정으로 이루어 지는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법. And a fifth process of detecting a start point and an end point of the voice based on the result of the voice detection.

제 16 항에 있어서,The method of claim 16,

상기 제4 과정은 음성이 존재하지 않을 확률이 특정 임계치 보다 높은 경 우, 해당 입력프레임을 비음성 프레임으로 판단하고, 그렇지 않은 경우 음성프레임으로 판단하는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법.In the fourth process, when the probability that the voice does not exist is higher than a specific threshold, the voice input of the automatic voice recognition system is characterized in that the input frame is determined as a non-voice frame, and if not, the voice frame is determined as a voice frame. Endpoint Detection Method.

제 16 항에 있어서,The method of claim 16,

상기 제5 과정은 음성검출 결과 음성프레임으로 판별된 프레임이 연속적으로 일정 수 이상 나타날 경우, 연속된 음성 프레임들 중 첫 프레임을 음성의 시작점으로 판별하는 제5-1 서브과정과;The fifth step may include: a 5-1 sub-process of determining a first frame among the consecutive voice frames as a start point of the voice when a number of frames determined as voice frames are continuously displayed as a result of voice detection;

음성의 시작점이 검출된 후 배경잡음 프레임으로 판별된 입력 프레임이 일정 수 이상 연속하여 나타날 경우 연속된 배경잡음 프레임들 중 첫 프레임을 음성의 끝점으로 판별하는 제5-2 서브과정으로 이루어 지는 것을 특징으로 하는 자동음성인식시스템의 음성 2단 끝점검출 방법.After the start point of the voice is detected, if the input frame determined as the background noise frame is continuously displayed for a predetermined number or more, the fifth sub-process is performed to determine the first frame among the consecutive background noise frames as the end point of the voice. A two-stage endpoint detection method of automatic speech recognition system.