KR100586045B1

KR100586045B1 - Recursive Speaker Adaptation Automation Speech Recognition System and Method using EigenVoice Speaker Adaptation

Info

Publication number: KR100586045B1
Application number: KR1020030078383A
Authority: KR
Inventors: 전형배; 김동국
Original assignee: 한국전자통신연구원
Priority date: 2003-11-06
Filing date: 2003-11-06
Publication date: 2006-06-07
Also published as: KR20050043472A

Abstract

본 발명은 고유음성 화자적응을 이용한 재귀적 화자적응 음성인식 시스템 및 방법에 관한 것으로, 입력된 화자 음성을 1차 인식하고, 상기 1차 인식된 음성결과를 이용하여 비교사(Unsupervised Adaptation)방법으로 고유음성(EigenVoice) 화자적응을 수행한 후, 상기 화자적응 된 음향모델을 이용하여 화자 음성을 2차 적으로 인식함으로써 인식률을 더욱 향상시킬 수 있다.The present invention relates to a recursive speaker adaptation speech recognition system and method using inherent speech speaker adaptation. The present invention relates to a method of firstly recognizing an input speaker speech and using an unsupervised adaptation method using the first recognized speech result. After performing speech adaptation, the recognition rate can be further improved by recognizing the speaker's voice secondaryly using the speaker-adapted acoustic model.

본 발명은 입력된 음성신호로부터 음성인식에 사용되는 특징벡터를 추출하는 음성특징 추출부와, 상기 추출된 특징벡터와 이미 훈련되어 있는 화자독립 음향모델을 이용하여 상기 음성신호의 어휘를 인식하는 1차 인식부와, 상기 1차 인식부의 인식결과인 레이블 정보와 상기 음성특징 추출부의 특징벡터인 관측 데이터를 참조하여 화자적응을 수행하는 화자 적응부 및 상기 화자 적응부에서 제공하는 화자종속 음향모델을 이용하여 상기 음성신호의 인식결과를 출력하는 2차 인식부로 구성된다. The present invention provides a speech feature extracting unit for extracting a feature vector used for speech recognition from an input speech signal, and recognizing a vocabulary of the speech signal using the extracted feature vector and a speaker-independent acoustic model already trained. A speaker adaptation unit that performs speaker adaptation by referring to a difference recognition unit, label information that is a recognition result of the primary recognition unit, and observation data that is a feature vector of the speech feature extraction unit, and a speaker-dependent acoustic model provided by the speaker adaptation unit And a secondary recognition unit which outputs the recognition result of the voice signal.

음성인식 시스템, 화자적응, 고유음성(EigenVoice), 비교사 방식Speech Recognition System, Speaker Adaptation, EigenVoice, Comparator Method

Description

고유음성 화자적응을 이용한 재귀적 화자적응 음성인식 시스템 및 방법{Recursive Speaker Adaptation Automation Speech Recognition System and Method using EigenVoice Speaker Adaptation} Recursive Speaker Adaptation Automation Speech Recognition System and Method using EigenVoice Speaker Adaptation

도 1은 일반적인 화자독립 음성인식 시스템의 구성도,1 is a block diagram of a general speaker independent speech recognition system,

도 2는 본 발명에 따른 재귀적 화자적응 음성인식 시스템의 구성도,2 is a block diagram of a recursive speaker adaptation speech recognition system according to the present invention;

도 3은 본 발명에 따른 재귀적 화자적응 음성인식 시스템의 상세 구성도,3 is a detailed configuration diagram of a recursive speaker adaptation speech recognition system according to the present invention;

도 4는 본 발명에 따른 고유음성(EigenVoice) 화자적응 방법의 흐름도,4 is a flowchart of an EigenVoice speaker adaptation method according to the present invention;

도 5는 본 발명에 따른 재귀적 화자적응 음성인식 방법의 흐름도이다.5 is a flowchart of a recursive speaker adaptation speech recognition method according to the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 음성신호 200 : 재귀적 화자적응 음성인식 시스템100: voice signal 200: recursive speaker adaptation speech recognition system

210 : 음성특징 추출부 220 : 1차 인식부210: speech feature extractor 220: primary recognition unit

230 : 화자 적응부 240 : 2차 인식부230: speaker adaptation unit 240: secondary recognition unit

300 : 인식결과300: recognition result

본 발명은 고유음성(EigenVoice) 화자적응을 이용한 재귀적 화자적응 음성인식 시스템 및 방법에 관한 것으로, 더욱 상세하게는 특정 화자음성 인식시 사용자가 음성인식 서비스를 사용하면서 발성하는 음성을 이용하여 화자적응을 수행하는 비교사 적응(Unsupervised Adaptation)방식을 사용함으로써 화자적응시 극히 적은 데이터를 이용하여도 화자의 음성인식을 보다 정확하게 할 수 있도록 하는 것이다.The present invention relates to a recursive speaker adaptation speech recognition system and method using EigenVoice speaker adaptation. More particularly, the present invention relates to a speaker adaptation using speech generated by a user using a speech recognition service when recognizing a specific speaker speech. By using the unsupervised adaptation method, the speech recognition of the speaker can be made more accurate even with very little data.

일반적인 대용량 음성인식 시스템은 불특정 화자를 대상으로 음성인식을 수행하기 때문에 많은 훈련 화자로부터 음성 데이터를 수집하여 화자독립의 음향모델을 훈련하게 되는데, 이를 화자독립 음성인식 시스템이라고 한다.In general, a large-capacity speech recognition system performs speech recognition for an unspecified speaker, so that voice data is collected from many trained speakers to train a speaker-independent acoustic model, which is called a speaker-independent speech recognition system.

도 1은 일반적인 화자독립 음성인식 시스템의 구성도로서, 도 1에 도시된 바와 같이 불특정 화자의 음성신호가 음성인식 시스템(20)에 입력되면, 음성특징 추출부(21)는 입력된 음성신호에서 음성인식에 필요한 특징벡터를 추출해 낸다. FIG. 1 is a block diagram of a general speaker independent speech recognition system. When a voice signal of an unspecified speaker is input to the speech recognition system 20 as illustrated in FIG. 1, the speech feature extracting unit 21 may be configured as an input speech signal. Extract feature vectors for speech recognition.

탐색부(23)는 상기 음성특징 추출부(21)에 의해 추출된 특징벡터와 이미 훈련되어 있는 화자독립 HMM 음향모델(22)을 이용하여 인식대상 어휘들로 이루어진 탐색 공간에서 해당 음성에 가장 유사한 어휘를 탐색하고, 가장 확률값이 높은 어휘를 인식결과(30)로 출력하게 된다.The searcher 23 uses the feature vector extracted by the voice feature extractor 21 and the speaker-independent HMM acoustic model 22 that has already been trained, and is most similar to the corresponding voice in a search space consisting of the words to be recognized. The vocabulary is searched and the vocabulary having the highest probability value is output as the recognition result 30.

그러나, 상기한 음성인식 시스템은 특정 화자가 계속 사용하게 되는 경우, 특정 화자의 음성 데이터로 훈련한 화자종속 음향모델을 사용하는 화자종속 음성인식 시스템보다는 상기 불특정 화자의 음성 데이터를 사용하는 화자독립 음성인식 시스템이 성능면에서 떨어지는 문제점이 있다. However, the speech recognition system is a speaker-independent voice that uses the speech data of the unspecified speaker rather than the speaker-dependent speech recognition system using the speaker-dependent acoustic model trained with the speech data of the specific speaker when the specific speaker continues to use it. There is a problem that the recognition system is poor in performance.

이 때문에, 상기 특정 화자의 음성을 이용하여 화자독립 음향모델로부터 화자종속 음향모델로 변환시키는 화자적응 방법이 필요하게 되는데, 기존의 화자적응 방법으로는 교사 적응 방식의 고속 화자적응방법이 있다.For this reason, a speaker adaptation method for converting a speaker-independent acoustic model into a speaker-dependent acoustic model using a specific speaker's voice is required. As a conventional speaker adaptation method, there is a fast speaker adaptation method using a teacher adaptation method.

상기 고속 화자적응 방법은 미리 정해진 특정 몇 문장을 사용자가 발성하고, 상기 문장을 이용하여 화자적응을 수행하는 교사 적응(Supervised Adaptation)방식에 의해 화자적응을 수행한다.In the fast speaker adaptation method, a user speaks a predetermined number of predetermined sentences, and performs speaker adaptation by a supervised adaptation method of performing speaker adaptation using the sentences.

그러나, 상기 교사 적응 방식은 실제 음성인식 서비스 환경에서 상기 사용자가 특정 문장을 음성인식 하기 전에 발성해야 한다는 불편한 문제점이 있다.However, the teacher adaptation method has an inconvenient problem that the user needs to speak before the user recognizes a specific sentence in a real voice recognition service environment.

따라서, 본 발명은 상술한 종래의 문제점을 해결하기 위한 것으로, 본 발명의 목적은 상기 재귀적 화자적응 음성인식 시스템의 비교사 적응을 위해 1차 인식부로부터 음성신호의 레이블 정보를 획득하고, 상기 획득된 레이블 정보를 이용하여 제 1차 화자적응을 수행한 다음, 상기 화자적응 된 음향모델을 이용하여 제 2차 화자적응을 함으로써, 상기 화자적응 음성인식 시스템을 사용하면 할수록 화자적응 성능이 향상되어 결과적으로 상기 음성인식 시스템의 성능을 개선시킬 수 있도록 하는 고유음성 화자적응을 이용한 재귀적 화자적응 음성인식 시스템 및 방법을 제공하는데 있다.
Accordingly, the present invention is to solve the above-mentioned conventional problems, an object of the present invention is to obtain the label information of the speech signal from the primary recognition unit for the non-adaptation adaptation of the recursive speaker adaptation speech recognition system, By performing the first speaker adaptation using the obtained label information and then performing a second speaker adaptation using the speaker adapted acoustic model, the speaker adaptation performance is improved as the speaker adaptation speech recognition system is used. As a result, the present invention provides a recursive speaker adaptation speech recognition system and method using inherent speech speaker adaptation to improve the performance of the speech recognition system.

상기와 같은 본 발명의 목적을 달성하기 위한 고유음성 화자적응을 이용한 재귀적 화자적응 음성인식 시스템은, 입력된 음성신호로부터 음성인식에 사용되는 특징벡터를 추출하는 음성특징 추출부와, 상기 추출된 특징벡터와 이미 훈련되어 있는 화자독립 음향모델을 이용하여 상기 음성신호의 어휘를 인식하는 1차 인식부와, 상기 1차 인식부의 인식결과인 레이블 정보와 상기 음성특징 추출부의 특징벡터인 관측 데이터를 참조하여 화자적응을 수행하는 화자 적응부 및 상기 화자 적응부에서 제공하는 화자종속 음향모델을 이용하여 상기 음성신호의 인식결과를 출력하는 2차 인식부로 구성된다.
A recursive speaker adaptation speech recognition system using inherent speech speaker adaptation for achieving the object of the present invention, a speech feature extraction unit for extracting a feature vector used for speech recognition from the input speech signal, and the extracted The first recognition unit for recognizing the vocabulary of the speech signal using a feature vector and a speaker-independent acoustic model already trained, the label information which is the recognition result of the primary recognition unit, and the observation data as the feature vector of the speech feature extractor A speaker adaptation unit that performs speaker adaptation by reference and a secondary recognition unit which outputs a recognition result of the speech signal using a speaker dependent acoustic model provided by the speaker adaptation unit.

상기와 같은 본 발명의 목적을 달성하기 위한 고유음성 화자적응을 이용한 재귀적 화자적응 음성인식 방법은, (a)입력된 음성신호가 화자 교체에 의한 입력신호인가를 판단하여, 화자 교체에 의한 신호일 경우, 입력된 음성신호로부터 음성인식에 사용되는 특징벡터를 추출하고, 상기 추출된 특징벡터와 이미 훈련되어 있는 화자독립 음향모델을 이용하여 상기 음성신호의 1차 인식결과를 추출하는 단계와, (b)상기 1차 인식결과를 음성신호의 레이블 정보로 참조하고, 상기 음성 특징벡터를 관측 데이터로 참조하여 화자적응을 수행하는 단계와, (c)상기 화자적응단계에서 획득한 화자종속 음향모델을 이용하여 상기 음성신호의 2차 인식결과를 추출하는 단계로 이루어진다.
Recursive speaker adaptation speech recognition method using intrinsic speech speaker adaptation to achieve the object of the present invention as described above, (a) by determining whether the input speech signal is the input signal by speaker replacement, the signal by speaker replacement In this case, extracting a feature vector used for speech recognition from the input speech signal, and extracting the first recognition result of the speech signal using the extracted feature vector and the speaker-independent acoustic model already trained; b) referring to the primary recognition result as label information of a speech signal, performing speech adaptation by referring to the speech feature vector as observation data, and (c) using the speaker dependent acoustic model obtained in the speaker adaptation step. And extracting a second recognition result of the voice signal.

또한, 상기 입력된 음성신호가 이전 화자가 연속적으로 발성하여 화자 교체 에 의한 입력신호가 아닐 경우, 입력된 음성신호로부터 음성인식에 사용되는 특징벡터를 추출하고, 상기 추출된 특징벡터와 이전 음성신호를 이용하여 화자적응 된 화자종속 음향모델로 상기 음성신호의 1차 인식결과를 추출하는 것을 특징으로 한다.
In addition, when the input voice signal is not the input signal by speaker replacement because the previous speaker continuously utters, a feature vector used for voice recognition is extracted from the input voice signal, and the extracted feature vector and the previous voice signal are extracted. Using the speaker-adapted speaker-dependent acoustic model to extract the first recognition result of the speech signal.

이하, 본 발명에 따른 실시예를 첨부한 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명에 따른 재귀적 화자적응 음성인식 시스템의 구성도이다.2 is a block diagram of a recursive speaker adaptation speech recognition system according to the present invention.

상기 재귀적 화자적응 음성인식 시스템(200)은 음성신호가 들어오면 음성특징 추출부(210)에 의해 획득된 특징벡터와 음향모델을 이용하여 1차 인식부(220)에서 1차적으로 입력된 화자의 음성을 인식하고, 상기 1차 인식부(220)에 의해 인식된 결과는 비교사 적응시 필요한 레이블 정보로 이용되어 화자 적응부(230)에서 화자 적응된다.The recursive speaker adaptation speech recognition system 200 is a speaker input by the primary recognition unit 220 using a feature vector and an acoustic model obtained by the speech feature extraction unit 210 when a voice signal is received. Recognizing the voice of the, and the result recognized by the primary recognition unit 220 is used as the label information required for the non-adapter adaptation, the speaker adaptation unit 230 is the speaker adaptation.

상기 화자 적응부(230)에 의해 화자적응 된 음향모델은 2차 인식부(240)에서 2차적으로 화자의 음성을 인식하는데 사용되며, 상기 2차 인식부(240)에서 출력된 인식결과(300)가 최종결과가 된다.The acoustic model adapted by the speaker adaptor 230 is used to secondarily recognize the speaker's voice in the secondary recognition unit 240 and the recognition result 300 output from the secondary recognition unit 240. ) Is the final result.

도 3은 본 발명에 따른 재귀적 화자적응 음성인식 시스템의 상세 구성도이 다.3 is a detailed block diagram of a recursive speaker adaptation speech recognition system according to the present invention.

도 3에 도시된 바와 같이 입력된 음성신호(100)로부터 음성인식에 사용되는 특징벡터를 추출하는 음성특징 추출부(210)와, 상기 음성특징 추출부(210)에 의해 추출된 특징벡터와 이미 훈련되어 있는 화자독립 음향모델(223)을 이용하여 상기 입력된 음성신호(100)와 가장 유사한 어휘를 탐색하는 1차 인식 탐색부(212)와, 상기 1차 인식 탐색부(212)에 의해 탐색된 어휘 중 가장 확률값이 높은 어휘를 음성인식 시스템의 1차 인식결과로 출력하는 1차 인식부(220)와, 상기 1차 인식결과를 레이블 정보로 참조하고, 상기 음성특징 추출부(210)에서 추출한 음성 특징벡터를 관측 데이터로 참조하여 화자적응을 수행하는 화자 적응부(230), 상기 화자 적응부(230)에서 얻은 화자적응된 화자종속 음향모델(231)을 이용하여 상기 음성신호의 2차 인식을 수행하는 2차 인식 탐색부(233) 및 상기 2차 인식 탐색부(233)의 인식결과를 최종결과로 출력하는 2차 인식부(240)로 구성된다.As shown in FIG. 3, the voice feature extractor 210 extracts a feature vector used for voice recognition from the input voice signal 100, and the feature vector extracted by the voice feature extractor 210. The first recognition search unit 212 and the first recognition search unit 212 search for the vocabulary most similar to the input voice signal 100 using the trained speaker independent acoustic model 223. The first recognition unit 220 outputs a word having the highest probability value among the acquired words as a first recognition result of the speech recognition system, and the first recognition result as label information, and the voice feature extraction unit 210 By using the speaker adaptation unit 230 that performs speaker adaptation by referring to the extracted speech feature vector as observation data and the speaker-adapted speaker dependent acoustic model 231 obtained by the speaker adaptation unit 230, the second order of the speech signal is obtained. Secondary recognition search unit for performing the recognition (2 33) and the secondary recognition unit 240 for outputting the recognition result of the secondary recognition search unit 233 as a final result.

도 4는 본 발명에 따른 고유음성(EigenVoice) 화자적응 방법의 흐름도로서, 상기 화자 적응부(230)의 화자종속 음향모델을 생성하는 단계를 보다 상세하게 설명하면 다음과 같다.4 is a flow chart of an EigenVoice speaker adaptation method according to the present invention. The steps of generating a speaker dependent acoustic model of the speaker adaptation unit 230 will be described in detail as follows.

우선, 상기 1차 인식부(220)의 인식결과에 따른 레이블 정보로부터 상기 음성인식의 단위가 되는 트라이폰(Triphone)의 음성신호에 대한 얼라인먼트(alignment)정보를 찾는다(S232).First, alignment information on a voice signal of a triphone, which is a unit of the voice recognition, is found from the label information according to the recognition result of the primary recognition unit 220 (S232).

이때, 상기 화자 적응부(230)의 신뢰도 계산부는 상기 트라이폰 얼라인먼트 정보와 상기 음성특징 추출부(210)에서 얻은 특징벡터를 이용하여 각 프레임(Frame : 시간축 단위)에서의 특징데이터가 해당 트라이폰에 맞을 신뢰도를 계산한다(S233).In this case, the reliability calculator of the speaker adaptor 230 uses the triphone alignment information and the feature vector obtained by the speech feature extractor 210 to display the corresponding feature data in each frame. Calculate the reliability to fit (S233).

상기 신뢰도 계산은 트라이폰(Triphone)과 반모델(Antimodel)의 로그유사도비(LLR : Log Likelihood Ratio)를 사용하며, 상기 반모델(Antimodel)은 해당 트라이폰의 센터 폰(center-phone)에 해당하는 모노 폰(monophone)을 제외한 다른 모노 폰을 모아서 생성한다.The reliability calculation uses a log likelihood ratio (LLR) of a triphone and an antimodel, and the antimodel corresponds to a center-phone of the corresponding triphone. It generates by collecting other mono phones except mono phones.

계속해서, 상기 신뢰도 계산부에 의해 획득된 LLR값과 문턱 값을 비교하여(S234), 상기 LLR 값이 상기 문턱 값보다 클 경우, 해당정보를 화자적응 계산에 필요한 관측정보로 누적하고(S235), 상기 LLR 값이 문턱 값보다 작을 경우에는 해당 특징벡터의 정보를 무시하고 사용하지 않는다.Subsequently, by comparing the LLR value and the threshold value obtained by the reliability calculator (S234), if the LLR value is larger than the threshold value, the corresponding information is accumulated as observation information necessary for speaker adaptation calculation (S235). When the LLR value is smaller than the threshold value, the information of the feature vector is ignored and not used.

이와 같이, 상기 음성신호 중 신뢰도가 기준 문턱 값보다 큰 믿을 수 있는 레이블 정보로 구한 관측정보를 이용하여 고유음성(EigenVoice 수학식 1의 Wi)들의 계수(수학식 1의 Ci)를 예측한다(S236).As described above, the coefficients (Ci of Equation 1) of the eigen voices (Wi of EigenVoice Equation 1) are predicted by using the observation information obtained with reliable label information of which the reliability of the voice signal is greater than the reference threshold value (S236). ).

여기서, 상기 고유음성(EigenVoice)은 다수의 훈련 화자로부터 구한 화자종속 HMM 음향모델을 일렬로 늘어놓은 벡터형태의 슈퍼벡터(Super Vector)들로부터 계산한 고유벡터(Eigen Vector)를 의미함으로, 상기 고유음성의 선형조합은 새로운 HMM 음향모델을 만들 수 있게 한다.Here, the eigenvoice (EigenVoice) means the eigenvector (Eigen Vector) calculated from the super vector of the vector form the speaker-dependent HMM acoustic models obtained from a plurality of training speakers in a line, Linear combinations of speech allow for the creation of new HMM acoustic models.

상기 고유음성 화자적응 방식에 의해 구하는 새로운 음향모델은 다음과 같은 수학식 1로 표현된다.The new acoustic model obtained by the eigen-voice speaker adaptation method is expressed by Equation 1 below.

수학식 1 Equation 1

상기 수학식 1은 HMM 음향모델의 가우시안 Mean값만을 적응한 예로, 상기 Wi는 고유음성들을 나타내며, 상기 Ci는 각 고유음성들의 계수 값을 나타내고, 상기

는 화자종속 HMM 가우시안 mean 값들의 평균값을 의미하고, P는 훈련과정에서 구한 고유음성(eigenvoice)중에 주요한 고유음성의 개수를 나타낸다.Equation 1 is an example of adapting only a Gaussian mean value of an HMM acoustic model, wherein Wi represents intrinsic voices, Ci represents a coefficient value of each inherent speech,

Denotes the mean value of speaker-dependent HMM Gaussian mean values, and P denotes the number of major eigenvoices among eigenvoices obtained during training.

이처럼, 상기 다수의 화자종속 HMM 음향모델로부터 구한 고유음성들과 상기 고유음성 계수 추정단계에서 계산한 고유음성들의 계수로부터 새로이 적응된 화자종속 음향모델이 생성된다(S238).In this way, a newly adapted speaker dependent acoustic model is generated from the unique voices obtained from the plurality of speaker dependent HMM acoustic models and the coefficients of the unique voices calculated in the unique voice coefficient estimating step (S238).

한편, 상기 신뢰도 검사를 통과하여 누적된 관측정보의 양이 적을 경우에는 예측한 고유음성 계수가 불확실한 값이 되고, 이로부터 얻은 적응된 음향모델 역시 좋은 성능을 보이지 않게 된다.On the other hand, when the amount of observation information accumulated by passing the reliability test is small, the predicted eigen-voice coefficient becomes an uncertain value, and the adaptive acoustic model obtained therefrom also does not show good performance.

이를 보완하기 위해서, 고유음성 계수의 Prior분포정보를 이용한 계수 추정단계가 포함되는데, 상기 고유음성 계수의 Prior 분포정보란, 상기 음향모델 훈련단계에서 훈련음성 데이터를 이용하여 상기 고유음성 계수 예측과정을 통해 구해지는 계수들의 분포를 가우시안 분포를 이용하여 훈련화자마다 획득한 가우시안의 평균값과 표준 편차값을 말한다.In order to compensate for this, a coefficient estimating step using prior distribution information of the eigen coefficients is included. The distribution information of the priority of the eigen coefficients includes the process of predicting the eigen coefficients using training voice data in the acoustic model training step. The distribution of coefficients obtained by means of Gaussian distribution means the mean value and standard deviation value of Gaussian obtained for each trainer.

상기 계수 추정단계는 실제 사용자의 음성에 대하여 구하여진 고유음성 계수값에 대해서 훈련화자의 고유음성 계수 분포들과의 거리를 가우시안의 평균과 표준편차를 이용하여 계산한 후, 상기한 모든 Prior 고유음성 계수분포와의 거리가 가 장 가까운 분포를 선택하게 된다(S237).In the coefficient estimating step, the distance between the intrinsic voice coefficient values obtained from the actual user's voice and the trainer's intrinsic voice coefficient distributions is calculated using the Gaussian mean and the standard deviation, and then all the prior intrinsic voices are obtained. The distribution closest to the coefficient distribution is selected (S237).

다음으로, 상기 선택된 Prior 고유음성 계수분포의 가우시안 분포함수의 평균과 표준편차에 의해 상기 예측된 고유음성 계수의 값을 MAP(Maximum a posterior) 방법에 의해 보정을 하게 된다.Next, the predicted values of the eigen-voice coefficients are corrected by MAP (Maximum a posterior) method by the mean and standard deviation of the Gaussian distribution function of the selected Prior eigen-voice coefficient distribution.

상기 Prior 정보를 이용하여 보정된 고유음성 계수값을 수학식 1에 적용하면, 상기 고유음성 방식으로 적응된 새로운 화자종속 음향모델이 생성된다(S238).When the eigen coefficients corrected using the Prior information are applied to Equation 1, a new speaker dependent acoustic model adapted to the eigen voice method is generated (S238).

상기 고유음성 방식으로 적응된 음향모델은 2차 인식부(240)에서 2차로 상기 화자의 음성인식을 하는데 사용된다.The acoustic model adapted to the eigen-voice method is used by the secondary recognizer 240 to recognize the speaker second.

도 5에 도시된 바와 같이 상기 음성인식 시스템(200)에 입력된 음성신호가 화자 교체에 의한 입력신호인가를 판단(S301)하여, 화자 교체에 의한 신호일 경우, 입력된 음성신호로부터 음성인식에 사용되는 특징벡터를 추출하고, 상기 추출된 특징벡터와 이미 훈련되어 있는 화자독립 음향모델(S302)을 이용하여 상기 음성신호의 1차 인식결과를 추출한다(S304).As shown in FIG. 5, it is determined whether the voice signal input to the voice recognition system 200 is an input signal by speaker replacement (S301), and when it is a signal by speaker replacement, the voice signal is used for voice recognition from the input voice signal. The extracted feature vector is extracted, and the first recognition result of the speech signal is extracted by using the extracted feature vector and the speaker-independent acoustic model (S302) already trained (S304).

이때, 상기 화자 교체여부의 판단방법은 화자가 교체되었을 경우, 새로운 화자가 고유화자 코드를 입력하는 방법과, 상기 화자의 음성신호로부터 화자음성 특징을 추출하여 훈련한 GMM(Gaussian Mixture Model)모델로부터 화자 교체여부를 판단하게 된다.In this case, the method of determining whether to replace the speaker includes a method in which a new speaker inputs a unique speaker code when the speaker is replaced, and a GMM (Gaussian Mixture Model) model trained by extracting speaker voice characteristics from the speaker's voice signal. It is determined whether the speaker is replaced.

계속해서, 상기 추출된 1차 인식결과를 음성신호의 레이블 정보로 참조하고, 상기 음성 특징벡터를 관측 데이터로 참조하여 화자적응을 수행한다(S305).Subsequently, the extracted primary recognition result is referred to as label information of the speech signal, and the speech adaptation is performed by referring to the speech feature vector as observation data (S305).

상기한 화자적응단계는 도 4에 설명한 바와 같으며, 상기 화자적응단계에서 화자적응 수행을 통해 획득한 화자종속 음향모델은 업 데이트 되어(S306) 상기 화자적응 음향모델에 저장되고, 상기 화자의 2차 인식결과를 출력하기 위해 사용된다(S307).The speaker adaptation step is as described with reference to FIG. 4, and the speaker dependent acoustic model obtained through the speaker adaptation in the speaker adaptation step is updated (S306) and stored in the speaker adaptation acoustic model, and the speaker 2. It is used to output the difference recognition result (S307).

한편, 상기 음성인식 시스템으로 입력된 화자의 음성이 이전 음성신호와 동일한 화자일 경우에는, 상기 1차 인식단계에서 이전 음성신호를 이용하여 화자적응 된 화자종속 음향모델을 사용한다(S303).On the other hand, when the speaker's voice input to the voice recognition system is the same speaker as the previous voice signal, the speaker-dependent acoustic model adapted to the speaker using the previous voice signal in the first recognition step is used (S303).

그리하면, 상기 화자독립 음향모델을 사용하는 것보다 1차 인식단계의 인식률이 좋아지고, 이로 인하여 상기 신뢰도 검사를 통과하는 관측정보들이 많아져 보다 정확한 정보를 상기 화자 적응부에 보내주게 됨으로 화자종속 모델에 더욱 가까운 화자종속 음향모델을 구할 수 있다.Then, the recognition rate of the first recognition step is better than using the speaker-independent acoustic model, which increases the observation information that passes the reliability test, thereby sending more accurate information to the speaker adaptor. A speaker dependent acoustic model closer to the model can be obtained.

최종적으로, 상기 화자종속 음향모델은 2차 인식부에서 2차적으로 상기 화자의 음성을 인식하는데 사용되어 결과를 출력하게 된다(S308).Finally, the speaker dependent acoustic model is used by the secondary recognition unit to recognize the speaker's voice secondaryly and outputs a result (S308).

이와 같이, 재귀적 화자적응 음향모델을 사용한 음성인식 시스템은 화자가 상기 시스템을 사용하면 할수록 화자적응이 업데이트 되어 화자종속 모델에 가깝게 화자적응 됨으로서 성능이 개선되는 장점이 있다.As described above, the speech recognition system using the recursive speaker adaptation acoustic model has the advantage that the speaker adaptation is updated as the speaker uses the system, and thus the speaker adaptation becomes closer to the speaker dependent model, thereby improving performance.

상기 본 발명에 따른 고유음성 화자적응을 이용한 재귀적 화자적응 음성인식 시스템 및 방법은 컴퓨터 프로그램으로 제작되어서 하드디스크, 플로피디스크, 광자기 디스크, 시디 롬, 롬, 램 등의 기록매체에 저장될 수 있다. The recursive speaker adaptation speech recognition system and method using the intrinsic speech speaker adaptation according to the present invention may be manufactured as a computer program and stored in a recording medium such as a hard disk, a floppy disk, a magneto-optical disk, a CD-ROM, a ROM, a RAM, or the like. have.

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 또한 설명하였으나, 본 발명은 상기한 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 실시가 가능한 것을 물론이고, 그와 같은 변경은 기재된 청구범위 내에 있게 된다.Although the preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the above-described embodiments, and the present invention is not limited to the above-described embodiments without departing from the spirit of the present invention as claimed in the claims. Of course, any person skilled in the art can make various modifications, and such changes are within the scope of the claims.

이상에 설명한 바와 같이 본 발명에 의하면, 상기 사용자에게 큰 불편이 없이 화자적응을 수행하면서, 음향모델을 화자종속 모델에 가깝게 적응하여, 상기 음성인식 시스템의 인식성능을 크게 개선하는 효과가 있다.As described above, according to the present invention, while performing speaker adaptation without any inconvenience to the user, the acoustic model is closely adapted to the speaker dependent model, thereby greatly improving the recognition performance of the speech recognition system.

또한, 연속적 적응 음향모델을 사용함으로써 화자가 음성인식 시스템을 사용하면 사용할수록 더욱 화자종속 모델에 가깝게 적응하여, 음성인식 시스템의 인식성능이 증가하는 효과가 있다.In addition, by using the continuous adaptive acoustic model, the more the speaker uses the speech recognition system, the more closely the speaker dependent model is used, and the recognition performance of the speech recognition system is increased.

Claims

입력된 음성신호로부터 음성인식에 사용되는 특징벡터를 추출하는 음성특징 추출부;A voice feature extraction unit for extracting a feature vector used for voice recognition from the input voice signal;

상기 추출된 특징벡터와 이미 훈련되어 있는 화자독립 음향모델을 이용하여 상기 음성신호의 어휘를 인식하는 1차 인식부;A primary recognition unit recognizing a vocabulary of the speech signal using the extracted feature vector and a speaker-independent acoustic model that has been trained;

상기 1차 인식부의 인식결과인 상기 레이블 정보로부터 얻은 트라이폰 얼라인먼트 정보와 상기 음성특징 추출부로부터 얻은 특징벡터인 관측 데이터를 참조하여 각 프레임에서의 특징데이터가 해당 트라이폰에 맞을 신뢰도를 계산하는 신뢰도 계산부를 포함하여 구성되어 화자적응을 수행하는 화자 적응부; 및Reliability for calculating the reliability of the feature data in each frame with reference to the triphone alignment information obtained from the label information which is the recognition result of the primary recognition unit and the observation data which is the feature vector obtained from the voice feature extraction unit. A speaker adaptor configured to include a calculator to perform speaker adaptation; And

상기 화자 적응부에서 제공하는 화자종속 음향모델을 이용하여 상기 음성신호의 인식결과를 출력하는 2차 인식부로 구성되는 것을 특징으로 하는 고유음성 화자적응을 이용한 재귀적 화자적응 음성인식 시스템.Recursive speaker adaptation speech recognition system using inherent speech speaker adaptation, characterized in that it comprises a secondary recognition unit for outputting the recognition result of the speech signal using the speaker dependent acoustic model provided by the speaker adaptor.

삭제delete

(a)입력된 음성신호가 화자 교체에 의한 입력신호인가를 판단하여, 화자 교체에 의한 신호일 경우, 입력된 음성신호로부터 음성인식에 사용되는 특징벡터를 추출하고, 상기 추출된 특징벡터와 이미 훈련되어 있는 화자독립 음향모델을 이용하여 상기 음성신호의 1차 인식결과를 추출하고, 입력된 음성신호가 이전 화자가 연속적으로 발성하여 화자 교체에 의한 입력신호가 아닐 경우, 입력된 음성신호로부터 음성인식에 사용되는 특징벡터를 추출하고, 상기 추출된 특징벡터와 이전 음성신호를 이용하여 화자적응된 화자종속 음향모델로 상기 음성신호의 1차 인식결과를 추출하는 단계;(a) It is determined whether the input voice signal is an input signal by speaker replacement, and when the signal is speaker replacement, the feature vector used for voice recognition is extracted from the input voice signal, and the already extracted feature vector and the training are performed. The first recognition result of the voice signal is extracted by using the speaker-independent acoustic model, and the voice recognition is performed from the input voice signal when the input voice signal is not the input signal by speaker replacement because the previous speaker is continuously uttered. Extracting a feature vector used for the first step and extracting a first recognition result of the speech signal using a speaker-dependent acoustic model adapted to the speaker using the extracted feature vector and the previous speech signal;

(b)상기 1차 인식결과를 음성신호의 레이블 정보로 참조하고, 상기 음성 특징벡터를 관측 데이터로 참조하여 화자적응을 수행하는 단계;(b) performing speaker adaptation by referring to the primary recognition result as label information of a speech signal and referring to the speech feature vector as observation data;

(c)상기 화자적응단계에서 획득한 화자종속 음향모델을 이용하여 상기 음성신호의 2차 인식결과를 추출하는 단계로 이루어지는 것을 특징으로 하는 고유음성 화자적응을 이용한 재귀적 화자적응 방법.and (c) extracting a second recognition result of the speech signal using the speaker dependent acoustic model obtained in the speaker adaptation step.

제 3항에 있어서, 상기 (c)단계는The method of claim 3, wherein step (c)

(d)상기 (a)단계에서 추출된 1차 인식결과의 레이블 정보로부터 음성인식의 단위가 되는 트라이폰의 음성신호에 대한 얼라인먼트 정보를 찾는 단계;(d) finding alignment information on the speech signal of the triphone, which is a unit of speech recognition, from the label information of the primary recognition result extracted in step (a);

(e)상기 트라이폰 얼라인먼트 정보와 음성특징 추출부에서 얻은 특징벡터를 이용하여 각 프레임에서의 특징데이터가 해당 트라이폰에 맞을 신뢰도를 계산하는 단계;(e) calculating reliability of feature data of each frame to fit the corresponding triphone using the triphone alignment information and the feature vector obtained by the voice feature extraction unit;

(f)상기 신뢰도 척도로 사용되는 LLR(Log Likelihood Ratio) 값과 문턱 값을 비교하여, 상기 LLR 값이 상기 문턱 값보다 클 경우, 해당정보를 화자적응 계산에 필요한 관측정보로 누적하는 단계; 및(f) comparing the LLR (Log Likelihood Ratio) value used as the reliability measure with a threshold value, and accumulating the corresponding information as observation information necessary for speaker adaptation calculation when the LLR value is larger than the threshold value; And

(g)상기 누적된 관측정보를 이용하여 고유음성(EigenVoice)들의 계수를 예측하고, 상기 예측한 고유음성(EigenVoice)들의 계수와 미리 다수의 화자종속 HMM 음향모델로 구한 고유음성(EigenVoice)들로부터 HMM 음향모델을 생성하는 단계로 이루어지는 것을 특징으로 하는 고유음성 화자적응을 이용한 재귀적 화자적응 방법.(g) predicting coefficients of eigenvoices using the accumulated observation information, and using the predicted coefficients of eigenvoices and eigenvoices previously obtained from a plurality of speaker-dependent HMM acoustic models. A recursive speaker adaptation method using an eigen-voice speaker adaptation, characterized by generating an HMM acoustic model.

제 4항에 있어서, 상기 (e)단계는The method of claim 4, wherein step (e)

상기 화자독립 음향모델 훈련시 모노폰으로 생성된 반 모델(Antimodel)의 LLR을 이용하는 것을 특징으로 하는 고유음성 화자적응을 이용한 재귀적 화자적응 방법.Reflexive speaker adaptation method using eigen-voice speaker adaptation, characterized in that using the LLR of the anti-model generated by the monophone when training the speaker independent acoustic model.

제 4항에 있어서, 상기 (g)단계는The method of claim 4, wherein step (g)

(h)상기 관측정보로 누적된 관측정보의 양이 적을 경우, 실제 사용자의 음성에 의해 구해진 고유음성(EigenVoice)계수 값에 대해서 훈련 화자의 고유음성(EigenVoice)계수 분포들과의 거리를 고유음성 계수 분포들의 가우시안 평균과 표준편차를 이용하여 계산하는 단계;(h) If the amount of observation information accumulated by the observation information is small, the distance from the distribution of the EigenVoice coefficients of the training speaker with respect to the value of the EigenVoice coefficient obtained by the actual user's voice is determined. Calculating using Gaussian mean and standard deviation of the coefficient distributions;

(i)상기 (h)단계에서 계산된 이전 고유음성(Prior EigenVoice)계수 분포와의 거리가 가장 가까운 분포를 선택하는 단계;(i) selecting a distribution closest to the distribution of the prior eigenvoice coefficient calculated in step (h);

(j)상기 선택된 이전 고유음성(Prior EigenVoice)계수 분포에서 가우시안 분포함수의 평균과 표준편차를 사용하여, 상기 고유음성(EigenVoice)계수의 값을 MAP(Maximum a posterior)방법에 의해 보정하는 단계; 및(j) correcting the value of the EigenVoice coefficient by the MAP method using the mean and standard deviation of the Gaussian distribution function in the selected Prior EigenVoice coefficient distribution; And

(k)상기 (j)단계에서 보정된 고유음성(EigenVoice)계수 값을 수학식 1에 적용하여 음향모델을 생성하는 단계로 이루어지는 것을 특징으로 하는 고유음성 화자적응을 이용한 재귀적 화자적응 방법.(k) a method of generating a sound model by applying an eigenvoice coefficient corrected in step (j) to Equation 1;

제 6항에 있어서, 상기 (k)단계의 수학식 1은The method of claim 6, wherein Equation 1 of step (k)

인 것을 특징으로 하는 고유음성 화자적응을 이용한 재귀적 화자적응 방법.Recursive speaker adaptation method using the eigen-voice speaker adaptation characterized in that the.

삭제delete