KR100655490B1

KR100655490B1 - Speech recognition service method using adaptation anti-phone model

Info

Publication number: KR100655490B1
Application number: KR1020040106607A
Authority: KR
Inventors: 전형배; 김상훈
Original assignee: 한국전자통신연구원
Priority date: 2004-12-15
Filing date: 2004-12-15
Publication date: 2006-12-08
Also published as: KR20060067714A

Abstract

본 발명은 적응 반음소 모델을 이용한 음성인식 서비스 방법에 관한 것으로, 특히 음향 모델과 반음소 모델을 훈련하는 제 100 단계(S100); 유사한 서비스 환경에서 음성 신호를 수집하는 제 200 단계(S200); 개발용 및 훈련용 음성 신호 집합으로 훈련된 반음소 모델을 사용하여 반음소 모델을 적응시키는 제 300 단계(S300); 적응된 반음소 모델을 사용하여 실제 음성인식 서비스를 수행하는 제 400 단계(S400); 실제 음성인식 서비스를 통해 사용된 음성신호를 수집하는 제 500 단계(S500); 상기 실제 서비스 음성신호 집합을 사용하여 반음소 모델을 적응시키는 제 600 단계(S600); 및 새로운 실제 서비스 환경에 적응된 반음소 모델을 사용하여 실제 음성인식 서비스를 수행하는 제 700 단계(S700)로 이루어진 것을 특징으로 하며, 이러한 본 발명은 발화검증의 성능이 향상시키고, 음성인식 시스템의 출력 결과에 대한 신뢰도를 향상시키는 뛰어난 효과가 있다.The present invention relates to a speech recognition service method using an adaptive half-phone model, in particular, the 100th step (S100) of training an acoustic model and a half-phone model; Collecting a voice signal in a similar service environment (S200); A step S300 of adapting the semitone phone model using the semitone phone model trained as a development and training voice signal set; Step 400 of performing an actual voice recognition service using the adapted semi-phoneme model; Collecting a voice signal used through an actual voice recognition service (S500); Step S600 of adapting a semi-phoneme model using the actual service voice signal set; And a 700 step (S700) of performing a real voice recognition service using a semi-phoneme model adapted to a new real service environment. The present invention improves performance of speech verification, and There is an excellent effect of improving the reliability of the output result.

음성인식 시스템, 발화검증, 적응 반음소 모델, 로그-유사도 비율, Speech recognition system, speech verification, adaptive semiphone model, log-similarity ratio,

Description

적응 반음소 모델을 이용한 음성인식 서비스 방법{SPEECH RECOGNITION SERVICE METHOD USING ADAPTATION ANTI-PHONE MODEL} Speech Recognition Service Method Using Adaptive Halftone Model {SPEECH RECOGNITION SERVICE METHOD USING ADAPTATION ANTI-PHONE MODEL}

도 1은 본 발명에 사용되는 음성인식 시스템의 구성을 나타낸 기능 블록도,1 is a functional block diagram showing the configuration of a speech recognition system used in the present invention,

도 2는 본 발명의 일실시예에 따른 적응 반음소 모델을 이용한 음성인식 서비스 방법을 나타낸 동작 플로우챠트,2 is an operation flowchart illustrating a voice recognition service method using an adaptive half-phone model according to an embodiment of the present invention;

도 3은 도 2에 따른 적응 반음소 모델을 이용한 음성인식 서비스 방법에서 제 300 단계(S300)의 세부 과정을 나타낸 동작 플로우챠트,3 is an operational flowchart illustrating a detailed process of step 300 (S300) in the voice recognition service method using the adaptive half-phone model according to FIG.

도 4는 도 2에 따른 적응 반음소 모델을 이용한 음성인식 서비스 방법에서 제 400 단계(S400)의 세부 과정을 나타낸 동작 플로우챠트, 4 is an operation flowchart showing a detailed process of the 400th step (S400) in the voice recognition service method using the adaptive half-phone model according to FIG.

도 5는 도 2에 따른 적응 반음소 모델을 이용한 음성인식 서비스 방법에서 제 600 단계(S600)의 세부 과정을 나타낸 동작 플로우챠트,FIG. 5 is an operational flowchart illustrating a detailed process of step 600 in the voice recognition service method using the adaptive half-phone model according to FIG. 2;

도 6은 도 2에 따른 적응 반음소 모델을 이용한 음성인식 서비스 방법에서 제 700 단계(S700)의 세부 과정을 나타낸 동작 플로우챠트이다.FIG. 6 is an operational flowchart illustrating a detailed process of step S700 in the voice recognition service method using the adaptive half-phone model according to FIG. 2.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

11 : 끝점 검출부 12 : 음성특징 추출부11 end point detector 12 voice feature extraction unit

13 : 탐색부 14 : 발화검증부13 search unit 14: ignition verification unit

본 발명은 적응 반음소 모델(Adaptation Anti-Phone Model)을 이용한 음성인식 서비스(Speech Recognition Service) 방법에 관한 것으로, 더욱 상세하게는 반음소 모델이 실제 음성인식 서비스 환경에 적응되고, 오인식 되는 어휘 패턴과 오인식 될 확률이 높은 발성 경향에 대해 거절 가능성을 높이도록 반음소 모델을 적응하기 때문에, 발화검증(Utterance Verification) 단계의 성능이 향상될 뿐만 아니라, 이로 인해 음성인식 시스템의 출력 결과에 대한 신뢰도가 향상되도록 해주는 적응 반음소 모델을 이용한 음성인식 서비스 방법에 관한 것이다.The present invention relates to a speech recognition service method using an adaptation anti-phone model, and more particularly, to a vocabulary pattern in which the half phoneme model is adapted to an actual speech recognition service environment and misrecognized. Because the phoneme model is adapted to increase the probability of rejection for vocal propensities that are more likely to be misidentified, the performance of the Utterance Verification step not only improves, but also the reliability of the output of the speech recognition system. The present invention relates to a speech recognition service method using an adaptive half-phone model to be improved.

주지하다시피, 일반적인 음성인식 시스템은 인식 성능이 100% 가 되지 못하기 때문에, 사용자가 음성인식 결과의 오인식으로 인해 불편을 겪을 수 있게 되면, 이를 방지하기 위해 발화검증 단계를 둔다. 이 때, 상기 발화검증 단계는 음성인식 결과의 신뢰도를 평가하여 신뢰도(Confidence)가 임계값(Threshold)보다 낮은 결과에 대해서는 인식결과를 거절하고 음성인식 시스템의 사용자에게 재발성을 요구하는 반면에, 신뢰도가 임계값보다 높은 인식결과에 대해서는 음성인식 시스템의 출력으로 보내게 되는 것이다.As is well known, the general speech recognition system does not have 100% recognition performance. Therefore, if the user may experience inconvenience due to misperception of the speech recognition result, a speech verification step is provided to prevent this. At this time, the speech verification step evaluates the reliability of the speech recognition result, rejects the recognition result for the result of the confidence lower than the threshold and demands recurrence from the user of the speech recognition system. Recognition results with higher reliability than the threshold are sent to the output of the voice recognition system.

발화검증 단계에서의 신뢰도는 일반적으로 로그-유사도 비율 (Log-Likelihood Ratio, LLR)로 결정한다. 로그-유사도 비율은 탐색과정에서 예측한 인식결과의 유사도 총합과 인식결과 음소의 반대 개념인 반음소(Anti-phone)에 대한 유사도의 총합의 비율에 로그(Log)를 취한 값을 의미한다. 이 때, 입력 음성신호(Ot)의 한 프레임(Frame)에 대한 로그-유사도의 정의를 수식으로 설명하면 하기의 [수학식 1]과 같다. Reliability at the ignition verification stage is usually determined by the Log-Likelihood Ratio (LLR). The log-similarity ratio is a value obtained by taking a log in the ratio of the total similarity of the recognition results predicted in the search process and the total similarity with respect to the anti-phone, which is the opposite concept of the recognition result phoneme. In this case, the definition of the log-similarity for one frame of the input voice signal Ot is expressed by the following Equation 1 below.

상기 [수학식 1]에서

항이 해당 프레임에서 탐색기가 예측한 인식결과의 해당 문맥종속(CD : Context Dependent) 음향모델의 해당 상태(state)에서의 유사도(Likelihood)이다. 상태단위의 유사도(Likelihood)는 해당 음향모델에 대한 관측 확률값을 사용한다. In [Equation 1]

The term is the likelihood in the corresponding state of the context-dependent acoustic model (CD) of the recognition result predicted by the searcher in the frame. Likelihood of the state unit uses the observed probability value for the acoustic model.

한편, 상기 [수학식 1]에서

항은 해당 문맥종속(CD) 음향모델(triphone)의 반음소 모델(anti-phone model)의 해당 상태(state)에서의 유사도(Likelihood)이다. 문장단위의 신뢰도인 로그-유사도 비율(LLR)은 프레임 단위의 로그-유사도 비율을 문장단위로 합하여 문장의 프레임 개수, 모델(phone) 개수로 정규화 한 값을 사용한다.On the other hand, in [Equation 1]

The term is the Likelihood in the state of the anti-phone model of the context-dependent CD. The log-similarity ratio (LLR), which is the reliability of sentence unit, is the sum of the log-similarity ratio of each frame unit in sentence units and uses the value normalized by the number of frames and models of sentences.

이와 같은 일반적인 발화검증 방법에서의 성능은 반음소 모델의 특성에 영향을 받게 된다. 반음소(Anti-phone) 모델은 음성인식 시스템의 음향모델 훈련을 위한 음성 데이터 베이스(Training DB)를 이용하여 훈련한다. 반음소 모델을 훈련하는 방법은 탐색에 사용되는 문맥종속(Context Dependent) 음향모델 훈련단계에서 훈련하였던 문맥독립(CI : Context Independent) 음향모델인 모노폰(monophone)으로부터, 자신을 제외한 나머지 음소 모델을 모아서 만드는 방법이 있다. 다른 방법으로는 벡터 양자화(VQ : Vector Quantization) 방법의 의해 훈련 데이터 베이스 중 자신이 아닌 데이터로부터 반음소 모델의 가우시안 분포(Gaussian Mixture)를 훈련하는 방법이 있다.Performance in this general speech verification method is affected by the characteristics of the semitone phone model. The anti-phone model is trained using a training database for training acoustic models of speech recognition systems. The method of training the semitone phone model is based on the monophone, which is a context-independent (CI) contextual model trained in the context-dependent acoustic model training phase. There is a way to collect them. Another method is to train the Gaussian Mixture of the half phoneme model from non-self data in the training database by the Vector Quantization (VQ) method.

상술한 바와 같은 일반적인 반음소(Anti-phone) 모델 훈련 방법은 음성인식기의 훈련 데이터 베이스를 사용한다. 하지만, 이와 같이 훈련 데이터 베이스로부터 훈련한 반음소 모델을 사용하는 발화검증은 실제 서비스 단계에서 주변환경이 바뀌고, 채널이 바뀌는 상황에서는 좋은 성능을 내지 못하는 문제점이 있었다.The general anti-phone model training method as described above uses a training database of a voice recognizer. However, the utterance verification using the semi-phoneme model trained from the training database has a problem in that the environment is changed at the actual service stage and the performance is not good in the situation of changing the channel.

따라서, 본 발명은 상기와 같은 종래의 문제점을 해결하기 위해 이루어진 것으로서, 본 발명의 목적은 음성인식 시스템의 실제 서비스를 위해 반음소 모델을 개발 데이터 베이스 또는 서비스 데이터 베이스를 통해 적응하여 발화검증의 성능을 개선시키기 위한 적응 반음소 모델을 이용한 음성인식 서비스 방법을 제공하는 데 있다.Accordingly, the present invention has been made to solve the conventional problems as described above, an object of the present invention is to adapt the semi-phoneme model through the development database or service database for the actual service of the speech recognition system performance of speech verification To provide a voice recognition service method using an adaptive half-phone model to improve the.

다른 목적으로는, 실제 서비스 환경과 유사한 개발용 데이터 베이스(Development Database)를 사용하여 반음소 모델을 적응시키고, 실제 서비스 이후에는 실제로 서비스 하면서 수집한 데이터 베이스(Real Service Database)를 통해 수집한 음성데이터 베이스를 사용하여 반음소 모델을 적응시켜 줌으로서, 실제 서비스 환경에서 성능이 개선되는 적응 반음소 모델을 이용한 음성인식 서비스 방법을 제공하는 데 있다.For other purposes, the phoneme model is adapted using a development database similar to the real service environment, and the voice data collected through the real service database collected during the actual service after the real service. It is to provide a speech recognition service method using an adaptive half-phone model that improves performance in a real service environment by adapting a half-phone model using a base.

또다른 목적으로는, 음성인식을 수행하여 오인식 되었던 음성신호에 대해 적응을 수행하여 오인식 되는 패턴에 대한 반음소의 유사도를 증가시키고, 이를 통해 오인식 가능성이 높은 음성신호에 대한 로그-유사도 신뢰도를 향상시켜 주기 위한 적응 반음소 모델을 이용한 음성인식 서비스 방법을 제공하는 데 있다.Another object is to perform a speech recognition to adapt to a speech signal that was misidentified, thereby increasing the similarity of the half phoneme to a pattern that is misidentified, thereby improving the log-likeness reliability of a speech signal that is likely to be misidentified. The present invention provides a voice recognition service method using an adaptive half-phone model.

상기와 같은 목적을 달성하기 위하여 본 발명 적응 반음소 모델을 이용한 음성인식 서비스 방법은, 훈련용 음성신호 집합으로부터 탐색에 사용할 음향 모델을 훈련하고, 발화검증에 사용할 반음소 모델을 훈련하는 제 100 단계; In order to achieve the above object, the speech recognition service method using the adaptive half-phone model according to the present invention includes training a sound model to be used for searching from a training speech signal set and training a half-phone model to be used for speech verification. ;

서비스 환경과 유사한 환경에서 음성 신호를 수집하여 개발용 음성 신호 집합을 구축하는 제 200 단계; In step 200, collecting a voice signal in an environment similar to a service environment to construct a voice signal set for development;

상기 개발용 음성 신호 집합과 상기 훈련용 음성신호 집합으로 훈련된 반음소 모델을 사용하여 반음소 모델을 적응시키는 제 300 단계; A step 300 for adapting the semitone model using the semitone model trained by the development voice signal set and the training voice signal set;

적응된 반음소 모델을 사용하여 실제 음성인식 서비스를 수행하는 제 400 단계; A step 400 of performing an actual speech recognition service using the adapted half-phone model;

실제 음성인식 서비스를 통해 사용된 음성신호를 수집하여 실제 서비스 음성 신호 집합을 구축하는 제 500 단계; A 500 step of collecting a voice signal used through a real voice recognition service to establish a real service voice signal set;

상기 실제 서비스 음성신호 집합을 사용하여 반음소 모델을 적응시키는 제 600 단계; 및 Step 600 of adapting a semi-phoneme model using the actual service voice signal set; And

새로운 실제 서비스 환경에 적응된 반음소 모델을 사용하여 실제 음성인식 서비스를 수행하는 제 700 단계로 이루어진 것을 특징으로 한다.It is characterized in that it comprises a step 700 of performing a real voice recognition service using a semi-phoneme model adapted to a new real service environment.

이하, 본 발명의 일 실시예에 의한 적응 반음소 모델을 이용한 음성인식 서비스 방법에 대하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, a voice recognition service method using an adaptive half-phone model according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 사용되는 음성인식 시스템의 구성을 나타낸 기능 블록도이다. 불특정 화자의 음성신호가 음성인식 시스템의 입력으로 들어가면, 음성인식 시스템의 음성 끝점 검출부(11)는 음성신호의 시작점을 찾고, 찾은 시작점 이후의 음성신호를 음성특징 추출부(12)로 보낸다. 한편, 상기 음성끝점 검출부(11)는 이후 음성 끝점을 연속적으로 찾으면서, 음성 끝점이 검출되면 이후 신호에 대해서는 상기 음성특징 추출부(12)로 전달하지 않는다. 1 is a functional block diagram showing the configuration of a speech recognition system used in the present invention. When the voice signal of the unspecified speaker enters the input of the voice recognition system, the voice end point detection unit 11 of the voice recognition system finds the start point of the voice signal and sends the voice signal after the found start point to the voice feature extraction unit 12. Meanwhile, the voice endpoint detector 11 continuously searches for the voice endpoint, and when the voice endpoint is detected, the voice endpoint detector 11 does not transmit the subsequent signal to the voice feature extractor 12.

또한, 상기 음성특징 추출부(12)는 음성인식에서 사용하게 되는 특징벡터를 계산하게 된다. 탐색부(13)에서는 이미 훈련되어 있는 문맥종속(Context Dependent) 음향모델을 이용하여 인식대상 어휘들로 이루어진 탐색 공간에서부터 해당 음성이 가장 유사한 어휘를 탐색한다. 이후 상기 탐색부(13)에서 가장 확률값이 높은 어휘와 해당 어휘와의 유사도 값을 발화검증부(14)로 전달한다. 상기 발화검증부(14)는 훈련을 통해 구축한 반음소 모델(Anti-Phone Model)을 이용하여 상기 탐색부(13)에서 넘겨준 인식결과 어휘에 대한 반음소들의 유사도를 계산하고, 이로부터 신뢰도 척도인 로그-유사도 비율을 계산한다. 최종적으로 음성인식 시스템은 음성인식 결과로서 상기 발화검증부(14)에서 계산된 신뢰도가 임계값 보다 높은 경우 상기 탐색기(13)의 인식결과 어휘를 출력하고, 신뢰도가 임계값보다 낮을 경우에는 상기 탐색기(13)의 인식결과 어휘를 거절하는 결과를 출력한다.In addition, the speech feature extraction unit 12 calculates a feature vector to be used in speech recognition. The searcher 13 searches for a vocabulary having the most similar speech from a search space consisting of the words to be recognized, using a context-dependent acoustic model that has already been trained. Thereafter, the search unit 13 transmits the similarity value between the vocabulary having the highest probability value and the corresponding vocabulary to the speech verification unit 14. The speech verification unit 14 calculates the similarity of the half phonemes to the recognition result vocabulary handed over by the search unit 13 using an anti-phone model constructed through training, and from this, reliability Calculate the log-similarity ratio, a measure. Finally, the speech recognition system outputs a recognition result vocabulary of the searcher 13 when the reliability calculated by the speech verification unit 14 is higher than a threshold as a speech recognition result, and the searcher when the reliability is lower than a threshold. The recognition result (13) outputs the result of rejecting the vocabulary.

그러면, 본 발명의 일 실시예에 따른 적응 반음소 모델을 이용한 음성인식 서비스 방법에 대해 첨부된 도면을 참조하여 설명하기로 한다. 도 2는 본 발명의 일실시예에 따른 적응 반음소 모델을 이용한 음성인식 서비스 방법을 나타낸 동작 플로우챠트이다.Next, a voice recognition service method using an adaptive half-phone model according to an embodiment of the present invention will be described with reference to the accompanying drawings. 2 is an operation flowchart showing a voice recognition service method using an adaptive half-phone model according to an embodiment of the present invention.

먼저, 훈련용 음성신호 집합으로부터 탐색에 사용할 음향 모델을 훈련하고, 발화검증에 사용할 반음소 모델을 훈련한다(S100). 그런후, 서비스 환경과 유사한 환경에서 음성 신호를 수집하여 개발용 음성 신호 집합을 구축한다(S200).First, an acoustic model to be used for searching is trained from the training voice signal set, and a semi-phoneme model to be used for speech verification is trained (S100). Then, a voice signal is collected in an environment similar to a service environment to build a voice signal set for development (S200).

이어서, 상기 개발용 음성 신호 집합과 상기 훈련용 음성신호 집합으로 훈련된 반음소 모델을 사용하여 반음소 모델을 적응시킨다(S300). 이하, 상기 제 300 단계(S300)의 세부 과정에 대해 도 3을 참조하여 설명하기로 한다. 먼저, 상기 제 200 단계(S200)를 통해 구축된 개발용 음성 신호 집합(Database)에 대해 음성인식을 수행한다(S310). 그런후, 상기 음성인식 결과를 음성 신호 정답 목록(Reference File)과 비교하여 정인식인지, 아니면 오인식인지의 여부를 판단한다(S320). 이 때, 상기 제 320 단계(S320)에서 정인식이면 다시 상기 제 310 단계(S200)로 진행하는 한편, 오인식이면 그 음성인식 결과 중 정답과 다르게 인식된 음소를 찾아내고, 해당 음소에 대해서 관측 데이터인 입력 음성 프레임의 관측 확률값(Observation Probability)을 누적한다. 이어서, 누적 과정이 적응에 사용할 모든 음성 신호에 대해서 수행되었는지 확인하여, 모든 음성신호에 대한 누적이 완료되었는지의 여부를 판단한다(S340). 이 때, 상기 제 340 단계(S340)에서 누적이 완료되지 않으면(NO) 다시 상기 제 310 단계(S310)로 진행하는 한편, 완료되면(YES) 그 누적된 관측 정보를 사용하여 하기의 [수학식 2]와 같이 표현된 Subsequently, the semi-phoneme model is adapted using the semi-phoneme model trained by the development voice signal set and the training voice signal set (S300). Hereinafter, a detailed process of the 300th step S300 will be described with reference to FIG. 3. First, voice recognition is performed on the development voice signal set (Database) constructed through the 200 th step (S200). Thereafter, the result of the speech recognition is compared with the speech signal correct answer list (Reference File) to determine whether the speech recognition is correct or incorrect (S320). At this time, if it is a positive recognition in step 320 (S320), the process proceeds to step 310 again (S200), and if it is a false recognition, the phoneme is found differently from the correct answer among the voice recognition results, and the observation data is corresponding to the phoneme. Accumulation of Observation Probability of the input speech frame. Subsequently, it is checked whether the accumulation process is performed on all the voice signals to be used for adaptation, and it is determined whether the accumulation of all the voice signals is completed (S340). At this time, if the accumulation is not completed in step 340 (S340) (NO), the flow proceeds to the step 310 again (S310), and when the accumulation is completed (YES), the accumulated equation is used. 2]

최대 사후 확률 (MAP : Maximum A Posteriori) 적응 방법에 의해 관측 정보 누적 데이터를 사용하여 기존에 훈련용 음성신호 집합(Database)으로 훈련된 반음소 모델에 대해 적응을 수행한다(S350).Adaptation is performed on the semi-phoneme model previously trained as a training voice signal database using the accumulated information of observation information by a maximum posterior probability (MAP) adaptation method (S350).

여기서,

는 시간 t에서의 관측 확률값을 나타내고,

값은 기존 반음소 모델의 가우시안 분포(Gaussian mixture)의 평균값을 나타내며,

값은 MAP 방법에 의해 적응된 새로운 평균값을 나타내고,

는 관측데이터인 입력 음성신호의 해당 프레임의 특징벡터 값을 나타내며,

값은 적응 정도를 결정하는 파라메터를 나타낸다. 상기와 같은 MAP 적응방법을 이용하여 오인식된 관측데이터에 대해 반음소 모델의 평균값의 적응을 수행한다. MAP 적응된 반음소 모델은 적응에 사용된 관측데이터와 유사한 입력에 대해서 큰 유사도(Likelihood)를 출력하게 되어, 신뢰도 척도인 로그-유사도 비율(LLR)을 작게 만들어 준다. 그러므로, 오인식 유형의 어휘나 발성 스타일, 발성환경에 대해 발화검증 단계에서 거절할 가능성을 더 높여 주게 된다. here,

Represents the observed probability at time t,

The value represents the mean value of the Gaussian mixture of the existing semitone phone model.

The value represents the new mean value adapted by the MAP method,

Denotes the feature vector value of the corresponding frame of the input audio signal as observation data.

The value represents a parameter that determines the degree of adaptation. The MAP adaptation method is used to adapt the mean value of the semitone phone model to the misidentified observation data. The MAP-adapted half-phone model outputs large similarities (Likelihood) for the inputs similar to the observation data used for adaptation, making the log-similarity ratio (LLR) a reliability measure small. Therefore, it increases the possibility of rejection of erroneous recognition type vocabulary, vocal style, and vocal environment during speech verification stage.

상기 반음소 모델 적응 단계(S300) 이후, 적응된 반음소 모델을 사용하여 실제 음성인식 서비스를 수행한다(S400). 도 4는 도 1에 도시된 음성인식 시스템에 의해 상기 제 400 단계(S400)가 수행되는 세부 과정을 나타낸 동작 플로우챠트이다. 먼저, 음성 신호가 입력되면 상기 끝점 검출부(11)는 끝점 검출을 수행하고, 상기 음성특징 추출부(12)는 음성인식에 유용한 특징 벡터를 추출한다(S410). 이어서, 상기 탐색부(13)는 상기에서 추출된 특징 벡터에 대해 인식대상 어휘로 구성된 탐색공간에서 가장 유사한 어휘를 탐색한다(S420).After the semitone phone model adaptation step (S300), an actual voice recognition service is performed using the adapted semitone phone model (S400). FIG. 4 is an operational flowchart illustrating a detailed process of performing the 400th step S400 by the voice recognition system illustrated in FIG. 1. First, when a voice signal is input, the endpoint detector 11 performs endpoint detection, and the voice feature extractor 12 extracts a feature vector useful for voice recognition (S410). Subsequently, the search unit 13 searches the most similar vocabulary in the search space composed of the recognition target vocabulary with respect to the extracted feature vector (S420).

한편, 상기 발화검증부(14)는 상기 탐색 과정을 통해 전달된 인식 결과 어휘와 해당 어휘를 구성하는 음소들의 유사도값을 전달받아 적응 반음소 모델을 통해 발화검증을 수행한다(S430). 이 때, 상기 발화검증 단계(S430)의 세부 동작 과정에 대해 보다 구체적으로 설명하면, 상기 발화검증부(14)는 상기 탐색 과정을 통해 인식 결과 어휘와 해당 어휘를 구성하는 음소들의 유사도값을 전달받는다(S431). 이어서, 상기 발화검증부(14)는 적응 반음소 모델에 대한 인식 결과 어휘의 유사도를 계산한다(S432). 그런후, 상기 발화검증부(14)는 인식 결과 어휘의 탐색에 사용된 문맥종속(CD) 음향모델에 대한 유사도와 적응 반음소 모델에 대한 유사도의 로그-유사도 비율을 계산한다(S433). 그리고, 상기 발화검증부(14)는 발화검증 과정의 신뢰도인 로그-유사도 비율(LLR)을 임계값과 비교하여, LLR값이 임계값보다 크면 인식 결과를 수락하는 한편, LLR값이 임계값보다 작으면 인식 결과를 거절한다(S434). 이어서, 상기 발화검증부(14)는 상기 발화검증의 수락 또는 거절 결과에 따라 인식 결과 어휘를 출력하거나, 또는 사용자에게 재발성을 요구한다(S435).On the other hand, the speech verification unit 14 receives the similarity value between the recognition result vocabulary and the phonemes constituting the vocabulary transmitted through the search process and performs speech verification through the adaptive half-phone model (S430). In this case, the detailed operation process of the speech verification step S430 will be described in more detail. The speech verification unit 14 transmits the recognition result vocabulary and the similarity value of the phonemes constituting the corresponding vocabulary through the search process. Received (S431). Subsequently, the speech verification unit 14 calculates the similarity of the recognition result vocabulary for the adaptive semitone phone model (S432). Then, the speech verification unit 14 calculates a log-similarity ratio of the similarity with respect to the context-dependent (CD) acoustic model used to search the recognition result vocabulary and the similarity with respect to the adaptive half-phone model (S433). The speech verification unit 14 compares the log-similarity ratio LLR, which is the reliability of the speech verification process, with a threshold value and accepts a recognition result when the LLR value is greater than the threshold value, while the LLR value is greater than the threshold value. If small, the recognition result is rejected (S434). Subsequently, the speech verification unit 14 outputs a recognition result vocabulary according to the acceptance or rejection result of the speech verification, or requests recurrence from the user (S435).

상기 제 400 단계(S400) 이후, 실제 음성인식 서비스를 통해 사용된 음성신호를 수집하여 실제 서비스 음성 신호 집합을 구축한다(S500).After the 400th step (S400), the voice signal used through the actual voice recognition service is collected to build a real service voice signal set (S500).

그런후, 상기 실제 서비스 음성신호 집합을 사용하여 반음소 모델을 적응시킨다(S600). 상기 제 600 단계(S600)에 대해 도 5를 참조하여 보다 상세히 설명하면 다음과 같다. 먼저, 구축된 실제 서비스 음성 신호 집합에 대해 음성인식을 수행한다(S610). 그런후, 상기 음성인식 결과를 음성 신호 정답 목록과 비교하여 정인식인지, 아니면 오인식인지의 여부를 판단한다(S620). 이 때, 상기 제 620 단계(S600)에서 정인식이면 다시 상기 제 610 단계(S610)로 진행하는 한편, 오인식 이면 그 음성인식 결과 중 오류 음소 구간을 찾아 적응에 필요한 관측 정보를 누적한다(S630). 그런후, 누적 과정이 적응에 사용할 모든 음성 신호에 대해서 수행되었는지 확인하여, 모든 음성신호에 대한 누적이 완료되었는지의 여부를 판단한다(S640). 이 때, 상기 제 640 단계(S640)에서 누적이 완료되지 않으면(NO) 다시 상기 제 610 단계(S610)로 진행하는 한편, 완료되면(YES) 그 누적된 관측 정보를 사용하여 상기 [수학식 2]와 같은 맵(MAP) 방법에 의한 반음소 모델 적응을 수행한다(S650). Thereafter, the phoneme model is adapted using the actual service voice signal set (S600). The 600th step S600 will be described in more detail with reference to FIG. 5 as follows. First, voice recognition is performed on the established actual service voice signal set (S610). Thereafter, the result of the speech recognition is compared with the speech signal answer list to determine whether the speech recognition is correct or incorrect (S620). At this time, if it is a positive recognition in step 620 (S600), the process proceeds to step 610 again (S610), and if it is a false recognition, it finds an error phoneme section among the voice recognition results and accumulates observation information necessary for adaptation (S630). Then, it is checked whether the accumulation process is performed for all the speech signals to be used for adaptation, and it is determined whether the accumulation for all the speech signals is completed (S640). At this time, if the accumulation is not completed in step 640 (S640) (NO), the process proceeds to the step 610 (S610), and when the accumulation is completed (YES), the accumulated observation information is used using the accumulated observation information [Equation 2]. The semitone phone model adaptation by the MAP method is performed (S650).

상기 제 600 단계(S600) 이후, 새로운 실제 서비스 환경에 적응된 반음소 모델을 사용하여 실제 음성인식 서비스를 수행한다(S700). 도 6은 도 1에 도시된 음성인식 시스템에 의해 상기 제 700 단계(S700)가 수행되는 세부 과정을 나타낸 동작 플로우챠트이다. 먼저, 음성 신호가 입력되면 상기 끝점 검출부(11)는 끝점 검출을 수행하고, 상기 음성특징 추출부(12)는 음성인식에 유용한 특징 벡터를 추출한다(S710). 이어서, 상기 탐색부(13)는 상기에서 추출된 특징 벡터에 대해 인식대상 어휘로 구성된 탐색공간에서 가장 유사한 어휘를 탐색한다(S720).After operation S600, an actual voice recognition service is performed using a semi-phoneme model adapted to a new real service environment (S700). FIG. 6 is an operational flowchart illustrating a detailed process of performing the 700th step S700 by the voice recognition system illustrated in FIG. 1. First, when a voice signal is input, the endpoint detector 11 performs endpoint detection, and the voice feature extractor 12 extracts a feature vector useful for voice recognition (S710). Subsequently, the search unit 13 searches the most similar vocabulary in the search space composed of the recognition target vocabulary with respect to the extracted feature vector (S720).

한편, 상기 발화검증부(14)는 상기 탐색 과정을 통해 전달된 인식 결과 어휘와 해당 어휘를 구성하는 음소들의 유사도값을 전달받아 새로운 실제 서비스 환경에 적응된 반음소 모델을 통해 발화검증을 수행한다(S730). 이 때, 상기 발화검증 단계(S730)의 세부 동작 과정에 대해 보다 구체적으로 설명하면, 상기 발화검증부(14)는 상기 탐색 과정을 통해 인식 결과 어휘와 해당 어휘를 구성하는 음소들의 유사도값을 전달받는다(S731). 이어서, 상기 발화검증부(14)는 새로운 실제 서비스 환경에 적응된 반음소 모델에 대한 인식 결과 어휘의 유사도를 계산한다(S732). 그런후, 상기 발화검증부(14)는 인식 결과 어휘의 탐색에 사용된 문맥종속(CD) 음향모델에 대한 유사도와 적응 반음소 모델에 대한 유사도의 로그-유사도 비율을 계산한다(S733). 그리고, 상기 발화검증부(14)는 발화검증 과정의 신뢰도인 로그-유사도 비율(LLR)을 임계값과 비교하여, LLR값이 임계값보다 크면 인식 결과를 수락하는 한편, LLR값이 임계값보다 작으면 인식 결과를 거절한다(S734). 이어서, 상기 발화검증부(14)는 상기 발화검증의 수락 또는 거절 결과에 따라 인식 결과 어휘를 출력하거나, 또는 사용자에게 재발성을 요구한다(S735).Meanwhile, the speech verification unit 14 receives a similarity value between the recognition result vocabulary and the phonemes constituting the vocabulary transmitted through the search process and performs speech verification through a semi-phoneme model adapted to a new real service environment. (S730). In this case, the detailed operation process of the speech verification step S730 will be described in more detail. The speech verification unit 14 transmits the recognition result vocabulary and the similarity value of the phonemes constituting the corresponding vocabulary through the search process. It is received (S731). Subsequently, the speech verification unit 14 calculates the similarity of the recognition result vocabulary for the semitone phone model adapted to the new real service environment (S732). Then, the speech verification unit 14 calculates a log-similarity ratio of the similarity with respect to the context-dependent (CD) acoustic model used to search the recognition result vocabulary and the similarity with respect to the adaptive half-phone model (S733). The speech verification unit 14 compares the log-similarity ratio LLR, which is the reliability of the speech verification process, with a threshold value and accepts a recognition result when the LLR value is greater than the threshold value, while the LLR value is greater than the threshold value. If small, the recognition result is rejected (S734). Subsequently, the speech verification unit 14 outputs a recognition result vocabulary according to the acceptance or rejection result of the speech verification, or requests a recurrence from the user (S735).

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다.The method of the present invention as described above may be implemented as a program and stored in a computer-readable recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.).

이상에서 몇 가지 실시예를 들어 본 발명을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것이 아니고 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다.Although the present invention has been described in more detail with reference to some embodiments, the present invention is not necessarily limited to these embodiments, and various modifications can be made without departing from the spirit of the present invention.

상술한 바와 같이 본 발명에 의한 적응 반음소 모델을 이용한 음성인식 서비스 방법에 의하면, 반음소 모델이 실제 음성인식 서비스 환경에 적응되고, 오인식 되는 어휘 패턴과 오인식 될 확률이 높은 발성 경향에 대해 거절 가능성을 높이도록 반음소 모델을 적응하기 때문에, 발화검증 단계의 성능이 향상될 뿐만 아니라, 이로 인해 음성인식 시스템의 출력 결과에 대한 신뢰도가 향상되는 뛰어난 효과가 있다.
As described above, according to the voice recognition service method using the adaptive half-phone model according to the present invention, the half-phone model is adapted to the actual voice recognition service environment, and the possibility of rejection of the lexical pattern that is misrecognized and the vocal tendency to be misrecognized is high. Because the phoneme model is adapted to increase the performance, the performance of the speech verification step is improved, and this results in an excellent effect of improving the reliability of the output of the speech recognition system.

Claims

음성인식 시스템을 이용한 음성인식 서비스 방법에 있어서, In the voice recognition service method using a voice recognition system,

훈련용 음성신호 집합으로부터 탐색에 사용할 음향 모델을 훈련하고, 발화검증에 사용할 반음소 모델을 훈련하는 제 100 단계; Training a sound model to be used for searching from the training voice signal set, and training a semi-phoneme model to be used for speech verification;

구축된 개발용 음성 신호 집합에 대해 음성인식을 수행하는 제 310 단계, 상기 음성인식 결과를 음성 신호 정답 목록과 비교하여 정인식인지, 아니면 오인식인지의 여부를 판단하는 제 320 단계, 상기 제 320 단계에서 정인식이면 다시 상기 제 310 단계로 진행하는 한편, 오인식이면 그 음성인식 결과 중 오류 음소 구간을 찾아 적응에 필요한 관측 정보를 누적하는 제 330 단계, 누적 과정이 적응에 사용할 모든 음성 신호에 대해서 수행되었는지 확인하여, 모든 음성신호에 대한 누적이 완료되었는지의 여부를 판단하는 제 340 단계 및 상기 제 340 단계에서 누적이 완료되지 않으면 다시 상기 제 310 단계로 진행하는 한편, 완료되면 누적된 관측 정보를 사용하여 맵(MAP) 방법에 의한 반음소 모델 적응을 수행하는 제 350 단계로 이루어지고, 상기 개발용 음성 신호 집합과 상기 훈련용 음성신호 집합으로 훈련된 반음소 모델을 사용하여 반음소 모델을 적응시키는 제 300 단계; In step 310, the voice recognition is performed on the constructed speech signal set. In step 320, the voice recognition result is compared with the voice signal answer list to determine whether the recognition is correct or incorrect. If the recognition is correct, the process proceeds to step 310 again. If the error is recognized, the error phoneme section is found in the speech recognition result to accumulate observation information necessary for adaptation, and the accumulation process is performed for all speech signals used for adaptation. In step 340, which determines whether accumulation of all voice signals is completed or not, if the accumulation is not completed in step 340, the process proceeds to step 310 again. In step 350, a half-phoneme model adaptation by the MAP method is performed. A step 300 for adapting the semitone phone model using a semitone phone model trained as a set and the training voice signal set;

구축된 실제 서비스 음성 신호 집합에 대해 음성인식을 수행하는 제 610 단계, 상기 음성인식 결과를 음성 신호 정답 목록과 비교하여 정인식인지, 아니면 오인식인지의 여부를 판단하는 제 620 단계, 상기 제 620 단계에서 정인식이면 다시 상기 제 610 단계로 진행하는 한편, 오인식이면 그 음성인식 결과 중 오류 음소 구간을 찾아 적응에 필요한 관측 정보를 누적하는 제 630 단계, 누적 과정이 적응에 사용할 모든 음성 신호에 대해서 수행되었는지 확인하여, 모든 음성신호에 대한 누적이 완료되었는지의 여부를 판단하는 제 640 단계 및 상기 제 640 단계에서 누적이 완료되지 않으면 다시 상기 제 610 단계로 진행하는 한편, 완료되면 누적된 관측 정보를 사용하여 맵(MAP) 방법에 의한 반음소 모델 적응을 수행하는 제 650 단계로 이루어지고, 상기 실제 서비스 음성신호 집합을 사용하여 반음소 모델을 적응시키는 제 600 단계; 및 In step 610, the voice recognition is performed on the constructed real service voice signal set. In step 620, the voice recognition result is compared with a voice signal answer list to determine whether the recognition is correct or incorrect. If the recognition is correct, the process proceeds to the step 610 again. If the error is recognized, the error phoneme section is found in the speech recognition result to accumulate observation information necessary for the adaptation, and the accumulation process is performed for all the speech signals used for the adaptation. In step 640 of determining whether or not accumulation of all voice signals is completed or not, if the accumulation is not completed in step 640, the process proceeds to the step 610 again. A step 650 of performing a phoneme model adaptation by the MAP method; Step 600 for adapting the semi-phoneme model using the speech signal set; And

새로운 실제 서비스 환경에 적응된 반음소 모델을 사용하여 실제 음성인식 서비스를 수행하는 제 700 단계로 이루어진 것을 특징으로 하는 적응 반음소 모델을 이용한 음성인식 서비스 방법.A speech recognition service method using an adaptive half-phone model, comprising a 700 step of performing a real speech recognition service using a half-phone model adapted to a new real service environment.

삭제delete

제 1항에 있어서, The method of claim 1,

상기 제 350 단계의 맵(MAP) 방법은, 하기의 [수학식 2]를 이용하여 반음소 모델을 적응하는 방식인 것을 특징으로 하는 적응 반음소 모델을 이용한 음성인식 서비스 방법.The MAP method of step 350 is a method of adapting the semitone phone model using Equation 2 below.

[수학식 2][Equation 2]

여기서,

는 시간 t에서의 관측 확률값을 나타내고,

값은 기존 반음소 모델의 가우시안 분포의 평균값을 나타내며,

값은 맵(MAP) 방법에 의해 적응된 새로운 평균값을 나타내고,

값은 적응 정도를 결정하는 파라메터를 나타낸다.here,

Represents the observed probability at time t,

The value represents the mean value of the Gaussian distribution of the existing semitone phone model.

The value represents a new mean value adapted by the map method,

The value represents a parameter that determines the degree of adaptation.

제 1항에 있어서, The method of claim 1,

상기 제 400 단계는, 음성 신호가 입력되면 끝점 검출을 수행한 후 음성인식에 유용한 특징 벡터를 추출하는 제 410 단계; The step 400 may include extracting a feature vector useful for speech recognition after performing endpoint detection when a speech signal is input;

상기 추출된 특징 벡터에 대해 인식대상 어휘로 구성된 탐색공간에서 가장 유사한 어휘를 탐색하는 제 420 단계; 및 Step 420, searching for the most similar vocabulary in the search space composed of the recognition target vocabulary with respect to the extracted feature vector; And

상기 탐색 과정을 통해 전달된 인식 결과 어휘와 해당 어휘를 구성하는 음소들의 유사도값을 전달받아 적응 반음소 모델을 통해 발화검증을 수행하는 제 430 단계로 이루어진 것을 특징으로 하는 적응 반음소 모델을 이용한 음성인식 서비스 방법.The speech using the adaptive semi-phoneme model, comprising the step 430 of performing a speech verification through an adaptive semi-phoneme model by receiving similarity values between the recognition result vocabulary and the phonemes constituting the vocabulary transmitted through the searching process Recognition service method.

제 4항에 있어서, The method of claim 4, wherein

상기 제 430 단계는, 상기 탐색 과정을 통해 인식 결과 어휘와 해당 어휘를 구성하는 음소들의 유사도값을 전달받는 제 431 단계; The step 430 may include: receiving a similarity value between the recognition result vocabulary and the phonemes constituting the vocabulary through the search process;

적응 반음소 모델에 대한 인식 결과 어휘의 유사도를 계산하는 제 432 단계; Step 432, calculating a similarity of the recognition result vocabulary for the adaptive half-phone model;

인식 결과 어휘의 탐색에 사용된 문맥종속(CD) 음향모델에 대한 유사도와 적응 반음소 모델에 대한 유사도의 로그-유사도 비율을 계산하는 제 433 단계; Calculating a log-similarity ratio of the similarity for the context dependent (CD) acoustic model used for the recognition result vocabulary and the similarity for the adaptive semiphone model;

발화검증 과정의 신뢰도인 로그-유사도 비율(LLR)을 임계값과 비교하여 LLR값이 임계값보다 크면 인식 결과를 수락하는 한편, LLR값이 임계값보다 작으면 인식 결과를 거절하는 제 434 단계; 및 A step 434 of comparing the log-likelihood ratio LLR, which is the reliability of the utterance verification process, with a threshold to accept the recognition result if the LLR value is greater than the threshold value, and rejecting the recognition result if the LLR value is less than the threshold value; And

상기 발화검증의 수락 또는 거절 결과에 따라 인식 결과 어휘를 출력하거나, 또는 사용자에게 재발성을 요구하는 제 435 단계로 이루어진 것을 특징으로 하는 적응 반음소 모델을 이용한 음성인식 서비스 방법.And a 435th step of outputting a recognition result vocabulary according to the result of accepting or rejecting the utterance verification, or requesting recurrence to the user.

삭제delete

제 1항에 있어서, The method of claim 1,

상기 제 650 단계의 맵(MAP) 방법은, 하기의 [수학식 2]를 이용하여 반음소 모델을 적응하는 방식인 것을 특징으로 하는 적응 반음소 모델을 이용한 음성인식 서비스 방법.The MAP method of step 650 is a method of adapting the semitone phone model using Equation 2 below.

[수학식 2][Equation 2]

여기서,

는 시간 t에서의 관측 확률값을 나타내고,

값은 맵(MAP) 방법에 의해 적응된 새로운 평균값을 나타내고,

값은 적응 정도를 결정하는 파라메터를 나타낸다.here,

Represents the observed probability at time t,

The value represents a new mean value adapted by the map method,

The value represents a parameter that determines the degree of adaptation.

제 1항에 있어서, The method of claim 1,

상기 제 700 단계는, 음성 신호가 입력되면 끝점 검출을 수행한 후 음성인식에 유용한 특징 벡터를 추출하는 제 710 단계; Step 700 includes extracting a feature vector useful for voice recognition after performing endpoint detection when a voice signal is input;

상기 추출된 특징 벡터에 대해 인식대상 어휘로 구성된 탐색공간에서 가장 유사한 어휘를 탐색하는 제 720 단계; 및 Step 720 of searching for the most similar vocabulary in the search space composed of the recognition target vocabulary with respect to the extracted feature vector; And

상기 탐색 과정을 통해 전달된 인식 결과 어휘와 해당 어휘를 구성하는 음소들의 유사도값을 전달받아, 새로운 실제 서비스 환경에 적응된 반음소 모델을 사용하여 발화검증을 수행하는 제 730 단계로 이루어진 것을 특징으로 하는 적응 반음소 모델을 이용한 음성인식 서비스 방법.Step 730 is performed by receiving the similarity value between the recognition result vocabulary and the phonemes constituting the vocabulary transmitted through the searching process, and performing speech verification using a semi-phoneme model adapted to a new real service environment. Speech recognition service method using an adaptive half-phone model.

제 8항에 있어서, The method of claim 8,

상기 제 730 단계는, 상기 탐색 과정을 통해 인식 결과 어휘와 해당 어휘를 구성하는 음소들의 유사도값을 전달받는 제 731 단계; Step 730 may include: receiving similarity values between the recognition result vocabulary and the phonemes constituting the vocabulary through the search process;

새로운 실제 서비스 환경에 적응된 반음소 모델에 대한 인식 결과 어휘의 유사도를 계산하는 제 732 단계; A step 732 of calculating a similarity of the recognition result vocabulary for the semitone phone model adapted to the new real service environment;

인식 결과 어휘의 탐색에 사용된 문맥종속(CD) 음향모델에 대한 유사도와 적응 반음소 모델에 대한 유사도의 로그-유사도 비율을 계산하는 제 733 단계; A seventh step of calculating a log-similarity ratio of the similarity for the context dependent (CD) acoustic model used for the recognition result vocabulary and the similarity for the adaptive half-phone model;

발화검증 과정의 신뢰도인 로그-유사도 비율(LLR)을 임계값과 비교하여 LLR값이 임계값보다 크면 인식 결과를 수락하는 한편, LLR값이 임계값보다 작으면 인식 결과를 거절하는 제 734 단계; 및 Step 734, comparing the log-likelihood ratio LLR, which is the reliability of the utterance verification process, with the threshold value, accepting the recognition result if the LLR value is larger than the threshold value, and rejecting the recognition result if the LLR value is smaller than the threshold value; And

상기 발화검증의 수락 또는 거절 결과에 따라 인식 결과 어휘를 출력하거나, 또는 사용자에게 재발성을 요구하는 제 735 단계로 이루어진 것을 특징으로 하는 적응 반음소 모델을 이용한 음성인식 서비스 방법.A speech recognition service method using an adaptive semi-phoneme model, comprising the step 735 of outputting a recognition result vocabulary or requesting recurrence from a user according to a result of accepting or rejecting the speech verification.