KR20210101971A

KR20210101971A - Server, method and computer program for providing voice recognition service

Info

Publication number: KR20210101971A
Application number: KR1020200016609A
Authority: KR
Inventors: 유승우; 김희경; 박성원
Original assignee: 주식회사 케이티
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2021-08-19
Also published as: KR102605159B1

Abstract

The present invention relates to classification of voice signals based on a defined language-based tree rule and provision of a voice recognition service based on a classification result. A server for providing a voice recognition service based on a sophisticated phoneme separation model comprises: a dictionary building unit that builds a pronunciation dictionary defined by distinguishing a plurality of pronunciation strings for each Korean pronunciation and foreign pronunciation; a learning unit that trains a phoneme separation model based on the language-based tree rule defined for classification of similar pronunciations; an input unit for receiving a voice signal; a division unit that divides a voice signal into a preset unit size; a recognition unit for recognizing the separated voice signal based on the phoneme separation model; and a provision unit that provides a voice recognition service for a voice signal based on the recognition result.

Description

음성 인식 서비스를 제공하는 서버, 방법 및 컴퓨터 프로그램{SERVER, METHOD AND COMPUTER PROGRAM FOR PROVIDING VOICE RECOGNITION SERVICE}SERVER, METHOD AND COMPUTER PROGRAM FOR PROVIDING VOICE RECOGNITION SERVICE

본 발명은 음성 인식 서비스를 제공하는 서버, 방법 및 컴퓨터 프로그램에 관한 것이다. The present invention relates to a server, method and computer program for providing a voice recognition service.

종래의 음성 인식 학습은 음성과 발화 문장을 이용, 유사 발음(음소)을 기준으로 군집하여 하나의 모델을 구성한다. 이때, 한국인 발음(예컨대, '아') 및 영어 발음(예컨대, 'a')은 유사하므로 섞이는 현상이 발생한다. 이러한, 음성 인식 학습은 한국인 발음과 영어 발음이 유사한 상황에서 구분하지 못한다는 단점을 갖는다. In the conventional speech recognition learning, one model is constructed by grouping voices and spoken sentences on the basis of similar pronunciations (phonemes). At this time, since the Korean pronunciation (eg, 'a') and the English pronunciation (eg, 'a') are similar, a mixing phenomenon occurs. Such speech recognition learning has a disadvantage in that it cannot distinguish between Korean pronunciation and English pronunciation in a similar situation.

이렇게 발음이 유사한 음소 성분이 섞이는 현상은 음성을 인식하는 과정에서 한국인 발음과 영어 발음을 구분하지 못하는 문제로 이어지게 된다. This mixing of phoneme components with similar pronunciation leads to the problem of not being able to distinguish between Korean pronunciation and English pronunciation in the process of recognizing a voice.

도 2a를 참조하면, 기존의 트리 규칙은 음성 신호의 유사도로만 판단하여 영어 발음열과 한국어 발음열 간의 경계가 없다. 또한, 기존의 음성 인식 모델은 한국어 발음열과 영어 발음열 간의 구별없이 서로의 발음열을 공유하기 때문에 영어 발음열 및 한국어 발음열 간의 경계가 모호해질 수 밖에 없다. Referring to FIG. 2A , in the existing tree rule, there is no boundary between the English pronunciation sequence and the Korean pronunciation sequence by judging only the similarity of voice signals. In addition, since the existing speech recognition model shares the pronunciation sequence with each other without distinguishing between the Korean pronunciation sequence and the English pronunciation sequence, the boundary between the English pronunciation sequence and the Korean pronunciation sequence is inevitably blurred.

한국인의 영어 교육 수준의 증가로 영어를 외국인처럼 발화하는 한국인이 늘어나고 있다. 한국인의 영어 발음과 외국인의 영어 발음 간의 인식 경계가 사라지고 있어서, 기존의 음성 인식 모델을 이용한 음성인식 방식으로는 외국인의 영어 발음과 유사한 한국인의 영어 발음을 구분 및 인식하는데 어려움이 있다. 예를 들어, 도 2b를 참조하면, 한국인(207)과 외국인(209)이 “BTS 노래 love 틀어”를 발화할 때, 한국인(207)의 영어 발음이 외국인(209)의 영어 발음과 유사할 경우, 기존의 음성 인식 모델은 인식률 개선이 요구될 수 밖에 없다. As the level of English education of Koreans increases, more Koreans speak English like foreigners. Since the recognition boundary between the English pronunciation of Koreans and the English pronunciation of foreigners is disappearing, it is difficult to distinguish and recognize the English pronunciation of Koreans similar to the English pronunciation of foreigners using the existing voice recognition model. For example, referring to FIG. 2B , when a Korean 207 and a foreigner 209 utter “Play BTS song love”, the English pronunciation of the Korean 207 is similar to the English pronunciation of the foreigner 209 , the existing voice recognition model is inevitably required to improve the recognition rate.

한편, 최근의 음성 인식 서비스의 현황을 살펴보면, 핸드폰의 키패드 설정(예컨대, 한글 타이핑, 영어 타이핑)에 따라 음성 인식 모델이 선택되고, 선택된 음성 인식 모델을 통해 언어별 음성 인식 서비스가 제공되고 있다. On the other hand, looking at the current state of voice recognition services, a voice recognition model is selected according to the keypad settings (eg, Korean typing, English typing) of a mobile phone, and a voice recognition service for each language is provided through the selected voice recognition model.

종래의 인공 지능 스피커의 경우, 한글 인식을 중점적으로 구사하고 있거나, 언어 식별 모델을 이용하여 발화되는 언어가 한글인지 또는 영어인지를 판단한 후에 선택된 언어에 대응하는 음성 인식 모델을 이용하여 음성 인식 서비스를 제공하고 있다. In the case of the conventional artificial intelligence speaker, the Korean language recognition is mainly used, or the speech recognition service is provided using the speech recognition model corresponding to the selected language after determining whether the spoken language is Korean or English using the language identification model. are providing

한국등록특허공보 제10-1482148호 (2015.01.07. 등록)Korean Patent Publication No. 10-1482148 (Registered on July 7, 2015)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 음성 신호로부터 추출된 음소 성분을 유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 분류하고자 한다. 또한, 본 발명은 분류 결과에 기초하여 음성 인식 서비스를 제공하고자 한다. The present invention is intended to solve the problems of the prior art, and to classify phoneme components extracted from a voice signal based on a language-based tree rule defined for classifying similar pronunciations. In addition, the present invention intends to provide a voice recognition service based on a classification result.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problems to be achieved by the present embodiment are not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 음소 분리 모델에 기초하여 음성 인식 서비스를 제공하는 서버는 한국인 발음 및 외국인 발음 각각에 대한 복수의 발음열을 구별하여 정의한 발음 사전을 구축하는 사전 구축부; 유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 학습시키는 학습부; 음성 신호를 입력받는 입력부; 상기 음성 신호를 기설정된 단위 크기로 분할하는 분할부; 상기 음소 분리 모델에 기초하여 상기 분할된 음성 신호를 인식하는 인식부; 및 인식 결과에 기초하여 음성 인식 서비스를 제공하는 제공부를 포함할 수 있다. As a technical means for achieving the above technical problem, a server providing a voice recognition service based on the phoneme separation model according to the first aspect of the present invention distinguishes and defines a plurality of pronunciation sequences for each of Korean pronunciation and foreign pronunciation. a dictionary building unit for constructing a pronunciation dictionary; a learning unit configured to train a phoneme separation model based on a language-based tree rule defined for classifying similar pronunciations; an input unit for receiving a voice signal; a division unit dividing the voice signal into predetermined unit sizes; a recognition unit for recognizing the divided speech signal based on the phoneme separation model; and a provider that provides a voice recognition service based on the recognition result.

본 발명의 제 2 측면에 따른 음소 분리 모델에 기초하여 음성 인식 서비스를 제공하는 서버는 한국인 발음 및 외국인 발음 각각에 대한 복수의 발음열을 구별하여 정의한 발음 사전을 구축하는 단계; 유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 학습시키는 단계; 음성 신호를 입력받는 단계; 상기 음성 신호를 기설정된 단위 크기로 분할하는 단계; 상기 음소 분리 모델에 기초하여 상기 분할된 음성 신호를 인식하는 단계; 및 인식 결과에 기초하여 상기 음성 신호에 대한 음성 인식 서비스를 제공하는 단계를 포함할 수 있다. A server providing a voice recognition service based on a phoneme separation model according to a second aspect of the present invention includes the steps of: constructing a pronunciation dictionary defined by distinguishing a plurality of pronunciation strings for each of Korean pronunciation and foreign pronunciation; training a phoneme separation model based on a language-based tree rule defined for classifying similar pronunciations; receiving a voice signal; dividing the voice signal into predetermined unit sizes; recognizing the divided speech signal based on the phoneme separation model; and providing a voice recognition service for the voice signal based on the recognition result.

본 발명의 제 3 측면에 따른 음소 분리 모델에 기초하여 음성 인식 서비스를 제공하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 한국인 발음 및 외국인 발음 각각에 대한 복수의 발음열을 구별하여 정의한 발음 사전을 구축하고, 유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 학습시키고, 음성 신호를 입력받고, 상기 음성 신호를 기설정된 단위 크기로 분할하고, 상기 음소 분리 모델에 기초하여 상기 분할된 음성 신호를 인식하고, 인식 결과에 기초하여 상기 음성 신호에 대한 음성 인식 서비스를 제공하는 명령어들의 시퀀스를 포함할 수 있다. When a computer program stored in a medium including a sequence of instructions for providing a speech recognition service based on a phoneme separation model according to the third aspect of the present invention is executed by a computing device, a plurality of pronunciations for each of Korean pronunciation and foreign pronunciation A pronunciation dictionary defined by dividing columns is constructed, a phoneme separation model is trained based on a language-based tree rule defined for classification of similar pronunciations, a voice signal is input, and the voice signal is converted into a preset unit size. and a sequence of commands for segmenting, recognizing the segmented speech signal based on the phoneme separation model, and providing a speech recognition service for the speech signal based on a recognition result.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary, and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 본 발명은 음성 신호로부터 추출된 음소 성분을 유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 분류하고, 분류 결과에 기초하여 음성 인식 서비스를 제공할 수 있다. According to any one of the above-described problem solving means of the present invention, the present invention classifies phoneme components extracted from a voice signal based on a language-based tree rule defined for classifying similar pronunciations, and based on the classification result, A voice recognition service may be provided.

이를 통해, 본 발명은 별도의 언어 분류 모델을 이용하지 않더라도 정의된 언어 기반의 트리 규칙에 기초하여 한국인 발음과 외국인 발음을 분류하기 때문에 한국인 발음과 외국인 발음이 섞이는 현상을 제거할 수 있어 한국어 인식률을 유지한 채로 영어 인식률을 향상시킬 수 있다. 또한, 본 발명은 한국어 및 영어가 동시에 포함된 음성 데이터가 입력되더라도 정의된 언어 기반의 트리 규칙을 통해 한영 전환 음성 인식이 가능하다. Through this, the present invention classifies Korean pronunciations and foreign pronunciations based on a defined language-based tree rule even without using a separate language classification model. You can improve your English recognition rate while maintaining it. In addition, according to the present invention, even when voice data including Korean and English are simultaneously input, it is possible to recognize Korean-English conversion through a tree rule based on a defined language.

도 1은 본 발명의 일 실시예에 따른, 음성 인식 서비스 제공 서버의 블록도이다.
도 2a 내지 2b는 종래의 음성 인식 방법을 설명하기 위한 도면이다.
도 3a 내지 3c는 본 발명의 일 실시예에 따른, 음소 분리 모델을 학습하는 방법을 설명하기 위한 도면이다.
도 4a 내지 4b는 종래의 음성 신호의 인식 방법과 본 발명의 음성 신호의 인식 방법을 비교 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른, 음성 인식 서비스 제공 방법을 나타낸 흐름도이다.
도 6은 본 발명의 일 실시예에 따른, 음소 분리 모델을 생성하는 방법을 나타낸 흐름도이다. 1 is a block diagram of a voice recognition service providing server according to an embodiment of the present invention.
2A to 2B are diagrams for explaining a conventional voice recognition method.
3A to 3C are diagrams for explaining a method of learning a phoneme separation model according to an embodiment of the present invention.
4A to 4B are diagrams for explaining a comparison between the conventional method for recognizing a voice signal and the method for recognizing a voice signal according to the present invention.
5 is a flowchart illustrating a method of providing a voice recognition service according to an embodiment of the present invention.
6 is a flowchart illustrating a method of generating a phoneme separation model according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, this means that other components may be further included rather than excluding other components unless otherwise stated.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다. In this specification, a "part" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. In addition, one unit may be implemented using two or more hardware, and two or more units may be implemented by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다. Some of the operations or functions described as being performed by the terminal or device in the present specification may be instead performed by a server connected to the terminal or device. Similarly, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the server.

이하, 첨부된 구성도 또는 처리 흐름도를 참고하여, 본 발명의 실시를 위한 구체적인 내용을 설명하도록 한다. Hereinafter, detailed contents for carrying out the present invention will be described with reference to the accompanying configuration diagram or process flow diagram.

도 1은 본 발명의 일 실시예에 따른, 음성 인식 서비스 제공 서버(10)의 블록도이다. 1 is a block diagram of a voice recognition service providing server 10 according to an embodiment of the present invention.

도 1을 참조하면, 음성 인식 서비스 제공 서버(10)는 학습부(100), 입력부(110), 분할부(120), 인식부(130), 사전 구축부(140), 트리 규칙 생성부(150) 및 제공부(160)를 포함할 수 있다. 여기서, 트리 규칙 생성부(150)는 언어 레벨 결정부(152), 음소 레벨 결정부(154) 및 운소 레벨 결정부(156)를 포함할 수 있다. 다만, 도 1에 도시된 음성 인식 서비스 제공 서버(10)는 본 발명의 하나의 구현 예에 불과하며, 도 1에 도시된 구성요소들을 기초로 하여 여러 가지 변형이 가능하다. Referring to FIG. 1 , the voice recognition service providing server 10 includes a learning unit 100 , an input unit 110 , a division unit 120 , a recognition unit 130 , a dictionary construction unit 140 , and a tree rule generator ( 150) and a provision unit 160 may be included. Here, the tree rule generator 150 may include a language level determiner 152 , a phoneme level determiner 154 , and a phoneme level determiner 156 . However, the voice recognition service providing server 10 shown in FIG. 1 is only one implementation example of the present invention, and various modifications are possible based on the components shown in FIG. 1 .

이하에서는 도 3a 내지 4b를 함께 참조하여 도 1을 설명하기로 한다. Hereinafter, FIG. 1 will be described with reference to FIGS. 3A to 4B.

사전 구축부(140)는 한국어 발음에 대한 발음열과 외국인 발음에 대한 발음열 간 유사성에 따른 간섭을 최소화하기 위해 발음열을 분리할 수 있다. 예를 들어, 사전 구축부(140)는 한글 모음에 해당하는 'ㅏ'와 영어 모음에 해당하는 'a' 가 유사하기 때문에 이러한 유사성에 따른 간섭을 최소화하기 위해 한국인 발음 및 외국인 발음 각각에 대한 발음열 분리 작업을 수행할 수 있다. The dictionary building unit 140 may separate the pronunciation sequences in order to minimize interference due to the similarity between the pronunciation sequences for Korean pronunciations and the pronunciation sequences for foreign pronunciations. For example, since 'a' corresponding to the Korean vowel and 'a' corresponding to the English vowel are similar, the dictionary building unit 140 pronounces each of the Korean pronunciation and the foreign pronunciation in order to minimize interference due to the similarity. Thermal separation can be performed.

사전 구축부(140)는 한국인 발음 및 외국인 발음 각각에 대한 복수의 발음열을 정의하고, 한국어 발음에 대한 발음열과 외국인 발음에 대한 발음열을 분리할 수 있다. 또한, 사전 구축부(140)는 외국인 발음의 발음열에 대한 강세 또는 길이를 설정하고, 한국인 발음의 발음열에 대한 강세 또는 길이를 설정할 수 있다. 예를 들어, 사전 구축부(140)는 외국인 발음의 발음열을 구성하는 자음 또는 모음의 발음에 대하여 발음의 강세 또는 길이를 설정할 수 있다. The dictionary building unit 140 may define a plurality of pronunciation sequences for each of Korean pronunciation and foreign pronunciation, and separate the pronunciation sequence for Korean pronunciation and pronunciation sequence for foreign pronunciation. Also, the dictionary building unit 140 may set the stress or length of the pronunciation string of foreign pronunciation, and may set the stress or length of the pronunciation string of Korean pronunciation. For example, the dictionary building unit 140 may set the stress or length of pronunciation with respect to the pronunciation of consonants or vowels constituting the pronunciation sequence of foreign pronunciations.

예를 들어, 도 3a를 참조하면, 사전 구축부(140)는 외국인의 영어 발음에 대한 제 1 발음열(301, 예컨대, 대문자 형태의 26개의 발음열)을 정의하고, 한국인의 영어 와 한국인의 한국어 발음에 대한 제 2 발음열(예컨대, 한글 형태의 40개의 한국인 발음열)을 정의하여 각 영어 발음에 대한 발음열을 분리할 수 있다. For example, referring to FIG. 3A , the dictionary building unit 140 defines a first pronunciation string (301, for example, 26 pronunciation strings in the form of capital letters) for foreign English pronunciation, and By defining a second pronunciation string (eg, 40 Korean pronunciation strings in Hangul form) for Korean pronunciation, it is possible to separate pronunciation strings for each English pronunciation.

예를 들어, 한국인이 'bts 노래 love 틀어'(305)를 발음하게 되는 경우, 해당 발음에 대한 한국인의 발음열은 [b t s ㄴ ㅗ ㄹ ㅐ l o v e ㅌ ㅡ ㄹ ㅇ ㅓ]로 구성될 수 있고, 외국인이 'BTS 노래 LOVE 틀어'(307)를 발음하게 되는 경우, 해당 발음에 대한 외국인의 발음열은 [B T S ㄴ ㅗ ㄹ ㅐ L O V E ㅌ ㅡ ㄹ ㅇ ㅓ]로 구성될 수 있다. For example, if a Korean pronounces 'play bts song love' (305), the Korean pronunciation string for the pronunciation may consist of [bts ㄴ ㅗ ㄹ ㅐ love t ㅡ ㄹ ㅇ ㅓ], and a foreigner When this 'play BTS song LOVE' (307) is pronounced, the foreigner's pronunciation string for the pronunciation may be composed of [BTS ㄴ ㅗ ㄹ ㅐ LOVE ㅡㅡ ㄹ ㅇ ㅓ].

사전 구축부(140)는 학습 텍스트 데이터 및 학습 텍스트 데이터에 대응하는 학습 음성 데이터를 포함하는 학습 데이터에 기초하여 음소 성분을 트라이폰 형태로 배열하고, 배열된 트라이폰 각각에 대하여 라벨링 작업을 수행할 수 있다. 사전 구축부(140)는 학습 텍스트 데이터를 발음하는 학습 음성 데이터가 한국인인지 외국인인지를 분류하고, 학습 텍스트 데이터를 구성하는 음소 성분을 3개씩 배열한 트라이폰 마다 라벨링 작업을 수행할 수 있다. 또한, 사전 구축부(140)는 각 트라이폰마다 트라인폰에 대응하는 발음열 및 트라이폰을 구성하는 음소의 시작과 끝 정보(즉, 음소 구간)을 매핑함으로써 각 트라이폰에 대한 라벨링 작업을 수행할 수 있다. The dictionary construction unit 140 arranges phoneme components in a triphone form based on learning data including learning text data and learning voice data corresponding to the learning text data, and performs a labeling operation on each of the arranged triphones. can The dictionary building unit 140 may classify whether the learning voice data for pronouncing the learning text data is a Korean or a foreigner, and may perform a labeling operation for each triphone in which three phoneme components constituting the learning text data are arranged. In addition, the dictionary construction unit 140 for each triphone by mapping the pronunciation string corresponding to the triphone and the start and end information (ie, phoneme section) of the phoneme constituting the triphone, labeling work for each triphone. can be done

예를 들어, 도 3b를 참조하면, 음소 추출부(미도시)는 학습 텍스트 데이터(309)로부터 음소 성분을 추출할 수 있다. 음소 추출부(미도시)는 'BTS 노래 LOVE 틀어'로 구성된 학습 텍스트 데이터(309)로부터 음소 성분에 해당하는 'B', 'T', 'S', 'ㄴ', 'ㅗ', 'ㄹ', 'ㅐ', 'L', 'O', 'V', 'E', 'ㅌ', 'ㅡ', 'ㄹ', 'ㅇ', 'ㅓ'를 추출할 수 있다. For example, referring to FIG. 3B , a phoneme extractor (not shown) may extract phoneme components from the training text data 309 . A phoneme extracting unit (not shown) is configured to extract 'B', 'T', 'S', 'B', 'ㅗ', 'ㄹ ', 'ㅐ', 'L', 'O', 'V', 'E', 'T', 'ㅡ', 'ㄹ', 'ㅇ', and 'ㅓ' can be extracted.

배열부(미도시)는 추출된 음소 성분을 3개의 음소 성분씩 그룹으로 묶어 트라이폰(Tri-Phone) 형태(313)로 배열할 수 있다. 예를 들어, 배열부(미도시)는 음소 성분에 해당하는 'B', 'T', 'S', 'ㄴ', 'ㅗ', ... , 'ㅓ'를 [<s>, B, T], [B, T, S], [T, S, ㄴ], [S, ㄴ, ㅗ], , [ㄹ, ㅇ, ㅓ], [ㅇ, ㅓ, <s>]와 같이 트라이폰 형태(313)로 배열할 수 있다. The arrangement unit (not shown) may group the extracted phoneme components into groups of three phoneme components and arrange them in a tri-phone shape 313 . For example, the arrangement unit (not shown) sets 'B', 'T', 'S', 'ㄴ', 'ㅗ', ... , 'ㅓ' corresponding to phoneme components to [<s>, B Triphone like , T], [B, T, S], [T, S, ㄴ], [S, ㄴ, ㅗ], , [ㄹ, ㅇ, ㅓ], [ㅇ, ㅓ, <s>] It can be arranged in the form 313 .

사전 구축부(140)는 학습 음성 데이터(311)로부터 각 트라이폰마다 트라이폰을 구성하는 음소의 음소 구간(315)을 추출하고, 추출된 음소 구간(315) 및 해당 트라이폰에 대응하는 발음열을 해당 트라이폰에 매핑함으로써 해당 트라이폰에 대한 라벨링 작업을 수행할 수 있다. The dictionary construction unit 140 extracts a phoneme section 315 of a phoneme constituting a triphone for each triphone from the learning voice data 311, and the extracted phoneme section 315 and a pronunciation sequence corresponding to the triphone. By mapping to the triphone, the labeling operation for the triphone can be performed.

각 트라이폰을 구성하는 음소 성분 중 가운데 음소 성분은 한국인 발음에 대응하는 음소 성분인지 또는 외국인 발음에 대응하는 음소 성분인지를 분류하는데 이용되고, 가운데 음소 성분을 제외한 나머지 음소 성분은 각 트라이폰 간의 앞뒤 간격의 문맥 정보(즉, 음향 문맥 정보)를 파악하는데 이용될 수 있다. Among the phoneme components constituting each triphone, the middle phoneme component is used to classify whether it is a phoneme component corresponding to a Korean pronunciation or a phoneme component corresponding to a foreign pronunciation. It may be used to determine the context information of the interval (ie, acoustic context information).

사전 구축부(140)는 한국인 발음 및 외국인 발음 각각에 대한 복수의 발음열을 구별하여 정의한 발음 사전을 구축할 수 있다. The dictionary building unit 140 may build a pronunciation dictionary defined by distinguishing a plurality of pronunciation strings for each of Korean pronunciation and foreign pronunciation.

이러한, 한국인 발음 및 외국인 발음 각각에 따라 발음열이 정의된 발음 사전을 구축하게 되면, 영어 단독 또는 한국어 및 영어 혼용 발화에 대한 음성 인식에 있어서 한국어 및 영어의 음소 간 자연스러운 전이가 가능하게 된다. When a pronunciation dictionary in which a pronunciation sequence is defined according to each of Korean and foreign pronunciations is constructed, a natural transition between Korean and English phonemes is possible in speech recognition for speech in English alone or in a mixture of Korean and English.

사전 구축부(140)는 발음 사전에 포함된 복수의 발음열의 음소 성분을 한국인 발음 및 상기 외국인 발음으로 분류하여 제 1 데이터베이스에 저장하고, 한국인 발음 및 외국인 발음 중 하나로 분류된 발음열이 자음에 해당하는지 또는 모음에 해당하는지 분류하여 제 2 데이터베이스에 저장하고, 자음 또는 모음으로 분류된 발음열에 대한 강세 또는 길이를 기설정된 강세 레벨 및 길이 레벨에 따라 분류하여 제 3 데이터베이스에 저장할 수 있다. The dictionary construction unit 140 classifies phoneme components of a plurality of pronunciation strings included in the pronunciation dictionary into Korean pronunciations and the foreign pronunciations and stores them in the first database, and the pronunciation sequences classified as either Korean pronunciations or foreign pronunciations correspond to consonants. It is possible to classify whether or not it corresponds to a vowel and store it in a second database, and classify the stress or length of a pronunciation string classified as a consonant or a vowel according to a preset stress level and length level and store it in a third database.

사전 구축부(140)는 제 1 데이터베이스, 제 2 데이터베이스 및 제 3 데이터베이스 각각으로 분류 저장된 복수의 발음열을 이용하여 발음 사전을 구축할 수 있다. The dictionary building unit 140 may construct a pronunciation dictionary by using a plurality of pronunciation strings classified and stored in each of the first database, the second database, and the third database.

트리 규칙 생성부(150)는 구축된 발음 사전을 이용하여 정의된 언어 기반의 트리 규칙을 생성할 수 있다. 여기서, 정의된 언어 기반의 트리 규칙은 음소 성분이 한국인 발음 또는 외국인 발음 중 어느 발음에 대응하는 음소 성분인지 1차적으로 판단 및 분류하고, 음소 성분이 자음 또는 모음 중 어느 하나에 해당되는지 2차적으로 판단 및 분류하고, 음소 성분의 강세 또는 길이가 기정의된 복수의 강세 또는 길이 중 어느 하나로 분류되는지 3차적으로 판단 및 분류하는 규칙으로 구성될 수 있다. The tree rule generator 150 may generate a tree rule based on a defined language by using the constructed pronunciation dictionary. Here, the defined language-based tree rule primarily determines and classifies whether a phoneme component corresponds to either a Korean pronunciation or a foreign pronunciation, and secondarily determines whether the phoneme component corresponds to either a consonant or a vowel. It may consist of rules for determining and classifying, and tertiarily determining and classifying whether the stress or length of a phoneme component is classified into any one of a plurality of predefined stresses or lengths.

트리 규칙 생성부(150)는 언어 레벨 결정부(152), 음소 레벨 결정부(154) 및 운소 레벨 결정부(156)와 연계하여 정의된 언어 기반의 트리 규칙에 기초하여 트라이폰 형태로 배열된 음소 성분을 분류할 수 있다. The tree rule generator 150 is arranged in a triphone form based on a language-based tree rule defined in conjunction with the language level determiner 152 , the phoneme level determiner 154 , and the phoneme level determiner 156 . Phoneme components can be classified.

언어 레벨 결정부(152)는 학습 데이터를 구성하는 음소 성분이 한국인 발음에 대응하는 음소 성분인지 또는 외국인 발음에 대응하는 음소 성분인지를 분류하여 음소 성분에 대한 언어 레벨을 결정할 수 있다. The language level determiner 152 may determine the language level of the phoneme component by classifying whether the phoneme component constituting the learning data is a phoneme component corresponding to a Korean pronunciation or a phoneme component corresponding to a foreign pronunciation.

언어 레벨 결정부(152)는 트라이폰을 구성하는 3개의 음소 성분 중 가운데 음소 성분이 한국인 발음에 대응하는 음소 성분인지 또는 외국인 발음에 대응하는 음소 성분인지를 분류함으로써 해당 음소 성분에 대한 언어 레벨을 결정할 수 있다. 예를 들어, 언어 레벨 결정부(152)는 트라이폰 형태로 배열된 [B T S]에서 가운데 음소 성분에 해당하는 'T'를 외국인 발음에 대응하는 음소 성분으로 분류할 수 있다. 또는, 언어 레벨 결정부(152)는 트라이폰 형태로 배열된 [S ㄴ ㅗ]의 경우, 가운데 음소 성분이 'ㄴ'이므로 해당 음소 성분을 한국인 발음에 대응하는 음소 성분으로 분류할 수 있다. The language level determining unit 152 determines the language level for the corresponding phoneme component by classifying whether the phoneme component among the three phoneme components constituting the triphone is a phoneme component corresponding to a Korean pronunciation or a phoneme component corresponding to a foreign pronunciation. can decide For example, the language level determiner 152 may classify 'T' corresponding to a middle phoneme component in [B T S] arranged in a triphone shape as a phoneme component corresponding to a foreign pronunciation. Alternatively, in the case of [S ㄴ ㅗ] arranged in a triphone shape, the language level determining unit 152 may classify the corresponding phoneme component as a phoneme component corresponding to the Korean pronunciation since the middle phoneme component is 'ㄴ'.

음소 레벨 결정부(154)는 음소 성분에 대한 언어 레벨이 결정되면, 해당 음소 성분이 자음에 해당되는지 또는 모음에 해당되는지를 분류하여 음소 성분에 대한 음소 레벨을 결정할 수 있다. When the language level of the phoneme component is determined, the phoneme level determiner 154 may determine the phoneme level of the phoneme component by classifying whether the phoneme component corresponds to a consonant or a vowel.

예를 들어, 트라이폰 형태로 배열된 [B T S]가 외국인 발음에 대응하는 음소 성분으로 결정되면, 음소 레벨 결정부(154)는 [B T S]에서 가운데 음소 성분인 'T'가 자음에 해당되는지 또는 모음에 해당되는지를 분류할 수 있다. 또는, 음소 레벨 결정부(154)는 트라이폰 형태로 배열된 [S ㄴ ㅗ]가 한국인 발음에 대응하는 음소 성분으로 결정되면, [S ㄴ ㅗ]에서 가운데 음소 성분인 'ㄴ'이 자음에 해당되는지 또는 모음에 해당되는지를 분류할 수 있다. For example, if [BTS] arranged in a triphone shape is determined as a phoneme component corresponding to a foreign pronunciation, the phoneme level determining unit 154 determines whether the middle phoneme component 'T' in [BTS] corresponds to a consonant or It can be classified as belonging to a collection. Alternatively, when the phoneme level determining unit 154 determines that [S ㄴ ㅗ] arranged in the form of a triphone is a phoneme component corresponding to the Korean pronunciation, the middle phoneme component 'ㄴ' in [S ㄴ ㅗ] corresponds to a consonant. You can classify whether or not it belongs to a collection.

운소 레벨 결정부(156)는 음소 성분에 대한 음소 레벨이 결정되면, 음소 성분에 대한 강세 또는 길이를 판단하여 기정의된 복수의 강세 또는 길이 중 하나로 분류함으로써 음소 성분에 대한 운소 레벨을 결정할 수 있다. When the phoneme level for the phoneme component is determined, the phoneme level determiner 156 determines the phoneme level for the phoneme component by determining the stress or length of the phoneme component and classifying it into one of a plurality of predefined stresses or lengths. .

정의된 언어 기반의 트리 규칙은 예를 들어, 한국어 발음에 해당하는 '아' 발음과 외국인 발음에 해당하는 'a' 발음을 분리하여 해당 발음들이 동일 군집으로 묶이는 현상을 제거할 수 있다.The defined language-based tree rule can remove, for example, the 'a' pronunciation corresponding to the Korean pronunciation and the 'a' pronunciation corresponding to the foreign pronunciation, so that the corresponding pronunciations are grouped into the same cluster.

학습부(100)는 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 생성하고, 유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 학습시킬 수 있다. The learning unit 100 may generate a phoneme separation model based on the defined language-based tree rule, and train the phoneme separation model based on the defined language-based tree rule for classifying similar pronunciations.

학습부(100)는 정의된 언어 기반의 트리 규칙에 기초하여 분류된 트라이폰 형태의 음소 성분에 대한 분류 결과에 기초하여 음소 분리 모델을 학습시킬 수 있다. 예를 들어, 도 3c를 참조하면, 학습부(100)는 트라이폰 형태로 배열된 [ㄴ, ㅗ, ㄹ]를 음소 분리 모델에 입력하게 되면, [ㄴ, ㅗ, ㄹ]의 'ㅗ' 음소 성분에 기초하여 [ㄴ, ㅗ, ㄹ]를 한국인 발음에 대응하는 음소 성분으로 분류하고, 모음에 해당되는 음소 성분으로 분류하고, 'ㅗ' 음소 성분에 해당하는 기정의된 강세 또는 길이로 분류하도록 음소 분리 모델을 학습시키고, [ㄴ, ㅗ, ㄹ]에 매핑된 발음열 및 음소 구간에 대응하는 학습 음성 데이터를 도출하여 음성 인식하도록 음소 분리 모델을 학습시킬 수 있다. The learning unit 100 may train a phoneme separation model based on a classification result of a triphone-type phoneme component classified based on a defined language-based tree rule. For example, referring to FIG. 3C , when the learning unit 100 inputs [b, ㅗ, r] arranged in a triphone shape to the phoneme separation model, the 'ㅗ' phoneme of [b, ㅗ, d] To classify [ㄴ, ㅗ, ㄹ] into phoneme components corresponding to Korean pronunciation, classify them into phoneme components corresponding to vowels, and classify them into predefined stresses or lengths corresponding to 'ㅗ' phoneme components based on the components. The phoneme separation model may be trained, and the phoneme separation model may be trained to recognize speech by deriving training voice data corresponding to the pronunciation sequence and phoneme section mapped to [b, ㅗ, d].

또한, 학습부(100)는 트라이폰 형태로 배열된 [H, L, O]를 음소 분리 모델에 입력하게 되면, [H, L, O]의 'L' 음소 성분에 기초하여 [H, L, O]을 외국인 발음에 대응하는 음소 성분으로 분류하고, 자음에 해당하는 음소 성분으로 분류하고, 'L' 음소 성분에 해당하는 기정의된 강세 또는 길이로 분류하도록 음소 분리 모델을 학습시키고, [H, L, O]에 매핑된 발음열 및 음소 구간에 대응하는 학습 음성 데이터를 도출하여 음성 인식하도록 음소 분리 모델을 학습시킬 수 있다. In addition, when the learning unit 100 inputs [H, L, O] arranged in the form of a triphone to the phoneme separation model, [H, L] based on the 'L' phoneme component of [H, L, O] , [ H, L, O], the phoneme separation model can be trained to recognize speech by deriving training speech data corresponding to the pronunciation sequence and phoneme section.

정의된 언어 기반의 트리 규칙에 기초하여 분류된 트라이폰 형태의 음소 성분에 대한 분류 결과는 한국인 발음 및 외국인 발음 간의 경계를 형성하기 때문에 음소 분리 모델을 정교하게 학습시킬 수 있다. 또한, 학습된 음소 분리 모델을 통해 한국인 발음 및 외국인 발음 간의 유사 발음을 분리하기 때문에 한국어 발음 및 외국인 발음을 구별할 수 있다. Since the classification result of the triphone-type phoneme component classified based on the defined language-based tree rule forms a boundary between the Korean pronunciation and the foreign pronunciation, the phoneme separation model can be trained precisely. In addition, since similar pronunciations are separated between Korean pronunciations and foreign pronunciations through the learned phoneme separation model, Korean pronunciations and foreign pronunciations can be distinguished.

입력부(110)는 사용자로부터 음성 신호를 입력 받을 수 있다. 예를 들어, 입력부(110)는 외국인(또는 한국인)으로부터 외국인 영어 발음에 해당하는 음성 데이터(또는, 한국식 영어 발음에 해당하는 음성 데이터)를 포함하는 음성 신호를 입력받을 수 있다. The input unit 110 may receive a voice signal from a user. For example, the input unit 110 may receive a voice signal including voice data corresponding to foreign English pronunciation (or voice data corresponding to Korean English pronunciation) from a foreigner (or Korean).

분할부(120)는 음성 신호를 기설정된 단위 크기로 분할할 수 있다. 예를 들어, 분할부(120)는 음소 분리 모델이 학습한 트라이폰의 음성 길이와 동일한 단위 크기로 음성 신호를 분할할 수 있다. 예를 들어, 트라이폰 크기가 10ms 단위인 경우, 분할부(120)는 음성 신호를 10ms 단위로 분할할 수 있다. The division unit 120 may divide the voice signal into predetermined unit sizes. For example, the dividing unit 120 may divide the voice signal into the same unit size as the length of the triphone voice learned by the phoneme separation model. For example, when the size of the triphone is in units of 10 ms, the division unit 120 may divide the voice signal in units of 10 ms.

인식부(130)는 학습된 음소 분리 모델에 기초하여 분할된 음성 신호를 인식할 수 있다. 예를 들어, 인식부(130)는 음소 분리 모델에 분할된 음성 신호를 입력하여 해당 분할된 음성 신호와 유사도가 높은 노드를 추출하고, 추출된 노드에 기초하여 음성 신호를 인식할 수 있다. The recognizer 130 may recognize the divided voice signal based on the learned phoneme separation model. For example, the recognizer 130 may input a divided voice signal to the phoneme separation model, extract a node having a high similarity to the divided voice signal, and recognize the voice signal based on the extracted node.

한편, 도 2b, 도 3c, 도 4a 및 도 4b를 함께 참조하여 종래의 음성 신호의 인식 방법과 본 발명의 음성 신호의 인식 방법을 비교 설명하기로 한다. Meanwhile, with reference to FIGS. 2B, 3C, 4A, and 4B, the conventional method for recognizing a voice signal and the method for recognizing a voice signal according to the present invention will be described in comparison.

도 2b를 참조하면, 기존의 트리 규칙을 이용한 음성 신호의 인식 방법을 살펴보면, 외국인 발음 및 한국인 발음 간의 구별없이 모든 음성 신호를 음성 신호의 유사도로만 분류하여 음성 인식 모델을 학습했기 때문에 외국인 발음과 영어 발음 간의 경계가 모호할 수 밖에 없다. Referring to FIG. 2B , looking at the method of recognizing a voice signal using the existing tree rule, all voice signals are classified only by the similarity of the voice signals without distinction between foreign pronunciations and Korean pronunciations, and the voice recognition model is learned by learning foreign pronunciation and English. The boundaries between pronunciations are bound to be blurred.

도 2b 및 4a를 함께 참조하면, 'BTS 노래 LOVE 틀어'를 포함하는 음성 신호에 대하여 기존의 트리 규칙을 적용하게 되면, [B, T, S], [E, ㅌ, ㅡ] 가 발음의 유사성으로 제 1 노드로 클러스터링되고, [ㅌ, ㅡ, ㄹ], [ㄴ, ㅗ, ㄹ], [L, O, V]가 발음의 유사성으로 제 3 노드로 클러스터링되고, [ㅡ, ㄹ, ㅇ], [ㅗ, ㄹ, ㅐ], [ㅐ, L, O]가 발음 유사성으로 제 5 노드로 클러스터링된다. 2B and 4A together, when the existing tree rule is applied to a voice signal including 'play BTS song LOVE', [B, T, S], [E, t, ㅡ] are similar to pronunciation. is clustered as the first node as the first node, [ㅡ, ㅡ, ㄹ], [ㄴ, ㅗ, ㄹ], [L, O, V] are clustered as the third node due to the similarity of pronunciation, and [ㅡ, ㄹ, ㅇ] , [ㅗ, r, ㅐ], [ㅐ, L, O] are clustered into the 5th node with pronunciation similarity.

이러한 기존의 트리 규칙이 적용된 음성 인식 모델은 [L, O, V]의 경우, [L, O, V]의 발음 유사도가 동일한 [ㅌ, ㅡ, ㄹ], [ㄴ, ㅗ, ㄹ], [L, O, V] 중 하나로 인식하기 때문에 음성 인식률이 떨어질 수 밖에 없다. In the case of [L, O, V], in the case of [L, O, V], the speech recognition model to which these existing tree rules are applied has the same pronunciation similarity of [L, O, V]: L, O, V], so the voice recognition rate is inevitably lowered.

도 3c를 참조하면, 본 발명은 음소 분리 모델을 통해 정의된 언어 기반의 트리 규칙에 따라 분류된 트라이폰을 리프 노드(leaf node)로서 설정하고, 음소 분리 모델을 통한 음성 인식시, 해당 리프 노드를 음성 인식의 결과로서 활용할 수 있다. Referring to FIG. 3C , in the present invention, a triphone classified according to a language-based tree rule defined through a phoneme separation model is set as a leaf node, and when speech is recognized through a phoneme separation model, the corresponding leaf node can be utilized as a result of speech recognition.

도 3c 및 4b를 함께 참조하면, 정의된 언어 기반의 트리 규칙이 적용된 음소 분리 모델의 경우, 'BTS 노래 LOVE 틀어'를 포함하는 음성 신호 중 [L, O, V]에 대응하는 분리된 음성 신호와 유사도가 높은 제 5 노드의 [L, O, V] 만을 추출하여 인식하기 때문에 음성 인식률을 높일 수 있다. 3C and 4B together, in the case of a phoneme separation model to which a defined language-based tree rule is applied, a separated voice signal corresponding to [L, O, V] among voice signals including 'Play BTS song LOVE' Since only [L, O, V] of the fifth node having a high degree of similarity is extracted and recognized, the speech recognition rate can be increased.

제공부(160)는 인식 결과에 기초하여 음성 신호에 대한 음성 인식 서비스를 제공할 수 있다. 또한, 제공부(160)는 음소 분리 모델을 통해 한국인 발음과 외국인 발음을 구분하여 음성 인식 서비스를 제공할 수 있다. The provider 160 may provide a voice recognition service for a voice signal based on the recognition result. In addition, the providing unit 160 may provide a voice recognition service by dividing the Korean pronunciation from the foreign pronunciation through a phoneme separation model.

한편, 당업자라면, 학습부(100), 입력부(110), 분할부(120), 인식부(130), 사전 구축부(140), 트리 규칙 생성부(150) 및 제공부(160) 각각이 분리되어 구현되거나, 이 중 하나 이상이 통합되어 구현될 수 있음을 충분히 이해할 것이다. Meanwhile, for those skilled in the art, the learning unit 100 , the input unit 110 , the dividing unit 120 , the recognition unit 130 , the dictionary building unit 140 , the tree rule generating unit 150 , and the providing unit 160 are each It will be fully understood that it may be implemented separately, or one or more of these may be implemented integrally.

도 5는 본 발명의 일 실시예에 따른, 음성 인식 서비스 제공 방법을 나타낸 흐름도이다. 5 is a flowchart illustrating a method of providing a voice recognition service according to an embodiment of the present invention.

도 5를 참조하면, 단계 S501에서 음성 인식 서비스 제공 서버(10)는 유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 학습시킬 수 있다. Referring to FIG. 5 , in step S501 , the voice recognition service providing server 10 may train a phoneme separation model based on a language-based tree rule defined for classifying similar pronunciations.

단계 S503에서 음성 인식 서비스 제공 서버(10)는 음성 신호를 입력받을 수 있다. In step S503, the voice recognition service providing server 10 may receive a voice signal.

단계 S505에서 음성 인식 서비스 제공 서버(10)는 음성 신호를 기설정된 단위 크기로 분할할 수 있다. In step S505, the voice recognition service providing server 10 may divide the voice signal into preset unit sizes.

단계 S507에서 음성 인식 서비스 제공 서버(10)는 음소 분리 모델에 기초하여 분리된 음성 신호를 인식할 수 있다. In step S507, the voice recognition service providing server 10 may recognize the separated voice signal based on the phoneme separation model.

단계 S509에서 음성 인식 서비스 제공 서버(10)는 인식 결과에 기초하여 음성 신호에 대한 음성 인식 서비스를 제공할 수 있다. In step S509, the voice recognition service providing server 10 may provide a voice recognition service for a voice signal based on the recognition result.

상술한 설명에서, 단계 S501 내지 S509는 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. In the above description, steps S501 to S509 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

도 6은 본 발명의 일 실시예에 따른, 음소 분리 모델을 생성하는 방법을 나타낸 흐름도이다. 6 is a flowchart illustrating a method of generating a phoneme separation model according to an embodiment of the present invention.

도 6을 참조하면, 단계 S601에서 음성 인식 서비스 제공 서버(10)는 학습 데이터로부터 음소 성분을 추출할 수 있다. Referring to FIG. 6 , in step S601 , the voice recognition service providing server 10 may extract phoneme components from the training data.

단계 S603에서 음성 인식 서비스 제공 서버(10)는 음소 성분이 한국인 발음에 대응하는 음소 성분인지 또는 외국인 발음에 대응하는 음소 성분인지를 분류하여 음소 성분에 대한 언어 레벨을 결정할 수 있다. In step S603, the voice recognition service providing server 10 classifies whether the phoneme component is a phoneme component corresponding to a Korean pronunciation or a phoneme component corresponding to a foreign pronunciation to determine a language level for the phoneme component.

단계 S605에서 음성 인식 서비스 제공 서버(10)는 언어 레벨이 결정되면, 음소 성분이 자음에 해당되는지 또는 모음이 해당되는지를 분류하여 음소 성분에 대한 음소 레벨을 결정할 수 있다. When the language level is determined in step S605, the voice recognition service providing server 10 may classify whether the phoneme component corresponds to a consonant or a vowel to determine the phoneme level for the phoneme component.

단계 S607에서 음성 인식 서비스 제공 서버(10)는 음소 레벨이 결정되면, 음소 성분에 대한 강세 또는 길이를 판단하여 기정의된 복수의 강세 또는 길이 중 하나로 분류함으로써 음소 성분에 대한 운소 레벨을 결정할 수 있다. When the phoneme level is determined in step S607, the voice recognition service providing server 10 determines the phoneme level for the phoneme component by determining the stress or length of the phoneme component and classifying it into one of a plurality of predefined stresses or lengths. .

단계 S609에서 음성 인식 서비스 제공 서버(10)는 트라이폰 형태의 음소 성분에 대한 분류 결과(음소 성분에 대한 언어 레벨, 음소 레벨 및 운소 레벨)에 기초하여 음소 분리 모델을 학습시킬 수 있다. In operation S609, the voice recognition service providing server 10 may train a phoneme separation model based on the classification result (language level, phoneme level, and phoneme level of the phoneme component) for the triphone-type phoneme component.

상술한 설명에서, 단계 S601 내지 S609는 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. In the above description, steps S601 to S609 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer-readable media may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. The scope of the present invention is indicated by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. .

10: 음성 인식 서비스 제공 서버
100: 학습부
110: 입력부
120: 분할부
130: 인식부
140: 사전 구축부
150: 트리 규칙 생성부
152: 언어 레벨 결정부
154: 음소 레벨 결정부
156: 운소 레벨 결정부
160: 제공부10: Speech recognition service providing server
100: study department
110: input unit
120: division
130: recognition unit
140: pre-construction unit
150: tree rule generator
152: language level determination unit
154: phoneme level determining unit
156: Unso level determining unit
160: provider

Claims

음소 분리 모델에 기초하여 음성 인식 서비스를 제공하는 서버에 있어서,
한국인 발음 및 외국인 발음 각각에 대한 복수의 발음열을 구별하여 정의한 발음 사전을 구축하는 사전 구축부;
유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 학습시키는 학습부;
음성 신호를 입력받는 입력부;
상기 음성 신호를 기설정된 단위 크기로 분할하는 분할부;
상기 음소 분리 모델에 기초하여 상기 분할된 음성 신호를 인식하는 인식부; 및
인식 결과에 기초하여 음성 인식 서비스를 제공하는 제공부를 포함하는, 음성 인식 서비스 제공 서버.
A server for providing a speech recognition service based on a phoneme separation model, the server comprising:
a dictionary building unit for constructing a pronunciation dictionary defined by distinguishing a plurality of pronunciation strings for each of Korean pronunciation and foreign pronunciation;
a learning unit configured to train a phoneme separation model based on a language-based tree rule defined for classifying similar pronunciations;
an input unit for receiving a voice signal;
a division unit dividing the voice signal into predetermined unit sizes;
a recognition unit for recognizing the divided speech signal based on the phoneme separation model; and
A voice recognition service providing server comprising a providing unit that provides a voice recognition service based on a recognition result.

제 1 항에 있어서,
상기 발음 사전을 이용하여 상기 언어 기반의 트리 규칙을 생성하는 트리 규칙 생성부
를 더 포함하는 것인, 음성 인식 서비스 제공 서버.
The method of claim 1,
A tree rule generator for generating the language-based tree rule by using the pronunciation dictionary
Which will further include, the voice recognition service providing server.

제 2 항에 있어서,
상기 트리 규칙 생성부는
학습 데이터를 구성하는 음소 성분이 상기 한국인 발음에 대응하는 음소 성분인지 또는 상기 외국인 발음에 대응하는 음소 성분인지를 분류하여 언어 레벨을 결정하는 언어 레벨 결정부를 포함하는 것인, 음성 인식 서비스 제공 서버.
3. The method of claim 2,
The tree rule generator
and a language level determiner configured to determine a language level by classifying whether the phoneme component constituting the learning data is a phoneme component corresponding to the Korean pronunciation or a phoneme component corresponding to the foreign pronunciation.

제 3 항에 있어서,
상기 트리 규칙 생성부는
상기 음소 성분이 자음에 해당되는지 또는 모음에 해당되는지를 분류하여 음소 레벨을 결정하는 음소 레벨 결정부를 더 포함하는 것인, 음성 인식 서비스 제공 서버.
4. The method of claim 3,
The tree rule generator
and a phoneme level determiner configured to determine a phoneme level by classifying whether the phoneme component corresponds to a consonant or a vowel.

제 4 항에 있어서,
상기 트리 규칙 생성부는
상기 음소 성분에 대한 강세 또는 길이를 판단하여 기정의된 복수의 강세 또는 길이 중 하나로 분류함으로써 상기 음소 성분에 대한 운소 레벨을 결정하는 운소 레벨 결정부를 더 포함하는 것인, 음성 인식 서비스 제공 서버.
5. The method of claim 4,
The tree rule generator
The server further comprising a phonemic level determining unit for determining the phoneme level of the phoneme component by determining the stress or length of the phoneme component and classifying it into one of a plurality of predefined stresses or lengths.

제 5 항에 있어서,
상기 학습부는 상기 정의된 언어 기반의 트리 규칙에 기초하여 분류된 트라이폰 형태의 음소 성분에 대한 분류 결과에 기초하여 상기 음소 분리 모델을 학습시키는 것인, 음성 인식 서비스 제공 서버.
6. The method of claim 5,
The learning unit is to learn the phoneme separation model based on the classification result of the triphone type phoneme component classified based on the defined language-based tree rule, the voice recognition service providing server.

제 1 항에 있어서,
상기 사전 구축부는
상기 발음 사전에 포함된 복수의 발음열의 음소 성분을 상기 한국인 발음 및 상기 외국인 발음으로 분류하여 저장하고,
상기 한국인 발음 및 상기 외국인 발음 중 하나로 분류된 발음열이 자음에 해당하는지 또는 모음에 해당하는지 분류하여 저장하고,
상기 자음 또는 모음으로 분류된 발음열에 대한 강세 또는 길이를 기설정된 강세 레벨 및 길이 레벨에 따라 분류하여 저장하는 것인, 음성 인식 서비스 제공 서버.
The method of claim 1,
The pre-construction unit
and storing the phoneme components of a plurality of pronunciation strings included in the pronunciation dictionary by classifying them into the Korean pronunciation and the foreign pronunciation,
Classifying and storing whether the pronunciation string classified as one of the Korean pronunciation and the foreign pronunciation corresponds to a consonant or a vowel,
A server for providing a voice recognition service that classifies and stores the stress or length of the pronunciation sequence classified into consonants or vowels according to preset stress levels and length levels.

음소 분리 모델에 기초하여 음성 인식 서비스를 제공하는 서버에 있어서,
한국인 발음 및 외국인 발음 각각에 대한 복수의 발음열을 구별하여 정의한 발음 사전을 구축하는 단계;
유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 학습시키는 단계;
음성 신호를 입력받는 단계;
상기 음성 신호를 기설정된 단위 크기로 분할하는 단계;
상기 음소 분리 모델에 기초하여 상기 분할된 음성 신호를 인식하는 단계; 및
인식 결과에 기초하여 상기 음성 신호에 대한 음성 인식 서비스를 제공하는 단계
를 포함하는 음성 인식 서비스 제공 방법.
A server for providing a speech recognition service based on a phoneme separation model, the server comprising:
constructing a pronunciation dictionary defined by distinguishing a plurality of pronunciation strings for each of Korean pronunciation and foreign pronunciation;
training a phoneme separation model based on a language-based tree rule defined for classifying similar pronunciations;
receiving a voice signal;
dividing the voice signal into predetermined unit sizes;
recognizing the divided speech signal based on the phoneme separation model; and
providing a voice recognition service for the voice signal based on a recognition result
A method of providing a voice recognition service comprising a.

제 8 항에 있어서,
상기 발음 사전을 이용하여 상기 정의된 언어 기반의 트리 규칙을 생성하는 단계를 포함하는 것인, 음성 인식 서비스 제공 방법.
9. The method of claim 8,
and generating a tree rule based on the defined language by using the pronunciation dictionary.

제 9 항에 있어서,
상기 트리 규칙을 생성하는 단계는
학습 데이터를 구성하는 음소 성분이 상기 한국인 발음에 대응하는 음소 성분인지 또는 상기 외국인 발음에 대응하는 음소 성분인지를 분류하여 언어 레벨을 결정하는 단계를 포함하는 것인, 인식 서비스 제공 방법.
10. The method of claim 9,
The step of creating the tree rule is
and determining a language level by classifying whether the phoneme component constituting the learning data is a phoneme component corresponding to the Korean pronunciation or a phoneme component corresponding to the foreign pronunciation.

제 10 항에 있어서,
상기 트리 규칙을 생성하는 단계는
상기 음소 성분이 자음에 해당되는지 또는 모음에 해당되는지를 분류하여 음소 레벨을 결정하는 단계를 포함하는 것인, 음성 인식 서비스 제공 방법.
11. The method of claim 10,
The step of creating the tree rule is
and determining a phoneme level by classifying whether the phoneme component corresponds to a consonant or a vowel.

제 11 항에 있어서,
상기 트리 규칙을 생성하는 단계는
상기 음소 성분에 대한 강세 또는 길이를 판단하여 기정의된 복수의 강세 또는 길이 중 하나로 분류함으로써 상기 음소 성분에 대한 운소 레벨을 결정하는 단계를 포함하는 것인, 음성 인식 서비스 제공 방법.
12. The method of claim 11,
The step of creating the tree rule is
and determining a phoneme level for the phoneme component by determining the stress or length of the phoneme component and classifying it into one of a plurality of predefined stresses or lengths.

제 12 항에 있어서,
상기 학습시키는 단계는
상기 정의된 언어 기반의 트리 규칙에 기초하여 분류된 트라이폰 형태의 음소 성분에 대한 분류 결과에 기초하여 상기 음소 분리 모델을 학습시키는 단계를 포함하는 것인, 음성 인식 서비스 제공 방법.
13. The method of claim 12,
The learning step
and learning the phoneme separation model based on a classification result of a triphone-type phoneme component classified based on the defined language-based tree rule.

제 9 항에 있어서,
상기 발음 사전을 구축하는 단계는
상기 발음 사전에 포함된 복수의 발음열의 음소 성분을 상기 한국인 발음 및 상기 외국인 발음으로 분류하여 저장하는 단계;
상기 한국인 발음 및 상기 외국인 발음 중 하나로 분류된 발음열이 자음에 해당하는지 또는 모음에 해당하는지 분류하여 저장하는 단계 및
상기 자음 또는 모음으로 분류된 발음열에 대한 강세 또는 길이를 기설정된 강세 레벨 및 길이 레벨에 따라 분류하여 저장하는 단계를 포함하는 것인, 음성 인식 서비스 제공 방법.
10. The method of claim 9,
The step of building the pronunciation dictionary is
classifying and storing phoneme components of a plurality of pronunciation strings included in the pronunciation dictionary into the Korean pronunciation and the foreign pronunciation;
classifying and storing whether the pronunciation string classified into one of the Korean pronunciation and the foreign pronunciation corresponds to a consonant or a vowel; and
and classifying and storing the stress or length of the pronunciation sequence classified into consonants or vowels according to preset stress levels and length levels.

음소 분리 모델에 기초하여 음성 인식 서비스를 제공하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램에 있어서,
상기 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우,
한국인 발음 및 외국인 발음 각각에 대한 복수의 발음열을 구별하여 정의한 발음 사전을 구축하고,
유사 발음에 대한 구분을 위해 정의된 언어 기반의 트리 규칙에 기초하여 음소 분리 모델을 학습시키고,
음성 신호를 입력받고,
상기 음성 신호를 기설정된 단위 크기로 분할하고,
상기 음소 분리 모델에 기초하여 상기 분할된 음성 신호를 인식하고,
인식 결과에 기초하여 상기 음성 신호에 대한 음성 인식 서비스를 제공하는 명령어들의 시퀀스를 포함하는, 매체에 저장된 컴퓨터 프로그램.
A computer program stored in a medium comprising a sequence of instructions for providing a speech recognition service based on a phoneme separation model, the computer program comprising:
When the computer program is executed by a computing device,
Construct a pronunciation dictionary defined by distinguishing a plurality of pronunciation strings for each Korean pronunciation and foreign pronunciation,
A phoneme separation model is trained based on a language-based tree rule defined for classification of similar pronunciations,
receiving a voice signal,
dividing the voice signal into predetermined unit sizes;
Recognizing the segmented speech signal based on the phoneme separation model,
A computer program stored in a medium comprising a sequence of instructions for providing a speech recognition service for the speech signal based on a recognition result.