KR100202292B1

KR100202292B1 - Text analyzer

Info

Publication number: KR100202292B1
Application number: KR1019960065620A
Authority: KR
Inventors: 오영환; 이상호
Original assignee: 윤덕용; 한국과학기술원
Priority date: 1996-12-14
Filing date: 1996-12-14
Publication date: 1999-06-15
Also published as: KR19980047177A

Abstract

주어진 입력 문서를 음성으로 변환시키는 문서 음성 변환 시스템에서 통계적 언어 처리 기법을 도입하여 합성음의 자연성을 좌우하는 운율 생성 모듈에 더욱 정확한 발음 표기 및 구문 구조를 제공하도록 한 것으로, 문서 음성 변환을 위하여 입력되는 문서를 하나의 문장씩 추출하며, 비결정 유한 오토마타를 이용하여 비한글 문자들을 한글 문자로 변환시키는 전처리 수단과, 상기 전처리 수단에서 인간되는 하나씩의 문장 어절에서 모든 가능한 형태소 분석 결과를 구하는 형태소 분석수단과, 상기 형태소 분석되어 인가되는 입력 문장을 확률 정보를 기반으로 하여 가장 가능성이 높은 형태소 분석 결과를 선택하여 최적 형태소 분석열을 추출하는 품사 태거와, 상기 품사 태거에서 인가되는 형태소 분석열의 결과를 이용하여 각 어절들의 발음 표기를 구하는 발음 표기 변환수단 및, 확률 의존 문법을 이용하여 어절들의 지배소-의존소 관계를 구하며, 입력된 문서에 대한 어절의 발음 표기 및 형태소 품사열, 의존 트리를 최종 결과로 출력하는 구문 분석수단을 구비하여 품사 태거에서 85.11%의 정확률을, 구문 분석기에서 78.68%의 정확률을 각각 나타내며, 특히 품사 태거의 경우, 처리 중인 문서에 미등록어가 없을 경우에는 96.46%의 높은 정확률을 제공한다.In the speech-to-speech system that converts a given input document into speech, the statistical language processing technique is introduced to provide a more accurate phonetic notation and syntax structure to the rhythm generation module that determines the naturalness of the synthesized speech. A preprocessing means for extracting a document one sentence and converting non-Hangul characters into Hangul characters using an amorphous finite automata, and a morphological analysis means for obtaining all possible morphological analysis results from one sentence word that is human in the preprocessing means; The part of speech tagger extracting an optimal morphological analysis sequence by selecting the most likely morphological analysis result based on the probability information of the input sentence applied to the morphological analysis, and using the result of the morphological analysis sequence applied from the part of speech tagger. Pronunciation of each word Obtaining the phonetic notation-dependency relationship between the phrases using the phonetic notation conversion means and the probability-dependent grammar, and the syntax analysis means for outputting the phonetic notation, the morpheme parts of speech, and the dependency tree for the input document as final results. It has an accuracy of 85.11% in parts-of-speech taggers and an accuracy of 78.68% in a parser, respectively, and a high accuracy of 96.46% in the absence of non-registered words in the documents being processed.

Description

한국어 문서 음성 변환 시스템을 위한 문서 분석기Document Analyzer for Korean document-to-speech system

본 발명은 주어진 입력 문서를 음성으로 변환시키는 문서 음성 변환 시스템(Text-to-Speech System)에 관한 것으로, 보다 더 상세하게는 통계적 언어 처리 기법을 도입하여 합성음의 자연성을 좌우하는 운율 생성 모듈에 더욱 정확한 발음 표기 및 구문 구조를 제공하도록 한 한국어 문서 음성 변환 시스템을 위한 문서 분석기에 관한 것이다.The present invention relates to a text-to-speech system for converting a given input document into speech. More particularly, the present invention relates to a rhythm generation module that introduces statistical language processing techniques to influence the naturalness of synthesized speech. The present invention relates to a document analyzer for a Korean document-to-speech system for providing accurate phonetic notation and syntax structure.

일반적으로 문서 음성 변환 시스템은 주어진 입력 문서를 음성으로 변환하는 시스템으로, 맹인용 독서기 등 많은 응용 분야을 갖고 있다.In general, the document speech conversion system converts a given input document into speech and has many application fields such as a blind reader.

이러한 문서 음성 변환 시스템은 크게 문서 분석기와 운율 생성 및 신호 합성 모듈로 이루어지며, 문서 분석기는 단어의 발음 표기와 그 문장이 갖고 있는 운율이 단어의 품사와 문장의 구문 구조가 밀접한 관계를 갖고 있기 때문에 이를 분석하는데 사용된다.The document speech conversion system is mainly composed of a document analyzer, a rhythm generation and signal synthesis module, and the document analyzer has a close relationship between the parts of speech of words and the syntax structure of sentences. It is used to analyze it.

기존의 한국어 문서 분석기들은 형태소 분석 단계에서 최장 일치법 등을 이용하여 하난의 결과만을 추출하고, 이를 이용하여 발음 변환, 구문 분석 등을 수행한다.Existing Korean document analyzers extract only Hanan's results using the longest consensus method in the morphological analysis stage, and perform pronunciation conversion and syntax analysis using them.

이외에 최근에는 두 단계(two-level) 모델에 기반한 발음 표기 변환 모듈이 개발되었으며, 더욱 정확한 정보를 추출하기 위해 많은 연구가 진행중이다.Recently, a phonetic notation conversion module based on a two-level model has been developed, and a lot of research is being conducted to extract more accurate information.

그러나 전술한 바와 같은 문서 음성 변환 시스템은 프로그래머가 설정한 의미의 품사로써 입력되는 문서의 품사 정보를 분석하게 되므로 같은 음절이 서로 다른 품사를 갖게 되는 경우 품사 정보의 그릇된 판단으로 발음 표기에 오 변환을 일으키게 되는 문제점이 있어 문서 음성 변환에 신뢰성이 저하되는 문제점이 있었다.However, as described above, the document-to-speech system analyzes the part-of-speech information of the input document as a part-of-speech set by the programmer. There is a problem that occurs, there was a problem that the reliability of the document speech conversion is reduced.

본 발명은 이와 같은 문제점을 감안하여 안출한 것으로, 그 목적은 통계적 언어 처리 기법을 이용하여 어절의 발음 표기와 형태소 품사열 및 의존 트리의 결정으로 합성음의 자연성을 좌우하는 운율 생성 모듈에 더욱 정확한 발음 표기 및 구문 구조를 제공하여 신뢰성 있는 문서 음성 변환을 제공하도록 한 것이다.The present invention has been made in view of the above problems, and its purpose is to use a statistical language processing technique to more accurately pronounce the rhythm generation module that determines the naturalness of the synthesized sound by determining the pronunciation of words, morphological parts of speech, and dependency trees. It provides a notation and syntax structure to provide reliable document speech conversion.

이와 같은 목적을 달성하기 위한 본 발명은 음성 변환을 위하여 이력되는 문서를 하나의 문장식 추출하며, 비결정 유한 오토마타를 이용하여 비한글 문자들을 한글 문자로 변환시키는 전처리 수단과;The present invention for achieving the object of the present invention comprises a pre-processing means for extracting a document history for speech conversion, and converting non-Hangul characters into Hangul characters using an amorphous finite automata;

상기 전처리 수단에서 인가되는 하나씩의 문장 어절에서 모든 가능한 형태소 분석 결과를 구하는 형태소 분석수단과;Morphological analysis means for obtaining all possible morphological analysis results from one sentence word applied by the preprocessing means;

상기 형태소 분석되어 인가되는 입력 문장을 확률 정보를 기반으로 하여 가장 가능성이 높은 형태소 분석 결과를 선택하여 최적 형태소 분석열을 추출하는 품사 태거와;A part-of-speech tagger which extracts an optimal morphological analysis sequence by selecting the most likely morphological analysis result based on probability information of the input sentence applied to the morphological analysis;

상기 품사 태거에서 인가되는 형태소 분석열의 결과를 이용하여 각 어절들의 발음 표기를 구하는 발음 표기 변환수단 및;A phonetic notation conversion means for obtaining a phonetic notation of each word using a result of a morpheme analysis sequence applied in the part-of-speech tagger;

확률 의존 문법을 이용하여 어절들의 지배소-의존소 관계를 구하며, 입력된 문서에 대한 어절의 발음 표기 및 형태소 품사열, 의존 트리를 최종 결과로 출력하는 구문 분석수단을 구비하는 것을 특징으로 한다.Probability-dependent grammar is used to obtain the governing-dependency relationship of the words, and includes a syntactic analysis means for outputting the pronunciation notation, the morpheme parts of speech, and the dependency tree for the input document.

제1도는 본 발명에 따른 한국어 문서 음성 변환 시스템을 위한 문서 분석기 구성 블록도이고,1 is a block diagram illustrating a document analyzer for a Korean document-to-speech system according to the present invention.

제2도는 본 발명에서 전처리를 위한 오토마타(automata)의 계통도이다.2 is a schematic diagram of an automata for pretreatment in the present invention.

제3도는 제1도의 본 발명에서 형태소 분석기의 구성도이고,'3 is a schematic diagram of a morpheme analyzer in the present invention of FIG.

제4도는 본 발명에서 나는에 대한 형태소 격자 구성이며,4 is a morpheme lattice configuration for I in the present invention,

제5도는 본 발명에서 격자 구성에 대한 알고리즘이다.5 is an algorithm for lattice construction in the present invention.

제6도는 본 발명에서 신을 신고 신고하기의 형태소 분석 격자 구성이고,6 is a morphological analysis grid configuration of the wearing and reporting shoes in the present invention,

제7도는 본 발명에서 신을 신고 신고한다.의 의존 트리 구성이며,7 is a dependency tree configuration of the declaration of the scene in the present invention.

제8도는 본 발명에서 신을 신고 신고한다.의 구구조 트리 구성이다.8 is a spherical structure tree structure of a god to declare a god according to the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 일 실시예를 상세히 설명하면 다음과 같다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

제1도에서 알 수 있는 바와 같이 본 발명에 따른 한국어 문서 음성 변환 시스템을 위한 문서 분석기에서 전처리 모듈(10)은 음성 변환을 위하여 입력되는 문서를 비결정 유한 오토마타(nondeterministic finite automata)를 이용하여 비한글 문자들을 한글 문자로 바꾸고 형태소 분석기(20)측에 문장 하나씩을 제공한다.As can be seen in FIG. 1, in the document analyzer for the Korean document speech conversion system according to the present invention, the preprocessing module 10 uses the nondeterministic finite automata to non-Hangul input documents for speech conversion. The characters are replaced with Hangul characters and the sentence is provided to the stemmer 20 side.

형태소 분석기(20)는 모든 가능한 형태소 분석 결과를 얻고, 형태소 분석이 실패되었을 경우는 분석중인 어절에 미등록어가 있다고 판단하고 미등록어를 추정한 후, 언어적 휴리스틱을 이용하여 분석을 위한 미등록어 후보들의 수를 줄인다.The morphological analyzer 20 obtains all possible morphological analysis results, and if the morphological analysis fails, it is determined that there is an unregistered word in the word under analysis, estimates the unregistered word, and then uses the linguistic heuristic to determine the candidates for the unregistered word. Reduce the number

품사 태거(part-of-speech tagger :30)에서는 말뭉치로부터 얻은 확률 정보를 기반으로 하여 형태소 분석을 통해 인가되는 입력 문장에서 최적 형태소 분석열을 추출한다.The part-of-speech tagger (30) extracts an optimal sequence of morphological analysis from input sentences applied through morphological analysis based on probability information obtained from corpus.

발음 표기 변환 모듈(40)에서는 품삼 태거(30)에서 인가되는 형태소 분석열의 결과를 이용하여 각 어절들의 발음 표기를 구한다.The pronunciation notation conversion module 40 obtains the pronunciation notation of each word using the result of the morphological analysis sequence applied by the Phamsam tagger 30.

구문 분석기(50)는 확률 의존 문법을 이용하여 어절들의 지배소-의존소 관계를 구하며, 입력된 문서에 대한 어절의 발음 표기 및 형태소 품사열, 의존 트리를 최종 결과로 출력한다.The syntax analyzer 50 obtains the governing-dependency relationship between the words using the probability-dependent grammar, and outputs the pronunciation notation, the morpheme parts of speech, and the dependency tree of the word for the input document as the final result.

또한, 품사 태거(30)와 구문 분석기(50)에는 중의성 해소를 수행하기 위해 각각 품사 태깅을 위한 확률 정보(70)와 구분 분석을 위한 확률 정보(80)를 구비한다.In addition, the part-of-speech tagger 30 and the syntax analyzer 50 are provided with probability information 70 for part-of-speech tagging and probability information 80 for segmental analysis, respectively, to perform neutrality elimination.

전술한 바와 같은 기능을 구비하여 이루어지는 본 발명에서 입력되는 한국어 문서의 음성 변환 동작을 설명하면 다음과 같다.Referring to the speech conversion operation of the Korean document input in the present invention having the function as described above is as follows.

음성 변환을 위한 문서가 전처리 모듈(10) 입력되면 총 48개의 상태로 이루어진 비결정 유한 오토마타가 이용되어, 이 중 12개의 종결 상태(final state)가 각각 다른 비한글의 한글 변환 모듈로 구현되어 있는 전처리 모듈(10)은 입력 문서로부터 하나씩 문장을 추출함과 동시에 그 문장안에 있는 비한글 문자들을 모두 한글로 변환시킨다.When the document for speech conversion is input to the preprocessing module 10, an amorphous finite automata consisting of a total of 48 states is used, and 12 of these final states are preprocessed with non-Hangul Hangul conversion modules. The module 10 extracts sentences one by one from the input document and simultaneously converts all non-Hangul characters in the sentence into Korean.

이때, 비한글 문자들은 영어, 영어 약어, 숫자, 전화번호, 년도, 시간 등으로 구별되고 이들은 다음에 오는 한글 문자들을 고려하여 한글화된다.At this time, the non-Hangul characters are divided into English, English abbreviations, numbers, phone numbers, year, time, etc. and they are Koreanized considering the following Hangul characters.

일 예를들어 Mr. Lee의 전화번호 이라는 문서가 입력되는 경우 첨부된 제2도의 전처리를 위한 오토마타에서 알 수 있는 바와 같이, 'Mr.'에 의해 4 번 상태에 도달하게 되고, 'Mr.'가 사전에 등록되어 있으면 '미스터'로 한글화되고, 그렇지 않을 경우에는 '엠 알'로 한글화된다.For example, Mr. If Lee's phone number is entered, as indicated in the attached automata for the pretreatment of FIG. 2, the state is reached by 'Mr.' 4, and 'Mr.' is registered in advance. It is localized as 'Mr', otherwise it is localized as 'MR'.

이때, 'Mr.'는 사전에 등록되어 있는 영어 약어이므로 '미스터'로 한글화된 그 다음 'Lee'의 경우는 5번 상태에서 'Lee'나 '리'로 한글화되고, 방금전에 입력된 '의'를 다시 입력하여 2번 상태에서 한글을 처리한다.At this time, 'Mr.' is an English abbreviation that is registered in advance, and then 'Lee' is localized in the case of 'Lee' and then 'Lee' or 'Lee' is in Korean in the 5th state. Enter 'again to process Hangul in state 2.

이와 같은 과정으로 미스터 리의 전화번호는이라는 최종 결과를 얻게 된다.With this process, Mr. Lee's phone number is the final result.

이후, 형태소 분석기(20)는 전술한 바와 같은 동작을 통해 전처리가 완료되어 인가되는 문장 어절의 오른쪽에서 왼쪽으로 형태소들을 찾아 형태소 격자를 구성하게 되는데, 첨부딘 제3도에서 알 수 있는 바와 같이 입력 어절(21)에 대하여 위치 추정모듈(22)을 통해 불규칙 및 축약 현상 위치들을 미리 계산하고, 격자 구성모듈(23)에 설정되어 있는 알고리즘에 의해 형태소 분석을 수행한다.Thereafter, the morpheme analyzer 20 constructs a morpheme grid by finding the morphemes from the right side to the left side of the sentence sentence to which the preprocessing is completed through the operation as described above. As shown in FIG. Irregular and reduced phenomenon positions are calculated in advance with respect to the word 21 through the position estimation module 22, and morphological analysis is performed by an algorithm set in the grid configuration module 23.

형태소 분석 결과는 형태소 격자로 표현하는데, 예를 들면, 나는의 최종 분석 결과는 첨부된 제4도와 같고, 'INI'부터 'FIN'까지의 모든 가능한 경로가 각각 서로 다른 분석 결과를 나타낸다.The morphological analysis results are expressed in morpheme grids. For example, the final analysis of I is shown in Figure 4 attached, and all possible paths from 'INI' to 'FIN' represent different analysis results.

이때, 첨부된 제4도와 같은 형태소 격자를 표현하기 위해서 하기의 [표 1]에 보이는 집합 L을 정의하였다.In this case, in order to express the morpheme lattice as shown in FIG. 4, a set L shown in Table 1 below was defined.

집합 L의 원소(k,w,t,I)는 형태소 격자에서 하나의 노드와 그 노드의 오른쪽에 붙어 있는 에지들을 표현하며, 제4도에 해당하는 집합 L은 하기의 [표 2]와 같다.The elements (k, w, t, I) of the set L represent one node and edges attached to the right side of the node in the morpheme lattice, and the set L corresponding to FIG. 4 is shown in Table 2 below. .

한편, 격자 구성 모듈(23)에 수록되는 격자 구성 알고리즘은 우선 어절의 자소의 열로 만든 후, 알고리즘의 편의를 위해 각 자소 사이에 0부터 시작해서 지정되는 스텝 (step)만큼씩 증가하며 숫자를 삽입한다.On the other hand, the lattice construction algorithm contained in the lattice construction module 23 is first made of a column of word phonemes, and then, for convenience of the algorithm, the numbers are incremented by a specified step starting from 0 between each phoneme and inserted into a number. do.

예를 들어, 스텝을 6으로 두었을 때 나는의 경우, '0 ㄴ 6 ㅏ 12 ㄴ 18 ㅡ24 ㄴ 30'이된다.For example, if I set the step to 6, then I would be '0 b 6 ㅏ 12 b 18 ㅡ 24 b 30'.

이와 같이 어절을 자소열로 표현한 후, 집합 L을 얻는 과정은 첨부된 제5도에 제시된 알고리즘을 사용하게 되는데, wi,j는 숫자 I와 j사이의 자소열을 뜻하고, 함수 LookupDict 는 사전으로부터 입력 자소열 wi,j의 가능한 품사들을 찾는 기능을 한다.In this way, after expressing the word as an autonomic string, the process of obtaining the set L uses the algorithm shown in FIG. 5, where wi, j denotes the autonomic string between the numbers I and j, and the function LookupDict is obtained from the dictionary. This function finds the possible parts of speech of the input substring wi, j.

함수 SearchL 은 집합 L에서 (wi,j, t)가 접속할 수 있는 원소들의 인덱스 집합을 받는 기능을 하는 것으로 이는 미리 구축된 품사 접속표를 이용하여 이루어진다.The function SearchL functions to receive a set of indices of elements that (wi, j, t) can access in the set L. This is done by using a pre-established part-of-speech table.

알고리즘에서 사용한 집합 J는 현재까지 완성된 격자에서 'FIN' 노드까지 경로가 존재하는 노드들의 왼쪽 번호들을 모아둔 집합으로, 이는 불필요한 사전 탐색을 줄이기 위해 사용되었다.The set J used in the algorithm is a set of left numbers of nodes whose paths from the completed grid to the 'FIN' node so far have been used to reduce unnecessary prior searches.

불규칙을 처리하는 부분은 미리 구해진 불규칙 위치 정보를 기반으로 이루어진는데, 나는의 경우, 12와 24에서 'ㄹ' 탈락 현상이 발생될 수 있으므로, j가 12 혹은 24일 때 'ㄹ'을 첨가하고 새로운 변수 h를 j부터 0까지 감소시키며, wh,j를 사전에서 찾게 된다.The irregularity processing part is based on the pre-obtained irregular position information. In the case of I, since 'ㄹ' dropout may occur at 12 and 24, I add 'ㄹ' when j is 12 or 24 and Decreases the variable h from j to 0, and finds wh, j in the dictionary.

그러므로 집합 L에 있는 (4, 날, 동사, 2)는 j가 12일 때 첨가된 것이다.Thus (4, day, verb, 2) in the set L is added when j is 12.

한편, 분석중인 어절이 미등록어를 포함하는 경우에는 미완성된 형태소 격자로부터 모든 가능한 미등록어 후보를 생성하여 격자를 완성하게 된다.On the other hand, if the word under analysis includes an unregistered word, all possible unregistered word candidates are generated from an incomplete morpheme grid to complete the grid.

이 때, 조사나 어미와 같은 기능어들은 모두 사전에 등록되어 있다고 가정하면 체언 혹은 용언과 같은 내용어만이 미등록어가 될 수 있으므로, 미등록어는 항상 어절의 왼쪽 부분에 나타나게 된다.At this time, assuming that all functional words such as surveys and endings are registered in advance, only content words such as verbs or verbs can be unregistered words, so unregistered words always appear in the left part of the word.

그러므로 미등록어를 추측한다는 것은 왼쪽에 남아 있는 자소열을 오른쪽에 있는 노드의 푸사와 접속 가능한 품사로 할당하는 것이 되며, 이러한 방법으로 미등록어를 추정한 형태소 격자를 생성하게 된다.Therefore, inferring a non-registered word means assigning the remaining phoneme string to the part of speech that can be connected to the Poussa of the node on the right side. In this way, the morpheme grid is estimated.

그러나, 미등록어가 추정된 형태소 격자는 가능한 형태소 분석 수가 너무 많게 되므로 다음 단계인 품사 태깅에 많은 오류를 범하게 할 수 있어 이를 방지하기 위해 음정 정보와 단서(clue) 형태소를 이용하여 후보의 수를 줄인다.However, the morpheme lattice in which the non-registered word is estimated has a large number of possible morpheme analyzes, which can cause a lot of errors in the next part-of-speech tagging. To prevent this, the number of candidates is reduced by using pitch information and clue morphemes. .

먼저, 음절 정보를 사용하는 방법은 미등록어의 마지막 음절이 추정된 품사의 마지막 음절로 사용되지 않을 경우에는 이를 제외하는 것이다.First, a method of using syllable information is to exclude when the last syllable of an unregistered word is not used as the last syllable of an estimated part of speech.

예를 들면, 한국어의 용언 중 마지막 음절이 '느'인 단어는 총 6개 뿐이라는 사실을 이용하면 추정된 미등록어의 품사가 용언이고 그 미등록어의 마지막 음절이 '느'일 경우 그 노드를 격자에서 제외할 수 있게 된다.For example, using the fact that there are only six words with the last syllable 'ne' among Korean verbs, if the presumed non-registered part of speech is a verb and the last syllable of the unregistered word is 'ne', You can remove it from the grid.

단서 형태소를 이용하는 방법은 미완성인 형태소 격자내에 아주 빈도가 높고 그 어절의 구성을 추정하기에 충분하다고 생각되는 형태소가 발견되면, 그 형태소의 앞에 추정된 미등록어만 남기고 나머지를 격자에서 제거한다.If a method using clue morphemes is found in an incomplete morpheme lattice that is very frequent and is thought to be sufficient to estimate the composition of the word, only the estimated unregistered words remain in front of the morpheme and the remainder is removed from the lattice.

예를 들면, 우회시켜라는 어절에서 '우회'가 미등록어일 경우, 미등록어 후보로는 '우회/동작성보통명사', '우회시키/동사, '우회시키/형용사' 등을 얻게 되는데, '시키'라는 형태소가 단서 형태소이므로 '우회/동작성보통명사'만을 남기고 나머지는 모두 제거하게 된다.For example, if 'bypass' is an unregistered word in the phrase 'bypassing', the candidates for unregistered words get 'bypass / operational common noun', 'bypass / verb', 'bypass / adjective', etc. Since the morpheme is a clue morpheme, only the 'bypass / operational common noun' is left, and the rest are removed.

이와 같은 방법을 이용하여 미등록어가 포함된 어절에 대해 어절당 형태소 분석 개수를 22.01개에서 10.83개로 줄일 수 있게 된다.Using this method, the number of stemmings per word for a word containing unregistered words can be reduced from 22.01 to 10.83.

이와 같이 입력되는 문장에 대하여 형태소 분석이 완료되면 품사 태거(30)는 각 어절들의 형태소 분석 후보들 중 최적의 형태소 분석 결과를 차기 위하여 미등록어를 처리하는 방법에 중점을 두어서 확률에 기반한 품사 태깅을 수행한다.When the morpheme analysis is completed on the sentence input as described above, the part-of-speech tagger 30 performs a part-of-speech tagging based on probability by focusing on a method of processing unregistered words in order to obtain an optimal morphological analysis result among the candidates for morphological analysis of each word. Perform.

확률에 기반한 품사 태깅은 n개의 어절로 구성된 문장, 즉 어절열 w1w2w3...wn인 w1..n에 대해 최적의형태소 분석 결과를 찾는 문제이므로, i 번째 어절의 형태소 분석 결과를 형태소열 mi와 품사열 ti의 쌍으로 표시하여 다음과 같이 품사 태깅 함수(w1..n)를 정의한다.Probability-based part-of-speech tagging finds the optimal morphological analysis results for n-word sentences, that is, w1..n with the word sequence w1w2w3 ... wn. Part-of-speech tagging function as a pair of parts-of-speech ti Define (w1..n).

식 1을 베이즈 룰과 일차 마르코프 가정 등을 이용하면 다음과 같은 식으로 표현된다.Equation 1 can be expressed as follows using Bayes' rule and the first Markov assumption.

식 2는 결국 이전 어절의 품사열에서 현재 어절의 품사열로 천이할 확률과 어절의 품사열에서 임의의 형태소열이 발생될 확률들을 매 어절마다 곱하였을 때 가장 큰 값을 갖는 형태소 분석열을 찾는 의미가 된다.Equation 2 finds the morphological analysis sequence that has the greatest value when multiplying each probability by the probability of transitioning from the previous part-of-speech sequence to the current word-part of speech and the word's part-of-speech sequence. It makes sense.

이것은 형태소 분석 결과를 노드로, 인접하는 어절들의 형태소 분석 결과 사이에 에지(edge)를 두어 노드에는 P(mi/ti)를, 에지에는 P(ti/ti-1)을 할당하고 가장 높은 확률을 내는 경로를 취하는 것으로 볼 수 있다.It assigns the stemming result as a node, with an edge between the stemming results of adjacent words, assigning P (mi / ti) to the node, and P (ti / ti-1) to the edge, with the highest probability. It can be seen as taking a path.

예를 들면, 신을 신고 신고하기라는 어절열은 첨부된 제6도와 같이 나타낼 수 있고, 이런 종류의 문제는 바이터비(Viterbi) 알고리즘에 의해 구해진다.For example, the word sequence of wearing and reporting a god can be represented as shown in FIG. 6 attached, and this kind of problem is obtained by the Viterbi algorithm.

이때, nc : 보통명사, po : 목적격 조사, vb : 동사, ex : 보조적 연결어미, ec : 연결어미, xj : 형용사 파생 접미사, en : 명사형 정성어미, na : 동작성 보통명사, xv : 동사 파생 접미사, vx : 보조 용언으로 정의한다.Where nc: common noun, po: objective investigation, vb: verb, ex: auxiliary link ending, ec: linking ending, xj: adjective derived suffix, en: noun form qualitative ending, na: behavioral common noun, xv: verb derived Suffix, vx: defined as an auxiliary verb.

그러나, 식 2를 그대로 이용하기에는 자료의 부족 현상이 발생할 수 있으므로 이를 더 작은 단위들의 확률 값으로 근사하여 모델의 파라미터 수를 줄인다.However, the lack of data may occur to use Equation 2 as it is, and the approximation to the probability values of smaller units reduces the number of parameters in the model.

우선, 식 2의 P(ti/ti-1)에서 ti는 형태소 품사열을 뜻하므로, 이를로 풀어 쓸 수 있다.First, ti in P (ti / ti-1) of Equation 2 means morpheme parts of speech, so Can be used to release

여기서는 i번째 어절의 임의의 형태소 분석 결과에서 j번째 형태소 품사를 뜻하는 것이고 Ni 는 그 형태소 분석 결과에 사용된 품사의 갯수를 뜻한다.here Is the jth morpheme part of speech in any morphological analysis of the i word, and Ni is the number of parts of speech used in the morphological analysis.

예를 들면, t_i가 vb, ex, vx, en 이라면은 vx이고, N_i는 4가 된다.For example, if t _i is vb, ex, vx, en Is vx and N _i is 4.

이때, vb : 동사, ex : 보조적 연결어미, vx : 보조 용언, en : 명사형 전성어미로 정의한다.In this case, vb is a verb, ex is an auxiliary link ending, vx is an auxiliary verb, and en is a noun-type malleable ending.

이러한 방법을 사용하면 P()은 다음과 같이 전개될 수 있다.Using these methods, P ( ) Can be developed as follows.

식 4는 식 3에서 현재 처리 중인 어절의 품사열은 이전 어절의 마지막 형태소 품사에만 의존한다는 가정에 의해 얻은 것이고, 이는 다시 연쇄 규칙(chain rule)에 의해 식 5로 전개되며, 일차 마르코프 가정을 적용하여 식 6을 얻게 된다.Equation 4 is obtained by the assumption that the parts of speech currently being processed in Equation 3 depend only on the last morpheme parts of the previous word, which is further developed by Equation 5 by the chain rule, applying the primary Markov assumption. Equation 6 is obtained.

즉, 품사열간의 천이 확률을 이전 어절의 마지막 형태소 품사에서 현재 처리중인 어절의 첫 형태소 품사로 천이하는 확률과 어절 내에서의 품사 천이 확률들을 곱한 것으로 근사한 것이다.That is, it is approximated by multiplying the probability of transition between parts of speech by the probability of transitioning from the last morpheme part of speech to the first morpheme part of speech currently being processed and the parts of speech transition within the word.

식 2의 P()는 품사열에서 형태소열이 발생할 확률인데 여기서 고려해야 할 점은 첫 번째 형태소이 미등록어일 수 있다는 점이다.P of Equation 2 ) Is the probability that morphemes occur in the part-of-speech sequence. This may be an unregistered word.

그러므로이 등록어인지 미등록어인지를 뜻하는 새로운 변수을 도입하여 다음과 같이 식을 다시 정의한다.therefore New variable to indicate whether this is a registered or unregistered word By redefining the expression as follows:

식 7에서은이 등록어일 때 1을, 미등록어일 때 0을 취하게 하면, 다음과 같이 전개된다.In equation 7 silver If 1 is taken as this registered word and 0 is taken as a non-registered word, it develops as follows.

식 9에서는 임의의 형태소 품사열이 주어지고, 그 중 첫 번째 품사가 미등록어인지, 혹은 등록어인지에 대한 확률을 뜻하고 이것은 독립 가정을 이용하면 다음과 같이 근사시킬 수 있다.In Equation 9 Is a random morpheme part-of-speech sequence, the probability of whether the first part of speech is an unregistered or registered word, which can be approximated using an independent assumption:

한편, 식 9의 P(i/ti, k*)은 다음과 같이 전개할 수 있다.On the other hand, P (i / ti, k *) in Expression 9 can be developed as follows.

식 15는 다음의 식 16, 17과 같은 가정을 이용하여 얻은 것으로, 최종적으로은 식 15의 첫 항에만 영향을 미치게 된다.Equation 15 is obtained using the following assumptions Affects only the first term of Eq.

식 15의 첫 항은의 값에 의해 다르게 처리되는데,이 1일 경우는 식 18의 근사식을 이용하고, 0일 경우는 식 19와 같은 가정을 이용한다.The first term in equation 15 is Is treated differently by the value of, In the case of 1, the approximation of Equation 18 is used, and in the case of 0, the same assumption as in Equation 19 is used.

식 19는 한국어에서 미등록어에 대한 추정은 미등록어 다음에 오는 형태소와 품사에 의존한다고 가정한 것으로 일종의 언어적 유리스틱이라고 볼 수 있다.Equation 19 assumes that the estimation of unregistered words in Korean depends on the morphemes and parts of speech that follow unregistered words.

위의 두 식을 이용하게 되면,는의 값에 따라 다음과 같은 근사 식을 얻게 된다.If you use the above two expressions, Is Depending on the value of, we get the following approximation.

위의 두 식들과 식 6, 9를 식 2에 넣으면 다음과 같은 품사 태깅 수식을 얻을 수 있다.Putting the above two expressions and equations 6 and 9 into equation 2 gives the part-of-speech tagging formula:

식 23에서은 임의의 품사가 주어졌을 때 미등록어 혹은 등록어가 발생할 확률을 의미하는 것이고, 이 확률 값들은 미등록어가 포함된 말뭉치로부터 얻게 된다.In Equation 23 Is the probability that an unregistered word or a registered word occurs when an arbitrary part-of-speech is given, and the probability values are obtained from the corpus containing the unregistered word.

특히 이 값들은 현재 사용중인 사전의 표제어 수에 의존하여, 사전의 크기에 관계없이 확률적으로 최적의 품사 태깅 결과를 얻는다.In particular, these values depend on the number of headings in the dictionary currently in use, yielding an optimal part-of-speech tagging result regardless of the size of the dictionary.

지금까지 전개한 식은 첨부된 제6도에서 노드에는을, 에지에는을 각각 할당한 후, 바이터비 알고리즘으로 해결할 수 있다.The expression that has been developed so far is not shown in the attached On the edge After each of the assignment, we can solve by the Viterbi algorithm.

한편, 제6도의 노드들을 관찰해 보면, 노드들이 형태소 분석기의 결과인 형태소 격자를 풀어놓은 것임을 알 수 있으므로 어절들의 형태소 격자를 연결하여 바이터비 알고리즘을 수행하면 결과를 더욱 빠르게 얻을 수 있다.On the other hand, when observing the nodes of FIG. 6, it can be seen that the nodes have released the morpheme lattice, which is the result of the morpheme analyzer, so that the result can be obtained more quickly by connecting the morpheme lattice of the words to perform the bitter ratio algorithm.

이와 같은 동작에 의하여 각 어절들의 형태소 분석이 완료되면 발음 표기 변환 모듈(40)은 주어진 문장의 발음 표기를 찾기 위해 문장을 이루는 모든 어절들의 쓰임새를 정확히 인식하기 위하여 품사 태거의 결과를 기반으로 하여 동일한 어절일지라도 쓰임새에 맞게 다르게 발음될 수 있도록 처리한다.When the morphological analysis of each word is completed by the operation as described above, the phonetic notation conversion module 40 is based on the result of the part-of-speech tag to accurately recognize the use of all the words forming the sentence in order to find the phonetic notation of the given sentence. Even if it is a word, it can be pronounced differently according to its use.

이때 사용된 발음 변환 규칙들은 문교부에서 고시한 '표준 발음법'을 적용한다.The pronunciation conversion rules used at this time apply the 'standard pronunciation method' announced by the Ministry of Education.

발음 표기 변환을 하는 방법은, 우선 품사 태깅의 결과를 바탕으로 입력 어절의 각 자소에 품사를 할당하고, 형태소 분석 결과에서 몇 번째 형태소에 있는지를 적는다.In the phonetic notation conversion method, the parts of speech are assigned to each phoneme of the input word based on the result of the parts of speech tagging, and the number of morphemes in the result of the morpheme analysis is written.

예를 들면, 신을 신고 신고하기의 품사 할당 결과는 [표 3]과 같다.For example, the part-of-speech assignment result of filing and reporting the shoes is shown in [Table 3].

한편, 한국어의 표준 발음법은 중성 'ㅕ', 'ㅖ', ㅢ와 종성에 의해 발음 규칙이 일으켜지므로, 3개의 중성에 의한 발음 규칙과 27개의 종성에 의한 발음 규칙을 각각 작성하여 입력 어절의 중성과 종성을 보며 해당 규칙을 실행시킨다.On the other hand, in the standard pronunciation of Korean, the pronunciation rules are caused by the neutral 'ㅕ', 'ㅖ', ㅢ and the finality. Therefore, the pronunciation rules by three neutrals and the pronunciation rules by 27 finalities are prepared respectively. Run the rule with neutrality and finality.

예를 들어 종성 'ㄴ'에 의한 규칙은 하기의 [표 4]와 같이 경음화, 자음 동화, 연음법칙 현상이 일어날 수 있는데, 모두 다음에 오는 초성의 품사에 의존하여 발생한다.For example, the rule according to the finality 'b' may cause phenomena, consonant assimilation, and the law of consonant phenomena, as shown in Table 4 below, all of which occur depending on the part of speech that follows.

그러므로 위의 예문의 경우, 전자의 '신고'는 경음화 현상이 발생되어 '신꼬'로, 후자의 '신고하기'는 어떠한 현상도 발생되지 않아 '신고하기'로 발음된다.Therefore, in the case of the above example, the former 'report' is pronounced as 'shinko' and the latter 'reporting' is pronounced as 'reporting' because no phenomenon occurs.

전술한 바와 같은 처리를 통해 추출된 형태소 분석열의 품사 정보가 발음 표기 변환 모듈(40)에 인가되면 발음 표기 모듈(40)은 대부분의 발음 변환 현상을 규칙에 의하여 처리하고, 합성어에서의 경음화 현상(예: 문고리[문꼬리]), 소리의 첨가 현상(예: 솜이불[솜니블]), 한자어에서의 경음화 현상(예: 갈증[갈쯩]에 해당하는 단어를 예외 발음 단어로 규정한다.When the part-of-speech information of the morphological analysis sequence extracted through the process as described above is applied to the pronunciation notation conversion module 40, the pronunciation notation module 40 processes most of the pronunciation conversion phenomena according to a rule, For example, a door ring [door tail]), a sound addition phenomenon (eg cotton quilt [somnibble]), and a phenomena phenomenon in Chinese characters (eg thirst [gale]) are defined as exception pronunciation words.

이렇게 품사 정보를 이용하여 입력 문장의 발음을 얻게 되지만, 어절의 정확한 발음을 얻기 위해서는 의미 처리 단계가 필요하다.The pronunciation of the input sentence is obtained using the part-of-speech information, but a semantic processing step is required to obtain the correct pronunciation of the word.

따라서, 입력 문장의 구문 구조를 알아내는 구문 분석기(50)는 확률적 의존 문법을 이용하여 입력 문장의 구조를 찾는다.Accordingly, the parser 50 that finds the syntax structure of the input sentence finds the structure of the input sentence using a probabilistic dependent grammar.

의존 문법은 구구조 문법과 달리 문장을 이루는 단어들의 지배소-의존소 관계를 찾기 위해 사용되고, 이는 일반적인 구구조 문법으로 표현 가능하다.Dependent grammar is used to find the dominant-dependency relationship of the words that make up a sentence, unlike the grammatical grammar.

그러므로 구구조 문법을 이용하는 파싱 기법을 그대로 사용할 수 있다.Thus, parsing techniques using spherical grammars can be used.

구문 분석기(50)는 구문 분석을 할 때, 어절들을 대표할 수 있는 비단말 기호(nonterminal symbol)가 필요하기 때문에 우선 품사 태깅된 결과를 바탕으로 각 어절들의 어절 품사를 생성한다.Since the parser 50 needs a nonterminal symbol that can represent the words when parsing, first, the word parser parts of the words are generated based on the result of the parts of speech tagging.

어절 품사를 생성하는 방법은 입력 어절의 가장 왼쪽에 위치하는 형태소 품사와 가장 오른쪽에 위치하는 형태소 품사를 연결하여 만드는 것을 기본으로 하고, 만약 오른쪽에 쉼표 등 문장 기호들이 있을 경우에는 그 기호들을 더한다.The method of generating word parts is based on the connection of the morpheme parts on the left of the input word and the morpheme parts of the word on the right. If there are sentence symbols such as commas on the right, the symbols are added.

예를 들어 피어 있습니다.라는 어절은 피/vb+어/ex+있/vx+습니다/ef+./se로 형태소 분석이 되고, 어절 품사는 vbefse가 된다.For example, the word peer is stemmed as blood / vb + er / ex + / vx + / ef +. / Se, and the word part of speech becomes vbefse.

한편, 두 개의 형태소 품사가 더해져서 다른 형태소 품사로 변하는 경우가 있는데, 예를 들어 공부하다.의 경우 공부/na+하/xv+다/ef+./se로 형태소 분석이 되지만, '공부하'가 의미적으로 동사의 역할을 하므로 어절 품사를 vbefse 로 만든다.On the other hand, two morpheme parts are added to change to another morpheme part of speech, for example, in case of study, stemming analysis as study / na + ha / xv + da / ef +. / Se, but 'study load' means As a verb, it makes vbefse a word part of speech.

이렇게 두 개의 형태소 품사가 하나의 형태소 품사로 바뀌는 규칙들은 다음과 같다.The rules for changing two morpheme parts into a morpheme part are as follows.

1. 상태성 보통명사 + 형용사 파생 접미사형용사1. conditional common noun + adjective derived suffix adjective

(예: 건강/ns+하/xj+다/ef+./sevjefse)(E.g. health / ns + low / xj + high / ef +. / Se vjefse)

2. 동작성 보통명사 + 동사 파생 접미사동사2. Behavioral common noun + verb-derived suffix verb

(예: 공부/na+하/vx+다/ef+./sevbefse)(E.g. study / na + ha / vx + da / ef +. / Se vbefse)

3. 보통 명사 + 명사 접미사보통 명사3. Common nouns + noun suffixes Common noun

(예: 사람/nc+들/xnnc)(E.g. person / nc + s / xn nc)

4. 상태성 보통명사 + 부사 파생 접미사부사4. Stateful Common Noun + Adverb Derived Suffix adverb

(예: 간단/ns+히/xaad)(E.g. simple / ns + hi / xa ad)

5. 동사 + 부상 파생 접미사부사5. Verb + Wound Derived Suffix adverb

(예: 소중/ns+하/xj+게/xaad)(E.g. precious / ns + low / xj + crab / xa ad)

이상과 같이 문장을 이루는 어절들에 대해 어절 품사를 모두 얻은 후, CYK 테이블을 이용하여 확률적으로 최적인 구문 구조를 찾는다.After all the word parts are obtained for the words that make up a sentence as above, the CYK table is used to find a stochastic optimal syntax structure.

일반적으로 의존 트리는 구구조 트리에 의해 표현될 수 있는데, 특히 한국어는 지배소 후위의 원칙이라는 특징이 있으므로 더욱 단순한 형태의 구구조 트리로 표현할 수 있다.In general, the dependency tree can be represented by a spherical tree. In particular, Korean can be expressed as a simpler spherical tree because it is characterized by the principle behind the rule.

예를 들어 신을 신고 신고한다.라는 문장의 의존 트리는 첨부된 제7도이고, 이에 해당하는 구구조 트리는 첨부된 제8도와 같이 된다.For example, the dependency tree of the sentence “Go and report God” is shown in FIG. 7 and the corresponding spherical tree is shown in FIG. 8.

이 때 제8도의 구구조 트리에서 사용한 규칙들은 모두 다음의 세가지 형태 중 하나에 해당한다는 특징을 관찰할 수 있다.At this time, it can be observed that the rules used in the sphere structure tree of FIG. 8 correspond to one of the following three forms.

1. SA1. S A

(S : 시작 심벌, A: 어절 품사)(S: start symbol, A: word part of speech)

2. AB A2. A BA

(A, B : 어절 품사)(A, B: word parts of speech)

3. Aa3. A a

(A : 어절 품사, a : 어절)(A: word part of speech, a: word)

한편, 제8도와 같은 구문 트리를 얻게 될 확률은 구문 트리에 사용된 모든 규칙들의 규칙 확률들의 곱으로 표현이 되는데, 규현된 구문 분석기에서는 첫 번째 규칙 형태와 세 번째 규칙 행태에 대해서는 모두 1.0의 확률을 부여하여 사용한다.Meanwhile, the probability of obtaining a syntax tree as shown in FIG. 8 is expressed as the product of the rule probabilities of all rules used in the syntax tree. In the parser, the probability of 1.0 for both the first rule type and the third rule behavior is expressed. To use.

이는 문장의 마지막 어절이 항상 그 문장의 지배소가 되므로 최적 구문 트리를 찾는데 영향을 끼치지 않는다는 점과, 또한 어절 품사에서 임의의 어절이 발생할 확률은 품사 태깅의 최적 결과만을 이용할 경우, 최적 구문 트리를 찾는데 역시 영향을 끼치지 않는다는 점을 이용한 것이다.This means that the last word of a sentence is always the rule of the sentence, so it does not affect finding the best syntax tree, and the probability that a random word occurs in the word part of speech is only optimal if it uses only the best result of part-of-speech tagging. It also uses the fact that it does not affect finding.

그러므로 첨부된 제8도의 구문 트리 확률은 P(vbefsevbec vbefse)P(vbecncpo vbec)이 된다.Therefore, the syntax tree probabilities of attached Figure 8 are P (vbefse). vbec vbefse) P (vbec ncpo vbec).

전술한 바와 같은 기능으로 실행되는 문서 분석기에서 품사 태거 성능 평가 실험 결과는 다음과 같다.The result of the part-of-speech tagger performance evaluation experiment in the document analyzer executed with the function described above is as follows.

품사 태거(30)의 성능을 평가하기 위해 수동으로 태깅된 49,506어절의 학습 말뭉치로 모델을 학습시키고, 4,729어절에 대하여 실험하였다.In order to evaluate the performance of the part-of-speech tagger (30), the model was trained using a manually tagged learning corpus of 49,506 words, and the experiment was performed on 4,729 words.

49,506어절 중 3,652어절은 미등록어에 관한 확률을 학습시키기 위한 것으로, 나머지 45,854어절을 기반으로 사전을 구축한 후, 3,652어절에 대해 사전에 등록되지 않은 단어가 발견되면 이를 미등록어로 간주한다.Of 49,506 words, 3,652 words are the probability of unregistered words. For the purpose of learning, the dictionary is constructed based on the remaining 45,854 words, and if words that are not registered in the dictionary for 3,652 words are found, they are regarded as unregistered words.

이때, 품사 태거의 성능은 하기의 [표 5]와 같고, [표 5]에서 등록 어절과 미등록 어절이란 각각 미등록어가 포함안된 어절과 미등록어가 포함된 어절을 뜻한다.In this case, the performance of the part-of-speech tagger is as shown in [Table 5] below, and in [Table 5], a registered word and an unregistered word mean a word including an unregistered word and a word including an unregistered word.

전술한 바와 같이 [표 5]에서 등록 어절에 대한 정확률이 88.03%로 그다지 높지 않은 정확률을 나타내는 것을 관찰할 수 있는데, 이는 구현된 태거가 형태소 격자가 구성되지 않았을 경우에만 그 어절에 미등록어가 포함되었다고 가정하는 것에 기인한다.As described above, it can be observed from Table 5 that the accuracy rate for the registered word is not very high, which is 88.03%, which means that the word is included in the word only if the implemented tagger does not have a morpheme grid. Is due to assumptions.

예를 들어, 신을 신고 신고한다라는 문장에서 '신을'의 올바른 태깅 결과는 '신/보통명사+을/목적격조사'인데 만약 '신'에 대해 동사라는 정보만 사전에 있을 경우, 형태소 분석기는 '신/동사+ㄹ/관형사형전성어미'의 형태소 격자만을 작성하고, 품사 태거는 미등록어가 없다고 간주한 우에 이를 그대로 태깅하게 된다.For example, the correct tagging result of 'God' in the sentence “God's declaration” is 'God / Normal Noun + / Purpose Screening'. Only the morpheme lattice of / verb + d / tubular grammatical mother is written, and the part-of-speech tagger will be tagged as it is if there are no unregistered words.

바로 이런 종류의 오류 때문에 미등록어가 포함안된 어절에 대한 태깅 결과가 그다지 높지 않게 되었다.Because of this kind of error, the tagging results for words that do not contain unregistered words are not very high.

한편, 실험 말뭉치에서 사용되는 단어들을 사전에 첨가한 후 태깅한 결과, 어절에 대한 정확률이 96.46%, 형태소에 대해서는 98.01%의 정확률을 얻었다.On the other hand, after adding the words used in the experiment corpus and tagging, the accuracy of the word was 96.46% and the morpheme was 98.01%.

이러한 실험 결과를 볼 때, 사전의 표제어 수가 높아질수록 태거의 성능은 더 높아진다.구문 분석기(50)의 성능 평가 실험결과는 다음과 같다.As a result of these experiments, the higher the number of headwords in the dictionary, the higher the performance of the tagger. The performance evaluation test results of the syntax analyzer 50 are as follows.

수동으로 작성된 498문장의 구문 트리들을 바탕으로 모델을 학습시킨 후, 100문장에 대해 실험하였다.We trained the model based on the 498-statement syntax trees, and then experimented with 100 sentences.

100문장은 738어절로 이루어졌고 문장의 마지막 어절은 항상 자신을 지배소로 결정하므로 638개의 지배소-의존소 관계가 존재하게 된다.The 100 sentences consisted of 738 words, and the last word of the sentence always decides itself as the ruler, so there are 638 ruler-dependency relationships.

우선 품사 태거의 결과를 모두 맞게 만들어 구문 분석기 자체의 성능을 평가하는 실험과 품사 태거의 결과를 그대로 이용하는 실험을 하였는데, 전자의 경우 80.87%의 정확률을 나타내었고, 후자의 경우 78.68%의 정확률을 나타내었다.First, experiments were conducted to evaluate the performance of the parser itself by fitting all parts of the part-of-speech tagger and experiments using the results of the part-of-speech tagger. The former showed an accuracy rate of 80.87% and the latter showed an accuracy rate of 78.68%. It was.

이 결과는 예를 들어 약 11개의 어절로 된 문장에 대해 마지막 어절을 제외한 어절 중 8개의 어절은 자신의 지배소를 올바르게 선택한 것으로 볼 수 있다.This result, for example, in about 11 sentences, 8 of the words except the last one can be seen as the correct choice of their control post.

이상에서 설명한 바와 같이, 본 발명에 따른 문서 분석기는 입력 문서에 대해 비결정 유한 오토마타를 이용하여 전처리를 수행한 후, 형태소 분석 단계에서 어절의 모든 가능한 형태소 분석 결과를 구하며, 형태소 분석 결과들 중 확률적으로 가장 가능한 형태소 분석 결과를 선택한 후, 선택된 결과를 발음 표기 변환 모듈을 통해 어절을 이루는 모든 자소에 품사를 할당한 후, 3개의 중성에 의한 규칙, 27개의 종성에 의한 규칙들을 적용하여 어절의 발음 표기를 얻은 다음 구문 분석기의 확률 의존 문법을 이용하여 입력 어절들의 지배소-의존소 관계를 찾아 운율을 생성하여 신뢰성 있는 문서 음성으로 변환을 제공한다.As described above, the document analyzer according to the present invention performs preprocessing using an amorphous finite automata on an input document, and then obtains all possible morphological analysis results of the word in the morphological analysis step, and probabilistic among the morphological analysis results. After selecting the most possible morphological analysis result, the participant is assigned to all phonemes that make up the word through the phonetic notation conversion module, and the pronunciation of the word is applied by applying three neutral rules and 27 final rules. After the notation is obtained, the probabilistic dependent grammar of the parser is used to find the dominant-dependency relationship of the input words and generate a rhyme to provide a reliable document speech.

본 발명에 따른 문서 분석기는 품사 태거에서 85.11%의 정확률을, 구문 분석기에서 78.68%의 정확률을 각각 나타내며, 특히 품사 태거의 경우, 처리 중인 문서에 미등록어가 없을 경우에는 96.46%의 높은 정확률을 제공한다.The document analyzer according to the present invention shows an accuracy rate of 85.11% in the part-of-speech tag and 78.68% in the parser, respectively, and provides a high accuracy rate of 96.46% in the case of no documents in the document being processed. .

또한, 본 발명에 따른 문서 분석기는 문서 음성 변환 시시템 이외에도 음성 인식 시스템을 위한 발음 사전을 구축하고자 할 때 유용하게 사용될 수 있다.In addition, the document analyzer according to the present invention can be usefully used to construct a phonetic dictionary for a speech recognition system in addition to the document speech conversion system.

Claims

문서 음성 변환 시스템에 있어서, 음성 변환을 위하여 입력되는 문서를 하나의 문장씩 추출하며, 비결정 유한 오토마타를 이용하여 비한글 문자들을 한글 문자로 변환시키는 전처리 수단과; 상기 전처리 수단에서 인가되는 하나씩의 문장 어절에서 모든 가능한 형태소 분석 결과를 구하는 형태소 분석수단과; 상기 형태소 분석된 입력 문장을 확률 정보를 기반으로 하여 가장 가능성이 높은 형태소 분석 결과를 선택하여 최적 형태소 분석열의 결과를 이용하여 각 어절들의 발음 표기를 구하는 발음 표기 변환수단 및; 확률 의존 문접을 이용하여 어절들의 지배소-의존소 관계를 구하며, 입력된 문서에 대한 어절의 발음 표기 및 형태소 품사열, 의존 트리를 최종 결과로 출력하는 구문 분석수단을 구비하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.A text-to-speech system, comprising: preprocessing means for extracting a document inputted for speech conversion by one sentence and converting non-Hangul characters into Hangul characters using an amorphous finite automata; Morphological analysis means for obtaining all possible morphological analysis results from one sentence word applied by the preprocessing means; A phonetic notation conversion means for selecting the most likely morphological analysis result based on the probability information on the morphologically analyzed input sentence and obtaining phonetic notation of each word using the result of an optimal morphological analysis sequence; A probabilistic dependency sentence is used to obtain the governing place-dependency relationship, and pronunciation means, morpheme parts of speech, and syntax analysis means for outputting the dependency tree as final results for the input document. Document analyzer for document-to-speech systems.

청구항1에 있어서, 상기 품사 태거와 구문 분석수단에는 중의성 해소를 수행하기 위해 각각 품사 태깅을 확률 정보와 구문 분석을 위한 확률 정보를 구비하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The document analyzer of claim 1, wherein the part-of-speech tagger and the parsing means comprise probability information for part-of-speech tagging and probability information for syntax analysis.

청구항1에 있어서, 상기 전처리 수단은 비한글 문자들에 대하여 영어, 영어, 약어, 숫자, 전화번호, 연도, 시간 등으로 구분하며, 이들을 다음에 오는 한글 문자들을 고려하여 한글화시키는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The method according to claim 1, wherein the pre-processing means is divided into English, English, abbreviations, numbers, telephone numbers, year, time, etc. for the non-Hangul characters, and Korean, characterized in that the Korean characters in consideration of the following Document analyzer for document-to-speech systems.

청구항1에 있어서, 상기 전처리 수단은 총 48개의 상태로 이루어지는 비결정 유한 오토마타로 구현되며, 이중 12개는 종결 상태가 각각 다른 비한글의 한글 변환 모듈로 이루어지는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The method according to claim 1, wherein the preprocessing means is implemented as an amorphous finite automata consisting of a total of 48 states, of which 12 are composed of a non-Hangul Hangul conversion module having a different termination state for the Korean document speech conversion system Document analyzer.

청구항1에 있어서, 상기 형태소 분석수단은 입력된 문서에서 판독된 단어를 사전에서 참조할 때 예외 발음 단어일 경우 발음 표기를 발음 표기 변환수단에 직접, 넘겨주는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The method according to claim 1, wherein the morpheme analysis means, when referring to the word read in the input document from the dictionary, if it is an exception pronunciation word, the phonetic notation system for Korean documents, characterized in that for handing the phonetic notation directly to the phonetic transcription conversion means. Document analyzer for.

청구항1에 있어서, 상기 형태소 분석수단은 전처리 수단에서 인가되는 문장에 대하여 형태소 분석을 실패한 경우 분석되는 어절에 미등록어가 있다고 판단하고 미등록어를 추정한 후 언어적 휴리스틱을 이용하여 그 후보를 최소화시키는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The method according to claim 1, wherein the morpheme analyzing means determines that there is an unregistered word in the phrase to be analyzed when the sentence applied by the preprocessing means fails and estimates the unregistered word and minimizes the candidate using linguistic heuristics. A document analyzer for a Korean document-to-speech system.

청구항1에 있어서, 상기 형태소 분석수단에 설정되는 격자 구조 알고리즘을 다음과 같이 이루어지는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The document analyzer for a Korean document-to-speech system according to claim 1, wherein a lattice structure algorithm set in said morpheme analysis means is performed as follows.

청구항1에 있어서, 상기 형태소 분석수단은 입력 어절의 불규칙 및 축약 현상 위치들을 미리 계산한 다음 격자 구성 알고리즘에 따라 형태소 분석하여 분석된 결과를 형태소 격자로 표현하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The Korean document speech conversion system of claim 1, wherein the morphological analysis means calculates the irregular and abbreviated phenomena of the input word in advance and then morphologically analyzes the morphological grid according to a lattice construction algorithm. Document analyzer for.

청구항1에 있어서, 상기 형태소 분석수단은 입력되는 문장 어절의 오른쪽에서 왼쪽으로 형태소를 찾아 형태소 격자를 구성하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The document analyzer of claim 1, wherein the morpheme analyzing unit forms a morpheme grid by finding a morpheme from a right side to a left side of an input sentence.

청구항1에 있어서, 상기 형태소 분석수단은 음절정보를 사용하여 미등록어의 마지막 음절이 추정된 음절로 사용되지 않는 경우 그 노드를 격자에서 제외하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The document analyzer of claim 1, wherein the morpheme analyzing unit excludes the node from the grid when the last syllable of the unregistered word is not used as an estimated syllable using syllable information.

청구항1에 있어서, 상기 형태소 분석수단은 단서 형태소를 이용하여 미완성된 형태소 격자내에 아주 빈도가 높고 그 어절의 구성을 추정하기에 충분하다고 생각되는 형태소가 발견되면 그 형태소의 앞에 추정된 미등록어만 남기고 나머지를 격자에서 제거하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The method according to claim 1, wherein the morphological analysis means uses clue morphemes, if the morphemes are found in the unfinished morpheme lattice very often and are sufficient to estimate the composition of the word, leaving only the unregistered words estimated in front of the morphemes. A document analyzer for a Korean document-to-speech system, characterized by removing the remainder from the grid.

청구항1에 있어서, 상기 품사 태거에서 실행되는 품사 태거 수식은 다음과 같이 이루어지는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The document analyzer of claim 1, wherein the part-of-speech tagger formula executed in the part-of-speech tagger is performed as follows.

청구항1에 있어서, 상기 발음 표기 변환수단은 어절을 이루는 모든 자소에 품사를 할당한 다음 3개의 중성에 의한 규칙과 27개의 종성에 의한 규칙으로 어절의 발음 표기를 얻는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The method of claim 1, wherein the phonetic transcription converting means assigns a part-of-speech to all phonemes forming a word, and then obtains a phonetic pronunciation of a word using three neutral rules and 27 final rules. Document analyzer for the system.

청구항1에 있어서, 상기 발음 표기 변환수단은 합성에서의 경음화 현상과 소리의 첨가 현상, 한자어에서의 경음화 현상에 해당하는 단어를 예외 발음 단어로 규정하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The method of claim 1, wherein the phonetic transcription conversion means defines a word corresponding to a phenomena of phenomena in a synthesis, a phenomenon of addition of sounds, and a phenomena of a kanji as exception pronunciation words. Analyzer.

청구항1에 있어서, 상기 구문 분석수단은 입력 어절의 가장 왼쪽에 위치하는 형태소 품사와 가장 오른쪽에 위치하는 형태소 품사를 연결하여 기본 적인 어절 품사를 생성하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The document according to claim 1, wherein the syntax analyzing means generates basic word parts by connecting the morpheme parts of speech located at the leftmost part of the input word and the morpheme parts of speech located at the rightmost part of the input word. Analyzer.

청구항1에 있어서, 상기 구문 오른쪽에 쉼표 등 문장부호가 있는 경우에는 그 기호를 더하여 어절 품사를 생성하는 것을 특징으로 하는 한국어 문서 음성 변환 시스템을 위한 문서 분석기.The document analyzer of claim 1, wherein, when there is a punctuation mark such as a comma to the right of the phrase, a word part of speech is generated by adding the symbol.