KR101799681B1

KR101799681B1 - Apparatus and method for disambiguating homograph word sense using lexical semantic network and word embedding

Info

Publication number: KR101799681B1
Application number: KR1020160074761A
Authority: KR
Inventors: 옥철영; 신준철; 이주상
Original assignee: 울산대학교 산학협력단
Priority date: 2016-06-15
Filing date: 2016-06-15
Publication date: 2017-11-20

Abstract

The present invention relates to an apparatus and a method to disambiguate a homograph using a lexical semantic network and word embedding, capable of correctly disambiguating a homograph with respect to a non-learning pattern. According to the present invention, the apparatus comprises: a data storage unit to store a corpus including one or more training words, a standard dictionary including word meaning data, and a lexical semantic network; a word list generation unit to generate a list of words to be trained from the corpus; a processed data generation unit converting convertible word meaning data among the word meaning data of the words to be trained included in the generated word list into a corpus, and processing the generated list of words to be trained, the converted corpus, and the word meaning data in accordance with word embedding learning to generate processed data; a word embedding learning unit using the generated processed data to learn the word to be trained through the word embedding learning using a learning model comprised of an input/output layer and a projection layer to generate a word vector; and a homograph disambiguation unit using the generated word vector to compare similarity between a homograph and neighboring words, and disambiguating the homograph in accordance with a comparison result.

Description

어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법{APPARATUS AND METHOD FOR DISAMBIGUATING HOMOGRAPH WORD SENSE USING LEXICAL SEMANTIC NETWORK AND WORD EMBEDDING}[0001] APPARATUS AND METHOD FOR DISABLING HOMOGRAPH WORD SENSE [0002] USING LEXICAL SEMANTIC NETWORK AND WORD EMBEDDING [0003]

본 발명은 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법에 관한 것으로서, 더욱 상세하게는 다양한 자원(예컨대, 말뭉치, 표준 대사전 및 어휘 의미망)으로부터의 학습할 단어 목록, 변환된 말뭉치 및 단어 의미 데이터를 이용한 단어 임베딩 학습을 통해 학습하고 동형이의어와 인접 어절 간의 유사도를 비교하여 동형이의어를 분별함으로써, 미학습 패턴에 대해서 동형이의어를 용이하게 분별할 수 있는, 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and a method for classifying homonyms using lexical semantic networks and word embedding, and more particularly, to a method and apparatus for classifying homonyms in a word list, a transformed corpus, and a word list from a variety of resources (e.g., corpus, standard dictionary and lexical semantic network) Learning by word-based learning using word semantic data, discriminating homographs by comparing similarities between homographs and adjacent words, and recognizing homonyms for non-learning patterns using vocabulary semantic networks and word embedding The present invention relates to an apparatus and a method for discriminating homonyms.

정보기술이 발달하면서 자연어를 자동으로 처리하려는 기술 또한 같이 발달하고 있다. 이런 기술들은 문서분류나 기계번역 포탈검색 등에 응용되고 있다. 한국어는 교착어로 형태소 분석이 어려운 편이지만 최근 들어 학습 말뭉치를 이용한 기계학습 방법으로 형태소 복원과 품사 태깅이 높은 정확률(예컨대, 약 98%)을 보이고 있다.As information technology develops, technologies for automatically processing natural language are also developing. These technologies are applied to document classification and machine translation portal search. In Korean, it is difficult to analyze morpheme, but in recent years, morphological restoration and part - of - speech tagging have shown high accuracy (for example, about 98%) as a machine learning method using learning corpus.

그러나 동형이의어 분별 정확률은 아직은 다소 낮은 편으로(약 96.5%), 실제로 기계번역에서 단어의 의미를 완전히 잘못 번역하는 경우를 쉽게 찾아볼 수 있다. 이런 오류를 줄이려면 동형이의어 분별 정확률을 향상시키는 것에 집중해야 한다.However, the accuracy of classification of homographs is still somewhat low (about 96.5%), and it is easy to find cases where the meaning of a word is completely misinterpreted in machine translation. To reduce these errors, you should concentrate on improving the homograph accuracy.

한국어 동형이의어 분별을 위해 문맥정보를 이용하는 방법으로는 크게 말뭉치학습 방법과 어휘망 학습 방법이 존재한다. 전자는 통계학습 방식이고 후자는 지식기반 방식에 속한다. 말뭉치를 학습하는 방법은 세종말뭉치처럼 대용량의 태그 부착 말뭉치를 학습하여서 인접어절에 대한 패턴을 학습하는 것이다. 이 방식은 인접 어절 전체나 일부를 그대로 저장하는 것으로, 어절 전체, 일부 형태소, 품사 또는 음절 등을 그대로 저장한다.There are corpus learning methods and vocabulary network learning methods as a method of using context information for discriminating Korean homonyms. The former is a statistical learning method and the latter is a knowledge-based method. The way to learn corpus is to learn patterns of adjacent words by learning large tagged corpus like Sejong corpus. In this method, all or some of the adjacent words are stored as they are, and the whole words, some morphemes, parts of speech or syllables are stored as they are.

그러나 종래의 말뭉치학습 방식은 직접적으로 학습한 적이 없다면 아무런 효과를 발휘할 수 없다. 예로 "사과는 열매다."를 분석한다고 가정하면, '사과(apple)'가 '열매'와 인접한 경우가 말뭉치에 존재해야 이 방식이 효과를 발휘한다. 하지만, 세종말뭉치에 그런 문장은 없으며, 만약 "사과는 식물이다." 또는 "사과는 과일이다."처럼 비슷한 문장이 말뭉치에 존재한다고 하더라도 '열매'가 직접적으로 나타나지 않았다면 아무런 효과가 없다. 다만, 개별 형태소 단위로 '사과'가 애플(apple)로 태깅된 비율을 측정하여 사용할 수는 있다. 이 방법으로는 베이스 라인만큼의 정확률만 기대할 수 있다.However, the conventional corpus learning method can not have any effect unless it has been learned directly. Assuming that "apple is a fruit", for example, the case where "apple" is adjacent to "fruit" must be present in the corpus. However, there is no such statement in the Sejong corpus, and "Apples are plants." Or even if a similar sentence exists in corpus like "apple is a fruit", there is no effect unless "fruit" is directly expressed. However, it is possible to measure the percentage of 'apple' tagged as apple in individual morpheme units. With this method, you can expect only the accuracy of the baseline.

이렇게 종래의 말뭉치학습 방법으로는 미학습 패턴을 전혀 처리할 수 없기 때문에 최근에 들어서 워드넷(WordNet), 한국어 어휘의미망(Korlex, Korean Wordnet) 또는 한국어 어휘지도(UWordMap)와 같은 어휘망을 사용하는 연구가 이뤄지고 있다. 이 방법들은 어휘망에서 상하위 관계나 용언-명사 관계에 등록되지 않은 경우에 효과를 발휘할 수 없다는 점에서는 말뭉치학습 방식에서와 유사한 문제점이 있다. 하지만, 재현율은 부분적으로 훨씬 더 높다고 평가할 수 있다. 왜냐하면, 대부분의 잘 알려진 명사는 상하위망에 등록되어 있으며, 용언-명사의 관계는 상하위망에서 최소상계노드만을 이용하여 작성되기 때문이다. 무작위로 특정 노드 하나를 선택한다고 가정할 때, 그 노드의 하위에는 수백 또는 수천 개의 하위노드가 포함되어 있을 것으로 기대할 수 있다. 그러나 용언-명사 관계망 자체가 많이 빈약하기 때문에 재현율의 문제는 여전히 남아 있다. 이 문제를 완화시키려면 큰 비용을 들여서 어휘망을 계속 보완해야 한다.Recently, we have used lexical networks such as WordNet, Korean Lexicon (Korlex, Korean Wordnet) or Korean lexical map (UWordMap) because we can not process uneducated patterns at all in the conventional corpus learning method. Research is being carried out. These methods have problems similar to those in the corpus-learning method in that they can not be effective when they are not registered in the upper-lower-level relationship or the vernacular-noun relation in the vocabulary network. However, the recall rate is partly much higher. Because most well-known nouns are registered in the top and bottom networks, the relationship between verb-nouns is created by using only the minimum vertex nodes in the top and bottom networks. Assuming that you choose a particular node at random, you can expect to have hundreds or even thousands of subnodes underneath that node. However, the problem of recall rate still remains because the proverb-noun network itself is very poor. To alleviate this problem, it is necessary to continue to supplement the vocabulary network at great expense.

최근에는 워드(단어) 임베딩(word embedding)의 유용성이 알려지면서 자연어처리나 자연어이해의 다양한 분야에 이것을 적용하려는 시도가 이뤄지고 있다. 대표적으로 알려진 워드 임베딩 모델로는 워드투벡(Word2Vec)이 있다. 본래 단어 임베딩은 그 자체로 유사한 단어끼리 묶어주거나, 단어 간의 의미적 유사도 또는 연관성의 정도를 계산할 수 있게 해준다. In recent years, the use of word embedding has become known, and attempts have been made to apply it to various fields of natural language processing and natural language understanding. A typical word embedding model is Word 2 Vec. Original word embedding in itself allows you to group similar words together or calculate the degree of semantic similarity or association between words.

한편, 의미처리시스템 개발을 위해서는 동형이의어 분별 기술은 필수적이다. 최근까지의 연구로는 말뭉치 학습 기반의 방법이 비교적 정확한 결과를 보여주고 있다. 종래의 말뭉치학습 방식은 인접 어절의 내용 중의 일부(음절 또는 형태소)를 그대로 기억하여 동형이의어를 분별하는 것이다.On the other hand, it is essential to develop homogeneous semantic classification technology to develop semantic processing systems. Until recently, corpus learning based methods have shown relatively accurate results. In the conventional corpus learning method, part of the contents (syllable or morpheme) of the adjacent words are stored as they are, and the homograph is discriminated.

그러나 종래의 말뭉치학습 방식은 학습한 적이 없는 패턴에 대해서는 정확률은 낮다.However, the accuracy of the conventional corpus learning method is low for patterns that have not been learned.

대한민국 등록특허공보 제10-1358614호 (2014.01.27. 등록)Korean Registered Patent No. 10-1358614 (Registered on Jan. 21, 2014)

본 발명의 실시 예들은 다양한 자원(예컨대, 말뭉치, 표준 대사전 및 어휘 의미망)으로부터의 학습할 단어 목록, 변환된 말뭉치 및 단어 의미 데이터를 이용한 단어 임베딩 학습을 통해 학습하고 동형이의어와 인접 어절 간의 유사도를 비교하여 동형이의어를 분별함으로써, 미학습 패턴에 대해서 동형이의어를 정확하게 분별할 수 있는, 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법을 제공하고자 한다.Embodiments of the present invention teach learning through word embedding learning using a list of words to be learned from various resources (e.g., corpus, standard dictionary, and lexical semantic network), transformed corpus and word semantic data, To discriminate the homograph of the same type, and to provide an apparatus and a method for homogeneous homonym sorting using the vocabulary semantic network and word embedding, which can correctly discriminate homographs of the homophone with respect to the non-learning patterns.

본 발명의 실시 예는 종래의 단어 임베딩이 아닌 말뭉치에서 실질형태소의 의미적 정보를 이용한 단어 임베딩 학습을 수행하기 때문에 적은 학습 데이터로도 단어 임베딩 학습이 가능하며 학습 시간도 적게 걸릴 수 있는, 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법을 제공하고자 한다.Since the embodiment of the present invention performs word embedding learning using semantic information of a substantial morpheme in a corpus rather than a conventional word embedding, it is possible to perform word embedding learning even with a small amount of learning data, And an apparatus and method for discriminating homonyms using network and word embedding.

또한, 본 발명의 실시 예는 단순하게 등장하는 단어 위치적 정보가 아닌 학습할 단어의 의미 기반으로 학습함으로써 저 빈도의 단어 학습에도 효율적이며, 사전에 등장한 단어를 대상으로 하기 때문에 종래의 단어 임베딩보다 많은 단어를 벡터로 표현할 수 있는, 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법을 제공하고자 한다.In addition, since the embodiment of the present invention learns not only simple word position information but also learning based on the meaning of a word to be learned, it is effective for word learning of low frequency, A method and apparatus for classifying homonyms using vocabulary semantic network and word embedding, which can express many words as vectors.

또한, 본 발명의 실시 예는 종래의 위치 기반의 단어 임베딩이 아닌 실질형태소와 인접한 어절과 그 인접한 어절의 단어 의미 데이터(예컨대, 상위어)를 인접 어절로 처리하여 단어 벡터를 생성함으로써, 단어들의 관계를 코사인 유사도를 통해 볼 수 있는, 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법을 제공하고자 한다.In addition, the embodiment of the present invention processes the word semantics (for example, a pseudonym) of a word phrase adjacent to a substantial morpheme and the adjacent word not adjacent to a conventional word-based word embedding, Which can be viewed through cosine similarity, using a lexical semantic network and word embedding.

본 발명의 제1 측면에 따르면, 적어도 하나 이상의 학습용 어절이 포함된 말뭉치와 단어 의미 데이터가 포함된 표준 대사전 및 어휘 의미망을 저장하는 데이터 저장부; 상기 말뭉치로부터 학습할 단어 목록을 생성하는 단어 목록 생성부; 상기 생성된 단어 목록에 포함된 학습할 단어의 단어 의미 데이터 중에서 변환 가능한 단어 의미 데이터를 말뭉치로 변환하고, 상기 생성된 학습할 단어 목록, 상기 변환된 말뭉치 및 상기 단어 의미 데이터를 단어 임베딩 학습에 맞게 가공하여 가공 데이터를 생성하는 가공 데이터 생성부; 상기 생성된 가공 데이터를 가지고 입출력층 및 프로젝션 층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성하는 단어 임베딩 학습부; 및 상기 생성된 단어 벡터를 이용하여 동형이의어와 인접 어절 간의 유사도를 비교하고 상기 비교 결과에 따라 동형이의어를 분별하는 동형이의어 분별부를 포함하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치가 게공될 수 있다.According to a first aspect of the present invention, there is provided a speech recognition apparatus comprising: a data storage unit for storing a standard dictionary and a lexical semantic network including a corpus including at least one learning word and word semantic data; A word list generating unit for generating a word list to be learned from the corpus; Converting the generated word list, the converted corpus, and the word semantics data into word corpus according to the word embedding learning, A machining data generating unit for machining the machining data to generate machining data; A word embedding learning unit for generating a word vector by learning a word to be learned through word embedding learning using a learning model including an input / output layer and a projection layer with the generated processing data; And an isomorphic discriminating unit for discriminating the isomorphous terms according to the comparison result by comparing the similarities between the homomorphic terms and the adjacent words using the generated word vectors, and a homothetic discriminant apparatus using the word embedding .

상기 데이터 저장부는 적어도 하나 이상의 학습용 어절이 포함된 말뭉치와, 용례 및 뜻풀이가 단어 의미 데이터로 포함된 표준 대사전과, 용언-명사 관계 정보, 상위어 및 반의어가 단어 의미 데이터로 포함된 어휘 의미망을 저장할 수 있다.The data storage unit may include a corpus including at least one learning word, a standard ambassador in which the usage example and the meaning solution are included as the word meaning data, a vocabulary meaning network including vernacular-noun relationship information, Can be stored.

상기 가공 데이터 생성부는 적어도 하나 이상의 학습용 어절이 포함된 말뭉치, 표준 대사전에 포함된 용례 및 뜻풀이, 및 어휘 의미망에 포함된 용언-명사 관계 정보 중에서 적어도 하나를 말뭉치 형태로 변환할 수 있다.The processing data generation unit may convert at least one of a corpus including at least one learning word, a usage example and a meaning solution included in the standard dictionary, and a vernacular-noun relation information included in the lexical meaning network into a corpus form.

상기 가공 데이터 생성부는 상기 변환된 말뭉치에서 상기 학습할 단어와 인접한 어절 및 상기 인접한 어절의 상위어를 상기 학습할 단어의 인접 어절로 처리하거나, 용언-명사 관계 정보를 상기 학습할 단어의 인접 어절로 처리할 수 있다.The processing data generation unit may process the word in the transformed corpus as a word adjacent to the word to be learned and a word in the neighborhood of the adjacent word as an adjacent word of the word to be learned or treat the word-noun relationship information as an adjacent word of the word to be learned can do.

상기 단어 임베딩 학습부는 상기 가공 데이터 중에서 상기 학습할 단어, 인접 어절, 용언-명사 관계 정보 및 반의어가 각각 가공된 가공 데이터를 단어 임베딩에서의 스킵-그램(Skip-Gram)의 입력층 및 출력층이 합쳐진 하나의 입출력층에 위치시키고, 상기 각각의 가공 데이터가 위치한 입출력층 및 미러층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성할 수 있다.The word embedding learning unit may include processing data in which the word to be learned, the adjacent word, the vernacular-noun relation information, and the antonym are respectively processed, and the input layer and the output layer of the Skip-Gram in the word embedding are combined A word vector can be generated by learning a word to be learned through word embedding learning using a learning model composed of an input / output layer and a mirror layer in which the respective processing data are located.

상기 단어 임베딩 학습부는 단어 임베딩 학습의 피드포워드(feedforward) 과정과 백 프로퍼게이션(back propagation) 과정을 통해 단어 임베딩 학습을 수행하고, 백 프로퍼게이션 과정에서 학습할 단어의 가공 데이터와 연결된 가중치 값을 변경하지 않고, 학습할 단어와 연결된 가중치 값을 변경할 수 있다.The word embedding learning unit performs word embedding learning through a feedforward process and a back propagation process of word embedding learning and calculates a weight value associated with the processing data of the word to be learned in the backprogramming process The weight value associated with the word to be learned can be changed.

상기 단어 임베딩 학습부는 단어 임베딩에서의 네거티브-샘플링(Negative-Sampling)을 이용해 가공 데이터 이외의 오답을 학습할 단어에 학습할 수 있다.The word-embedding learning unit can learn a word to learn an incorrect answer other than the processed data by using negative-sampling in word embedding.

상기 단어 임베딩 학습부는 단어 임베딩 학습을 통해 학습할 단어와 인접하고 조사 또는 어미를 제외한 실질형태소의 단어 벡터를 생성 할 수 있다.The word embedding learning unit may generate a word vector of a substantial morpheme excluding words to be examined or adjacent to words to be learned through word embedding learning.

상기 동형이의어 분별부는 상기 생성된 실질형태소의 단어 벡터를 이용하여 분별할 동형이의어의 실질형태소와 인접 어절의 실질형태소 간의 유사도를 비교하여 동형이의어를 분별 할 수 있다.The homotypic classification unit can discriminate the homograph by comparing the similarity between the substantial morpheme of the homograph to be discriminated and the substantial morpheme of the adjacent word using the word vector of the generated substantial morpheme.

한편, 본 발명의 제2 측면에 따르면, 적어도 하나 이상의 학습용 어절이 포함된 말뭉치로부터 학습할 단어 목록을 생성하는 단계; 상기 생성된 단어 목록에 포함된 학습할 단어의 단어 의미 데이터 중에서 변환 가능한 단어 의미 데이터를 말뭉치로 변환하는 단계; 상기 생성된 학습할 단어 목록, 상기 변환된 말뭉치 및 상기 단어 의미 데이터를 단어 임베딩 학습에 맞게 가공하여 가공 데이터를 생성하는 단계; 상기 생성된 가공 데이터를 가지고 입출력층 및 프로젝션 층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성하는 단계; 및 상기 생성된 단어 벡터를 이용하여 동형이의어와 인접 어절 간의 유사도를 비교하고 상기 비교 결과에 따라 동형이의어를 분별하는 단계를 포함하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법이 제공될 수 있다.According to a second aspect of the present invention, there is provided a speech recognition method comprising: generating a word list to be learned from a corpus including at least one learning word; Converting the convertible word semantic data into a corpus among the word semantic data of a word to be learned included in the generated word list; Generating processed data by processing the generated word list to be learned, the converted corpus and the word semantic data according to word embedding learning; Generating a word vector by learning a word to be learned through word embedding learning using a learning model including an input / output layer and a projection layer with the generated processing data; And a step of comparing the similarity degree between the homologous word and the adjacent word using the generated word vector and discriminating the homonym according to the result of the comparison, and a homophonic word discrimination method using word embedding can be provided .

상기 말뭉치로 변환하는 단계는 용례 및 뜻풀이가 포함된 표준 대사전과, 용언-명사 관계 정보, 상위어 및 반의어가 포함된 어휘 의미망에서의 단어 의미 데이터 중에서 변환 가능한 단어 의미 데이터를 말뭉치로 변환할 수 있다.The step of transforming into the corpus may include transforming the convertible word semantic data into a corpus among standard ambiguities including usage examples and meaning solutions, and word semantic data in vocabulary semantic networks including vernacular-noun relation information, have.

상기 말뭉치로 변환하는 단계는 적어도 하나 이상의 학습용 어절이 포함된 말뭉치, 표준 대사전에 포함된 용례 및 뜻풀이, 및 어휘 의미망에 포함된 용언-명사 관계 정보 중에서 적어도 하나를 말뭉치 형태로 변환 할 수 있다.The step of converting into the corpus may convert at least one of corpus including at least one learning word, application and meaning included in the standard dictionary, and vernage-noun relation information included in the lexical meaning network into a corpus form .

상기 가공 데이터를 생성하는 단계는 상기 변환된 말뭉치에서 상기 학습할 단어와 인접한 어절 및 상기 인접한 어절의 상위어를 상기 학습할 단어의 인접 어절로 처리하거나, 용언-명사 관계 정보를 상기 학습할 단어의 인접 어절로 처리 할 수 있다.Wherein the step of generating the processed data includes processing the word in the transformed corpus with a word adjacent to the word to be learned and a word in the adjacent word with an adjacent word of the word to be learned or replacing the word- It can be processed as a word.

상기 단어 벡터를 생성하는 단계는 상기 가공 데이터 중에서 상기 학습할 단어, 인접 어절, 용언-명사 관계 정보 및 반의어가 각각 가공된 가공 데이터를 단어 임베딩에서의 스킵-그램(Skip-Gram)의 입력층 및 출력층이 합쳐진 하나의 입출력층에 위치시키고, 상기 각각의 가공 데이터가 위치한 입출력층 및 미러층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성 할 수 있다.Wherein the step of generating the word vector comprises: inputting processing data, in which the words to be learned, adjacent words, vernacular-noun relation information, and antonyms are respectively processed, to the input layer of a Skip-Gram in word embedding; A word vector can be generated by learning words to be learned through word embedding learning using a learning model composed of an input / output layer and a mirror layer in which the respective processing data are located.

상기 단어 벡터를 생성하는 단계는 단어 임베딩 학습의 피드포워드(feedforward) 과정과 백 프로퍼게이션(back propagation) 과정을 통해 단어 임베딩 학습을 수행하고, 백 프로퍼게이션 과정에서 학습할 단어의 가공 데이터와 연결된 가중치 값을 변경하지 않고, 학습할 단어와 연결된 가중치 값을 변경 할 수 있다.The step of generating the word vector includes performing word-embedding learning through a feedforward process and a back propagation process of word-embedding learning, processing data of a word to be learned in a back- You can change the weight value associated with the word to learn without changing the associated weight value.

상기 단어 벡터를 생성하는 단계는 단어 임베딩에서의 네거티브-샘플링(Negative-Sampling)을 이용해 가공 데이터 이외의 오답을 학습할 단어에 학습 할 수 있다.The step of generating the word vector may include learning negative words other than the processed data by using negative-sampling in word embedding.

상기 단어 벡터를 생성하는 단계는 단어 임베딩 학습을 통해 학습할 단어와 인접하고 조사 또는 어미를 제외한 실질형태소의 단어 벡터를 생성 할 수 있다.The step of generating the word vector may generate a word vector of a substantial morpheme adjacent to the word to be learned and excluding a search or an ending word through word embedding learning.

상기 동형이의어를 분별하는 단계는 상기 생성된 실질형태소의 단어 벡터를 이용하여 분별할 동형이의어의 실질형태소와 인접 어절의 실질형태소 간의 유사도를 비교하여 동형이의어를 분별 할 수 있다.The step of discriminating the homonym can discriminate the homograph by comparing the similarity between the substantial morpheme of the homograph to be discriminated and the substantial morpheme of the adjacent word using the word vector of the generated substantial morpheme.

본 발명의 실시 예들은 다양한 자원(예컨대, 말뭉치, 표준 대사전 및 어휘 의미망)으로부터의 학습할 단어 목록, 변환된 말뭉치 및 단어 의미 데이터를 이용한 단어 임베딩 학습을 통해 학습하고 동형이의어와 인접 어절 간의 유사도를 비교하여 동형이의어를 분별함으로써, 미학습 패턴에 대해서 동형이의어를 정확하게 분별할 수 있다.Embodiments of the present invention teach learning through word embedding learning using a list of words to be learned from various resources (e.g., corpus, standard dictionary, and lexical semantic network), transformed corpus and word semantic data, The homograph of the same type can be accurately discriminated with respect to the non-learned pattern.

본 발명의 실시 예는 종래의 단어 임베딩이 아닌 말뭉치에서 실질형태소의 의미적 정보를 이용한 단어 임베딩 학습을 수행하기 때문에 적은 학습 데이터로도 단어 임베딩 학습이 가능하며 학습 시간도 적게 걸릴 수 있다.Since the embodiment of the present invention performs word embedding learning using semantic information of a substantial morpheme in a corpus rather than a conventional word embedding, word embedding learning can be performed even with a small amount of learning data and learning time can be reduced.

또한, 본 발명의 실시 예는 단순하게 등장하는 단어 위치적 정보가 아닌 학습할 단어의 의미 기반으로 학습함으로써 저 빈도의 단어 학습에도 효율적이며, 사전에 등장한 단어를 대상으로 하기 때문에 종래의 단어 임베딩보다 많은 단어를 벡터로 표현할 수 있다.In addition, since the embodiment of the present invention learns not only simple word position information but also learning based on the meaning of a word to be learned, it is effective for word learning of low frequency, Many words can be represented as vectors.

또한, 본 발명의 실시 예는 종래의 위치 기반의 단어 임베딩이 아닌 실질형태소와 인접한 어절과 그 인접한 어절의 단어 의미 데이터(예컨대, 상위어)를 인접 어절로 처리하여 단어 벡터를 생성함으로써, 단어들의 관계를 코사인 유사도를 통해 볼 수 있다.In addition, the embodiment of the present invention processes the word semantics (for example, a pseudonym) of a word phrase adjacent to a substantial morpheme and the adjacent word not adjacent to a conventional word-based word embedding, Can be seen through the cosine similarity.

도 1은 본 발명의 실시 예에 따른 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치의 구성도이다.
도 2는 본 발명의 실시 예에 따른 입출력층 및 미러층으로 이루어진 학습 모델의 설명도이다.
도 3은 종래의 단어 임베딩에서의 스킵-그램 모델에 대한 설명도이다.
도 4는 본 발명의 실시 예에 따른 동형이의어 분별 장치에서의 변형된 스킵-그램 모델인 피쳐 미러 모델에 대한 설명도이다.
도 5는 종래의 스킵-그램 모델을 이용한 형태소 단위 학습 과정에 대한 예시도이다.
도 6은 본 발명의 실시 예에 따른 동형이의어 분별 장치에 의해 수행되는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법에 대한 흐름도이다.1 is a block diagram of an apparatus for classifying the homonyms using a lexical semantic network and word embedding according to an embodiment of the present invention.
2 is an explanatory diagram of a learning model composed of an input / output layer and a mirror layer according to an embodiment of the present invention.
3 is an explanatory diagram of a skip-gram model in a conventional word embedding.
FIG. 4 is an explanatory diagram of a feature mirror model which is a modified skip-graph model in a homograph sorting apparatus according to an embodiment of the present invention.
5 is an exemplary diagram illustrating a morpheme-based learning process using a conventional skip-gram model.
6 is a flowchart illustrating a method for classifying an isomorphous word using a vocabulary semantic network and word embedding performed by an isomorphous classifier according to an embodiment of the present invention.

이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 설명한다. 본 발명에 따른 동작 및 작용을 이해하는 데 필요한 부분을 중심으로 상세히 설명한다. 본 발명의 실시 예를 설명하면서, 본 발명이 속하는 기술 분야에 익히 알려졌고 본 발명과 직접적으로 관련이 없는 기술 내용에 대해서는 설명을 생략한다. 이는 불필요한 설명을 생략함으로써 본 발명의 요지를 흐리지 않고 더욱 명확히 전달하기 위함이다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. The present invention will be described in detail with reference to the portions necessary for understanding the operation and operation according to the present invention. In describing the embodiments of the present invention, description of technical contents which are well known in the art to which the present invention belongs and which are not directly related to the present invention will be omitted. This is for the sake of clarity of the present invention without omitting the unnecessary explanation.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 동일한 명칭의 구성 요소에 대하여 도면에 따라 다른 참조부호를 부여할 수도 있으며, 서로 다른 도면임에도 동일한 참조부호를 부여할 수도 있다. 그러나 이와 같은 경우라 하더라도 해당 구성 요소가 실시 예에 따라 서로 다른 기능을 갖는다는 것을 의미하거나, 서로 다른 실시 예에서 동일한 기능을 갖는다는 것을 의미하는 것은 아니며, 각각의 구성 요소의 기능은 해당 실시 예에서의 각각의 구성 요소에 대한 설명에 기초하여 판단하여야 할 것이다.In describing the constituent elements of the present invention, the same reference numerals may be given to constituent elements having the same name, and the same reference numerals may be given to different drawings. However, even in such a case, it does not mean that the corresponding component has different functions according to the embodiment, or does not mean that it has the same function in different embodiments, and the function of each component is different from that of the corresponding embodiment Based on the description of each component in FIG.

도 1은 본 발명의 실시 예에 따른 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치의 구성도이다.1 is a block diagram of an apparatus for classifying the homonyms using a lexical semantic network and word embedding according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 실시 예에 따른 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치(100)는 데이터 저장부(110), 단어 목록 생성부(120), 가공 데이터 생성부(130), 단어 임베딩 학습부(140) 및 동형이의어 분별부(150)를 포함한다.As shown in FIG. 1, the apparatus 100 for separating homotypes using a lexical semantic network and word embedding according to an embodiment of the present invention includes a data storage unit 110, a word list generation unit 120, (130), a word embedding learning unit (140), and an equivalence classifying unit (150).

본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 단어 임베딩의 변형된 스킵-그램(Skip-Gram) 모델인 피쳐 미러(Feature Mirror) 모델을 이용해서 다양한 자원들을 학습하고 실질형태소의 단어 벡터를 생성한다.The homothetic word classification apparatus 100 according to the embodiment of the present invention learns various resources using a feature mirror model which is a modified Skip-Gram model of word embedding, .

이를 위해, 동형이의어 분별 장치(100)는 우선 말뭉치(111)로부터 학습할 단어 목록을 생성한다. 또한, 동형이의어 분별 장치(100)는 학습용 어절이 포함된 말뭉치(111), 표준 대사전(112)의 용례를 말뭉치의 형태로 변환하여 학습한다. 또한, 동형이의어 분별 장치(100)는 어휘 의미망(113)의 용언-명사 관계를 말뭉치의 형태로 변환하여 학습한다. 추가로 동형이의어 분별 장치(100)는 어휘 의미망(113)의 상위어와 표준 대사전(112)의 뜻풀이를 이용하여 학습할 수 있다. 여기서, 어휘 의미망(113)의 반의어는 네거티브-샘플링(Negative-Sampling) 용도로 이용한다. 이용되는 모든 학습 자원들은 동형이의어 수준에서 태그가 부착되어 있다.To this end, the homothetic word classifier 100 generates a word list to be learned from the corpus 111 first. In addition, the homothetic word classification apparatus 100 learns by converting the usage of the corpus 111 and the standard ambassador 112 including a learning word into the form of a corpus. In addition, the homothetic word classifier 100 converts the vernacular-noun relationship of the vocabulary semantic network 113 into a corpus form. In addition, the homotypic differentiation apparatus 100 can learn using the meanings of the words in the lexical meaning network 113 and the standard dictionary 112. Here, the antonym of the lexical meaning network 113 is used for negative-sampling purposes. All learning resources used are tagged at the homograph level.

동형이의어 분별 장치(100)는 이러한 단어 임베딩 학습을 통해 학습한 결과물로 형태소의 단어 벡터를 생성한다. 그리고 동형이의어 분별 장치(100)는 형태소의 단어 벡터를 이용하여 두 형태소의 유사도를 계산하여 동형이의어를 분별할 수 있다.The homothetic word classification apparatus 100 generates a word vector of the morpheme as a result of the learning through the word embedding learning. Then, the homotypic discriminating apparatus 100 can discriminate the homotypic variant by calculating the similarity of the two morphemes using the word vector of the morpheme.

여기서, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 말뭉치(111), 표준 대사전(112) 및 어휘 의미망(113)이 구축될 수 있는 한국어뿐만 아니라 영어(WordNet), 중국어(HowNet) 등의 모든 언어에 적용 가능하다Herein, the homotopic sorting apparatus 100 according to the embodiment of the present invention is not limited to Korean in which the corpus 111, the standard dictionary 112 and the lexical meaning network 113 can be constructed, ), And so on.

이하, 도 1의 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치(100)의 각 구성요소들의 구체적인 구성 및 동작을 설명한다.Hereinafter, the specific configuration and operation of each element of the homonym classification apparatus 100 using the lexical semantic network and word embedding of FIG. 1 will be described.

데이터 저장부(110)는 적어도 하나 이상의 학습용 어절이 포함된 말뭉치(111)와 단어 의미 데이터가 포함된 표준 대사전(112) 및 어휘 의미망(113)을 저장한다. 데이터 저장부(110)는 적어도 하나 이상의 학습용 어절이 포함된 말뭉치(111)와, 용례 및 뜻풀이가 단어 의미 데이터로 포함된 표준 대사전(112)과, 용언-명사 관계 정보, 상위어 및 반의어가 단어 의미 데이터로 포함된 어휘 의미망(113)을 저장한다. 예컨대, 말뭉치(111)는 기설정된 말뭉치로서, 세종말뭉치가 적용될 수 있다.The data storage unit 110 stores a corpus 111 including at least one learning word and a standard dictionary 112 and a lexical meaning network 113 including word meaning data. The data storage unit 110 includes a corpus 111 including at least one learning word, a standard dictionary 112 in which the usage example and the meaning solution are included as word meaning data, vernacular-noun relationship information, And stores the lexical semantic network 113 included as semantic data. For example, the corpus 111 may be a predetermined corpus, and a Sejong corpus may be applied.

본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 학습할 단어의 종류와 단어 의미 데이터로 말뭉치(111), 표준 대사전(112) 및 어휘 의미망(113)을 이용한다. 여기서, 어휘 의미망(113)은 어휘 지도로 지칭되기고 한다. 어휘 의미망(113)은 표준 사전을 기반으로 명사, 용언, 부사 어휘들이 의미제약으로 상호 연결된 어휘 의미망을 의미한다. The homothetic word classifying apparatus 100 according to the embodiment of the present invention uses the corpus 111, the standard dictionary 112, and the lexical semantic network 113 as the kind of words to be learned and the word semantic data. Here, the lexical meaning network 113 is referred to as a vocabulary map. The lexical meaning network 113 refers to a lexical meaning network in which nouns, verbs, and adjective vocabularies are interconnected by meaning constraints based on a standard dictionary.

단어 목록 생성부(120)는 데이터 저장부(110)에서 저장된 말뭉치(111)로부터 학습할 단어 목록을 생성한다. The word list generation unit 120 generates a word list to be learned from the corpus 111 stored in the data storage unit 110.

가공 데이터 생성부(130)는 단어 목록 생성부(120)에서 생성된 학습할 단어 목록에 포함된 단어의 단어 의미 데이터를 데이터 저장부(110)로부터 가져와서 단어 임베딩 학습에 맞게 가공하여 학습용 가공 데이터를 생성한다.The machining data generation unit 130 fetches the word sense data of the words included in the word list to be learned generated by the word list generation unit 120 from the data storage unit 110 and processes the data according to the word embedding learning, .

여기서, 가공 데이터 생성부(130)는 단어 목록 생성부(120)에서 생성된 학습할 단어 목록에 포함된 학습할 단어의 단어 의미 데이터 중에서 변환 가능한 단어 의미 데이터를 말뭉치로 변환한다.Here, the machining data generation unit 130 converts the convertible word semantic data into corpus from the word semantic data of the word to be learned included in the word list to be generated, which is generated by the word list generation unit 120.

구체적으로 살펴보면, 가공 데이터 생성부(130)는 적어도 하나 이상의 학습용 어절이 포함된 말뭉치(111), 상기 표준 대사전(112)에 포함된 용례 및 상기 어휘 의미망(113)에 포함된 용언-명사 관계 정보 중에서 적어도 하나를 말뭉치 형태로 변환한다.More specifically, the processing data generation unit 130 includes a corpus 111 including at least one learning word, an example included in the standard dictionary 112, and a vernacular relation And converts at least one of the information into a corpus form.

그리고 가공 데이터 생성부(130)는 변환된 말뭉치에서 상기 학습할 단어와 인접한 어절 및 상기 인접한 어절의 상위어를 상기 학습할 단어의 인접 어절로 처리하거나, 용언-명사 관계 정보를 상기 학습할 단어의 인접 어절로 처리할 수 있다.The processed data generation unit 130 may process the word pairs adjacent to the word to be learned and the pair words of the adjacent words in the converted corpus as adjacent words of the word to be learned, It can be processed as a word.

그리고 가공 데이터 생성부(130)는 그 생성된 학습할 단어 목록, 변환된 말뭉치 및 단어 의미 데이터를 단어 임베딩 학습에 맞게 가공하여 가공 데이터를 생성한다.Then, the machining data generation unit 130 generates the machining data by processing the generated word list to be learned, the converted corpus and the word semantics according to the word embedding learning.

단어 임베딩 학습부(140)는 가공 데이터 생성부(130)에서 생성된 가공 데이터를 가지고 입출력층 및 프로젝션 층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성한다. 이때, 단어 임베딩 학습부(140)는 단어 임베딩 학습을 통해 학습할 단어와 인접하고 조사 또는 어미를 제외한 실질형태소의 단어 벡터를 생성할 수 있다.The word embedding learning unit 140 generates a word vector by learning a word to be learned through word embedding learning using a learning model including an input / output layer and a projection layer, with the processed data generated by the processed data generating unit 130. At this time, the word embedding learning unit 140 can generate a word vector of a substantial morpheme except for the search or the ending which is adjacent to the word to be learned through the word embedding learning.

여기서, 단어 임베딩 학습부(140)는 단어 임베딩에서의 스킵-그램(Skip-Gram)의 입력층 및 출력층이 합쳐진 하나의 입출력층 및 미러층(mirror layer)으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성한다. 단어 임베딩 학습부(140)는 각각의 가공 데이터가 위치한 입출력층 및 미러층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성한다.Here, the word-embedding learning unit 140 performs word-embedding learning using a learning model consisting of one input / output layer and a mirror layer in which an input layer and an output layer of a Skip-Gram in a word embedding are combined The word vector is generated by learning the word to be learned. The word embedding learning unit 140 learns a word to be learned through word embedding learning using a learning model composed of an input / output layer and a mirror layer in which each processing data is located, and generates a word vector.

구체적으로 살펴보면, 단어 임베딩 학습부(140)는 가공 데이터 중에서 학습할 단어, 인접 어절, 용언-명사 관계 정보 및 반의어가 각각 가공된 가공 데이터를 단어 임베딩에서의 스킵-그램(Skip-Gram)의 입력층 및 출력층이 합쳐진 하나의 입출력층에 위치시킨다. 그리고 단어 임베딩 학습부(140)는 각각의 가공 데이터가 위치한 입출력층 및 미러층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성한다.More specifically, the word-embedding learning unit 140 extracts, from the machining data, the machining data on which the word to be learned, the adjacent word, the vernacular-noun relation information, Layer and the output layer are combined. Then, the word embedding learning unit 140 learns words to be learned through word embedding learning using a learning model consisting of an input / output layer and a mirror layer in which respective processing data are located, and generates word vectors.

이때, 단어 임베딩 학습부(140)는 단어 임베딩 학습의 피드포워드(feedforward) 과정과 백 프로퍼게이션(back propagation) 과정을 통해 단어 임베딩 학습을 수행한다. 단어 임베딩 학습부(140)는 백 프로퍼게이션 과정에서 학습할 단어의 가공 데이터와 연결된 가중치 값을 변경하지 않고, 학습할 단어와 연결된 가중치 값을 변경한다.At this time, the word embedding learning unit 140 performs word embedding learning through a feedforward process and a back propagation process of word embedding learning. The word embedding learning unit 140 changes the weight value associated with the word to be learned without changing the weight value associated with the processing data of the word to be learned in the backprogramming process.

또한, 단어 임베딩 학습부(140)는 단어 임베딩에서의 네거티브-샘플링(Negative-Sampling)을 이용해 가공 데이터 이외의 오답을 학습할 단어에 학습할 수 있다. 단어 임베딩 학습부(140)는 반의어에 대해서 네거티브-샘플링을 수행한다. 단어 임베딩 학습부(140)는 반의어의 경우에만 출력 값을 0으로 설정해 학습한다. 단어 임베딩 학습부(140)는 학습할 단어의 다른 동형이의어에 대해서 네거티브-샘플링을 수행한다. 예를 들면, '사과_05/NNG'에 대해서 학습할 때 이 단어는 '사과'이며 품사가 NNG인 다른 형태소(사과_08/NNG 등)'와 네거티브-샘플링이 수행될 수 있다.In addition, the word-embedding learning unit 140 can learn words to learn incorrect answers other than the processed data by using negative-sampling in word embedding. The word embedding learning unit 140 performs negative-sampling on the antonym. The word embedding learning unit 140 learns by setting the output value to 0 only in the case of an antonym. The word-embedding learning unit 140 performs negative-sampling on other homographs of words to be learned. For example, when learning about 'apple_05 / NNG', this word can be 'apple' and negative-sampling with another morpheme (api _08 / NNG etc.) where the part is NNG.

한편, 동형이의어 분별부(150)는 단어 임베딩 학습부(140)에서 생성된 단어 벡터를 이용하여 동형이의어와 인접 어절 간의 유사도를 비교하고 그 비교 결과에 따라 동형이의어를 분별한다.On the other hand, the homotypic typing unit 150 compares the homology between the homotypes and the adjacent words using the word vector generated by the word embedding learning unit 140, and discriminates the homotypes based on the comparison result.

여기서, 동형이의어 분별부(150)는 실질형태소의 단어 벡터를 이용한다. 즉, 동형이의어 분별부(150)는 단어 임베딩 학습부(140)에서 생성된 실질형태소의 단어 벡터를 이용하여 분별할 동형이의어의 실질형태소와 인접 어절의 실질형태소 간의 유사도를 비교하여 동형이의어를 분별할 수 있다.Here, the homotypic differentiator 150 uses a word vector of a substantial morpheme. That is, the homotypic classification unit 150 compares the similarity between the substantial morpheme of the homozygous phrase to be discriminated and the substantial morpheme of the adjacent word using the word vector of the substantial morpheme generated by the word embedding learning unit 140, have.

이와 같이, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 말뭉치뿐만 아니라 용례, 뜻풀이 또한 단순 말뭉치로 취급하여 학습하게 된다. 동형이의어 분별 장치(100)는 변환된 말뭉치에서 학습할 단어의 바로 다음 어절이나 형태소에 나타날 단어를 가공 데이터로 생성하게 된다. 또한, 동형이의어 분별 장치(100)는 상위어, 반의어, 용언-명사 관계를 가공 데이터로 이용한다.As described above, the homothetic word classifying apparatus 100 according to the embodiment of the present invention learns not only a corpus but also a usage example and a meaning solution as a simple speech corpus. The homothetic word classification apparatus 100 generates a word to be displayed in the next word or morpheme of the word to be learned in the transformed corpus as processing data. In addition, the homothetic word separator 100 uses a parent word, antonym, and spoken-noun relation as processing data.

한편, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)에서는 말뭉치(111)와 표준 대사전(112)의 용례를 묶어서 하나의 말뭉치로 취급한다. 종래의 워드투벡이나 기타 알려진 단어 임베딩 모델들은 영어와 영어 말뭉치를 대상으로 한다.Meanwhile, in the homotopic sorting apparatus 100 according to the embodiment of the present invention, the corpus of the corpus 111 is used as a corpus of the standard ambassador 112. Conventional word-to-block and other known word-embedding models target English and English corpus.

본 발명의 실시 예에 따른 동형이의어 분별 장치(100)에서는 영어뿐만 아니라 다른 언어(예컨대, 교착어인 한국어 등)에 적용하기 위해서 각 언어의 특성에 맞게 변경하여 단어 임베딩 학습을 수행한다.In the homothetic word classifying apparatus 100 according to the embodiment of the present invention, word embedding learning is performed in accordance with the characteristics of each language in order to be applied not only to English but also to other languages (for example, Korean as a ploy).

일례로, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)가 한국어에 적용되는 경우, 한국어에서는 '을', '를'같은 조사나 '하다', '였다'와 같은 어미들은 단어 벡터를 가지지 않게 한다. 즉, 단어 임베딩 학습부(140)는 일반명사(NNG), 동사(VV), 형용사(VA) 등의 실질형태소에 대해서 단어 벡터를 생성한다. 그리고 단어 임베딩 학습부(140)는 변환된 말뭉치에서 실질형태소만 남기고 바로 인접하는 실질형태소를 단어 임베딩 학습에 이용한다. 또한, 단어 임베딩 학습부(140)는 인접 형태소를 기설정된 횟수(예컨대, 1회)로 학습할 때마다 네거티브-샘플링을 수행할 수 있다. 단어 임베딩 학습부(140)는 무작위로 선택된 동형이의어의 실질형태소와 동일한 형태소의 다른 동형이의어에 대해서 네거티브-샘플링을 수행한다. 예컨대, 단어 임베딩 학습부(140)는 '사과_05/NNG'에 대해서 학습할 때, 이 형태소는 '사과'이며 품사가 일반명사(NNG)인 다른 형태소(사과_08/NNG 등)'와 네거티브-샘플링을 수행한다.For example, when the homothetic word separator 100 according to the embodiment of the present invention is applied to Korean, the mother words such as' a ',' a ',' a ',' Do not have it. That is, the word embedding learning unit 140 generates a word vector for a substantial morpheme such as a general noun (NNG), a verb (VV), and an adjective (VA). Then, the word embedding learning unit 140 uses only the substantial morpheme in the transformed corpus, and uses the immediately adjacent substantial morpheme for word embedding learning. In addition, the word-embedding learning unit 140 can perform negative-sampling every time the adjacent morpheme is learned by a preset number of times (for example, once). The word embedding learning unit 140 performs negative-sampling on the other homotypes having the same morpheme as the actual morpheme of the randomly selected homograph. For example, when learning about 'Apology_05 / NNG', the word embedding learning unit 140 determines whether the morpheme is 'apple' and the part-of-speech is a general noun (NNG) Negative - Performs sampling.

한편, 다양한 정보들이 있는 어휘 의미망(113)을 학습하는 과정에 대해서 살펴보면 다음과 같다. 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 이 중에서 상위어, 반의어, 용언-명사 관계 정보를 이용한다. 가공 데이터 생성부(130)는 변환된 말뭉치를 가공할 때 인접 어절에 대해 처리하면서 동시에 그 어절의 상위어도 인접한 것과 동일하게 처리한다. 그리고 단어 임베딩 학습부(140)는 반의어에 대해서 네거티브-샘플링을 수행한다.A process of learning the lexical semantic network 113 having various information will be described as follows. The homothetic word separator 100 according to the embodiment of the present invention uses the pairs of words, antonyms, and vernacular-noun relationship information. The machining data generation unit 130 processes adjacent words while processing the converted corpus, and at the same time treats the words of the same word as adjacent words. Then, the word embedding learning unit 140 performs negative-sampling on the antonym.

그리고 동형이의어 분별 장치(100)는 변환된 말뭉치 전체에 대해 이런 과정이 1회 수행되고 나면 용언-명사 관계 정보를 학습할 수 있다. 동형이의어 분별 장치(100)는 용언-명사 관계망에 있는 용언과 명사는 서로 인접한 것으로 처리하여 학습할 수 있다.Then, the homothetic word classifying apparatus 100 can learn the vernacular-noun relation information once this process is performed once for the entire transformed corpus. The homograph sorting apparatus (100) can learn a proverb and a noun in the proverb-noun relationship network by treating them as adjacent to each other.

도 2는 본 발명의 실시 예에 따른 입출력층 및 미러층으로 이루어진 학습 모델의 설명도이다.2 is an explanatory diagram of a learning model composed of an input / output layer and a mirror layer according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 각 단어들을 학습하기 위해서 종래의 단어 임베딩 또는 워드투벡(Word2Vec)의 스킵-그램(Skip-Gram)을 변형한 학습 모델과 네거티브-샘플링(Negative-Sampling)을 이용한다. 여기서, 가공 데이터의 입력 값과 결과 값은 원-핫(One-hot) 방식으로 표현되어 이용된다.As shown in FIG. 2, the homothetic word classifying apparatus 100 according to the embodiment of the present invention includes a word-by-block (Word2Vec) skip-gram One learning model and negative-sampling are used. Here, the input values and the result values of the machining data are expressed and used in a one-hot manner.

그리고 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 각 단어를 50차원의 단어 벡터로 표현한다.In addition, the homothetic word classification apparatus 100 according to the embodiment of the present invention expresses each word as a 50-dimensional word vector.

도 2에는 본 발명의 실시 예에서 이용한 학습 모델이 나타나 있다. 동형이의어 분별 장치(100)는 이러한 학습 모델을 통해 단어 A를 학습하기 위해 단어 B와 연결된 가중치(Weight) 값을 이용하여 A라는 단어를 201 및 202 간선을 통해 학습한다. 여기서, 동형이의어 분별 장치(100)는 실제 가중치 값을 변경하는 경우, 학습할 단어 A와 연결된 가중치(204)만 변경하게 되고, 단어 B와 연결된 가중치(203)를 변경하지 않는다.FIG. 2 shows a learning model used in the embodiment of the present invention. In order to learn the word A through the learning model, the homothetic word classification apparatus 100 learns the word A by using the weight value associated with the word B through 201 and 202 trunks. In this case, when changing the actual weight value, the homomorphic classification apparatus 100 changes only the weight 204 associated with the word A to be learned, and does not change the weight 203 associated with the word B.

이러한 학습 모델은 단어 A의 단어 임베딩 결과에 단어 B와 연결된 단어들의 정보도 포함하기 위해서 이용된다. 그리고 단어들 간의 의미적 연쇄효과를 얻기 위해 도 2와 같은 학습 모델이 이용된다. 이때, 동형이의어 분별 장치(100)는 뜻풀이와 상위어에 대해 출력(Output) 값으로 1로 설정해 학습하고, 반의어의 경우에만 출력 값을 0으로 설정해 학습한다.This learning model is used to include information of words associated with word B in the word-embedding result of word A. In order to obtain the semantic chain effect between words, the learning model as shown in FIG. 2 is used. At this time, the homograph sorting apparatus 100 learns by setting the output value to 1 for the meaning solution and the parent word, and sets the output value to 0 only for the antonyms.

또한, 동형이의어 분별 장치(100)는 네거티브-샘플링을 이용해 학습 데이터 이외의 오답을 각 단어에 학습시켜 학습의 정확률을 높일 수 있다.In addition, the homothetic word classification apparatus 100 can improve accuracy of learning by learning negative words other than learning data for each word using negative-sampling.

이하, 도 3 및 도 4를 참조하여 종래의 단어 임베딩의 스킵-그램 모델과 본 발명의 실시 예에 따른 피쳐 미러 모델과의 차이점을 살펴보기로 한다.Hereinafter, the difference between the conventional skip-gram model of word embedding and the feature mirror model according to an embodiment of the present invention will be described with reference to FIGS. 3 and 4. FIG.

도 3은 종래의 단어 임베딩에서의 스킵-그램 모델에 대한 설명도이다.3 is an explanatory diagram of a skip-gram model in a conventional word embedding.

도 3에 도시된 바와 같이, 종래의 스킵-그램 모델은 입력층(input layer), 프로젝션층(projection layer) 및 출력층(output layer)의 3개의 층(layer)을 이용하여 구성된 인공 신경망이다. As shown in FIG. 3, the conventional skip-graph model is an artificial neural network configured by using three layers of an input layer, a projection layer, and an output layer.

입력층에는 원-핫(One-hot) 형태로 학습할 단어가 들어가게 된다. 출력층에는 입력층에서 사용한 단어의 앞의 두 단어(w(t-2), w(t-1)), 뒤의 두 단어(w(t+1), w(t+2))가 들어가서 학습하게 된다. 도 3에 도시된 스킵-그램 모델에서는 입력층과 프로젝션층 사이의 간선과 프로젝션층 및 출력층 사이의 간선은 서로 다른 값(

)으로 구성되어 있다.In the input layer, words to be learned are entered in one-hot form. In the output layer, two words (w (t-2), w (t-1)) and the following two words (w (t + 1) and w . In the skip-graph model shown in Fig. 3, the trunk line between the input layer and the projection layer and the trunk line between the projection layer and the output layer have different values

).

도 4는 본 발명의 실시 예에 따른 동형이의어 분별 장치에서의 피쳐 미러 모델에 대한 설명도이다.4 is an explanatory diagram of a feature mirror model in the homothetic sorting apparatus according to the embodiment of the present invention.

도 4의 (a) 및 (b)에 도시된 바와 같이, 본 발명의 실시 예에 따른 피쳐 미러 모델은 두 개의 층(Layer)으로 이루어진다. 피쳐 미러 모델은 종래의 스킵-그램 모델의 입력층 및 출력층을 합친 하나의 입출력층과 미러층으로 이루어져 있다.4 (a) and 4 (b), the feature mirror model according to the embodiment of the present invention consists of two layers. The feature mirror model consists of one input / output layer and a mirror layer that combine the input layer and the output layer of the conventional skip-graph model.

본 발명의 실시 예에 따른 피쳐 미러 모델에서의 입출력층에는 학습할 단어(X(t)), 학습할 단어(X(t))의 단어 의미 데이터가 원-핫 형태로 가공되어 들어가게 된다. 여기서, 학습할 단어(X(t))의 단어 의미 데이터에는 학습할 단어(X(t))와 인접한 인접 어절(X(t+1)), 용언-명사 관계(V-N relation) 정보, 반의어(antonym)가 포함될 수 있다. 예컨대, 단어 의미 데이터에는 단어 뜻풀이(Word definition), 상위어(hypernym) 및 반의어(antonym) 등이 포함될 수 있다.The word semantic data of the word to be learned (X (t)) and the word to be learned (X (t)) are processed into a circle-hot pattern in the input / output layer in the feature mirror model according to the embodiment of the present invention. Here, the word semantic data of the word to be learned (X (t)) includes the adjacent word (X (t)) and adjacent word (X (t + 1)), the vernacular relation (VN relation) antonym) may be included. For example, word definition data may include word definition, hypernym, and antonym.

단어 의미 임베딩 장치(100)는 단어 임베딩 학습의 피드포워드(feedforward) 과정과 백 프로퍼게이션(back propagation) 과정을 통해 단어 임베딩 학습을 수행한다. 단어 의미 임베딩 장치(100)는 백 프로퍼게이션 과정에서 학습할 단어의 단어 의미 데이터와 연결된 가중치 값을 변경하지 않고, 학습할 단어와 연결된 가중치 값을 변경한다.The word meaning embedding apparatus 100 performs word embedding learning through a feedforward process and a back propagation process of word embedding learning. The word meaning embedding apparatus 100 changes the weight value associated with the word to be learned without changing the weight value associated with the word semantic data of the word to be learned in the backprogramming process.

즉, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 피쳐 미러 모델을 통해 학습을 적용할 때, 학습할 단어 즉, 타겟이 되는 단어(x(target))와 연결된 간선(가중치)만을 변화시키고, 학습할 단어의 단어 뜻풀이, 상위어 및 반의어 등과 연결된 간선(가중치)을 변화시키지 않는다.That is, when applying the learning through the feature mirror model, the homothetic word classification apparatus 100 according to the embodiment of the present invention only recognizes a word to be learned, that is, an edge (weight) connected to a target word x And does not change the trunk (weight) connected to the word meaning solution of words to be learned, parent words and antonyms.

이하, 도 5를 참조하여 종래의 스킵-그램 모델과 본 발명의 실시 예에 따른 스킵-그램 모델을 이용한 학습 과정을 살펴보기로 한다.Hereinafter, a conventional skip-gram model and a learning process using a skip-gram model according to an embodiment of the present invention will be described with reference to FIG.

도 5는 종래의 스킵-그램 모델을 이용한 형태소 단위 학습 과정에 대한 예시도이다.5 is an exemplary diagram illustrating a morpheme-based learning process using a conventional skip-gram model.

종래의 단어 임베딩 방식은 문장에서 학습할 단어의 주변 단어를 이용해 학습한다. 도 5에 도시된 바와 같이, "학교에서 사과를 먹었다."라는 문장에서 학습할 단어가 "사과"인 경우에 주변 단어 즉, 주변 형태소인 "학교", "에서", "를" 및 "먹"라는 주변 단어가 단어 임베딩 방식에 이용된다. 학습할 단어인 "사과"와는 의미상으로 관련이 없을 수도 있지만 문장에서 학습할 단어와 인접한 단어가 단어 임베딩 방식에 이용되는 것이다. 종래의 단어 임베딩 방식은 주변의 실질 형태소뿐만 아니라 인접한 조사나 어미들도 학습에 이용한다.The conventional word embedding method learns by using surrounding words of a word to be learned in a sentence. As shown in FIG. 5, when the word "apple" is used in the sentence "I ate apples at school", the surrounding words, ie, "school", "to", " Quot; is used for the word embedding method. Although words that are not semantically related to the word "apology" to be learned may be used for the word embedding method, words adjacent to the words to be learned in the sentence are used. The conventional word embedding method uses not only the surrounding actual morpheme but also neighboring surveys and endings for learning.

여기서, 종래의 단어 임베딩 방식에서는 단어의 수나 정확률을 위해서는 대용량의 형태소 의미 번호가 부착된 말뭉치가 필요하게 된다. 말뭉치의 양에 따라 단어 임베딩의 결과에 영향을 주므로, 종래의 단어 임베딩 방식은 많은 말뭉치를 사용할 것을 권장하고 있다.Here, in the conventional word embedding method, a corpus with a large amount of morpheme semantic numbers is required for the number of words and accuracy. Since the amount of corpus affects the result of word embedding, the conventional word embedding method is recommended to use a large number of corpus.

한편, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 워드투벡(Word2Vec)의 변형 모델 즉, 스킵-그램을 변형한 모델을 이용한다. 학습된 단어들을 단어 벡터 값을 가지고 동형이의어의 중의성을 해소하기 위한 계산을 하기와 같이 수행한다.Meanwhile, the homothetic word classification apparatus 100 according to the embodiment of the present invention uses a modified version of word-by-word (Word2Vec), that is, a model obtained by modifying a skip-gram. We perform the computation to solve the ambiguity of the homonym with the word vector value of the learned words as follows.

실질형태소의 단어 벡터를 이용한 유사도 계산 과정은 다음과 같다.The similarity calculation procedure using the word vectors of the real morpheme is as follows.

동형이의어 분별 장치(100)는 "사과를 먹다"라는 문장에서 동형이의어인 사과의 중의성을 해소하기 위해서는 "사과"에 해당하는 모든 단어 벡터 값을 찾는다. 여기서, 모든 단어 벡터 값에는 사과(05)(사과나무의 열매), 사과(08)(자기 잘못을 인정하고 용서를 빔) 등이 포함된다.The homograph sorting apparatus 100 finds all the word vector values corresponding to "apples" in order to eliminate the ambiguity of apples which are the homograph in the sentence "eat apples." Here, all word vector values include apples (05) (fruit of apple trees), apples (08) (apologize for mistakes and forgiveness).

그리고 동형이의어 분별 장치(100)는 먹다(음식을 배 속으로 들이키다)의 단어 벡터와 하기의 [수학식 1]과 같은 코사인 유사도(Cosine Similarity)를 이용하여 값을 구한다.Then, the homothetic word separator 100 obtains a value by using a word vector of eat (food is swallowed) and a cosine similarity as shown in the following equation (1).

여기서, A 및 B는 두 개의 단어 A 및 B 각각의 단어 벡터를 나타낸다.Here, A and B represent word vectors of two words A and B, respectively.

이후, 동형이의어 분별 장치(100)는 가장 높은 값을 가진 사과의 벡터를 "사과를 먹다"라는 문장에서 사용된 "사과"라는 단어의 의미로 결정한다.Then, the homothetic word separator 100 determines the vector of the apple having the highest value as the meaning of the word "apple" used in the sentence "eat apple. &Quot;

예를 들어서, “배가 썩는다”에서 ‘배’라는 동형이의어의 중의성을 해소하는 과정을 살펴보기로 한다.For example, let us look at the process of eliminating the ambiguity of the homosexual word "ship" from "the ship rot".

이 문장 자체는 세종말뭉치에 없기 때문에 미학습 패턴에 속한다. 정확히 말하자면, 열매라는 의미의 '배'와 '썩다'라는 표현이 인접하는 경우가 세종말뭉치에 없다. 종래의 학습 방법들로는 이러한 미학습 패턴이 발생하면 정확률이 현저히 낮아진다.This sentence itself is not in the Sejong corpus, so it belongs to the non-learning pattern. To be precise, there is no case in the Sejong corpus where the expressions of 'berries' and 'rotten' in the sense of fruits are adjacent. With conventional learning methods, the accuracy rate is significantly lowered when such an un-learned pattern occurs.

하지만, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)에서 생성된 단어 벡터를 이용하는 경우에는 정확률의 하락 정도가 적다. 왜냐하면, 열매 '배'의 단어 벡터 방향이 '썩다'와 인접하는 경우가 있는 '열매'나 '사과', '음식'의 벡터와 비슷하기 때문이다.However, in the case of using the word vector generated in the homothetic word classification apparatus 100 according to the embodiment of the present invention, the degree of decline of the accuracy rate is small. This is because the word vector direction of the fruit 'boat' is similar to the vector of 'fruit', 'apple', and 'food', which may be adjacent to 'rot'.

비슷해지도록 학습되는 근거는 어휘 의미망(113)에서 '배'의 상위어에 '열매'가 나타나기 때문이다. 그 외에도 '배'의 인접어가 '음식'의 인접어(예컨대, 먹다, 삼키다 등)와 비슷하기 때문에 단어 벡터가 유사해지도록 학습되는 것이다. '사과'나 '음식' 등의 단어 벡터는 동사 '썩다'와 인접하는 경우가 있다. 인접하는 단어끼리는 그들의 벡터가 유사해지게 된다. 결국, 열매 '배'의 단어 벡터가 '썩다'와 유사해지게 된다.The reason for learning to be similar is that in the lexical meaning network (113), 'fruit' appears in the upper word of 'abdomen'. In addition, since the adjacent word of 'abdomen' is similar to the adjoining words of 'food' (eg, eat, swallow, etc.), the word vectors are learned to be similar. The word vectors such as 'apple' and 'food' may be adjacent to the verb 'rot'. The vectors of adjacent words become similar to each other. Eventually, the word vector of the fruit 'boat' becomes similar to 'rot'.

그리고 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 '배'의 다른 의미들(예컨대, 선박, 복부 등)의 단어 벡터가 서로 방향으로 멀어지도록 네거티브-샘플링하는 과정을 수행한다. 그래서 선박을 뜻하는 '배'의 단어 벡터는 열매 '배'의 단어 벡터와 크게 다르게 된다. 결국, 선박을 뜻하는 '배'의 단어 벡터는 ‘썩다’ 벡터와 유사도가 낮아지게 된다.Then, the homothetic word classifying apparatus 100 according to the embodiment of the present invention performs a negative-sampling process so that the word vectors of other meanings of 'double' (for example, ship, abdomen, etc.) are distant from each other. Therefore, the word vector of 'ship' which means ship is greatly different from the word vector of 'ship' of fruit. Eventually, the word 'ship', which means ship, has a lower degree of similarity with the 'rot' vector.

즉, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 단어 임베딩 학습을 통해 생성된 단어 벡터를 동형이의어 분별 과정에 이용함으로써, 미학습 패턴에 강건한 경향이 있다. 또한, 동형이의어 분별 장치(100)는 어휘 의미망(113)을 추가로 학습하고, 의미번호만 다른 동형이의어끼리 네거티브-샘플링을 한다는 점(예를 들어, 열매 '배'는 선박 '배'와 네거티브-샘플링) 때문에 더욱 정확하게 동형이의어를 분별할 수 있다.That is, the homothetic word classifying apparatus 100 according to the embodiment of the present invention tends to be robust to the non-learning patterns by using the word vectors generated through the word-embedding learning for the homotopic classifying process. In addition, the homothetic word classification apparatus 100 further learns the vocabulary semantic network 113 and performs negative-sampling among homotypes other than the semantic number (for example, 'fruit' is' ship 'and' Negative - Sampling) allows more accurate discrimination of homographs.

한편, 피쳐 미러 모델을 이용하여 "사과(먹는 사과)"라는 학습할 단어를 학습하는 과정을 살펴보기로 한다.On the other hand, let's look at the process of learning the word "apples" by using the feature mirror model.

동형이의어 분별 장치(100)는 사과의 뜻풀이인 사과나무의 열매에서 명사와 용언을 추출한 "사과나무", "열매"라는 단어 의미 데이터를 가공 데이터로 이용한다. 그리고 동형이의어 분별 장치(100)는 "사과"라는 학습할 단어의 상위어인 "과일"을 사과의 가공 데이터로 추가한다. 여기서, 학습할 단어 "사과"의 반의어가 없기 때문에 반의어를 가공 데이터는 추가하지 않는다. 그렇지만, 반의어가 있는 경우에는 반의어를 추가한다.The homograph sorting apparatus (100) uses the semantic data of the words "apple tree" and "fruit" extracted from nouns and vernacles in the fruit of an apple tree, which is an apple's meaning, as processing data. Then, the homotypic discriminating apparatus 100 adds "fruit", which is a word of the word "apple", to the processing data of the apple. Here, since there is no antonym of the word "apple " to be learned, processing data is not added to the antonym. However, if there are antonyms, add the antonyms.

"사과"라는 학습할 단어의 가공 데이터로 "사과나무", "열매", "과일"을 찾은 후 오른쪽에 학습 모델을 이용하여 단어 임베딩 학습을 수행한다.After finding "apple tree", "fruit", "fruit" with the processing data of the word "apple", the word embedding learning is performed using the learning model on the right side.

일례로, "사과"라는 학습할 단어와 뜻풀이의 "사과나무"라는 단어를 이용한 단어 임베딩 학습 과정을 하기의 [수학식 2] 및 [수학식 3]을 참조하여 살펴보기로 한다.For example, a word embedding learning process using the word "apple" and the word "apple tree" of the meaning solution will be described with reference to the following equations (2) and (3).

여기서,

는 사과의 벡터,

는 사과나무의 벡터,

는 사과의 벡터와 사과나무의 벡터를 곱한 후 모두를 더한 값을 나타낸다.here,

Vector of an apple,

Vector of an apple tree,

Represents the value obtained by multiplying the apple vector by the vector of the apple tree and adding all together.

상기의 [수학식 2]는 사과 및 사과나무 간의 출력값을 계산하는 수식을 나타낸다.Equation (2) above expresses a formula for calculating an output value between an apple and an apple tree.

상기의 [수학식 2]에 따라 사과의 벡터와 사과나무의 벡터를 곱한 후 모두를 더한 값에 상기의 [수학식 3]에 나타난 시그모이드 함수(Sigmoid Function)를 이용하면 출력 값이 된다. The output value is obtained by multiplying the vector of the apple tree by the vector of the apple tree according to the above-mentioned [Equation 2], and adding all together to use the sigmoid function shown in the above-mentioned [Equation 3].

여기서,

는 오차 값,

는 가공 데이터 상의 정답 값을 나타낸다.here,

Is an error value,

Represents the correct value on the machining data.

상기의 [수학식 3]과 같이, 오차 값은

에서 출력 값인

를 뺀 값으로 나타내진다. 여기서,

는 가공 데이터 상의 정답 값을 의미한다. 이러한 정답 값은 반의어 학습이 아닌 경우에 값이 1이며 반의어 학습인 경우 0이다.As shown in Equation (3), the error value is

The output value

. here,

Means the correct answer value on the machining data. The value of this correct answer is 1 if it is not antonyms and 0 if it is antonyms.

단어 임베딩 학습부(140)는 오차 값(E)을 이용해

의 변화량을 구한다. 단어 임베딩 학습부(140)는 변화량을 구한 후

에만 변화량을 적용한다. 즉, 단어 임베딩 학습부(140)는 사과와 연결된 가중치에만 변화량을 적용하고, 사과나무와 연결된 가중치에는 변화량을 적용하지 않는다.The word embedding learning unit 140 uses the error value E

. The word embedding learning unit 140 obtains a change amount

Only the change amount is applied. That is, the word embedding learning unit 140 applies a variation only to the weight associated with apples, and does not apply the variation to the weight associated with apples.

도 6은 본 발명의 실시 예에 따른 동형이의어 분별 장치에 의해 수행되는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법에 대한 흐름도이다.6 is a flowchart illustrating a method for classifying an isomorphous word using a vocabulary semantic network and word embedding performed by an isomorphous classifier according to an embodiment of the present invention.

본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 적어도 하나 이상의 학습용 어절이 포함된 말뭉치로부터 학습할 단어 목록을 생성한다(S101).The homothetic word classification apparatus 100 according to an embodiment of the present invention generates a word list to be learned from corpus including at least one learning word (S101).

그리고 동형이의어 분별 장치(100)는 생성된 단어 목록에 포함된 학습할 단어의 단어 의미 데이터 중에서 변환 가능한 단어 의미 데이터를 말뭉치로 변환한다(S102). 이때, 동형이의어 분별 장치(100)는 단어 의미 데이터를 데이터 저장부(110)에서 저장된 말뭉치(111), 표준 대사전(112) 및 어휘 의미망(113)으로부터 가져온다.Then, the homothetic word classifying apparatus 100 converts the convertible word semantic data from the word semantic data of the word to be learned included in the generated word list into a corpus (S102). At this time, the homothetic word classifier 100 fetches the word semantic data from the corpus 111, the standard dictionary 112, and the lexical semantic network 113 stored in the data storage unit 110.

이어서, 동형이의어 분별 장치(100)는 생성된 학습할 단어 목록, 변환된 말뭉치 및 단어 의미 데이터를 단어 임베딩 학습에 맞게 가공하여 가공 데이터를 생성한다(S103).Subsequently, the homothetic word classifying apparatus 100 generates processed data by processing the generated word list to be learned, the transformed corpus and the word semantic data according to the word embedding learning (S103).

이후, 동형이의어 분별 장치(100)는 생성된 가공 데이터를 가지고 입출력층 및 프로젝션 층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성한다(S104).Then, the homothetic word classifier 100 generates a word vector by learning a word to be learned through word embedding learning using a learning model including an input / output layer and a projection layer, with the generated processing data (S104).

그리고 동형이의어 분별 장치(100)는 생성된 단어 벡터를 이용하여 동형이의어와 인접 어절 간의 유사도를 비교한다(S105).Then, the homothetic word classifying apparatus 100 compares the similarity between the homograph and the adjacent word using the generated word vector (S105).

이어서, 동형이의어 분별 장치(100)는 S105 과정에서의 비교 결과에 따라 동형이의어를 분별한다(S106).Subsequently, the homograph sorting apparatus 100 discriminates the homograph of the homograph based on the result of the comparison in the step S105 (S106).

한편, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)에 의해 실험된 실험 과정 및 결과를 하기의 [표 1]을 참조하여 살펴보기로 한다.The experimental procedure and the results of the experiment performed by the homotopic sorting apparatus 100 according to the embodiment of the present invention will be described with reference to Table 1 below.

본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 학습한 적이 없는 패턴을 처리하기 위해서 동형이의어 형태소를 단어 벡터로 벡터화하고, 그 동형이의어와 인접 어절의 실질형태소 간의 유사도를 비교한다.In order to process a pattern that has never been learned, the homothetic word classifying apparatus 100 according to the embodiment of the present invention vectors the homothetic morpheme into a word vector and compares the homology between the homotopic morpheme and the adjacent morpheme.

여기서, 단어를 벡터화하는 과정에서 학습용 말뭉치(111), 표준 대사전(112) 및 어휘 의미망(113)이 이용된다. 하기 미학습 패턴들에 대한 소규모의 실험에서 충분히 의미가 있는 결과가 하기의 [표 1]과 같이 나타난다.Here, in the process of vectorizing the words, the learning corpus 111, the standard dictionary 112, and the lexical meaning network 113 are used. Table 1 shows the results that are sufficiently meaningful in the small-scale experiments on the following learning patterns.

분별할 명사Noun 인접 형태소Adjoining stem 정답answer 사과Apple 열매_01/NNGFruit _01 / NNG 사과_05/NNGApple_05 / NNG 배ship 썩/VVRot / VV 배_03/NNGPear _03 / NNG 면if 삶/VVLife / VV 면_08/NNGIf _08 / NNG

[표 1]은 간이 실험용 테스트세트의 일부를 나타낸다.[Table 1] shows a part of a test set for a simplified experiment.

본 발명의 실시 예에 따른 동형이의어 분별 장치(100)가 의미가 있는지를 간단하게 확인하기 위해 간이 실험 환경을 구성하였다.A simplified experimental environment was constructed to simply check whether the homograph sorting apparatus 100 according to the embodiment of the present invention is meaningful.

먼저, 실험을 위해, 동형이의어가 여러 가지이면서 각각의 동형이의어가 고르게 나타나는 즉, 베이스라인이 비교적 낮은 명사를 선택한다(예컨대, 사과).First, for the experiment, we select nouns with relatively homogeneous homonyms (eg, Apples), where homographs are homogeneous and homogeneous homonyms appear uniformly.

그리고 '사과'와 인접할 수 있지만 학습용 말뭉치(111)에서는 인접한 적이 없는 형태소(예컨대, 열매)를 정한다. [표 1]은 실험에 사용한 테스트세트에서 3개만 예로 표시한 것이다.And 'apple', but the learning corpus 111 determines a morpheme (for example, a fruit) that is not adjacent to the apple. [Table 1] shows only three examples in the test set used in the experiment.

일례로, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 '사과'와 '열매_01/NNG'를 입력받게 되고, '사과'의 동형이의어 중에서 '열매'와 가장 유사한 것이 '사과_05/NNG'로 계산되면 정답을 맞춘 것이다. For example, the homothetic word separator 100 according to the embodiment of the present invention receives 'apple' and 'fruit_01 / NNG', and among the homotypes of 'apple' _05 / NNG ', the answer is correct.

본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 총 24개 중에 18개를 맞추었다. 비록 테스트세트가 아주 작지만 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)가 의미가 있음을 알 수 있다.The homotopic sorting apparatus 100 according to the embodiment of the present invention sets 18 out of 24 total. Although the test set is very small, it can be seen that the homograph sorting apparatus 100 according to the embodiment of the present invention is meaningful.

이와 같이, 본 발명의 실시 예에서는 한국어뿐만 아니라 영어, 중국어 등 모든 언어에 적용가능한 동형이의어 분별을 위해서 단어 임베딩 학습을 통해 학습하고 단어 벡터들 간의 유사도 비교를 통해 동형이의어를 분별한다. As described above, in the embodiment of the present invention, learning is performed through word embedding learning for discrimination of homonyms applicable to all languages such as English and Chinese, as well as Korean, and discrimination of homographs is performed by comparing similarities between word vectors.

본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 단어 임베딩 학습을 위해 말뭉치(111)뿐만 아니라 표준 대사전(112), 어휘 의미망(113)을 이용한다. 상기의 [표 1]과 같이 미학습 패턴만으로 간단한 실험을 하였으며 의미 있는 결과를 보였다.The homothetic word classifier 100 according to the embodiment of the present invention uses the standard dictionary 112 and the lexical semantic network 113 as well as the corpus 111 for word embedding learning. As shown in [Table 1] above, simple experiments were performed with only the un-learned patterns and the results were meaningful.

미학습패턴은 종래의 방법으로는 매우 낮은 베이스라인 수준의 정확률만을 기대할 수 있다. 따라서 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 종래의 동형이의어 분별기가 재현하지 못하는 패턴에 대해서 후속 보완 모듈로도 이용될 수 있다. 또한, 후속 보완 모듈로 이용된다면 전체적인 정확률을 향상시킬 수 있다. 즉, 동형이의어 분별 장치(100)가 적용되면, 미학습패턴에 대해서도 안정적으로 작동하는 동형이의어 분별기가 제공될 수 있다. 이와 같이, 본 발명의 실시 예에 따른 동형이의어 분별 장치(100)는 종래의 동형이의어 분별기(UTagger)에 통합될 수 있다.Unrecognized patterns can expect only a very low baseline level accuracy in the conventional method. Therefore, the homograph sorting apparatus 100 according to the embodiment of the present invention can also be used as a follow-up supplementary module for a pattern that can not be reproduced by a conventional homograph classifier. Also, if used as a subsequent complementary module, the overall accuracy can be improved. That is, when the homograph type discriminating apparatus 100 is applied, a homograph type discriminator which stably operates with respect to a non-learning pattern can be provided. As described above, the homotopic identifier-assigning apparatus 100 according to the embodiment of the present invention can be integrated into a conventional homograph classifier (UTagger).

또한, 동형이의어 분별 장치(100)는 의존명사, 부사, 접미사, 접두사 등에 대해서도 단어 벡터를 생성할 수 있다. 이들은 일반명사나 용언과는 다른 성격을 가지고 있기 때문에 동형이의어 분별 장치(100)는 그 성격에 따라 변경하여 단어 임베딩 학습을 수행할 수 있다.The homonym classification device 100 can also generate word vectors for dependent nouns, adverbs, suffixes, prefixes, and the like. Since they have different characteristics from common nouns or verbs, the homograph sorting apparatus 100 can perform word-embedding learning by changing it according to its nature.

이상에서 설명한 실시 예들은 그 일 예로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or essential characteristics thereof. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100: 동형이의어 분별 장치
110: 데이터 저장부
111: 말뭉치
112: 표준 대사전
113: 어휘 의미망
120: 단어 목록 생성부
130: 가공 데이터 생성부
140: 단어 임베딩 학습부
150: 동형이의어 분별부100: homograph sorting device
110: Data storage unit
111: corpus
112: Standard Dictionary
113: Vocabulary Meaning Network
120: word list generation unit
130:
140: word embedding learning unit
150: Homonym classification section

Claims

적어도 하나 이상의 학습용 어절이 포함된 말뭉치와 단어 의미 데이터가 포함된 표준 대사전 및 어휘 의미망을 저장하는 데이터 저장부;
상기 말뭉치로부터 학습할 단어 목록을 생성하는 단어 목록 생성부;
상기 생성된 단어 목록에 포함된 학습할 단어의 단어 의미 데이터 중에서 변환 가능한 단어 의미 데이터를 말뭉치로 변환하고, 상기 생성된 학습할 단어 목록, 상기 변환된 말뭉치 및 상기 단어 의미 데이터를 단어 임베딩 학습에 맞게 가공하여 가공 데이터를 생성하는 가공 데이터 생성부;
상기 생성된 가공 데이터를 가지고 입출력층 및 프로젝션 층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성하는 단어 임베딩 학습부; 및
상기 생성된 단어 벡터를 이용하여 동형이의어와 인접 어절 간의 유사도를 비교하고 상기 비교 결과에 따라 동형이의어를 분별하는 동형이의어 분별부를 포함하고,
상기 단어 임베딩 학습부는 단어 임베딩 학습의 피드포워드(feedforward) 과정과 백 프로퍼게이션(back propagation) 과정을 통해 단어 임베딩 학습을 수행하고, 백 프로퍼게이션 과정에서 학습할 단어의 가공 데이터와 연결된 가중치 값을 변경하지 않고, 학습할 단어와 연결된 가중치 값을 변경하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치.A data storage unit for storing a standard dictionary and a lexical semantic network including a corpus including at least one learning word and word semantic data;
A word list generating unit for generating a word list to be learned from the corpus;
Converting the generated word list, the converted corpus, and the word semantics data into word corpus according to the word embedding learning, A machining data generating unit for machining the machining data to generate machining data;
A word embedding learning unit for generating a word vector by learning a word to be learned through word embedding learning using a learning model including an input / output layer and a projection layer with the generated processing data; And
And a homophonic typing unit for comparing the homology between the homonym and the adjacent word using the generated word vector and discriminating the homograph according to the comparison result,
The word embedding learning unit performs word embedding learning through a feedforward process and a back propagation process of word embedding learning and calculates a weight value associated with the processing data of the word to be learned in the backprogramming process , And the weight value associated with the word to be learned is changed without changing the weight value.

제1항에 있어서,
상기 데이터 저장부는
적어도 하나 이상의 학습용 어절이 포함된 말뭉치와, 용례 및 뜻풀이가 단어 의미 데이터로 포함된 표준 대사전과, 용언-명사 관계 정보, 상위어 및 반의어가 단어 의미 데이터로 포함된 어휘 의미망을 저장하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치.The method according to claim 1,
The data storage unit
A corpus containing at least one learning word, a standard dictionary in which the usage example and the meaning solution are included as the word meaning data, and a vocabulary storing the vocabulary meaning network including the vernacular-noun relation information, Equivalent Equivalent Device Using Network and Word Embedding.

제1항에 있어서,
상기 가공 데이터 생성부는
적어도 하나 이상의 학습용 어절이 포함된 말뭉치, 표준 대사전에 포함된 용례 및 뜻풀이, 및 어휘 의미망에 포함된 용언-명사 관계 정보 중에서 적어도 하나를 말뭉치 형태로 변환하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치.The method according to claim 1,
The processing data generation unit
A corpus including at least one learning word, a usage example and a meaning solution included in the standard dictionary, and a verb-noun relationship information included in the vocabulary semantic network, a vocabulary which converts at least one word into a corpus form, An opponent sorting device.

제1항에 있어서,
상기 가공 데이터 생성부는
상기 변환된 말뭉치에서 상기 학습할 단어와 인접한 어절 및 상기 인접한 어절의 상위어를 상기 학습할 단어의 인접 어절로 처리하거나, 용언-명사 관계 정보를 상기 학습할 단어의 인접 어절로 처리하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치.The method according to claim 1,
The processing data generation unit
A lexical semantic network for processing a word in the transformed corpus and a word adjacent to the word to be learned and a word in the adjacent word as an adjacent word of the word to be learned or processing word- A homograph classification apparatus using word embedding.

제1항에 있어서,
상기 단어 임베딩 학습부는
상기 가공 데이터 중에서 상기 학습할 단어, 인접 어절, 용언-명사 관계 정보 및 반의어가 각각 가공된 가공 데이터를 단어 임베딩에서의 스킵-그램(Skip-Gram)의 입력층 및 출력층이 합쳐진 하나의 입출력층에 위치시키고, 상기 각각의 가공 데이터가 위치한 입출력층 및 미러층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치.The method according to claim 1,
The word embedding learning unit
Processing data in which the words to be learned, adjacent words, vernacular-noun relation information, and antonyms are respectively processed is input to one input / output layer in which an input layer and an output layer of a Skip-Gram in word embedding are combined And a word vector is generated by learning a word to be learned through word embedding learning using a learning model consisting of an input / output layer and a mirror layer in which each of the processing data is located, and an isomorphous discriminating device using word embedding.

삭제delete

제1항에 있어서,
상기 단어 임베딩 학습부는
단어 임베딩에서의 네거티브-샘플링(Negative-Sampling)을 이용해 가공 데이터 이외의 오답을 학습할 단어에 학습하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치.The method according to claim 1,
The word embedding learning unit
A homophonic discriminator using vocabulary semantic network and word embedding to learn words to learn incorrect answers other than processed data using negative-sampling in word embedding.

제1항에 있어서,
상기 단어 임베딩 학습부는
단어 임베딩 학습을 통해 학습할 단어와 인접하고 조사 또는 어미를 제외한 실질형태소의 단어 벡터를 생성하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치.The method according to claim 1,
The word embedding learning unit
An apparatus for discriminating homonyms using semantic network and word embedding, which generates word vectors of substantial morpheme excluding words or words adjacent to words to be learned through word embedding learning.

제1항에 있어서,
상기 동형이의어 분별부는
상기 생성된 실질형태소의 단어 벡터를 이용하여 분별할 동형이의어의 실질형태소와 인접 어절의 실질형태소 간의 유사도를 비교하여 동형이의어를 분별하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치.The method according to claim 1,
The homotypic classification section
And an isomorphous discriminating device for discriminating isomorphous terms by comparing the similarity between the substantial morphemes of the isomorphous terms to be discriminated by using the generated substantial morpheme word vectors and the substantial morphemes of the adjacent words.

동형이의어 분별 장치에 의해 수행되는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법에 있어서,
적어도 하나 이상의 학습용 어절이 포함된 말뭉치로부터 학습할 단어 목록을 생성하는 단계;
상기 생성된 단어 목록에 포함된 학습할 단어의 단어 의미 데이터 중에서 변환 가능한 단어 의미 데이터를 말뭉치로 변환하는 단계;
상기 생성된 학습할 단어 목록, 상기 변환된 말뭉치 및 상기 단어 의미 데이터를 단어 임베딩 학습에 맞게 가공하여 가공 데이터를 생성하는 단계;
상기 생성된 가공 데이터를 가지고 입출력층 및 프로젝션 층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성하는 단계; 및
상기 생성된 단어 벡터를 이용하여 동형이의어와 인접 어절 간의 유사도를 비교하고 상기 비교 결과에 따라 동형이의어를 분별하는 단계를 포함하고,
상기 단어 벡터를 생성하는 단계는 단어 임베딩 학습의 피드포워드(feedforward) 과정과 백 프로퍼게이션(back propagation) 과정을 통해 단어 임베딩 학습을 수행하고, 백 프로퍼게이션 과정에서 학습할 단어의 가공 데이터와 연결된 가중치 값을 변경하지 않고, 학습할 단어와 연결된 가중치 값을 변경하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법.A method for discriminating homonyms using a vocabulary semantic network and word embedding, which is performed by a homograph sorting apparatus,
Generating a word list to be learned from a corpus including at least one learning word;
Converting the convertible word semantic data into a corpus among the word semantic data of a word to be learned included in the generated word list;
Generating processed data by processing the generated word list to be learned, the converted corpus and the word semantic data according to word embedding learning;
Generating a word vector by learning a word to be learned through word embedding learning using a learning model including an input / output layer and a projection layer with the generated processing data; And
Comparing the similarity between the homonym and the adjacent word using the generated word vector, and discriminating the homonym according to the comparison result;
The step of generating the word vector includes performing word-embedding learning through a feedforward process and a back propagation process of word-embedding learning, processing data of a word to be learned in a back- A method of classifying homogeneous terms using vocabulary semantic network and word embedding to change the weight value associated with a word to be learned without changing the associated weight value.

제10항에 있어서,
상기 말뭉치로 변환하는 단계는
용례 및 뜻풀이가 포함된 표준 대사전과, 용언-명사 관계 정보, 상위어 및 반의어가 포함된 어휘 의미망에서의 단어 의미 데이터 중에서 변환 가능한 단어 의미 데이터를 말뭉치로 변환하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법.11. The method of claim 10,
The step of converting into the corpus
A standard dictionary containing usage examples and meanings, and vocabulary including vernacular-noun relation information, upper and lower terms, and vocabulary semantic data in the network. Homograph identification method.

제10항에 있어서,
상기 말뭉치로 변환하는 단계는
적어도 하나 이상의 학습용 어절이 포함된 말뭉치, 표준 대사전에 포함된 용례 및 뜻풀이, 및 어휘 의미망에 포함된 용언-명사 관계 정보 중에서 적어도 하나를 말뭉치 형태로 변환하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법.11. The method of claim 10,
The step of converting into the corpus
A corpus including at least one learning word, a usage example and a meaning solution included in the standard dictionary, and a verb-noun relationship information included in the vocabulary semantic network, a vocabulary which converts at least one word into a corpus form, Method of discrimination.

제10항에 있어서,
상기 가공 데이터를 생성하는 단계는
상기 변환된 말뭉치에서 상기 학습할 단어와 인접한 어절 및 상기 인접한 어절의 상위어를 상기 학습할 단어의 인접 어절로 처리하거나, 용언-명사 관계 정보를 상기 학습할 단어의 인접 어절로 처리하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법.11. The method of claim 10,
The step of generating the processed data
A lexical semantic network for processing a word in the transformed corpus and a word adjacent to the word to be learned and a word in the adjacent word as an adjacent word of the word to be learned or processing word- Identification of homonyms using word embedding.

제10항에 있어서,
상기 단어 벡터를 생성하는 단계는
상기 가공 데이터 중에서 상기 학습할 단어, 인접 어절, 용언-명사 관계 정보 및 반의어가 각각 가공된 가공 데이터를 단어 임베딩에서의 스킵-그램(Skip-Gram)의 입력층 및 출력층이 합쳐진 하나의 입출력층에 위치시키고, 상기 각각의 가공 데이터가 위치한 입출력층 및 미러층으로 이루어진 학습 모델을 이용한 단어 임베딩 학습을 통해 학습할 단어를 학습하여 단어 벡터를 생성하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법.11. The method of claim 10,
The step of generating the word vector
Processing data in which the words to be learned, adjacent words, vernacular-noun relation information, and antonyms are respectively processed is input to one input / output layer in which an input layer and an output layer of a Skip-Gram in word embedding are combined And generating a word vector by learning a word to be learned through word embedding learning using a learning model consisting of an input / output layer and a mirror layer in which the respective processing data are located, and a method of discriminating the same type using word embedding.

삭제delete

제10항에 있어서,
상기 단어 벡터를 생성하는 단계는
단어 임베딩에서의 네거티브-샘플링(Negative-Sampling)을 이용해 가공 데이터 이외의 오답을 학습할 단어에 학습하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법.11. The method of claim 10,
The step of generating the word vector
A method for classifying homonyms using semantic network and word embedding to learn words to learn incorrect answers other than processed data using negative-sampling in word embedding.

제10항에 있어서,
상기 단어 벡터를 생성하는 단계는
단어 임베딩 학습을 통해 학습할 단어와 인접하고 조사 또는 어미를 제외한 실질형태소의 단어 벡터를 생성하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법.11. The method of claim 10,
The step of generating the word vector
A method of classifying homonyms using semantic network and word embedding to generate word vectors of substantial morpheme except for searching or ending adjacent to words to be learned through word embedding learning.

제10항에 있어서,
상기 동형이의어를 분별하는 단계는
상기 생성된 실질형태소의 단어 벡터를 이용하여 분별할 동형이의어의 실질형태소와 인접 어절의 실질형태소 간의 유사도를 비교하여 동형이의어를 분별하는 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 방법.11. The method of claim 10,
The step of discriminating the isomorphism
A method for classifying the homonyms using the vocabulary semantic network and the word embedding to discriminate the homograph of the homonyms between the substantial morphemes of the homonyms to be discriminated using the word vectors of the generated morphemes and the substantial morphemes of the adjacent words.