KR20200064880A

KR20200064880A - System and Method for Word Embedding using Knowledge Powered Deep Learning based on Korean WordNet

Info

Publication number: KR20200064880A
Application number: KR1020190080209A
Authority: KR
Inventors: 권혁철; 김민호
Original assignee: 부산대학교 산학협력단
Priority date: 2018-11-29
Filing date: 2019-07-03
Publication date: 2020-06-08
Also published as: KR102347505B1

Abstract

The present invention relates to a word embedding apparatus using a Korean wordnet-based knowledge powered deep learning that can increase the efficiency of word embedding by calculating the similarity by decomposing and analyzing words into a partial word model, and a method thereof. The word embedding apparatus comprises: a set definition unit defining a complex word composed of two or more morphemes as a set of real morphemes; a vector calculator that calculates a word vector for each n-gram using the skip-gram method; and a similarity calculator that calculates similarity between words and contexts in a word embedding process using a real morphological set. By defining a compound word composed of two or more morphemes as a set of real morphemes, the morphological characteristics are reflected without error.

Description

한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법{System and Method for Word Embedding using Knowledge Powered Deep Learning based on Korean WordNet}System and Method for Word Embedding using Knowledge Powered Deep Learning based on Korean WordNet}

본 발명은 워드 임베딩에 관한 것으로, 구체적으로 단어를 부분단어 모형으로 분해하여 분석함으로써 유사도를 산출하여 워드 임베딩의 효율성을 높일 수 있도록 한 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법에 관한 것이다.The present invention relates to word embedding. Specifically, a word embedding apparatus and method using a knowledge-driven in-depth learning based on Korean word net that improves the efficiency of word embedding by calculating the similarity by decomposing and analyzing words into a partial word model It is about.

심층학습에 기반을 둔 통계적 언어모형에서 가장 중요한 작업은 단어의 분산표현(Distributed representation)이다. 단어의 분산 표현은 단어 자체가 가지는 의미를 다차원 공간에서 벡터로 표현하는 것으로서, 워드 임베딩(Word embedding)이라고도 한다. 워드 임베딩 방법은 비슷한 문맥을 가진 단어가 비슷한 의미들을 가진다는 분포가설(Distributional hypothesis)에 기반을 둔다.The most important work in the statistical language model based on deep learning is the distributed representation of words. The distributed expression of a word expresses the meaning of the word itself as a vector in a multidimensional space, and is also referred to as word embedding. The word embedding method is based on the distributional hypothesis that words with similar contexts have similar meanings.

분포가설에 기반을 둔 워드 임베딩 방법은 크게 잠재 의미분석(Latent Semantic Analysis: LSA)과 같은 빈도 기반 방법(Count-based method)과 인공신경망 언어모형(Neural Probabilistic Language Model: NPLM)과 같은 예측 방법(Predictive method)으로 구분할 수 있다.Word embedding methods based on the distribution hypothesis are largely count-based methods, such as latent semantic analysis (LSA), and prediction methods, such as the neural probabilistic language model (NPLM). Predictive method).

이와 관련하여 예측 방법에 기반을 둔 다양한 방법이 제안되었으며, 아래 4개의 아키텍처(Architecture)가 가장 널리 사용되고 있다.In this regard, various methods based on prediction methods have been proposed, and the following four architectures are the most widely used.

(1) CBOW(Continuous Bag-of-Words): 분포가설에 기반을 둔 예측 방법의 하나로서, 전방향 인공신경망 언어모형(feedforward Neural Network Language Model; NNLM)과 유사하나 은닉층(hidden layer)이 없다. 문맥 단어로부터 단어를 예측하는 방법이며, 소규모 데이터에 대하여 성능이 좋다. (1) Continuous Bag-of-Words (CBOW): One of the prediction methods based on the distribution hypothesis, similar to the feedforward Neural Network Language Model (NNLM), but without a hidden layer. . It is a method of predicting words from context words, and has good performance for small data.

(2) Skip-gram: 분포가설에 기반을 둔 예측 방법의 하나로서, CBOW와 함께 word2vec에서 제공되는 방법이다. CBOW와 달리 단어로부터 문맥 단어를 예측하는 방법이며, 학습 속도가 빨라 대규모 데이터에 기반을 둔 워드 임베딩에 주로 이용된다.(2) Skip-gram: This is a prediction method based on the distribution hypothesis, and is a method provided by word2vec along with CBOW. Unlike CBOW, it is a method of predicting contextual words from words, and is used mainly for word embedding based on large-scale data due to its fast learning speed.

(3) GloVe(Global Vectors for Word Representation): 분포가설에 기반을 둔 빈도 기반 방법의 하나로서, 전역단어 문맥 행렬(global co-occurrences matrix)의 각행을 단어 벡터로 사용한다.(3) GloVe (Global Vectors for Word Representation): One of the frequency-based methods based on the distribution hypothesis, each row of the global co-occurrences matrix is used as a word vector.

(4) fastText: skip-gram을 변형한 방법으로서, 단어의 형태론적 특성을 반영하고자 n-gram에 의한 부분단어를 생성하여 사용한다.(4) fastText: This is a modified method of skip-gram, which is used by generating a partial word by n-gram to reflect the morphological characteristics of words.

최근에는 문자 n-gram에 기반을 둔 부분단어 정보를 활용한 방법이 영어권에서 좋은 성능을 보인다. 한국어 워드 임베딩에서도 음절 n-gram에 기반을 둔 부분단어 정보를 활용한 워드 임베딩이 제안되었다.Recently, the method using sub-word information based on the letter n-gram shows good performance in English. In Korean word embedding, word embedding using partial word information based on syllable n-gram was also proposed.

부분단어 정보를 활용한 워드 임베딩 방법들은 단어의 형태론적 특성을 학습할 수 있고, 학습데이터에 나타나지 않은 단어(Out-of-vocabulary: OOV)도 처리할 수 있다는 장점이 있다. 다만 문자나 음절 n-gram에 의해 잘못된 부분단어 정보가 반영되어 의도치 않은 결과가 나올 수도 있다.Word embedding methods using subword information have the advantage of being able to learn morphological characteristics of words and to process words that do not appear in the learning data (out-of-vocabulary: OOV). However, the wrong subword information is reflected by letters or syllable n-grams, and unintended results may occur.

도 1은 단어 '달력'과 의미상으로 유사한 단어의 예를 나타낸 구성도이다.1 is a block diagram showing an example of a word similar in meaning to the word'calendar'.

예를 들어, fastText에 의한 한국어 워드 임베딩에서 '달력'와 '전달력'의 유사도를 계산하면 0.6472라는 큰 값이 나타난다. 이는 '달력'이 '전달력'의 부분단어이기 때문이다.For example, when calculating the similarity between'calendar' and'delivery force' in Korean word embedding by fastText, a large value of 0.6472 appears. This is because'calendar' is a sub-word of'delivery'.

따라서, 워드 임베딩(word embedding)시에 정보 불균형에 따른 단어 벡터의 품질 저하를 막을 수 있도록 하는 새로운 기술의 개발이 요구되고 있다.Accordingly, there is a need to develop a new technology that can prevent the degradation of the quality of a word vector due to information imbalance during word embedding.

대한민국 공개특허 제10-2018-0008199호Republic of Korea Patent Publication No. 10-2018-0008199 대한민국 등록특허 제10-1797365호Republic of Korea Registered Patent No. 10-1797365 대한민국 등록특허 제10-1799681호Republic of Korea Registered Patent No. 10-1799681

본 발명은 종래 기술의 워드 임베딩 기술의 문제점을 해결하기 위한 것으로, 단어를 부분단어 모형으로 분해하여 분석함으로써 유사도를 산출하여 워드 임베딩의 효율성을 높일 수 있도록 한 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problem of the word embedding technique of the prior art, using a Korean wordnet-based knowledge-based deep learning to increase the efficiency of word embedding by calculating the similarity by decomposing and analyzing words into a partial word model It is an object to provide a word embedding apparatus and method.

본 발명은 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하여 단어 벡터의 품질을 높일 수 있도록 한 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is a word embedding apparatus using Korean wordnet-based knowledge-driven in-depth learning to improve the quality of word vectors by converting words of high-dimensional data into concepts of low-dimensional data while maintaining the semantic characteristics of words. And its purpose.

본 발명은 단어를 개념으로 자동으로 변환하여 임베딩하는 방법으로 정보 불균형에 따른 이러한 단어 벡터의 품질 저하를 막을 수 있도록 한 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법을 제공하는데 그 목적이 있다.The present invention provides a word embedding apparatus and method using a knowledge-driven deep learning based on Korean word net to prevent the degradation of the quality of these word vectors due to information imbalance as a method of automatically converting and embedding words into concepts. There is this.

본 발명은 임베딩 시 단어를 개념으로 치환하여 수행하는 것으로서 데이터의 특성을 유지하면서 데이터의 양을 줄이는 차원 축소(Dimensionality Reduction)에 기반한 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법을 제공하는데 그 목적이 있다.The present invention provides a word embedding apparatus and method based on Korean wordnet-based knowledge-driven in-depth learning based on Dimensionality Reduction, which reduces the amount of data while maintaining the characteristics of data as a substitution of words as a concept when embedding. There is a purpose.

본 발명은 워드 임베딩 과정에서 학습데이터의 단어를 개념으로 변환하여 학습데이터에서 자주 나타나지 않거나 혹은 아예 나타나지 않는 단어의 의미정보를 반영할 수 있는 단어의 벡터 표현이 가능하도록 한 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is an in-depth knowledge-driven Korean wordnet based language that enables a vector expression of words that can reflect the semantic information of words that do not appear frequently or do not appear in the learning data by converting the words of the learning data into concepts in the word embedding process. The purpose is to provide a word embedding apparatus and method using learning.

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to those mentioned above, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치는 둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하는 집합 정의부;skip-gram의 방법을 활용하여 각 n-gram에 대한 단어 벡터를 계산하는 벡터 계산부;실질 형태소 집합을 이용해서 워드 임베딩 과정에서 단어와 문맥의 유사도 산출을 하는 유사도 산출부;를 포함하고, 둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하여 형태론적 특성이 오류없이 반영되도록 하는 것을 특징으로 한다.In order to achieve the above object, the word embedding apparatus using Korean wordnet-based knowledge-driven deep learning according to the present invention includes a set definition unit defining a complex word composed of two or more morphemes as a set of real morphemes; a method of skip-gram Includes a vector calculation unit for calculating the word vector for each n-gram by using; a similarity calculation unit for calculating the similarity between words and contexts in the word embedding process using a real morphological set; It is characterized by defining as a set of real morphemes so that morphological characteristics are reflected without error.

여기서, 실질 형태소 집합을 이용한 워드 임베딩 과정에서 단어

와 문맥

의 유사도 함수(scoring function)

는,Here, the word in the word embedding process using a real morphological set

And context

The similarity function of

Is,

으로 정의되고, 이때,

는 단어

에 나타나는 실질 형태소의 집합

에 속한 실질 형태소

의 벡터이고,

는 문맥 단어의 벡터인 것을 특징으로 한다.

Is defined as

The word

Set of real morphemes appearing in

A real morpheme belonging to

Is a vector of

Is a vector of context words.

그리고 워드 임베딩 과정에서 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하기 위하여, 문장이 입력되면 형태 분석을 수행하는 형태 분석부와,형태 분석이 이루어진 각 단어에 대한 단어 판단 및 의미 판단이 이루어지 않는 경우에 의미 분석을 수행하는 의미 분석부와,한국어 어휘의미망을 활용하여 각 단어에 대한 개념 변환을 수행하는 개념 변환부를 더 포함하는 것을 특징으로 한다.And in the process of word embedding, in order to transform words of high-dimensional data into concepts of low-dimensional data while maintaining the semantic characteristics of words, a form analysis unit that performs form analysis when a sentence is input, and form analysis is performed. Characterized in that it further comprises a semantic analysis unit that performs semantic analysis when the word judgment and semantic judgment for each word are not made, and a conceptual conversion unit that performs concept conversion for each word by utilizing the Korean vocabulary meaning network. do.

그리고 한국어 어휘의미망에서 해당 단어를 검색하였을 때 하나의 개념만 존재한다면 해당 개념으로 바로 변환하고, 해당 단어가 여러 개의 의미로 사용될 수 있는 동형이의어 혹은 다의어일 경우에는 의미 분석을 통해 해당 단어의 의미를 먼저 분석한 다음 분석된 개념으로 변환하는 것을 특징으로 한다.And when searching for the word in the Korean vocabulary meaning network, if there is only one concept, it is immediately converted to the concept, and if the word is an isomorphic or multi-word that can be used in multiple meanings, the meaning of the word is analyzed through semantic analysis. It is characterized by first analyzing and then converting it to the analyzed concept.

그리고 개념 변환부는, 워드임베딩의 학습을 위해 원시말뭉치가 입력이 되었을 때 개별 문장에 대한 개념 변환 작업을 수행하는 동적 변환 또는, 워드임베딩을 위한 학습과정에서 원시말뭉치에 포함된 모든 단어를 한국어 어휘의미망에 등록된 개념으로 일괄적으로 변환하는 정적 변환을 하는 것을 특징으로 한다.In addition, the concept conversion unit converts all words included in the original corpus in the process of dynamic conversion that performs concept conversion on individual sentences when the original corpus is input for learning word embedding, or in the Korean vocabulary. It is characterized by performing static conversion that converts in a batch to the concept registered in the widow.

다른 목적을 달성하기 위한 본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법은 둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하는 집합 정의 단계;skip-gram의 방법을 활용하여 각 n-gram에 대한 단어 벡터를 계산하는 벡터 계산 단계;실질 형태소 집합을 이용해서 워드 임베딩 과정에서 단어와 문맥의 유사도 산출을 하는 유사도 산출 단계;를 포함하고, 둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하여 형태론적 특성이 오류없이 반영되도록 하는 것을 특징으로 한다.The word embedding method using Korean wordnet-based knowledge-driven deep learning according to the present invention for achieving another purpose is a set definition step of defining a compound word composed of two or more morphemes as a set of real morphemes; by utilizing a skip-gram method And a vector calculation step of calculating a word vector for each n-gram; a similarity calculation step of calculating similarity between words and contexts in a word embedding process using a set of real morphemes; It is characterized by defining as a set of morphological characteristics so that they are reflected without error.

여기서, 워드 임베딩 과정에서 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하기 위하여, 문장이 입력되면 형태 분석을 수행하는 형태 분석 단계와,형태 분석이 이루어진 각 단어에 대한 단어 판단 및 의미 판단이 이루어지 않는 경우에 의미 분석을 수행하는 의미 분석 단계와,한국어 어휘의미망을 활용하여 각 단어에 대한 개념 변환을 수행하는 개념 변환 단계를 더 포함하는 것을 특징으로 한다.Here, in the word embedding process, in order to convert words of high-dimensional data into concepts of low-dimensional data while maintaining the semantic characteristics of words, a shape analysis step of performing shape analysis when a sentence is input, and a shape analysis Further comprising a semantic analysis step of performing a semantic analysis when the word judgment and the semantic judgment for each word made are not performed, and a conceptual conversion step of performing a conceptual conversion for each word by utilizing the Korean vocabulary meaning network. It is characterized by.

그리고 개념 변환 단계는, 워드임베딩의 학습을 위해 원시말뭉치가 입력이 되었을 때 개별 문장에 대한 개념 변환 작업을 수행하는 동적 변환 또는, 워드임베딩을 위한 학습과정에서 원시말뭉치에 포함된 모든 단어를 한국어 어휘의미망에 등록된 개념으로 일괄적으로 변환하는 정적 변환을 하는 것을 특징으로 한다.In the concept conversion step, when a raw corpus is input for learning word embedding, a dynamic conversion that performs a conceptual conversion operation for individual sentences or all words included in the original corpus in the learning process for word embedding in Korean vocabulary It is characterized by performing static conversion that converts collectively to the concept registered in the semantic network.

그리고 워드임베딩 결과를 활용하는 단계를 더 포함하고, 워드임베딩 결과를 활용하는 단계는, 단어가 입력되었을 때 해당 단어에 대한 단어 벡터를 획득하기 위하여 학습 단계와 마찬가지로 개별 단어를 해당하는 개념으로 변환한 다음, 워드임베딩 결과에서 해당 개념으로 변환하는 것을 특징으로 한다.And further comprising the step of using the word embedding result, the step of utilizing the word embedding result, when a word is input, in order to obtain a word vector for the word, the individual words are converted into corresponding concepts as in the learning step. Next, it is characterized by converting the word embedding result to a corresponding concept.

그리고 워드임베딩 결과를 활용할 때 입력된 단어가 미등록어이면, 미등록어와 관계된 상위어, 하위어, 동의어를 한국어 어휘의미망에서 찾은 다음 해당 단어에 대한 임베딩 결과를 차용하는 것을 특징으로 한다.In addition, if the input word is an unregistered word when using the word embedding result, it is characterized in that it finds a high-order word, a sub-word, and a synonym related to the unregistered word in the Korean vocabulary meaning network and then borrows the embedding result for the word.

이상에서 설명한 바와 같은 본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법은 다음과 같은 효과가 있다.As described above, the word embedding apparatus and method using the Korean wordnet-based knowledge-based deep learning according to the present invention has the following effects.

첫째, fasttext에서 단어를 n-gram에 의해 부분단어의 집합으로 정의하는 부분을 실질 형태소의 집합으로 정의하도록 하여 워드 임베딩의 효율성을 높인다.First, it improves the efficiency of word embedding by defining the part that defines a word as a set of subwords by n-gram as a set of real morphemes.

둘째, 워드 임베딩시에 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하여 단어 벡터의 품질을 높일 수 있다.Second, the quality of a word vector can be improved by converting a word that is high-dimensional data into a concept that is low-dimensional data while maintaining the semantic characteristics of the word during word embedding.

셋째, 단어를 개념으로 자동으로 변환하여 임베딩하는 방법으로 정보 불균형에 따른 이러한 단어 벡터의 품질 저하를 막을 수 있다.Third, by automatically converting and embedding words into concepts, it is possible to prevent the degradation of the quality of these word vectors due to information imbalance.

넷째, 워드 임베딩 시 단어를 개념으로 치환하여 수행하는 차원 축소(Dimensionality Reduction)에 기반한 단어의 벡터 표현으로 데이터의 특성을 유지하면서 데이터의 양을 줄일 수 있다.Fourth, it is possible to reduce the amount of data while maintaining the characteristics of the data as a vector representation of words based on dimensional reduction, which is performed by substituting words as concepts in word embedding.

다섯째, 워드 임베딩 과정에서 학습데이터의 단어를 개념으로 변환하여 학습데이터에서 자주 나타나지 않거나 혹은 아예 나타나지 않는 단어의 의미정보를 반영할 수 있다.Fifth, in the word embedding process, the words of the learning data are converted into concepts to reflect the semantic information of words that do not appear frequently or not at all in the learning data.

도 1은 단어 '달력'과 의미상으로 유사한 단어의 예를 나타낸 구성도
도 2는 본 발명의 제 1 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치의 구성도
도 3은 본 발명의 제 1 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법을 나타낸 플로우 차트
도 4는 본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 단어의 벡터 표현을 위한 과정을 나타낸 구성도
도 5는 본 발명의 제 2 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치의 구성도
도 6은 본 발명의 제 2 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법을 나타낸 플로우 차트
도 7은 도 6의 워드 임베딩하는 방법의 일 예를 나타낸 구성도
도 8은 개념 변환에서 정적 변환을 나타낸 구성도
도 9는 본 발명에 따른 학습된 워드 임베딩을 활용하는 방법을 나타낸 플로우 차트
도 10은 워드 임베딩 결과를 활용할 때 미등록어에 대한 처리 예를 나타낸 구성도1 is a block diagram showing an example of a word similar in meaning to the word'calendar'
FIG. 2 is a block diagram of a word embedding device using deep knowledge-based Korean wordnet-based learning according to a first embodiment of the present invention;
3 is a flow chart showing a word embedding method using deep knowledge based on Korean wordnet based on a first embodiment of the present invention.
4 is a diagram illustrating a process for vector expression of words using deep knowledge-based Korean wordnet based learning according to the present invention
FIG. 5 is a block diagram of a word embedding device using deep knowledge-based Korean wordnet-based learning according to a second embodiment of the present invention.
6 is a flow chart showing a word embedding method using deep knowledge-based Korean wordnet-based learning according to a second embodiment of the present invention
7 is a block diagram showing an example of a method of embedding the word of FIG. 6
8 is a block diagram showing a static transformation in the conceptual transformation
9 is a flow chart showing a method of utilizing the learned word embedding according to the present invention
10 is a block diagram showing an example of processing for unregistered words when utilizing the word embedding result.

이하, 본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of the word embedding apparatus and method using the Korean wordnet-based knowledge-based deep learning according to the present invention will be described in detail as follows.

본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.The features and advantages of the word embedding apparatus and method using a Korean wordnet-based knowledge-driven deep learning according to the present invention will become apparent through detailed description of each embodiment below.

도 2는 본 발명의 제 1 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치의 구성도이다.FIG. 2 is a block diagram of a word embedding apparatus using deep knowledge based on Korean word net based on the first embodiment of the present invention.

본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법은 워드 임베딩의 효율성 및 정확성을 높이기 위하여, 단어를 부분단어 모형으로 분해하여 분석함으로써 유사도를 산출하여 워드 임베딩의 효율성을 높이는 구성 및 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하여 단어 벡터의 품질을 높이는 구성을 포함할 수 있다.The word embedding apparatus and method using the Korean word net-based knowledge-driven deep learning according to the present invention increases the efficiency of word embedding by calculating the similarity by decomposing and analyzing words into subword models to increase the efficiency and accuracy of word embedding. It may include a configuration that improves the quality of a word vector by converting a word that is high-dimensional data into a concept that is low-dimensional data while maintaining the composition and the semantic characteristics of the word.

먼저, 단어를 부분단어 모형으로 분해하여 분석함으로써 유사도를 산출하여 워드 임베딩의 효율성을 높이는 구성에 관하여 설명하면 다음과 같다.First, a description will be given of a configuration that increases the efficiency of word embedding by calculating the similarity by decomposing and analyzing words into a partial word model.

fastText와 같이 부분단어 정보를 반영하는 워드 임베딩 방법에서 단어의 분해를 위해 가장 많이 사용하는 방법은 n-gram을 이용하는 것이다.In word embedding method that reflects partial word information such as fastText, the most used method for decomposing words is to use n-gram.

fastText에서는 단어를 단어 내 n-gram과 해당 단어의 집합으로 정의한다. In fastText, a word is defined as an n-gram in a word and a set of words.

이때, 단어의 시작과 끝에 각각 '<' , '>'를 넣어 시작과 끝을 구분한다.At this time,'<' and'>' are added to the start and end of the word to distinguish the start and end.

예를 들어, n = 2일 때 '전달력'이라는 단어는 <전, 전달, 달력, 력>, <전달력>으로 정의되고, skip-gram의 방법을 활용하여 각 n-gram에 대한 단어 벡터를 계산하게 된다.For example, when n = 2, the word'delivery force' is defined as <previous, forward, calendar, calendar>, <delivery calendar>, and the word vector for each n-gram by using the skip-gram method. Will calculate

즉, 특정 단어의 벡터는 해당 단어를 이루는 n-gram의 벡터 조합으로 결정되는 것이다.That is, the vector of a specific word is determined by a vector combination of n-grams constituting the word.

부분단어 정보를 반영하는 이러한 방식은 학습말뭉치에 나타나지 않는 단어에 대해서도 알려진 n-gram으로 분해하여 분석함으로써 단어 벡터를 유추할 수 있다는 장점이 있다.This method of reflecting subword information has an advantage in that a word vector can be inferred by decomposing and analyzing words that do not appear in the learning corpus into known n-grams.

반면에 n-gram이 해당 단어의 형태론적 특성을 반영하지는 못하기 때문에 간혹 잘못된 학습이 이루어지기도 한다.On the other hand, because n-gram does not reflect the morphological characteristics of the word, sometimes wrong learning occurs.

도 1은 fastText에 의한 워드 임베딩 결과의 예로서 단어 '달력'과 의미상으로 유사한 단어를 출력한 것이다.1 is an example of a word embedding result by fastText, and outputs a word similar in meaning to the word'calendar'.

'달력'과 유사한 단어에 '전달력'이 있다. 이는 두 단어가 '달력'이라는 공통된 부분단어(n-gram)를 가지기 때문에 생긴 결과이다.The word'calendar' is similar to'calendar'. This is the result of the two words having a common subword (n-gram) called'calendar'.

본 발명의 제 1 실시 예에서는 이러한 한계를 극복하기 위하여 fastText에서 단어를 n-gram에 의한 부분단어의 집합으로 정의하는 부분을 실질 형태소의 집합으로 정의하도록 수정한 워드 임베딩 방법을 제안한다.In order to overcome this limitation, the first embodiment of the present invention proposes a word embedding method modified to define a word as a set of subwords by n-gram in fastText as a set of real morphemes.

본 발명의 제 1 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치는 도 2에서와 같이, 둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하는 집합 정의부(21)와, skip-gram의 방법을 활용하여 각 n-gram에 대한 단어 벡터를 계산하는 벡터 계산부(22)와, 실질 형태소 집합을 이용해서 워드 임베딩 과정에서 단어와 문맥의 유사도 산출을 하는 유사도 산출부(23)를 포함한다.The word embedding apparatus using the Korean wordnet-based knowledge-based deep learning according to the first embodiment of the present invention includes a set definition unit 21 for defining a compound word composed of two or more morphemes as a set of real morphemes, as shown in FIG. 2. , A vector calculation unit 22 that calculates word vectors for each n-gram using the skip-gram method, and a similarity calculation unit that calculates similarity between words and contexts in the word embedding process using a real morphological set ( 23).

한국어에서 단일어는 하나의 형태소로만 이루어진 단어이며, 복합어는 둘 이상의 형태소로 이루어진 단어이다. 따라서 단일어를 음절 n-gram의 집합으로 정의하게 되면 형태론적 특성이 잘못 반영될 위험이 있다. 반면에 복합어를 실질 형태소의 집합으로 정의하게 되면 유의미한 형태론적 특성을 반영할 수 있다. In Korean, a single word is a word composed of only one morpheme, and a compound word is a word composed of two or more morphemes. Therefore, if a single word is defined as a set of syllable n-grams, there is a risk that morphological characteristics are incorrectly reflected. On the other hand, defining a compound word as a set of real morphemes can reflect significant morphological characteristics.

예를 들어, '달력'은 단일어이기 때문에 <달력>으로 표현하고, '전달력'은 명사 '전달'과 접미사 '-력'의 합성어이므로 <전달, -력>, <전달력>으로 표현한다. For example, because'calendar' is a single word, it is expressed as <calendar>, and'transfer power' is a compound word of the nouns'transfer' and the suffix'-power', so it is expressed as <transfer, -power>, <transfer power>. .

마찬가지로 '붙잡다'는 어간 '붙잡-'이 두 개의 형태소로 결합하여 있으므로 복합어에 해당하는데, 실질 형태소 '붙-'에 실질 형태소 '잡-'이 결합하여 있으므로 <붙-, 잡다, <붙잡다>로 표현한다.Likewise,'catch' corresponds to a compound word because the stem'catch-' combines into two morphemes, but since the actual morpheme'paste' is combined with the actual morpheme'jab-', it becomes <grab-, grab, <grab>. Express.

이와 같은 실질 형태소 집합을 이용하는 워드 임베딩 과정에서 단어

와 문맥

의 유사도 함수(scoring function)

는 수학식 1에서와 같이 정의된다.Words in the word embedding process using this real set of morphemes

And context

The similarity function of

Is defined as in Equation 1.

이때,

는 단어

에 나타나는 실질 형태소의 집합

에 속한 실질 형태소

의 벡터이고,

는 문맥 단어의 벡터이다.At this time,

The word

Set of real morphemes appearing in

A real morpheme belonging to

Is a vector of

Is a vector of context words.

도 3은 본 발명의 제 1 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법을 나타낸 플로우 차트이다.FIG. 3 is a flow chart showing a word embedding method using deep knowledge based on Korean word net based on the first embodiment of the present invention.

본 발명의 제 1 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법은 둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하는 단계(S301)와, skip-gram의 방법을 활용하여 각 n-gram에 대한 단어 벡터를 계산하는 단계(S302)와, 실질 형태소 집합을 이용해서 워드 임베딩 과정에서 단어와 문맥의 유사도 산출을 하는 단계(S303)를 포함한다.The word embedding method using the Korean wordnet-based knowledge-based deep learning according to the first embodiment of the present invention uses a method of defining a compound word composed of two or more morphemes as a set of real morphemes (S301) and a skip-gram method. And calculating a word vector for each n-gram (S302) and calculating a similarity between the word and the context in the word embedding process using the real morphological set (S303).

본 발명은 단어를 부분단어 모형으로 분해하여 분석함으로써 유사도를 산출하여 워드 임베딩의 효율성을 높이는 구성에 더하여, 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하여 단어 벡터의 품질을 높이는 구성을 수행할 수 있다.In accordance with the present invention, in addition to a configuration that increases the efficiency of word embedding by calculating the similarity by decomposing and analyzing words into a partial word model, the concept of high-level words as low-dimensional data is maintained while maintaining the semantic characteristics of words. Transformation can be performed to increase the quality of the word vector.

이와 같은 구성은 워드 임베딩 시 단어를 개념으로 치환하여 수행하는 것으로서 데이터의 특성을 유지하면서 데이터의 양을 줄이는 차원 축소(Dimensionality Reduction)에 기반한 것이다.Such a configuration is performed by substituting words into concepts in word embedding and is based on dimensionality reduction, which reduces the amount of data while maintaining the characteristics of the data.

즉, 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하는 것으로 단순히 단어 임베딩 과정에서 관계어를 이용하는 것과는 다른 것이다.That is, it is different from simply using a related word in the word embedding process by converting a word of high-dimensional data into a concept of low-dimensional data while maintaining the semantic characteristics of the word.

도 4는 본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 단어의 벡터 표현을 위한 과정을 나타낸 구성도이고, 도 5는 본 발명의 제 2 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치의 구성도이다.4 is a diagram illustrating a process for vector expression of words using deep knowledge-based Korean wordnet-based learning according to the present invention, and FIG. 5 is deep knowledge-based Korean wordnet-based learning according to a second embodiment of the present invention. It is a block diagram of a word embedding device using.

본 발명은 도 4에서와 같이, 워드임베딩에서 각각의 단어에 대해 워드임베딩을 수행한 것과 달리 단어를 개념으로 변환한 다음 워드임베딩을 수행한다.In the present invention, as in FIG. 4, unlike word embedding for each word in word embedding, the word is converted into a concept and then word embedding is performed.

단어 '주택'과 '집'은 같은 개념이므로 둘다 개념 'SYN001(예를 위한 임의의 개념 번호)'로 변환할 수 있다.Since the words'housing' and'house' are the same concept, both can be converted to the concept'SYN001 (any concept number for example)'.

본 발명의 제 2 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치는 도 5에서와 같이, 문장이 입력되면 형태 분석을 수행하는 형태 분석부(51)와, 형태 분석이 이루어진 각 단어에 대한 단어 판단 및 의미 판단이 이루어지 않는 경우에 의미 분석을 수행하는 의미 분석부(52)와, 한국어 어휘의미망을 활용하여 각 단어에 대한 개념 변환을 수행하는 개념 변환부(53)를 포함한다.The word embedding apparatus using the Korean wordnet-based knowledge-driven deep learning according to the second embodiment of the present invention, as shown in FIG. 5, has a shape analysis unit 51 that performs shape analysis when a sentence is input, and shape analysis is performed. A semantic analysis unit 52 that performs semantic analysis when word determination and semantic determination are not made for each word, and a concept conversion unit 53 that performs conceptual conversion for each word by utilizing a Korean vocabulary meaning network. It includes.

도 6은 본 발명의 제 2 실시 예에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법을 나타낸 플로우 차트이고, 도 7은 도 6의 워드 임베딩하는 방법의 일 예를 나타낸 구성도이다.FIG. 6 is a flow chart showing a word embedding method using deep knowledge-based Korean wordnet-based learning according to a second embodiment of the present invention, and FIG. 7 is a configuration diagram showing an example of the word embedding method of FIG. 6.

먼저, 문장이 입력되면 형태 분석을 수행한다.(S601)First, when a sentence is input, shape analysis is performed. (S601)

예를 들어, '나는 집을 산다'라는 문장은 '나/명사+는/조사 집/명사+을/조사 사다/동사+ㄴ다/어미'로 분석된다.For example, the sentence'I buy a house' is analyzed as'I/noun+ is/investigative house/noun+/investigative buy/verb+bda/mother'.

형태 분석이 끝나면 각 단어에 대한 개념 변환이 이루어진다.After the morphological analysis is completed, a concept transformation is performed for each word.

개념 변환은 한국어 어휘의미망을 활용한다.Concept transformation utilizes the meaning of the Korean vocabulary.

한국어 어휘의미망에서 해당 단어를 검색하였을 때 하나의 개념만 존재한다면 해당 개념으로 바로 변환된다.When searching for the word in the Korean vocabulary meaning network, if there is only one concept, it is immediately converted to the concept.

그러나 해당 단어가 여러 개의 의미로 사용될 수 있는 동형이의어 혹은 다의어일 경우에는 단어 판단(S602) 및 의미 판단(S603)이 이루지지 않으므로 의미 분석을 통해 해당 단어의 의미를 먼저 분석한다.(S604)However, if the word is an isomorphic or multi-word that can be used for multiple meanings, the word determination (S602) and the meaning determination (S603) are not made, so the meaning of the word is first analyzed through semantic analysis (S604).

그런 다음 분석된 개념으로 변환한다.(S605)Then, it is converted into an analyzed concept (S605).

도 7은 도 6의 의미 판단이 필요한 경우와 그렇지 않았을 때 대한 상세 예이다.FIG. 7 is a detailed example of a case in which the meaning determination of FIG. 6 is necessary and not.

단어 '컴퓨터'는 한국어 워드넷에 하나의 의미로만 등록되어 있다.The word'computer' is registered only in one meaning on the Korean word net.

따라서 별도의 의미 분석 과정 없이 '컴퓨터'를 'SYN02971359'로 변환할 수 있다.Therefore,'computer' can be converted to'SYN02971359' without a separate semantic analysis process.

반면에 '배'는 한국어 워드넷에 여러 개의 의미로 등록된 다의어이다.On the other hand,'pear' is a multi-word registered in several meanings on the Korean word net.

이 경우 '배'가 포함된 문장을 통해 의미 분석 과정을 수행한다.In this case, the semantic analysis process is performed through a sentence including'pear'.

예를 들어, '배가 아프다'라는 문장에서 '배'는 '배04'의 의미로 사용되었기 때문에 '배'를 'SYN02971359'로 변환한다.For example, in the sentence'the stomach hurts','pear' is used as the meaning of'pear 04', so'pear' is converted to'SYN02971359'.

도 8은 개념 변환에서 정적 변환을 나타낸 구성도이다.8 is a block diagram showing a static transformation in the conceptual transformation.

도 7은 워드 임베딩의 학습을 위해 원시말뭉치가 입력이 되었을 때 개별 문장에 대한 개념 변환 작업을 수행하는 동적 변환이다.FIG. 7 is a dynamic transformation that performs concept transformation for individual sentences when a raw corpus is input for learning word embedding.

반면에 정적 변환은 워드임베딩을 위한 학습과정에서 원시말뭉치에 포함된 모든 단어를 한국어 어휘의미망에 등록된 개념으로 보아 형태 분석(S801), 의미 판단(802), 의미 분석(S803)을 하여 일괄적으로 변환한다. On the other hand, in the static conversion, all words included in the original corpus in the learning process for word embedding are viewed as a concept registered in the Korean vocabulary meaning network, and the form analysis (S801), semantic judgment (802), and semantic analysis (S803) are performed collectively. Transform into an enemy.

도 9는 본 발명에 따른 학습된 워드 임베딩을 활용하는 방법을 나타낸 플로우 차트이다.9 is a flow chart showing a method of utilizing the learned word embedding according to the present invention.

본 발명에 의한 워드임베딩 결과는 단어를 개념으로 변환한 다음 학습한 일종의 개념 벡터이다.The result of word embedding according to the present invention is a kind of concept vector learned after converting a word into a concept.

따라서 단어가 입력되었을 때 해당 단어에 대한 단어 벡터를 획득하려면, 활용 단계에서도 개념 변환과정이 필요하다.Therefore, in order to obtain a word vector for the word when the word is input, a concept conversion process is also required in the utilization stage.

먼저 학습 단계와 마찬가지로 개별 단어를 해당하는 개념으로 변환한 다음, 워드임베딩 결과에서 해당 개념으로 변환한다.First, as in the learning stage, individual words are converted into corresponding concepts, and then, in the word embedding result, they are converted into corresponding concepts.

도 10은 워드 임베딩 결과를 활용할 때 미등록어에 대한 처리 예를 나타낸 구성도이다.10 is a configuration diagram showing an example of processing for unregistered words when utilizing the word embedding result.

만약 워드임베딩 결과를 활용할 때 입력된 단어가 미등록어라면 해당 단어에 대한 워드임베딩을 다시 수행할 수 밖에 없다.If the entered word is an unregistered word when using the word embedding result, it is inevitable to perform word embedding for the word again.

그러나 본 발명에서는 미등록어와 관계된 단어(상위어, 하위어, 동의어)를 한국어 어휘의미망에서 찾은 다음 해당 단어에 대한 임베딩 결과를 차용하면 된다. However, in the present invention, words related to unregistered words (higher words, lower words, synonyms) may be found in the Korean vocabulary definition and then the embedding results for the words may be borrowed.

예를 들어, '가랑비'에 대한 워드임베딩 결과가 필요할 때, '가랑비'는 미등록어로서 기존 워드임베딩 결과에 없다.For example, when a word embedding result for'garangbi' is needed,'garangbi' is an unregistered word and does not exist in the existing word embedding result.

그러나 '가랑비'는 '비4'의 하위어이기 때문에 임의의 벡터로 변환하는 것보다는 '비4'의 벡터를 차용해서 사용하는게 더 효과적이다.However,'Garinbi' is a sub-word of'Bi 4', so it is more effective to borrow a vector of'Bi 4'rather than converting it into an arbitrary vector.

이상에서 설명한 본 발명에 따른 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치 및 방법은 워드 임베딩의 효율성 및 정확성을 높이기 위하여, 단어를 부분단어 모형으로 분해하여 분석함으로써 유사도를 산출하여 워드 임베딩의 효율성을 높이는 구성 및 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하여 단어 벡터의 품질을 높이는 구성을 포함하는 것이다.The word embedding apparatus and method using the knowledge-driven deep learning based on the Korean word net according to the present invention described above in order to increase the efficiency and accuracy of the word embedding, analyzes the words into sub-word models and analyzes them to calculate the similarity to calculate the word embedding. It includes a configuration that improves efficiency and the meaning of the word vector by converting the word, which is high-dimensional data, to the concept, which is low-dimensional data, while maintaining the semantic characteristics of the word.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.It will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention as described above.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the specified embodiments should be considered in terms of explanation rather than limitation, and the scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent range are included in the present invention. Should be interpreted.

21. 집합 정의부
22. 벡터 계산부
23. 유사도 산출부21. Set Definition
22. Vector calculator
23. Similarity calculator

Claims

둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하는 집합 정의부;
skip-gram의 방법을 활용하여 각 n-gram에 대한 단어 벡터를 계산하는 벡터 계산부;
실질 형태소 집합을 이용해서 워드 임베딩 과정에서 단어와 문맥의 유사도 산출을 하는 유사도 산출부;를 포함하고,
둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하여 형태론적 특성이 오류없이 반영되도록 하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치.A set definition unit defining a complex word composed of two or more morphemes as a set of real morphemes;
a vector calculator that calculates a word vector for each n-gram using the skip-gram method;
Includes a similarity calculation unit for calculating the similarity between the word and the context in the word embedding process using the real set of morphemes.
A word embedding device using Korean wordnet-based knowledge-driven in-depth learning characterized by defining a complex word composed of two or more morphemes as a set of real morphemes so that morphological characteristics are reflected without error.

제 1 항에 있어서, 실질 형태소 집합을 이용한 워드 임베딩 과정에서 단어

와 문맥

의 유사도 함수(scoring function)

는,

으로 정의되고,
이때,

는 단어

에 나타나는 실질 형태소의 집합

에 속한 실질 형태소

의 벡터이고,

는 문맥 단어의 벡터인 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치.The method of claim 1, wherein the word in the word embedding process using a real morphological set

And context

The similarity function of

Is,

Is defined as
At this time,

The word

Set of real morphemes appearing in

A real morpheme belonging to

Is a vector of

Is a word embedding device based on Korean wordnet-based knowledge-driven deep learning, which is a vector of context words.

제 1 항에 있어서, 워드 임베딩 과정에서 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하기 위하여,
문장이 입력되면 형태 분석을 수행하는 형태 분석부와,
형태 분석이 이루어진 각 단어에 대한 단어 판단 및 의미 판단이 이루어지 않는 경우에 의미 분석을 수행하는 의미 분석부와,
한국어 어휘의미망을 활용하여 각 단어에 대한 개념 변환을 수행하는 개념 변환부를 더 포함하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치.The method of claim 1, wherein in the word embedding process, in order to transform a word that is high-dimensional data into a concept that is low-dimensional data while maintaining the semantic characteristics of the word,
A form analysis unit that performs form analysis when a sentence is input;
A semantic analysis unit that performs semantic analysis when word judgment and semantic judgment are not made for each word for which form analysis is performed;
A word embedding device using Korean wordnet-based knowledge-driven in-depth learning, further comprising a concept conversion unit that performs concept conversion for each word by utilizing the Korean vocabulary meaning network.

제 3 항에 있어서, 한국어 어휘의미망에서 해당 단어를 검색하였을 때 하나의 개념만 존재한다면 해당 개념으로 바로 변환하고,
해당 단어가 여러 개의 의미로 사용될 수 있는 동형이의어 혹은 다의어일 경우에는 의미 분석을 통해 해당 단어의 의미를 먼저 분석한 다음 분석된 개념으로 변환하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치.The method according to claim 3, if there is only one concept when searching for the word in the Korean vocabulary meaning network, it is immediately converted to the concept.
If the word is an isomorphic or multi-word that can be used for multiple meanings, the meaning of the word is first analyzed through semantic analysis and then converted into the analyzed concept. Word embedding device.

제 3 항에 있어서, 개념 변환부는,
워드임베딩의 학습을 위해 원시말뭉치가 입력이 되었을 때 개별 문장에 대한 개념 변환 작업을 수행하는 동적 변환 또는,
워드임베딩을 위한 학습과정에서 원시말뭉치에 포함된 모든 단어를 한국어 어휘의미망에 등록된 개념으로 일괄적으로 변환하는 정적 변환을 하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 장치.The method of claim 3, wherein the concept conversion unit,
Dynamic transformation that performs concept conversion for individual sentences when a raw corpus is input for learning word embedding, or
A word embedding device using Korean wordnet-based knowledge-driven in-depth learning, characterized in that a static conversion is performed to collectively convert all words included in a raw corpus to a concept registered in the Korean vocabulary network in the learning process for word embedding. .

둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하는 집합 정의 단계;
skip-gram의 방법을 활용하여 각 n-gram에 대한 단어 벡터를 계산하는 벡터 계산 단계;
실질 형태소 집합을 이용해서 워드 임베딩 과정에서 단어와 문맥의 유사도 산출을 하는 유사도 산출 단계;를 포함하고,
둘 이상의 형태소로 이루어진 복합어를 실질 형태소의 집합으로 정의하여 형태론적 특성이 오류없이 반영되도록 하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법.A set definition step of defining a compound word composed of two or more morphemes as a set of real morphemes;
a vector calculation step of calculating a word vector for each n-gram using the skip-gram method;
Includes a similarity calculation step of calculating the similarity between the word and the context in the word embedding process using the real morpheme set.
Word embedding method using Korean wordnet-based knowledge-driven in-depth learning, characterized by defining a complex word composed of two or more morphemes as a set of real morphemes so that morphological characteristics are reflected without error.

제 6 항에 있어서, 워드 임베딩 과정에서 단어의 의미적 특성을 유지한 상태로 고차원의 데이터인 단어를 저차원의 데이터인 개념으로 변환하기 위하여,
문장이 입력되면 형태 분석을 수행하는 형태 분석 단계와,
형태 분석이 이루어진 각 단어에 대한 단어 판단 및 의미 판단이 이루어지 않는 경우에 의미 분석을 수행하는 의미 분석 단계와,
한국어 어휘의미망을 활용하여 각 단어에 대한 개념 변환을 수행하는 개념 변환 단계를 더 포함하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법.The method of claim 6, wherein in the word embedding process, in order to convert a word that is high-dimensional data into a concept that is low-dimensional data while maintaining the semantic characteristics of the word,
A form analysis step of performing form analysis when a sentence is input;
A semantic analysis step of performing semantic analysis when word judgment and semantic judgment are not made for each word for which form analysis is performed;
A word embedding method using Korean wordnet-based knowledge-driven deep learning, further comprising a concept conversion step of performing concept conversion for each word by utilizing the Korean vocabulary meaning network.

제 7 항에 있어서, 한국어 어휘의미망에서 해당 단어를 검색하였을 때 하나의 개념만 존재한다면 해당 개념으로 바로 변환하고,
해당 단어가 여러 개의 의미로 사용될 수 있는 동형이의어 혹은 다의어일 경우에는 의미 분석을 통해 해당 단어의 의미를 먼저 분석한 다음 분석된 개념으로 변환하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법.The method of claim 7, when searching for the word in the Korean vocabulary meaning network, if there is only one concept, it is immediately converted to the concept.
If the word is an isomorphic or multi-word that can be used for multiple meanings, the meaning of the word is first analyzed through semantic analysis and then converted into the analyzed concept. Word embedding method.

제 7 항에 있어서, 개념 변환 단계는,
워드임베딩의 학습을 위해 원시말뭉치가 입력이 되었을 때 개별 문장에 대한 개념 변환 작업을 수행하는 동적 변환 또는,
워드임베딩을 위한 학습과정에서 원시말뭉치에 포함된 모든 단어를 한국어 어휘의미망에 등록된 개념으로 일괄적으로 변환하는 정적 변환을 하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법.The method of claim 7, wherein the concept conversion step,
Dynamic transformation that performs concept conversion for individual sentences when a raw corpus is input for learning word embedding, or
A word embedding method using Korean wordnet-based knowledge-driven deep learning, characterized in that a static conversion is performed to collectively convert all words included in a raw corpus to a concept registered in the Korean vocabulary network in the learning process for word embedding. .

제 7 항에 있어서, 워드임베딩 결과를 활용하는 단계를 더 포함하고,
워드임베딩 결과를 활용하는 단계는,
단어가 입력되었을 때 해당 단어에 대한 단어 벡터를 획득하기 위하여 학습 단계와 마찬가지로 개별 단어를 해당하는 개념으로 변환한 다음, 워드임베딩 결과에서 해당 개념으로 변환하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법.8. The method of claim 7, further comprising utilizing the word embedding result,
The steps to utilize the word embedding result are:
In order to obtain a word vector for a word when a word is input, as in the learning step, an individual word is converted into a corresponding concept and then converted into a corresponding concept in the word embedding result. Word embedding method using learning.

제 10 항에 있어서, 워드임베딩 결과를 활용할 때 입력된 단어가 미등록어이면,
미등록어와 관계된 상위어, 하위어, 동의어를 한국어 어휘의미망에서 찾은 다음 해당 단어에 대한 임베딩 결과를 차용하는 것을 특징으로 하는 한국어 워드넷 기반 지식 주도 심층학습을 이용한 워드 임베딩 방법.
11. The method of claim 10, When using the word embedding result, if the entered word is an unregistered word,
A word embedding method using Korean wordnet-based knowledge-driven in-depth learning, characterized by finding the upper words, lower words, and synonyms related to unregistered words in the Korean vocabulary meaning network and then borrowing the embedding results for the words.