KR101999152B1

KR101999152B1 - English text formatting method based on convolution network

Info

Publication number: KR101999152B1
Application number: KR1020170182571A
Authority: KR
Inventors: 한욱신; 김현지; 박용희; 김경민; 소병훈
Original assignee: 포항공과대학교 산학협력단
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2019-07-11
Also published as: KR20190080234A

Abstract

본 발명은 컨벌루션 신경망 기반 영문 텍스트 정형화 방법에 관한 것으로, 영문 텍스트로 이루어진 문서의 집합을 미리 정의된 데이터 스키마를 가진 정형 데이터로 변환하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법에 있어서, a) 입력 텍스트를 전처리하여 입력 텍스트를 실수 벡터 형식으로 변환하는 단계와, b) 입력 텍스트로부터 후보 키워드를 추출하는 단계와, c) 각 후보 키워드를 포함하는 문장을 색인하는 단계와, d) 상기 후보 키워드에 대한 특징점을 추출하는 단계와, e) 추출한 특징점을 입력 데이터로 하여 키워드의 라벨을 예측하는 단계와, f) 상기 산출된 키워드 속성을 네거티브 샘플링 기반 신경망 학습처리를 수행하여 각 속성에 키워드를 매핑하는 단계를 포함한다.The present invention relates to a convolutional neural network based English text formatting method, and more particularly, to a convolutional neural network based English text formatting method for converting a set of documents made up of English text into fixed data having a predefined data schema, Converting the input text into a real number vector format; b) extracting candidate keywords from the input text; c) indexing sentences containing each candidate keyword; d) And e) extracting the extracted feature points as input data to predict the label of the keyword; and f) mapping the keyword attribute to each attribute by performing the negative sampling-based neural network learning process on the calculated keyword attribute do.

Description

컨벌루션 신경망 기반 영문 텍스트 정형화 방법{English text formatting method based on convolution network}[0001] The present invention relates to a convolutional neural network based text formatting method,

본 발명은 컨벌루션 신경망 기반 영문 텍스트 정형화 방법에 관한 것으로, 더 상세하게는 컨벌루션 신경망 모형을 활용하여 키워드가 입력으로 주어지지 않고, 세 개 이상의 속성을 가지는 임의의 데이터 스키마가 주어지는 경우에도 효율적으로 정형화할 수 있는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법에 관한 것이다.The present invention relates to a convolutional neural network-based English text formatting method, and more particularly, to a method and apparatus for efficiently formatting an English text using a convolutional neural network model, even when an arbitrary data schema having three or more attributes is given, The present invention relates to a convolutional neural network based English text formatting method.

일반적으로, 텍스트 데이터 정형화는 텍스트와 데이터 스키마가 주어졌을 때 입력된 텍스트로부터 주어진 데이터 스키마에 대응되는 최적의 키워드를 찾아내는 것을 뜻한다.In general, text data canonicalization refers to finding the best keyword corresponding to a given data schema from the input text when text and data schema are given.

텍스트 데이터 정형화는, 자연어 처리 및 인공 지능 분야, 데이터베이스 등 다양한 분야에서 중요한 핵심적인 기술 중 하나로서 많은 응용 프로그램에서 사용되고 있다. 최근 정형화 문제를 해결하는 많은 신경망 기반 알고리즘들이 제안되었으나, 기존의 모든 알고리즘들이 키워드의 후보가 입력으로 주어진다고 가정하고 있으며, 대부분의 알고리즘은 두 개의 속성(attribute)을 가지는 이진 관계(binary relation)만 처리할 수 있다는 한계가 있다.Text data formatting is one of the key technologies in various fields such as natural language processing, artificial intelligence, and databases, and is used in many applications. Recently, many neural network based algorithms have been proposed to solve the formalization problem. However, all of the existing algorithms assume that the candidates of the keywords are given as inputs, and most of the algorithms have only binary relations with two attributes There is a limitation that it can be processed.

선행문헌1(Mary Elaine Califf and Raymond J. Mooney: Relational Learning of Pattern-Match Rules for Information Extraction. AAAI/IAAI 1999: 328-334)에서는 영문 텍스트와 데이터 스키마가 주어졌을 때, 영문 텍스트를 관계형 데이터 베이스(Relational database)의 튜플(Tuple)로 변환하는 규칙 학습 기반 정형화 기술 RAPIER를 제안하였다. RAPIER는 규칙을 학습하기 위해 세 가지 패턴을 정의한다.Given the English text and the data schema, the English text can be stored in a relational database (for example, (RAPIER), which transforms into a tuple of a relational database. RAPIER defines three patterns for learning rules.

첫 번째는 필러(filler) 바로 앞의 텍스트와 일치하는 사전-필러(Pre-filler) 패턴이다. 두 번째는 실제 텍스트와 일치하는 필러 패턴이다. 마지막 세 번째는 필러 바로 뒤에 나오는 텍스트와 일치하는 사후-필러(Post-filler) 패턴이다. The first is a pre-filler pattern that matches the text immediately before the filler. The second is a filler pattern that matches the actual text. The final third is a post-filler pattern that matches the text immediately following the filler.

RAPIER는 이 세가지 패턴을 이용해 데이터 정형화를 위한 규칙을 학습한다. RAPIER는 패턴 규칙 학습을 기반으로 하여 텍스트 데이터에서 빈번하게 등장하는 패턴을 자동으로 인식하고 정형 데이터를 추출한다. 따라서 다른 알고리즘에 비해 높은 정확도를 나타낸다. 그러나 데이터에 등장한 적이 없는 패턴을 처리할 수 없기 때문에 재현율이 매우 낮다는 한계점을 가진다.RAPIER uses these three patterns to learn rules for data formatting. RAPIER automatically recognizes patterns frequently appearing in text data based on pattern rule learning and extracts formal data. Therefore, it shows higher accuracy than other algorithms. However, it has a limitation that the recall rate is very low because it can not process a pattern that has never appeared in the data.

최근 지식 베이스 시스템의 발전에 따라 지도 학습에 필요한 데이터의 양이 증가하였고, 이에 따라 신경망을 이용한 정형화 기술이 활발히 연구되고 있다. 신경망 기반 정형화 연구들은 크게 이진 관계 분류(binary relation classification), 슬롯 채우기(slot filling), N항 관계 추출(N-ary relation extraction) 연구로 나눌 수 있다.Recently, as the knowledge base system has been developed, the amount of data required for instruction learning has been increased. Therefore, a formalization technique using neural network has been actively studied. Neural network-based formalization studies can be broadly divided into binary relation classification, slot filling, and N-ary relation extraction.

이진 관계 분류에 대한 연구는 컨벌루션 신경망 네트워크(convolutional neural networks) 기반의 방법을 소개하는 선행문헌2(Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, Jun Zhao: Relation Classification via Convolutional Deep Neural Network. COLING 2014: 2335-2344), 선행문헌3(Daojian Zeng, Kang Liu, Yubo Chen, Jun Zhao: Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. EMNLP 2015: 1753-1762), 선행문헌4(Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, Maosong Sun: Neural Relation Extraction with Selective Attention over Instances. ACL (1) 2016)가 있다.In this paper, we propose a method for classifying binary relations by using a convolutional neural network based on Relation Classification via Convolutional Deep Neural Network (COLG) 2014 (Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, : 2335-2344), Daejian Zeng, Kang Liu, Yubo Chen, and Jun Zhao: Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. EMNLP 2015: 1753-1762), Prior Art 4 (Yankai Lin, Shiqi Shen , Zhiyuan Liu, Huanbo Luan, Maosong Sun: Neural Relation Extraction with Selective Attention over Instances, ACL (1) 2016).

또한, 순환 신경망/장단기 기억 네트워크(recurrent neural networks/Long Short Term Memory networks)기반의 방법을 소개하는 선행문헌5(Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, Houfeng Wang: A Dependency-Based Neural Network for Relation Classification. ACL (2) 2015: 285-290) 및 선행문헌6(Daniil Sorokin, Iryna Gurevych: Context-Aware Representations for Knowledge Base Relation Extraction. EMNLP 2017: 1785-1790)이 있다.In this paper, we propose a new method for recurrent neural networks based on recurrent neural networks, which is based on long-term neural networks. Based Neural Network for Relation Classification, ACL (2) 2015: 285-290, and Daniel Sorokin, Iryna Gurevych: Context-Aware Representations for Knowledge Base Relation Extraction. EMNLP 2017: 1785-1790.

상기 선행문헌2는 컨벌루션 신경망 네트워크 기반 이진 관계 분류 방법으로, 입력으로 주어진 두 개의 명사에 대해 특징점을 추출하고 컨벌루션 신경망 접근법을 사용해 문장 수준의 특징점을 학습하여 두 개의 명사 사이 관계를 예측한다.The preceding document 2 is a convolution neural network-based binary relation classification method that extracts feature points for two nouns given as inputs and predicts the relationship between two nouns by learning feature points at a sentence level using a convolution neural network approach.

선행문헌3은 컨벌루션 신경망 접근법에 개별적 최대값 선정(piecewise max pooling) 기법을 적용하여 기존 특징점 추출 방식의 오류를 개선한 알고리즘을 제안하였다. Prior Art 3 proposed an algorithm that improves error of existing feature point extraction method by applying piecewise max pooling technique to convolution neural network approach.

선행문헌4는 선택적 어텐션(selective attention) 기법을 적용한 컨벌루션 신경망 접근법을 사용하여 학습 모형의 정확도를 향상시킴과 동시에, 잘못 분류 표시된 학습 데이터(labeled data)의 문제를 해결하였다. Prior art 4 solved the problem of labeled data which was displayed incorrectly while improving the accuracy of the learning model by using a convolution neural network approach using selective attention.

선행문헌5는 문장의 종속 관계 기반 파스 트리(dependency-based parse tree)에서 두 키워드 사이의 최단 거리를 구한 뒤, 종속 관계 기반 순환 신경망을 이용하여, 최단 거리 상의 단어가 가지는 의미를 분석함으로써 관계 분류 문제 성능을 향상시켰다. In the preceding document 5, the shortest distance between two keywords is found in the dependency-based parse tree of the sentence, and then the meaning of the word on the shortest distance is analyzed using the dependency-based recurrent neural network, Improved problem performance.

선행문헌6은 장단기 기억 신경망 네트워크 접근법을 사용하였는데, 데이터셋 내에 두 키워드 중 어느 하나의 키워드만을 포함하는 데이터를 컨텍스트로 정의하고, 이 컨텍스트를 신경망의 입력값으로 하여 기존보다 정밀한 관계 추출이 가능한 학습 모형을 제안하였다. Prior art 6 uses a neural network network approach to short and long term storage. We define data that contains only one of the two keywords in the data set as a context, and use this context as an input value of neural network, Model.

이처럼 최근 이진 관계 분류 문제를 해결하기 위한 다양한 신경망 모형이 제안되고 있지만, 이러한 기술들은 모두 2개 키워드 사이의 관계를 추출하는 문제에 한정되어 있다. Although various neural network models have recently been proposed to solve the problem of classifying binary relations, all of these techniques are limited to the problem of extracting the relationship between two keywords.

따라서 스키마의 속성이 2개라고 가정하기 때문에, 3개 이상의 속성을 가지는 정형 데이터를 처리할 수 없다는 문제점이 있었다.Therefore, since the schema is assumed to have two attributes, there is a problem in that it is impossible to process the fixed data having three or more attributes.

또한, 앞서 설명한 연구들에서는 2개의 키워드가 입력으로 주어지며, 2개의 키워드는 한 문장 내에 동시에 등장한다고 가정하고 있기 때문에 정형화 처리가 매우 한정적인 문제점이 있었다.Further, in the above-described studies, since two keywords are given as inputs and two keywords are assumed to appear simultaneously in one sentence, there is a problem that the formalization process is very limited.

다음으로, 슬롯 채우기는 텍스트 뭉치(corpus)와 키워드 그리고 하나의 속성이 주어졌을 때, 키워드와 해당 속성 관계를 갖는 또 다른 키워드를 텍스트 뭉치 내에서 찾아내는 문제이다. Next, slot filling is a problem of finding another keyword in a text bundle, given a corpus, a keyword, and a property, when a single attribute is given.

선행문헌7(Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, Christopher D. Manning: Position-aware Attention and Supervised Data Improve Slot Filling. EMNLP 2017: 35-45)에는 장단기 기억 네트워크를 이용한 방법으로, 키워드 위치 정보를 활용하여 키워드와 속성 관계를 갖는 다른 키워드 추출 성능을 향상시켰다. 슬롯 채우기 문제는 N항 관계를 추출할 수 있으나, 키워드가 입력으로 주어지는 것에 한정되어 있는 문제점이 있었다.This paper proposes a method using short and long term storage network in the prior art document 7 (Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning: Position-aware Attention and Supervised Data Improvement Slot Filling. EMNLP 2017: 35-45) Information is used to improve the performance of extracting other keywords having a property relationship with the keyword. The slot filling problem can extract the N-term relationship, but there is a problem that the keyword is limited to the input.

한편, 최근 그래프 기반 장단기 기억 네트워크(Graph LSTM)를 활용한 N항 관계 추출 연구가 제안되었다. On the other hand, research on the extraction of the N-term relationship using the graph-based short-term storage network (Graph LSTM) has been proposed.

선행문헌8(Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-tau Yih: Cross-Sentence N-ary Relation Extraction with Graph LSTMs. TACL 5: 101-115 (2017))에서는 N항 관계 추출 연구는 텍스트, N항 관계의 목록이 주어졌을 때, 텍스트에서 N개 키워드의 순서쌍이 관계 표현 목록 중 어떤 관계에 매핑되는지 찾아내는 문제이다. 그러나 N항 관계 추출 연구 또한 N개의 키워드 순서쌍이 입력 자료로 주어진다고 가정하고 있기 때문에 입력 자료가 주어지지 않는 경우에는 수행할 수 없는 문제점이 있었다.In N-ary Relation Extraction with Graph LSTMs. TACL 5: 101-115 (2017)), the N-term relationship extraction study Given a list of text and N terms, it is a problem to find out which ordered pair of N expressions is mapped to an ordered pair of N keywords in the text. However, the N - term relationship extraction study also has a problem that it can not be performed when the input data is not given, since it is assumed that N keyword ordered pairs are given as input data.

다크 데이터(Dark Data)는 정형화 되지 않은 형식으로 되어 있는 데이터의 방대한 집합인데, 공공기관 또는 회사의 공문서, 웹 페이지 등이 이에 속한다. 기업의 수많은 데이터 중 활용되는 데이터(Smart Data)는 12% 에 불과하다[16]. 나머지 이용되지 않는 데이터(다크 데이터) 88%는 대부분 비정형 데이터이기 때문에 상용 DBMS로 처리가 불가하여 거의 활용되지 않고 있다. 비정형 데이터의 정형화는 비정형 데이터가 주어졌을 때 이를 활용성이 높은 정형 데이터로 변환하는 문제로, 이 때 비정형 데이터는 텍스트, 사진, 음성 등을 포함할 수 있다.Dark Data is a vast collection of data in an unstructured format, such as public institutions, corporate documents, and Web pages. Only 12% of the company's data is used (Smart Data) [16]. The remaining unused data (dark data) 88% is mostly unstructured data, so it can not be processed by commercial DBMS and is rarely utilized. The unstructured data can be converted into regular data with high usability when the unstructured data is given. In this case, the unstructured data may include text, pictures, voice, and the like.

도 1은 비정형 데이터를 정형 데이터로 처리하는 예시도이다.Fig. 1 is an example of processing unstructured data as formatted data.

도 1을 참조하면, (a)는 비정형 텍스트 데이터, (b)는 정형 데이터의 스키마이다. 이 예에서 비정형 텍스트는 영문 텍스트 데이터로서 문장을 이룬다. 이러한 문장에서 주요한 텍스트를 추출하고, (b)와 같이 사람과 국가를 표로 정리하는 정형 데이터의 스키마에 적용하기 위해서는 사람의 이름과 국가에 대한 텍스트 정보를 추출해야 한다.Referring to FIG. 1, (a) is an unstructured text data, and (b) is a schema of a formal data. In this example, the irregular text forms a sentence as English text data. In order to extract the main text from these sentences and to apply it to the schema of the formal data, such as (b), where people and countries are tabulated, the text information about the person's name and country should be extracted.

(a)를 (b)에 대하여 정형화하면, (c)의 정형 데이터를 구할 수 있다. 이처럼 비정형 데이터를 정형 데이터로 변환할 수 있다면 활용도가 낮은 88%의 다크 데이터를 다양하게 활용할 수 있는 장점이 있다. 또한, 정형 데이터와 비정형 데이터의 분석 소프트웨어를 통합적으로 관리할 수 있는 장점과 정형화된 데이터를 가지고 상용 DBMS의 기능을 모두 활용할 수 있어 고도화된 분석이 가능하다는 장점이 있다.(a) is formulated with respect to (b), the form data of (c) can be obtained. If unstructured data can be converted into regular data, it is advantageous to utilize the dark data of 88%, which is low in utilization, in various ways. In addition, it has the advantage of integrated management of the analysis software of fixed data and unstructured data, and it is possible to utilize all the functions of commercial DBMS with standardized data, which enables advanced analysis.

비정형 텍스트 데이터의 정형화 기법은 인신 매매 탐지, 유전학 기반 데이터 마이닝, 지질학과 고생물학 데이터 분석 등에서 폭넓게 활용되고 있다. 인신 매매 탐지에서는, 성 광고 텍스트에서 찾아낸 전화번호와 이메일을 광고가 등장한 위치 정보와 결합하여 해당 광고주에 의해 발생하는 성거래의 범위와 네트워크를 찾아내는 데 정형화 기술을 활용한다. The formalization technique of unstructured text data is widely used in human trafficking detection, genetic-based data mining, geology and paleontology data analysis. In human trafficking detection, we combine the phone number and e-mail found in the sex advertisement text with the location information of the advertisement, and use stereotyping technology to find the range and network of sex transactions generated by the advertiser.

유전학에서는 비정형 텍스트로 되어있는 문헌에서 유전자와 유전자 변이형 및 표현형을 정형 데이터로 저장하고, 이를 분석하여 유전자들의 관계를 추론한 결과를 통해 임상 유전 진단, 생식 카운셀링 등에 적용한다. In genetics, genes, gene mutation types and phenotypes are stored as formal data in a document with atypical texts, and the result is applied to clinical genetic diagnosis and reproductive counseling through the result of inferring the relationship of genes.

또한, 지질학과 고생물학의 비정형 데이터를 심층적으로 분석하기 위해 정형화 기법을 활용한다.In addition, we use formalization techniques to analyze in depth the unstructured data of geology and paleontology.

최근 들어 다양한 분야에서 비정형 텍스트 데이터를 분석해 활용 가능한 데이터를 추출하는 신경망 기반 기술이 활발히 연구되고 있다. 신경망 기반 정형화 기술은 기존 규칙 기반 기술과는 달리 도메인 지식을 요구하지 않고 다양한 데이터 스키마에 폭넓게 적용 가능하다는 장점이 있다. 또한, 규칙 기반 방법과 대비하여 높은 수준의 재현율을 가진다는 장점이 있다. In recent years, neural network based technology has been actively studied to extract available data by analyzing unstructured text data in various fields. The neural network based stereotyping technology has the advantage that it can be widely applied to various data schemas without requiring domain knowledge unlike existing rule based technology. Also, it has the advantage of having a high level of recall ratio in comparison with the rule - based method.

그러나 앞서 설명한 바와 같이 많은 선행 연구들이 2개의 키워드 사이의 이진 관계를 추출하는 문제에 한정되어 있다. 이 연구들은 데이터 스키마의 속성이 2개라고 가정하고 있기 때문에, 3개 이상의 속성을 가지는 다양한 실세계의 정형 데이터를 추출할 수 없다. However, as described above, many prior studies are limited to the problem of extracting the binary relation between two keywords. Since these studies assume that the attributes of the data schema are two, it is not possible to extract the various types of formal data of three or more attributes.

또한, 이 연구들은 추론 과정에서 타겟으로 하는 키워드가 입력으로 주어진다고 가정 한다. 그러나 실제 어플리케이션에서는 텍스트 내의 수많은 키워드 중에 중요한 키워드가 무엇인지 알 수 없는 경우가 많다. In addition, these studies assume that the target keyword is given as an input in the inference process. In real-world applications, however, many of the keywords in the text are often unknown.

마지막으로, 이 연구들은 2개의 키워드를 모두 포함하는 한 문장을 입력받고 이를 분석하여 2개의 키워드 사이의 관계를 추출하는데, 텍스트에2개의 키워드가 중복해서 등장하는 경우나, 한 문장에 2개의 키워드가 동시에 등장하지 않는 경우를 고려하지 않는다. 따라서, 텍스트로부터 주어진 데이터 스키마에 대한 엔티티를 추출해내기 위해서는 기존 방법들을 그대로 사용할 수 없다.Finally, these studies extract a relationship between two keywords by inputting and analyzing a sentence that includes all two keywords. In the case where two keywords appear in duplicate in a text, or two keywords in one sentence Are not considered at the same time. Thus, existing methods can not be used to extract entities for a given data schema from text.

한편, 세 개 이상의 속성으로 구성된 데이터 스키마를 처리할 수 있는 선행 연구로 앞서 설명한 바와 같이 슬롯 채우기(slot filling)가 있다. 슬롯 채우기는 문서 코퍼스와 키워드, 속성 목록이 입력으로 주어졌을 때, 키워드가 어떤 속성에 해당하는 엔티티인지 찾아내는 문제이다. 그러나 이 연구들은 앞서 설명한 케이스와 마찬가지로 키워드가 입력으로 주어진다고 가정하고 있다. 또한, 전체 문서 집합에서 키워드의 속성이 유일하게 결정되기 때문에 키워드가 입력으로 주어지지 않고, 텍스트에 따라 키워드의 속성이 달라지는 경우에 이 방법을 어떻게 확장하느냐는 어려운 이슈이다. On the other hand, there is a slot filling as described above in order to process a data schema composed of three or more attributes. Slot population is a problem that, when given a document corpus, keyword, or attribute list as an input, identifies which attribute the keyword corresponds to. However, these studies assume that a keyword is given as input as in the case described above. In addition, it is difficult to extend this method when the keyword is not given as an input because the attribute of the keyword is uniquely determined in the entire document set, and the attribute of the keyword is changed according to the text.

최근 여러 개의 문장으로 구성된 텍스트 데이터에서 N항 관계를 추출하는 연구가 제안되었으나, 이 연구 또한 N개의 키워드가 입력으로 주어진다는 한계를 가진다.Recently, research has been proposed to extract the N - term relationship from text data composed of several sentences. However, this study also has limitations in that N keywords are given as inputs.

본 발명이 해결하고자 하는 기술적 과제는, 영문 텍스트와 테이블의 스키마가 주어졌을 때, 텍스트에 등장한 키워드 별로 특징점을 추출하고, 추출된 특징점을 입력으로 하는 신경망을 이용하여 데이터 스키마의 각 속성에 키워드를 매핑하는 효율적인 정형화 기법을 제공함에 있다.SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and apparatus for extracting feature points for each keyword appearing in a text when a schema of English text and a table is given, And to provide an efficient shaping technique.

특히, 이진 관계 추출이 아닌 3개 이상의 속성(N항 관계)을 가지는 정형 데이터를 처리하는 경우, 서로 다른 문장에 속하는 후보 키워드 간의 관계를 고려하여 처리할 수 있는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법을 제공함에 있다.In particular, it provides a convolutional neural network-based English text formatting method that can process the regular data having three or more attributes (N-term relationship) rather than binary relation extraction, by considering the relationship between candidate keywords belonging to different sentences. .

또한, 본 발명이 해결하고자 하는 다른 기술적 과제는, 텍스트가 여러 개의 문장으로 구성되고, 타겟 키워드가 입력으로 주어지지 않는 경우에도 데이터를 정형화할 수 있는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법을 제공함에 있다.Another aspect of the present invention is to provide a convolutional neural network-based English text formatting method that can format data even when the text is composed of a plurality of sentences and the target keyword is not input.

상기와 같은 과제를 해결하기 위한 컨벌루션 신경망 기반 영문 텍스트 정형화 방법은, 영문 텍스트로 이루어진 문서의 집합을 미리 정의된 데이터 스키마를 가진 정형 데이터로 변환하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법에 있어서, a) 입력 텍스트를 전처리하여 입력 텍스트를 실수 벡터 형식으로 변환하는 단계와, b) 입력 텍스트로부터 후보 키워드를 추출하는 단계와, c) 각 후보 키워드를 포함하는 문장을 색인하는 단계와, d) 상기 후보 키워드에 대한 특징점을 추출하는 단계와, e) 추출한 특징점을 입력 데이터로 하여 키워드의 라벨을 예측하는 단계와, f) 추출된 상기 키워드 특징점을 네거티브 샘플링 기반 신경망 학습처리를 수행하여 각 속성에 키워드를 매핑하는 단계를 포함한다.According to another aspect of the present invention, there is provided a convolutional neural network-based English text formatting method for converting a set of documents composed of English text into fixed data having a predefined data schema, B) extracting a candidate keyword from the input text; c) indexing a sentence including each candidate keyword; d) comparing the candidate keyword with the candidate keyword, E) extracting the feature point of the keyword by using the extracted feature points as input data; f) performing a negative sampling-based neural network learning process on the extracted keyword feature point to map a keyword to each attribute .

본 발명의 일실시예에 따르면, 상기 a) 단계는, 비정형 데이터인 비정형 데이터인 영문 텍스트를 문장 단위로 분리한 뒤 문장 내의 각 단어를 식별하는 토큰화 단계와, 상기 식별된 각 단어를 벡터로 표현하고, 각 단어에 대한 품사 태깅 및 개체명 클래스를 통해 벡터화하는 임베딩 단계를 포함할 수 있다.According to an embodiment of the present invention, the step a) includes: a tokenizing step of separating the English text, which is unstructured data, which is unstructured data, into English sentences, and then identifying each word in the sentence; And include embedding tags for each word, and an embedding step for vectorizing through the entity name class.

본 발명의 일실시예에 따르면, 상기 b) 단계는, 최소 1에서 최대 k_max(k는 양의 정수)크기를 가지는 윈도우(window)를 이용하여 텍스트의 모든 문장에 포함된 키워드를 추출하되, 키워드 시퀀스가 텍스트 내에 여러 번 등장하는 경우 중복 시퀀스를 제거하여 키워드 집합을 추출할 수 있다.According to an embodiment of the present invention, the step b) includes a step of calculating a minimum k _max (k is a positive integer) A keyword included in every sentence of a text is extracted using a window having a size, and when a keyword sequence appears in the text several times, a redundant sequence can be removed to extract a keyword set.

본 발명의 일실시예에 따르면, 상기 c) 단계는, 추출된 각 키워드를 포함하는 모든 문장의 집합을 상기 입력된 텍스트에서 추출할 수 있다.According to an embodiment of the present invention, in the step c), a set of all sentences including each extracted keyword may be extracted from the input text.

본 발명의 일실시예에 따르면, 상기 문장의 집합을 추출하는 과정은 상기 키워드 집합을 구하는 과정에서 각 키워드에 대해, 키워드를 포함하는 문장의 인덱스 정보를 함께 저장하고, 상기 문장의 인덱스 정보를 이용하여 문장의 집합을 추출할 수 있다.According to an embodiment of the present invention, in the step of extracting the set of sentences, index information of a sentence including a keyword is stored for each keyword in the step of obtaining the keyword set, and index information of the sentence is used To extract a set of sentences.

본 발명의 일실시예에 따르면, 상기 d) 단계는, 키워드를 단어 임베딩 및 태그 임베딩 벡터를 구한 후, 두 벡터에 어텐션 기법을 적용하여 임베딩 벡터를 산출하여, 키워드가 포함하는 단어의 개수에 관계없이 키워드의 임베딩 벡터의 크기가 일정하게 유지되도록 하는 단어 특징점 추출과정과, 상기 키워드의 단어 임베딩 벡터로 변환한 결과를 이용하여 각 문장 내의 키워드와의 상대위치를 구하여 문장 특징점을 구하는 문장 특징점 추출과정을 포함할 수 있다.According to an embodiment of the present invention, in the step d), after embedding a keyword into a keyword and obtaining a tag embedding vector, an embedding vector is calculated by applying an attracting technique to the two vectors, A word feature point extracting step of extracting a feature point of a sentence from a keyword and extracting a sentence feature point by obtaining a relative position with respect to a keyword in each sentence by using a result obtained by converting the keyword into a word embedding vector, . &Lt; / RTI >

본 발명의 일실시예에 따르면, 상기 e) 단계는, 상기 특징점을 소프트맥스 회귀(Softmax regression)의 입력 값으로 하여, 키워가 속하는 속성과 점수를 계산할 수 있다.According to an embodiment of the present invention, in the step e), the feature point may be an input value of a softmax regression, and an attribute and a score to which the keyword belongs may be calculated.

본 발명의 일실시예에 따르면, 상기 f) 단계는, 상기 설정된 키워드 중 정형 데이터에 저장된 값과 일치하지 않는 타겟 키워드의 경우 NA를 해당 키워드의 라벨로 설정하는 단계와, 상기 라벨이 정해진 키워드들 중에서 네거티브 샘플링을 통해 데이터 스키마의 속성을 라벨로 가진 키워드 개수와 NA를 라벨로 가진 키워드의 개수의 비율을 조절하는 단계와, 상기 설정된 타겟 키워드와 라벨을 입력받아 컨볼루션 신경망 기반 문장 특징점(sentence feature)/어텐션 신경망 기반 키워드 특징점(keyword feature)를 포함하는 신경망을 학습하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step (f) includes the steps of: setting a NA of the target keyword as a label of the target keyword, the target keyword not matching the value stored in the set data among the set keywords; A step of adjusting the ratio of the number of keywords having the attribute of the data schema as a label and the number of keywords having the label as NA through negative sampling, and a step of receiving the set target keyword and label and receiving a convolutional neural network based sentence feature ) / Attention Neural Network-based keyword feature point (keyword feature).

본 발명의 일실시예에 따르면, 상기 학습한 신경망을 토대로 새로운 텍스트 문서를 정형 데이터로 추론하는 단계와, 상기 추론 단계에서 문서 내의 키워드를 상기 b) 단계의 방법으로 설정하는 단계와, 상기 설정된 키워드에 대해 데이터 스키마의 각 속성 및 NA를 대상으로 관련도 점수를 계산하는 단계와, 상기 계산한 관련도 점수를 랭킹하여 타겟 키워드를 하나의 데이터 스키마 속성 혹은 NA에 매핑하는 단계와, 상기 매핑 단계에서 타겟 키워드가 단일 문서 내에 복수 등장할 경우 가장 높은 관련도 점수를 가지는 데이터 스키마의 속성 혹은 NA에 매핑하는 단계와, 상기 추론 단계에서 타겟 키워드의 관련도 점수가 한계점을 넘지 못할 경우 NA에 매핑하는 단계를 더 포함할 수 있다.According to an embodiment of the present invention, the method may further include the steps of inferring a new text document as formatted data based on the learned neural network, setting the keyword in the document in the inference step in the method of step b) Calculating a relevance score for each attribute of the data schema and an NA, mapping the target keyword to one data schema attribute or NA by ranking the calculated relevance score; and in the mapping step Mapping to an attribute or NA of a data schema having a highest relevance score when a plurality of target keywords appear in a single document; and mapping the NA to an NA if the relevance score of the target keyword does not exceed a threshold in the inference step As shown in FIG.

본 발명의 일실시예에 따르면, 상기 신경망은, 컨벌루션 신경망을 이용하여 계산한 타겟 키워드를 포함하는 문장의 벡터, 타겟 키워드의 임베딩 벡터, 타겟 키워드의 형태소 분석값/개체명 인식값의 임베딩 벡터, 문장 내 타겟 키워드 직전 단어의 형태소 분석값/개체명 인식값의 임베딩 벡터, 및 문장 내 타겟 키워드 직후 단어의 형태소 분석값/개체명 인식값의 임베딩 벡터를 특징점으로 포함할 수 있다.According to an embodiment of the present invention, the neural network includes a vector of a sentence including a target keyword calculated using a convolutional neural network, an embedding vector of a target keyword, an embedding vector of a morpheme analysis value / The embedding vector of the morpheme analysis value / entity name recognition value of the word immediately before the target keyword in the sentence, and the embedding vector of the morpheme analysis value / entity name recognition value of the word immediately after the target keyword in the sentence can be included as feature points.

본 발명의 일실시예에 따르면, 매핑하는 단계에서 데이터 스키마의 속성 중에 단일 레코드를 저장하는 속성의 경우, 해당 속성에 매핑된 타겟 키워드가 여러 개일 경우 관련도 점수로 타겟 키워드를 랭킹하여 가장 높은 관련도 점수를 가지는 타겟 키워드만 저장할 수 있다.According to an embodiment of the present invention, in the case of an attribute that stores a single record among the attributes of the data schema in the mapping step, if there are a plurality of target keywords mapped to the attribute, the target keyword is ranked by relevance score, Only the target keyword having the degree score can be stored.

본 발명의 일실시예에 따르면, 상기 특징점에서 타겟 키워드의 임베딩 벡터는, 타겟 키워드가 2개 이상의 단어로 이루어진 경우 타겟 키워드 각 단어의 임베딩 벡터에 어텐션 신경망을 적용하여 계산될 수 있다.According to an embodiment of the present invention, the embedding vector of the target keyword in the minutiae may be calculated by applying the admission neural network to the embedding vector of each word of the target keyword when the target keyword is composed of two or more words.

본 발명의 일실시예에 따르면, 상기 특징점에서 컨벌루션 신경망을 이용하여 문장의 벡터를 계산하는 과정은, 문장 내의 각 단어를 임베딩 벡터로 변환한 값과 문장 내의 타겟 키워드에 대한 문장 내의 각 단어의 상대적인 위치를 임베딩 벡터로 변환한 값을 입력으로 받는 단일 레이어 컨볼루션 신경망을 이용할 수 있다.According to an embodiment of the present invention, a process of calculating a vector of a sentence using the convolutional neural network in the minutiae may include calculating a value of a value obtained by converting each word in a sentence into an embedding vector, A single-layer convolutional neural network can be used, which receives a value obtained by converting a position into an embedding vector.

본 발명의 일실시예에 따르면, 상기 특징점에서 타겟 키워드 및 타겟 키워드 전후 단어의 형태소 분석값/개체명 인식값의 임베딩 벡터는 신경망 학습 과정에서 가변성을 가질 수 있다.According to an embodiment of the present invention, the embedding vector of the morpheme analysis value / entity name recognition value of the target keyword and the word before and after the target keyword in the minutiae may have variability in the neural network learning process.

본 발명의 일실시예에 따르면, 상기 컨벌루션 신경망의 입력값 중 단어의 상대적인 위치를 임베딩 벡터로 변환한 값은 신경망 학습과정에서 가변성을 가질 수 있다.According to an embodiment of the present invention, a value obtained by converting a relative position of a word among input values of the convolutional neural network into an embedded vector may have variability in a neural network learning process.

본 발명 컨벌루션 신경망 기반 영문 텍스트 정형화 방법은, 3개 이상의 속성을 가지는 데이터를 정형화 할 수 있는 효과가 있다.The convolution neural network based English text formatting method according to the present invention has the effect of shaping data having three or more attributes.

또한, 텍스트가 여러 개의 문장으로 구성되고, 타겟 키워드가 입력으로 주어지지 않는 경우에도 데이터를 효과적으로 정형화할 수 있는 효과가 있다.Further, even when the text is composed of a plurality of sentences and the target keyword is not given as an input, the data can be effectively shaped.

도 1은 비정형 데이터를 정형 데이터로 처리하는 예시도이다.
도 2는 본 발명의 바람직한 실시예에 따른 컨벌루션 신경망 기반 영문 텍스트 정형화 방법의 순서도이다.
도 3은 도 2에서 전처리 단계의 예시도이다.
도 4는 본 발명에 적용되는 신경망 기반 정형화 알고리즘이다.
도 5는 키워드 c_i의 특징점 설명도이다.
도 6은 문장 S_i의 특징점 설명도이다.Fig. 1 is an example of processing unstructured data as formatted data.
2 is a flowchart of a convolutional neural network based English text formatting method according to a preferred embodiment of the present invention.
Figure 3 is an illustration of the preprocessing step in Figure 2;
Fig. 4 is a neural network-based shaping algorithm applied to the present invention.
5 is a characteristic point explanatory diagram of the keyword c _i .
Fig. 6 is an explanatory diagram of the minutiae of the sentence S _i .

이하, 본 발명 컨벌루션 신경망 기반 영문 텍스트 정형화 방법에 대하여 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, the convolution neural network-based English text formatting method according to the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 실시 예들은 당해 기술 분야에서 통상의 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해 제공되는 것이며, 아래에 설명되는 실시 예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 아래의 실시 예들로 한정되는 것은 아니다. 오히려, 이들 실시 예는 본 발명을 더욱 충실하고 완전하게 하며 당업자에게 본 발명의 사상을 완전하게 전달하기 위하여 제공되는 것이다.The embodiments of the present invention are provided to explain the present invention more fully to those skilled in the art, and the embodiments described below can be modified into various other forms, The scope of the present invention is not limited to the following embodiments. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.

본 명세서에서 사용된 용어는 특정 실시 예를 설명하기 위하여 사용되며, 본 발명을 제한하기 위한 것이 아니다. 본 명세서에서 사용된 바와 같이 단수 형태는 문맥상 다른 경우를 분명히 지적하는 것이 아니라면, 복수의 형태를 포함할 수 있다. 또한, 본 명세서에서 사용되는 경우 "포함한다(comprise)" 및/또는"포함하는(comprising)"은 언급한 형상들, 숫자, 단계, 동작, 부재, 요소 및/또는 이들 그룹의 존재를 특정하는 것이며, 하나 이상의 다른 형상, 숫자, 동작, 부재, 요소 및/또는 그룹들의 존재 또는 부가를 배제하는 것이 아니다. 본 명세서에서 사용된 바와 같이, 용어 "및/또는"은 해당 열거된 항목 중 어느 하나 및 하나 이상의 모든 조합을 포함한다.　The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an," and "the" include plural forms unless the context clearly dictates otherwise. Also, " comprise "and / or" comprising "when used herein should be interpreted as specifying the presence of stated shapes, numbers, steps, operations, elements, elements, and / And does not preclude the presence or addition of one or more other features, integers, operations, elements, elements, and / or groups. As used herein, the term "and / or" includes any and all combinations of one or more of the listed items.

본 명세서에서 제1, 제2 등의 용어가 다양한 부재, 영역 및/또는 부위들을 설명하기 위하여 사용되지만, 이들 부재, 부품, 영역, 층들 및/또는 부위들은 이들 용어에 의해 한정되지 않음은 자명하다. 이들 용어는 특정 순서나 상하, 또는 우열을 의미하지 않으며, 하나의 부재, 영역 또는 부위를 다른 부재, 영역 또는 부위와 구별하기 위하여만 사용된다. 따라서, 이하 상술할 제1 부재, 영역 또는 부위는 본 발명의 가르침으로부터 벗어나지 않고서도 제2 부재, 영역 또는 부위를 지칭할 수 있다.Although the terms first, second, etc. are used herein to describe various elements, regions and / or regions, it is to be understood that these elements, parts, regions, layers and / . These terms do not imply any particular order, top, bottom, or top row, and are used only to distinguish one member, region, or region from another member, region, or region. Thus, the first member, region or region described below may refer to a second member, region or region without departing from the teachings of the present invention.

이하, 본 발명의 실시 예들은 본 발명의 실시 예들을 개략적으로 도시하는 도면들을 참조하여 설명한다. 도면들에 있어서, 예를 들면, 제조 기술 및/또는 공차에 따라, 도시된 형상의 변형들이 예상될 수 있다. 따라서, 본 발명의 실시 예는 본 명세서에 도시된 영역의 특정 형상에 제한된 것으로 해석되어서는 아니 되며, 예를 들면 제조상 초래되는 형상의 변화를 포함하여야 한다.Hereinafter, embodiments of the present invention will be described with reference to the drawings schematically showing embodiments of the present invention. In the figures, for example, variations in the shape shown may be expected, depending on manufacturing techniques and / or tolerances. Accordingly, embodiments of the present invention should not be construed as limited to any particular shape of the regions illustrated herein, including, for example, variations in shape resulting from manufacturing.

도 2는 본 발명의 바람직한 실시예에 따른 컨벌루션 신경망 기반 영문 텍스트 정형화 방법의 순서도이다.2 is a flowchart of a convolutional neural network based English text formatting method according to a preferred embodiment of the present invention.

도 2를 참조하면 본 발명의 바람직한 실시예에 따른 컨벌루션 신경망 기반 영문 텍스트 정형화 방법은, 입력 텍스트를 전처리하여 입력 텍스트를 실수 벡터 형식으로 변환하는 단계(S10)와, 상기 입력 텍스트로부터 후보 키워드를 추출하는 단계(S20)와, 각 후보 키워드를 포함하는 문장을 색인하는 단계(S30)와, 상기 후보 키워드에 대한 특징점을 추출하는 단계(S40)와, 추출한 특징점을 입력 데이터로 하여 키워드의 라벨을 예측하는 단계(S50)와, 상기 산출된 키워드 속성을 네거티브 샘플링 기반 신경망 학습처리를 수행하여 각 속성에 키워드를 매핑하는 단계(S60)를 포함한다.Referring to FIG. 2, a convolutional neural network based English text formatting method according to a preferred embodiment of the present invention includes a step S10 of converting an input text into a real number vector format by preprocessing an input text, and extracting a candidate keyword from the input text A step S30 of reading a sentence including each candidate keyword, a step S40 of extracting a minutiae for the candidate keyword, a step of extracting a keyword of the keyword by using the extracted minutiae as input data, And a step S60 of performing a negative sampling-based neural network learning process on the calculated keyword attribute to map a keyword to each attribute.

이하, 상기와 같이 구성되는 본 발명의 바람직한 실시예에 따른 컨벌루션 신경망 기반 영문 텍스트로부터 관계형 데이터 베이스의 튜플을 추출하는 방법의 구성과 작용에 대하여 좀 더 상세히 설명한다.Hereinafter, a configuration and an operation of a method for extracting a tuple of a relational database from convolution neural network based English text according to a preferred embodiment of the present invention will be described in more detail.

먼저, S10단계에서는 처리대상 데이터를 전처리한다.First, in step S10, data to be processed is pre-processed.

도 3은 상기 S10단계를 설명하기 위한 예시도이다.3 is an exemplary diagram for explaining step S10.

본 발명에서 사용하는 전처리 단계(S10단계)는 토큰화 단계(S11)와 토큰화된 데이터를 임베딩하는 단계(S12)를 포함한다.The pre-processing step (step S10) used in the present invention includes a tokenizing step S11 and a step S12 of embedding the tokenized data.

상기 토큰화 단계(S11)는 비정형 데이터인 영문 텍스트를 문장 단위로 분리한 뒤 문장 내의 각 단어를 토큰화 기술을 통해 식별하는 것이다. The tokenizing step S11 separates the English text, which is atypical data, on a sentence-by-sentence basis, and then identifies each word in the sentence through a tokenizing technique.

문장의 분리와 토큰화를 위하여 Stanford CoreNLP(Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, David McClosky: The Stanford CoreNLP Natural Language Processing Toolkit. ACL (System Demonstrations) 2014: 55-60)의 토큰 분석기(tokenizer)를 이용할 수 있다.(Stanford CoreNLP) (System Demonstrations 2014: 55-60) for separation and tokenization of sentences, and Stanford CoreNLP (Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard and David McClosky: ) Tokenizer can be used.

토큰 이러한 과정을 통해, n_s개의 문장을 포함하는 입력 텍스트 T=(S₁, .., Sn_s)에서 m_i개의 단어를 포함하는 각 문장 S_i=(w_(i,1),w_(i,2),..,w₍ _i,mi ₎)를 임베딩 벡터의 시퀀스 E_i=(v_(i,1),…,v_(i,mi))로 변형할 수 있다.Token through this process, n T = _s sentences input text containing _{_{(S 1, .., Sn s}} ) each statement _{_{S i = (w (i,}} 1) including a m _i of words in, w _{( (i, 2)} , ..., w ₍ _{i, mi} ₎ ) into the sequence of embedding vectors E _i = (v _{(i, 1)} , ..., v _{(i, mi)} ).

상기 토큰화 단계(S11)를 수행한 후 토큰화 된 데이터에 대하여 단어 임베딩 및 태깅 임베딩(품사 태깅 및 개체명 인식 과정)을 수행한다.After performing the tokenizing step S11, word embedding and tagging embedding (part-of-speech tagging and entity name recognition) are performed on the tokenized data.

단어 임베딩은 m개의 단어를 포함한 문장 S=(w₁,w₂,..,w_m)이 주어졌을 때, 각 단어 w_i는 실수 값을 가지는 d^word차원의 벡터 v_i로 변환된다. 이때, d^wordx|V|차원의 임베딩 매트릭스(embedding matrix) W^word를 이용하는데, V는 어휘 목록이며, |V|는 어휘 목록의 크기로 고정된 값이다. W^word를 구하는 대표적인 기술로 word2vec(Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, Jeffrey Dean: Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013: 3111-3119), GloVe(Jeffrey Pennington, Richard Socher, Christopher D. Manning: Glove: Global Vectors for Word Representation. EMNLP 2014: 1532-1543)가 있다. 본 발명에서는 W^word로 구글에서 word2vec을 이용하여 학습시킨 임베딩 매트릭스를 이용한다. When the word embedding is given a sentence S = (w ₁ , w ₂ , .., w _m ) containing _m words, each word w _i is transformed into a vector v _i of a ^word dimension with a real value. In this case, we use an embedding matrix of ^word d V x | V | and ^word , V is a vocabulary list, and | V | is a fixed value for the size of the vocabulary list. (NIPS 2013: 3111-3119), GloVe (Jeffrey Pennington, Richard), and the like, as a representative technology for obtaining the W ^word , Socher, Christopher D. Manning: Glove: Global Vectors for Word Representation. EMNLP 2014: 1532-1543). In the present invention, the embedding matrix that is learned by word2vec in Google is used as W ^word .

이때, d^word는 300개, |V|는 3,000,000개이다. At this time, the ^word d is 300, and | V | is 3,000,000.

문장 내의 각 단어 w_i를 v_i로 변환하는 과정은 먼저, w_i 를 V에서 색인한 후 w_i의 인덱스(index)에 해당하는 값을 원 핫 인코딩(one-hot encoding)한 V차원 벡터 o_i를 구하고 o_i와 W^word간의 매트릭스-벡터 곱 연산을 수행한다.The process of converting each word w _i into v _i in the sentence is performed by first performing a one-hot encoding of the value corresponding to the index of w _i after indexing w _i at V _i and performs a matrix-vector multiplication operation between o _i and W ^word .

이를 식으로 나타내면 다음의 수학식 1과 같다.This can be expressed by the following equation (1).

(수학식 1)(1)

v_i=W^wordo_i v _i = W ^word o _i

이때, w_i가 V에 존재하지 않는 단어일 경우, v i 은 d word 차원의 영벡터로 설정된다.At this time, if w _i is a word not present in V, vi is set to a zero-dimensional vector of word d.

그 다음, 태그 임베딩 과정에서는 상기 문장 S에 대해 품사 태깅(part-of-speech tagging)과 개체명 인식(named-entity recognition) 기술을 이용하여 문장 내 각 단어 w_i의 품사 태그(part-of-speech tag) 및 개체명 클래스(named entity class)를 구한다. Then, the tag embedding process, the sentence for the S part of speech tagging (part-of-speech tagging) and entity recognition (named-entity recognition) part-of-speech tag (part-of- of using techniques within the sentence, each word w _i speech tag and named entity class.

(w_i, pos_i, nec_i)의 순서쌍이 구해지면, pos_i와 nec_i를 각 d^pos, d^nec차원의 벡터 vpos_i, vnec_i로 변환한다. 이때, pos_i, nec_i를 임베딩하기 위해 d^posx|V^pos| 차원의 임베딩 매트릭스 W^pos와 d^necx|V^nec| 차원의 임베딩 매트릭스 W^nec를 각각 이용한다.(w _i , pos _i , nec _i ), we convert pos _i and nec _i to each of d ^pos , d ^nec dimension vectors vpos _i , vnec _i . At this time, to embed pos _i , nec _i , d ^pos x | V ^pos | Embedding matrix of dimension W x ^pos ^nec d and | V ^nec | Dimensional embedding matrix W ^nec , respectively.

상기 V^pos는 품사 태그의 목록이며 V^nec는 개체명 클래스의 목록이다.V ^pos is a list of part of speech tags and V ^nec is a list of object name classes.

이때 품사 태그/개체명 클래스를 벡터로 변환하기 위한 임베딩 매트릭스(W^pos,W^nec)를 구하는 대표적인 기술로 word2vec(Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, Jeffrey Dean: Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013: 3111-3119), GloVe(Jeffrey Pennington, Richard Socher, Christopher D. Manning: Glove: Global Vectors for Word Representation. EMNLP 2014: 1532-1543)가 있다. 본 발명에서는 W^pos와W^nec로 구글에서 word2vec을 이용하여 학습시킨 임베딩 매트릭스를 이용한다. In this case, the embedding matrix (W ^pos , W ^nec) to obtain Tomas (with leading technology word2vec Mikolov, Ilya Sutskever, Kai Chen , Gregory S. Corrado, Jeffrey Dean: Distributed Representations of Words and Phrases and their Compositionality NIPS 2013:. 3111-3119), GloVe (Jeffrey Pennington, Richard Socher, Christopher D. Manning: Glove: Global Vectors for Word Representation. EMNLP 2014: 1532-1543). In the present invention, W ^pos and W ^nec uses an embedding matrix that Google has learned using word2vec.

pos_i와 nec_i를 각 d^pos, d^nec차원의 벡터 vpos_i, vnec_i로 변환하기 위하여, pos_i와 nec_if를 d에서 색인한 후, pos_i와 nec_i의 인덱스에 해당하는 값을 원 핫 인코딩(one-hot encoding)한 d차원 벡터를 구하고, d차원 벡터와 W^pos,W^nec 각각과 매트릭스-벡터 곱 연산을 수행하여 변환할 수 있다.In order to convert the pos _i and nec _i each d ^pos, d ^nec dimensional vector vpos _i of, vnec _i, then index the pos _i and nec _i f at d, a value for the index of pos _i and nec _i Dimensional vector obtained by one-hot encoding, and the d-dimensional vector and W ^pos , W ^nec Can be transformed by performing a matrix-vector multiplication operation with each of them.

상기 W^pos와 W^nec는 학습 과정에서 갱신되는 파라미터이며, 초기값은 Xavier 초기화 방식(Xavier initialization)을 이용하여 설정된다. The W ^pos and W ^nec are parameters updated in the learning process, and the initial values are set using the Xavier initialization method.

Vpos와 Vnec이 각각 모든 품사 태그 및 개체명 클래스를 포함한 목록이므로 모든 pos와 nec은 Vpos와 Vnec에 반드시 포함된다.Since Vpos and Vnec are lists containing all parts tag and object name classes, all pos and nec must be included in Vpos and Vnec.

그 다음, S20단계에서는 영어 텍스트 데이터에서 후보 키워드를 추출한다.Next, in step S20, candidate keywords are extracted from the English text data.

후보 키워드 추출 과정에서는 최소 1에서 최대 k_max 크기를 가지는 윈도우(window)를 이용하여 텍스트 T의 모든 문장 S₁,…,S_ns에 포함된 키워드를 추출한다. 총 m_i개의 단어로 구성된 S_i내의 단어 w₍ _i,j ₎부터 이후 k번째 단어를 포함하는 시퀀스 (w₍ _i,j ₎, w₍ _i,j ₊₁₎,…, w₍ _i,j _+k-1))를 w₍ _i,j:j _+k- ₁₎로 나타낸다고 할 때, i+j-1은 m_i보다 작거나 같다. The candidate keyword extraction process up to k _max from 1 All the sentences S ₁ , ... of the text T using a window of size , The keyword included in S _ns is extracted. Total m words w _i in the composed words S _{_{i _(i,} _j)} after the sequence including a k-th word _{_{_{(w (i, j),}}} w (i, j +1), ..., w (i, j from a _{_{+ k-1)) w (}} i, j: when represents a _{_{_{j + k- 1), i +}}} j-1 is less than or equal to m _i.

S_i로부터 크기가 k(1<=k<=min(k_max, m_i))인 윈도우를 이용하여 w₍ _i,1:k ₎, w_(i,2:k+1),..., w₍ _i,mi _-k+ _1:mi ₎의 총 m_i-k+1개의 시퀀스를 얻을 수 있다. The k (1 <= k <= min (k max, m i)) from the size of S _i by a window _{_{_{w (i, 1: k)}}} , w (i, 2: k + 1), ... _{_{_{, w (i, mi -k +}}} 1: mi) to obtain a total of m _i -k + 1 sequences of.

다시 말하면, S_i로부터 1에서 k_max 사이의 크기 k의 윈도우를 이용하여 min(0, m_i-k+1) 개의 시퀀스를 얻을 수 있다. 위와 같은 방식으로, 텍스트 T의 각 문장 S_i에 대해 모든 시퀀스 w₍ _i,j:j _+k- ₁₎가 후보 키워드에 포함된다.That is, at first from the S _i k _max (0, m _i -k + 1) sequences can be obtained by using the window of the size k between the input terminal and the output terminal. In this manner, all the sequences w ₍ _{i, j: j} _{+ k-} ₁₎ are included in the candidate keywords for each sentence S _i of the text T.

주어진 텍스트 T에 대해서, 각 윈도우 크기에 대해 텍스트의 각 문장을 스캔하며 후보 키워드 집합 KC=[c₁, c₂, …,c_nk]를 계산한다. 동일한 시퀀스가 텍스트 내에 여러 번 등장할 수 있는데, KC는 [w_(1,1:1), w_(1,2:2), …, w₍ _ns,m _{ns -} _kmax ₊ _1:m _ns)]에서 이러한 중복 시퀀스를 제거한 집합과 같다.For a given text T, each sentence of the text is scanned for each window size, and a set of candidate keywords KC = [c ₁ , c ₂ , ... , c _nk ]. The same sequence can appear several times in the text, where KC is [w _{(1,1: 1)} , w _{(1,2: 2)} , ... , w ₍ _{ns, m} _{ns -} _kmax ₊ _{1: m} _ns) ].

도 4는 본 발명에 적용되는 신경망 기반 정형화 알고리즘이다.Fig. 4 is a neural network-based shaping algorithm applied to the present invention.

위에서 설명한 후보 키워드 집합 KC를 구하는 과정은 먼저, 비정형 텍스트 데이터 T와 데이터 스키마 SC를 입력 받는다. The process of obtaining the above-described candidate keyword set KC first receives the unstructured text data T and the data schema SC.

도 4에서 INITIALIZESTRUCTURE(·)는 결과 정형 데이터를 저장할 d를 초기화하는 함수로써, d는 입력받은 SC를 데이터 스키마로 가지는 빈 튜플로 초기화 된다(1번째 줄). In Fig. 4, INITIALIZESTRUCTURE () is a function to initialize d to store the resultant formal data, and d is initialized to an empty tuple having the input SC as a data schema (line 1).

EXTRACTKEYWORDS(·)는 비정형 텍스트 데이터 T로부터 후보 키워드 집합 KC를 생성하는 함수이다(2번째 줄). EXTRACTKEYWORDS (·) is a function to generate a candidate keyword set KC from unstructured text data T (second line).

상기 후보 키워드 집합 KC를 구하는 함수에 관해서는 앞서 상세히 설명한 바와 같다. The function for obtaining the candidate keyword set KC has been described in detail above.

상기 정형화 알고리즘에서 확인할 수 있는 바와 같이 본 발명은 컴퓨터로 수행되는 알고리즘의 해석이며, 수행주체는 컴퓨터 또는 컴퓨터에 준하는 연산장치를 사용할 수 있다.As can be seen from the above-described formatting algorithm, the present invention is an interpretation of an algorithm performed by a computer, and a computer or a computing device equivalent to a computer can be used as the performing entity.

그 다음, S30단계와 같이 후보 키워드를 포함하는 문장을 색인한다.Then, the sentence including the candidate keyword is indexed as in step S30.

상기와 같이 후보 키워드 집합 KC를 구한 후, 각 키워드 c_i를 포함하는 모든 문장의 집합 SC_i을 텍스트 T 내에서 추출하는 과정을 수행한다. After the candidate keyword set KC is obtained as described above, a set SC _i of all sentences including each keyword c _i is extracted in the text T. [

문장 색인 과정은 KC를 구하는 과정에서 각 키워드 c_i에 대해, c_i를 포함하는 문장의 인덱스 정보를 함께 저장함으로써 효율적으로 수행될 수 있다. The sentence indexing process can be efficiently performed by storing the index information of the sentences including c _i for each keyword c _i in the process of obtaining KC.

저장한 인덱스 정보를 이용하여 다음의 순서쌍 (KC,SC)=[(c₁, SC₁), …, (c _nk, SC_nk)]를 구한다.Using the stored index information, the following ordered pair (KC, SC) = [(c ₁ , SC ₁ ), ... , (c _nk , SC _nk )].

이와 같은 순서쌍을 구하는 과정은 도 4의 알고리즘에서 4번째 줄에 기재되어 있으며, 앞서 구해진 두 함수를 호출하여 d의 초기화 및 KC를 생성한 다음, KC에 속한 각 키워드 c에 대해 c를 포함하는 T 내의 모든 문장을 찾는다.The procedure for obtaining such an ordered pair is described in the fourth line in the algorithm of FIG. 4. The initialization of d and the KC are generated by calling the two previously obtained functions. Then, for each keyword c belonging to KC, T Find all sentences within.

그 다음, S40단계와 같이 구해진 (KC, SC) 쌍과 각 문장 S_i의 임베딩 결과인 EP_i를 이용하여, 각 키워드 c_i의 특징점을 추출한다.Then, the minutiae of each keyword c _i is extracted using the (KC, SC) pair obtained as in step S40 and the result EP _i of the embedding of each sentence S _i .

앞서 하나의 키워드에 대해 여러 개의 문장이 SC_i에 포함될 수 있음을 설명하였다. 특징점 추출과정에서는, SC_i에 포함되는 각 문장 S₍ _ci,j ₎에 대해 (c_i, S₍ _ci,j ₎)의 특징점을 추출하여 신경망 모형의 입력값으로 이용한다. We have previously explained that several sentences can be included in SC _i for a single keyword. In the minutia extraction process, minutiae points of (c _i , S ₍ c _i _{, j} ₎ ) are extracted for each sentence S ₍ c _i , _j ₎ included in SC _i and used as input values of the neural network model.

즉, 하나의 키워드 c_i는 c_i를 포함한 문장 개수인 |SC_i| 횟수만큼 신경망 모형에 입력되며, 이후 추론 과정에서 |SC_i| 개의 신경망 출력 값 중에 가장 적합한 출력값을 선택한다.That is, one keyword c _i is the number of sentences including c _i , | SC _i | And the number of times is input to the neural network model. Then, in the inference process, | SC _i | The most suitable output value among the neural network output values is selected.

특징점은 크게 키워드 c_i의 특징점과 문장 S_(ci,j) 의 특징점으로 나뉜다. 먼저, c_i의 특징점은 1) c_i의 단어 임베딩 벡터, 2) c_i의 품사 태그 임베딩 벡터, 3) c_i의 개체명 클래스의 임베딩 벡터, 4) S₍ _ci,j ₎ 내에서 c_i의 직전 위치에 있는 단어의 품사 태그 임베딩 벡터 및 5) 개체명 클래스 임베딩 벡터, 6) c_i의 직후 위치의 단어의 품사 태그 임베딩 벡터 및 7) 개체명 클래스 임베딩 벡터로 총 7가지가 있다.Feature point is divided largely into a feature point of the feature point and a text keyword c _i S _{(ci, j)} of the. First, c _i of the feature point 1) c _i of the word embedding vectors, 2) c _i of POS tags embedded vector, and 3) embed vector, 4) of the object name the class of c _i S _{_(ci, _j)} in the c _i (6) a lexical tag embedding vector of a word immediately after c _i , and (7) an object class embedding vector.

각 임베딩 벡터는 단일 단어임을 가정하는 반면 c_i는 여러 개의 단어로 구성되어 있다. 따라서, 1) - 3)의 경우, 여러 단어로 된 키워드의 임베딩 벡터를 계산하는 방법을 추가로 고려해야 한다. Each embedding vector is assumed to be a single word while c _i is composed of several words. Therefore, in the case of 1) - 3), a method of calculating the embedding vector of the keyword of plural words should be additionally considered.

본 발명에서는 어텐션(attention)을 적용하여 여러 단어로 된 키워드의 임베딩 벡터를 구하였다. S₍ _ci,j ₎의 특징점은 컨벌루션 신경망에 S₍ _ci,j ₎ 에 대응되는 단어 임베딩 벡터의 시퀀스 E₍ _ci,j ₎와 S(_ci,j ₎ 내 c_i와 각 단어의 상대적 위치 값을 임베딩한 상대 위치 임베딩 벡터 vposition 의 시퀀스 Epos_(ci,j)를 연결한 EC₍ _ci,j ₎=[(v₍ _k,1 ₎, vposition_(k,1)), (v₍ _k,2 ₎, vposition₍ _k,2 ₎), …, (v₍ _k,mk ₎, vposition₍ _k,mk ₎)]를 통과시켜 추출한다.In the present invention, an embedding vector of a keyword having a plurality of words is obtained by applying attention. Feature points of S _{_(ci, _j)} is the sequence of the vector word embedding corresponding to S _{_(ci, _j)} in the convolutional neural network E _{_(ci, _j)} and S _{(ci, _j)} within the c _i and the relative position of each word connect the sequence Epos _{(ci, j)} of embedding a relative position embedding vector _{_{_{vposition (ci, j) EC =}}} [(v (k, 1), vposition (k, 1)), (v (k, 2), vposition ₍ _{k, 2} ₎ ), ... , (v ₍ _{k, mk} ₎ , vposition ₍ _{k, mk} ₎ ).

상기의 과정은 도 4의 알고리즘의 7번째 및 8번째 줄에 기재되어 있다. 특징점은 문장의 특징점과 키워드 c의 특징점으로 나뉘는데, 문장 S의 특징점은 컨벌루션 신경망을 이용하여 추출하고(7번째 줄), c의 특징점은 GETFEATURE(·) 함수를 통해 추출한다(8번째 줄). The above process is described in the seventh and eighth lines of the algorithm of Fig. The minutiae are divided into the minutiae of the sentence and the minutiae of the keyword c. The minutiae of the sentence S are extracted using the convolution neural network (line 7), and the minutiae of c are extracted through the GETFEATURE () function (line 8).

도 5와 도 6은 각각 ‘Database Administrator’와 이를 포함한 문장 ‘We are seeking a Senior Database Administrator.’을 예시로 특징점을 추출하는 과정의 설명도로서, 도 5는 키워드 c_i의 특징점 설명도이고, 도 6은 문장 S_i의 특징점 설명도이다.Figure 5 and Figure 6 is an explanatory view of a process of extracting feature points in an illustrative sentence 'We are seeking a Senior Database Administrator.' Containing it with the 'Database Administrator', respectively, also the characteristic point described in Figure 5 is the keyword c _i, Fig. 6 is an explanatory diagram of the minutiae of the sentence S _i .

도 5와 도 6을 각각 참조하면, 타겟 키워드는 ‘Database’, ‘Administrator’ 두 개의 단어로 이루어져 있다. Referring to FIGS. 5 and 6, the target keyword is composed of two words 'Database' and 'Administrator'.

먼저, 두 단어의 단어 임베딩, 태그 임베딩 벡터를 구한다. 그런 다음 두 벡터에 어텐션 기법을 적용하여 ‘Database Administrator’의 임베딩 벡터를 계산한다. First, word embedding and tag embedding vectors of two words are obtained. Then we apply the affinity technique to both vectors to calculate the embedding vector of 'Database Administrator'.

그 결과, 키워드가 포함하는 단어 개수에 관계없이 키워드의 임베딩 벡터의 크기는 일정하게 유지된다. As a result, the size of the embedding vector of the keyword remains constant regardless of the number of words included in the keyword.

문장에서 ‘Database Administrator’ 직전단어는 ‘Senior’이고, 직후 단어는 ‘.’이므로 두 단어의 품사 태그 임베딩 벡터와 개체명 클래스 임베딩 벡터를 구한다. In the sentence, the word immediately before 'Database Administrator' is 'Senior', and the word immediately after it is '.', So the two-word lexical tag embedding vector and object name class embedding vector are obtained.

한편, 도 6은 문장의 특징점을 구하는 과정을 보여준다. Meanwhile, FIG. 6 shows a process of obtaining feature points of a sentence.

먼저, 각 문장의 단어를 단어 임베딩 과정을 통해 벡터로 변환하고, 문장 내 각 단어의 타겟 키워드와의 상대 위치를 임베딩한 벡터를 구한다. First, words of each sentence are converted into vectors through a word embedding process, and vectors obtained by embedding the relative positions of the respective words in the sentence with the target keywords are obtained.

도 6에 도시한 바와 같이 한 문장이 하나의 매트릭스로 표현되는데, 이 매트릭스를 컨벌루션 신경망에 입력하여, 문장의 특징점을 구한다. As shown in Fig. 6, one sentence is represented by one matrix. The matrix is input to the convolutional neural network to obtain the minutiae of the sentence.

상기 컨벌루션 신경망은, 필터(filter) 3개를 사용한 단일 레이어 컨벌루션 신경망을 가정하였으나, 본 발명은 이러한 컨벌루션 신경망 구조에 의해 제한되는 것은 아니다.Although the convolutional neural network assumes a single layer convolutional neural network using three filters, the present invention is not limited to such a convolutional neural network structure.

어텐션을 적용하여 c_i의 임베딩 벡터를 계산하는 과정에서 어텐션 벡터 v^att 가 이용된다. 이때, v^att는 (d^word+d^pos+d^nec) 차원의 벡터로 학습 과정에서 갱신되는 파라미터이며, Xavier 초기화 방식을 이용하여 초기값이 설정된다. Attention vector v ^att is used in the process of calculating the embedding vector of c _i by applying the affinity. In this case, v ^att is a parameter that is updated in the learning process as a vector of (d ^word + d ^pos + d ^nec ) dimension, and the initial value is set using the Xavier initialization method.

키워드 c_i가 k개의 단어 (w₁, w₂,..,w_k)로 구성되어 있다고 하고, 단어 w_j의 임베딩 벡터는 vp_j=(v_j, vpos_j, vnec_j)라고 할 때, w_j의 어텐션 aj =

이 된다. 어텐션을 적용한 c_i의 임베딩 벡터는

가 된다. When the keyword c _i is composed of _k words (w ₁ , w ₂ , .., w _k ) and the embedding vector of the word w _j is vp _j = (v _j , vpos _j , v nec _j ) Attachment of wj _j =

. The embedding vector of c _i to which the affinity is applied is

.

S_(ci,j) 의 특징점을 추출하기 위해 이용되는 컨벌루션 신경망은 1개의 컨벌루션 레이어(convolution layer), 1개의 맥스 풀링 레이어(max-pooling layer)로 구성될 수 있다. 컨벌루션 레이어에서는 크기가 w(w=3)인 윈도우를 이용하여 문장 내 연속된 w개의 단어들의 국소 특징점(local feature)을 추출한다. The convolutional neural network used to extract feature points of S _{(c i, j)} may be composed of one convolution layer and one max-pooling layer. In the convolution layer, the local feature points of w words in a sentence are extracted using a window of size w (w = 3).

단어 시퀀스 w_k:k _+w-1의 특징점을 conv_k라고 할 때, conv_k=EC_(ci,j)[k:k+w-1]^T x W^conv으로 나타내 질 수 있다. 여기서 W^conv은 EC_{(ci, j)}·w 차원의 벡터로 학습 과정에서 갱신되는 파라미터이고, EC₍ _ci,j ₎[k]을 EC₍ _ci,j ₎의 k번째 벡터값이라고 할 때, EC_(ci,j)[k:k+w-1]은 EC₍ _ci,j ₎[k], EC_(ci,j)[k+1],...,EC₍ _ci,j ₎[k+w-1]를 모두 이은 벡터이다. ₍ K, k + w-1) ^T x W ^conv , _where _k is the feature point of the word sequence w _{k: k} _{+ w-1} . Wherein W ^conv is EC _{(ci, j)} · a parameter that is updated in the learning process with a vector of the _{w-dimensional,} EC _{_(ci, _j)} when said [k] k-th vector value of the EC _{_(ci, _j),} EC _{(ci, j) [k:} k + w-1] are _{_{_{EC (ci, j) [k}}} ], EC (ci, j) [k + 1], ..., EC (ci, j) [k + w-1].

이렇게 얻어진 conv_k는 스칼라 값이며, 길이가 m_(ci,j)인 문장의 경우 m₍ _ci,j _)-w+1 개의 서로 다른 윈도우로 부터 스칼라 값을 얻을 수 있다. Conv _k thus obtained is a scalar value, for a sentence of length m _{(ci, j)} can be obtained from the scalar value m _{_(ci, _j)} +1 _-w different window.

즉, m₍ _ci,j _)-w+1 차원의 벡터 conv가 계산된다. 본 발명에서는 제로 패딩(zero padding) 기술을 적용하여 m₍ _ci,j ₎ 차원의 벡터 conv를 계산하였다. That is, a vector conv of m ₍ _{ci, j} _{) -w} + 1 dimension is calculated. In the present invention, a vector conv of m ₍ _{ci, j} ₎ dimension is calculated by applying a zero padding technique.

맥스 풀링 레이어에서는 벡터 conv 의 각 필터마다 가장 큰 값을 추출하면 S_(ci,j)의 특징점이 된다. 본 발명에서는 특징점의 다양성을 보장하기 위하여 d^conv개의 서로 다른 필터를 적용하였고, 이에 따라 최종적으로 d^conv차원의 특징점 벡터를 추출하였다. 이때, d^conv는 정수 값을 갖는 하이퍼 파라미터이다.In the Max-Pulling layer, extracting the largest value for each filter of the vector conv becomes a feature point of S _{(ci, j)} . In the present invention, in order to ensure diversity of minutia points, d ^conv different filters are applied, and ultimately, feature point vectors of d ^conv dimension are extracted. Here, d ^conv is a hyper parameter having an integer value.

컨벌루션 신경망의 입력값으로 이용되는 상대 위치 임베딩 벡터는 상대 위치 임베딩 매트릭스를 이용하여 계산한다. 먼저, 상대 위치는 키워드 c_i에 속하는 단어의 경우 모두 0이고, 문장 내 키워드 c_i=[w₁, …, w_k]에서 w₁보다 좌측에 위치한 단어는 w₁에서 멀어질수록 상대 위치 값이 1씩 감소하고, w_k보다 우측에 위치한 단어는 w_k에서 멀어질수록 1씩 증가하도록 정의된다. The relative position embedding vector used as the input value of the convolutional neural network is calculated using the relative position embedding matrix. First, the relative position is 0 in the case of the words belonging to the keyword c _i , and the keyword in the sentence c _i = [w ₁ , ... , W _k] words on the left side than w _1, w ₁ is farther from the relative location value and decremented by one, the word located on the right side than on the w _k is defined to be increased by the more distant from the w _k 1.

도 6의 예시에서는 이러한 방식으로 설정된 상대 위치 값을 보여주고 있다. ‘We’는 가장 좌측에 있는 단어로, ‘Database’로부터의 상대 위치가 -5이다. ‘Administrator’의 우측에 위치한 단어는 ‘.’로, ‘.’의 상대 위치는 1이 된다.In the example of FIG. 6, the relative position values set in this manner are shown. 'We' is the leftmost word and the relative position from 'Database' is -5. The word on the right side of 'Administrator' is '.' And the relative position of '.' Is 1.

상대 위치 값은 품사 태그, 개체명 클래스와 같은 방법으로 d^position차원의 벡터로 임베딩된다. 이 과정에서 사용되는 d^position x|V^position| 차원의 상대 위치 임베딩 매트릭스 W^position은 학습 과정에서 갱신되는 파라미터이다.Relative position values are embedded in the vector of the d ^position dimension in the same way as the part mark tag and object name class. The d ^position x | V ^position | The relative position embedding matrix of dimension W and ^position are parameters that are updated in the learning process.

그 다음, S50단계에서는 키워드 특징점을 이용하여 키워드 c_i의 라벨을 예측한다. Then, in step S50, the label of the keyword c _i is predicted using the keyword feature point.

본 발명에서는 키워드 c_i의 라벨을 예측하기 위하여 소프트맥스 회귀(Softmax regression)를 이용한다. 상기 추출한 특징점은 소프트맥스 회귀의 입력 값이 된다. N항 관계의 데이터 스키마를 (r₁, r₂,…, r_N)으로 나타낼 때, 클래스의 이산 집합 Y는 총 N+1개의 클래스 [r₁, r₂, …, r_N, NA]로 구성된다. In the present invention, softmax regression is used to predict the label of the keyword c _i . The extracted minutiae become input values of the soft max regression. (R ₁ , r ₂ , ..., r _N ), the discrete set Y of the class contains a total of N + 1 classes [r ₁ , r ₂ , ..., r _N ). , r _N , NA].

이때, 키워드 c_i와 키워드 c_i를 포함하는 문장 S₍ _i,j ₎로부터 추출한 특징점 f를 이용하여 최종적으로 계산한 점수 score(c_i, S₍ _i,j ₎)=Softmax(W^(S)f)이다. 이때 W^(S) 는 (N+1)x(d^word+3d^pos+3^dnec+d^conv) 차원의 소프트맥스 매트릭스이다. In this case, keyword c _i with the keyword c _i sentence S feature points f finally calculated score score (c _i, S _{_(i, _j))} by using the extracted from the _{_(i, _j)} containing the = Softmax ^{(W (S)} f). At this time, W ^(S) is a (N + 1) is x (d + 3d ^word ^pos +3 ^dnec + d ^conv) dimension of SoftMax matrix.

W^(S)r을 W^(S)의 r번째 행 벡터라고 했을 때, 키워드 c_i의 최종 클래스 r(c_i)= argmax_r(score(W^(S) _rf))가 된다.When said W ^(S) r the r-th row of W ^(S) vector, and the end of the keyword c _i class _{_{r (c i) = argmax r}} (score (W (S) r f)).

이와 같은 과정은 도 4의 알고리즘의 9번째 줄과 같이 추출한 특징점을 소프트맥스 회귀(Softmax regression)의 입력 값으로 하여, 키워드 c가 속하는 속성과 점수를 계산하는 과정이며, 6번째 줄 내지 11번째 줄과 같이 키워드 c를 포함하는 모든 문장 S에 대해서 특징점 추출 및 소프트맥스 회귀 과정을 반복한다. This process is a process for calculating the attribute and the score to which the keyword c belongs by using the extracted minutia as an input value of the soft max regression as shown in the ninth line of the algorithm of FIG. 4, The feature point extraction and the soft max regression process are repeated for all sentences S including the keyword c as shown in FIG.

이때 문장에 따라 키워드 c의 속성과 점수가 달라질 수 있다. At this time, the attribute and the score of the keyword c may be changed according to the sentence.

그 다음, S60단계에서는 신경망 학습을 통해 매핑한다. Then, in step S60, mapping is performed through neural network learning.

신경망 학습의 학습 데이터로는 텍스트 데이터와 텍스트 데이터에 대응되는 튜플을 이용한다. 이때, 튜플에 저장된 각 엔티티는 텍스트 데이터에 등장하는 키워드여야 한다. As learning data of neural network learning, tuples corresponding to text data and text data are used. At this time, each entity stored in the tuple must be a keyword appearing in the text data.

튜플의 각 엔티티 값과 동일한 텍스트 데이터 내의 키워드에 대해서, 해당 키워드의 라벨을 엔티티의 속성으로 설정한다. 이외의 키워드는 라벨을 NA로 설정한다.For each keyword in the text data that is the same as each entity value in the tuple, the label of the keyword is set as the attribute of the entity. For other keywords, set the label to NA.

신경망 학습 과정에서 윈도우를 이용하여 구한 모든 후보 키워드를 학습 데이터로 이용할 경우, 대부분의 후보 키워드의 라벨이 NA인 문제가 발생할 수 있다. 본 발명에서는 이러한 문제를 해결하기 위해 네거티브 샘플링 기법을 학습 과정에 적용하였다. When all the candidate keywords obtained by using the window in the neural network learning process are used as the learning data, there may arise a problem that the label of most candidate keywords is NA. In the present invention, a negative sampling technique is applied to the learning process to solve this problem.

후보 키워드 중 라벨이 NA가 아닌 키워드를 양성데이터(positive data)로 정의하고, 라벨이 NA인 키워드는 음성 데이터(negative data)로 정의한다. Among the candidate keywords, a keyword whose label is not NA is defined as positive data, and a keyword whose label is NA is defined as negative data.

양성데이터인 키워드는 모두 학습 데이터로 이용하고, 음성 데이터인 키워드 개수는 양성 데이터와의 비율이 Po:Ne을 만족하도록 균일 분포(uniform distribution)를 따라 전체 음성 데이터에서 무작위 추출한다. 이때, Po와 Ne는 실수 값을 갖는 하이퍼 파라미터이다.The keywords that are positive data are all used as learning data and the number of keywords that are voice data is randomly extracted from the entire voice data along a uniform distribution so that the ratio with positive data is Po: Ne. At this time, Po and Ne are hyper parameters having real values.

교차 엔트로피 오차 함수(cross-entropy cost function)를 기반으로 하는 본 발명의 학습 비용 함수는 θ를 갱신 가능한 모든 파라미터라고 할 때 다음의 수학식 2와 같다.The learning cost function of the present invention based on a cross-entropy cost function is expressed by Equation (2) below, where?

(수학식 2)(2)

본 발명은 상기 실시예에 한정되지 않고 본 발명의 기술적 요지를 벗어나지 아니하는 범위 내에서 다양하게 수정, 변형되어 실시될 수 있음은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 있어서 자명한 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention will be.

Claims

컴퓨터를 이용하여 영문 텍스트로 이루어진 문서의 집합을 미리 정의된 데이터 스키마를 가진 정형 데이터로 변환하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법에 있어서,
a) 입력 텍스트를 전처리하여 입력 텍스트를 실수 벡터 형식으로 변환하는 단계;
b) 입력 텍스트로부터 후보 키워드를 추출하는 단계;
c) 각 후보 키워드를 포함하는 문장을 색인하는 단계;
d) 상기 후보 키워드에 대한 특징점을 추출하는 단계;
e) 추출한 특징점을 입력 데이터로 하여 키워드의 라벨을 예측하는 단계; 및
f) 상기 추출된 키워드 특징점을 네거티브 샘플링 기반 신경망 학습처리를 수행하여 각 속성에 키워드를 매핑하는 단계를 포함하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.1. A convolutional neural network based English text formatting method for converting a set of documents made up of English text into fixed data having a predefined data schema using a computer,
a) transforming the input text into a real number vector format by preprocessing the input text;
b) extracting candidate keywords from the input text;
c) indexing a sentence containing each candidate keyword;
d) extracting minutiae for the candidate keyword;
e) predicting a label of the keyword using the extracted feature points as input data; And
f) performing a negative sampling-based neural network learning process on the extracted keyword minutiae to map a keyword to each attribute.

제1항에 있어서,
상기 a) 단계는,
비정형 데이터인 비정형 데이터인 영문 텍스트를 문장 단위로 분리한 뒤 문장 내의 각 단어를 식별하는 토큰화 단계; 및
상기 식별된 각 단어를 벡터로 표현하고, 각 단어에 대한 품사 태깅 및 개체명 클래스를 통해 벡터화하는 임베딩 단계를 포함하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.The method according to claim 1,
The step a)
A tokenizing step of separating the English text, which is unstructured data, which is unstructured data, into sentences and identifying each word in the sentence; And
Expressing each of the identified words as a vector, and vectorizing the identified word through a part marking tag and a entity name class for each word.

제1항에 있어서,
상기 b) 단계는,
최소 1에서 최대 k_max(k는 양의 정수)크기를 가지는 윈도우(window)를 이용하여 텍스트의 모든 문장에 포함된 키워드를 추출하되,
키워드 시퀀스가 텍스트 내에 여러 번 등장하는 경우 중복 시퀀스를 제거하여 키워드 집합을 추출하는 것을 특징으로 하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.The method according to claim 1,
The step b)
Minimum 1 to maximum k _max (k is a positive integer) Extracting a keyword included in every sentence of the text using a window having a size,
Wherein the redundant sequence is removed to extract a set of keywords when the keyword sequence appears in the text several times.

제3항에 있어서,
상기 c) 단계는,
추출된 각 키워드를 포함하는 모든 문장의 집합을 상기 입력된 텍스트에서 추출하는 것을 특징으로 하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.The method of claim 3,
The step c)
Extracting a set of all sentences including each extracted keyword from the input text.

제4항에 있어서,
상기 문장의 집합을 추출하는 과정은 상기 키워드 집합을 구하는 과정에서 각 키워드에 대해, 키워드를 포함하는 문장의 인덱스 정보를 함께 저장하고, 상기 문장의 인덱스 정보를 이용하여 문장의 집합을 추출하는 것을 특징으로 하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.5. The method of claim 4,
In the step of extracting the set of sentences, index information of a sentence including a keyword is stored for each keyword in the process of obtaining the keyword set, and a set of sentences is extracted using the index information of the sentence Based on a convolutional neural network.

제1항에 있어서,
상기 d) 단계는,
키워드를 단어 임베딩 및 태그 임베딩 벡터를 구한 후, 두 벡터에 어텐션 기법을 적용하여 임베딩 벡터를 산출하여, 키워드가 포함하는 단어의 개수에 관계없이 키워드의 임베딩 벡터의 크기가 일정하게 유지되도록 하는 단어 특징점 추출과정과,
상기 키워드의 단어 임베딩 벡터로 변환한 결과를 이용하여 각 문장 내의 키워드와의 상대위치를 구하여 문장 특징점을 구하는 문장 특징점 추출과정을 포함하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.The method according to claim 1,
The step d)
A word embedding and a tag embedding vector are obtained, and an embedding vector is calculated by applying an attracting technique to the two vectors, so that the size of the embedding vector of the keyword is kept constant regardless of the number of words included in the keyword. Extraction process,
And a sentence feature point extraction step of obtaining a sentence feature point by obtaining a relative position with respect to a keyword in each sentence by using a result obtained by converting the keyword into a word embedding vector.

제1항에 있어서,
상기 e) 단계는,
상기 특징점을 소프트맥스 회귀(Softmax regression)의 입력 값으로 하여, 키워가 속하는 속성과 점수를 계산하는 것을 특징으로 하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.The method according to claim 1,
The step e)
Wherein the feature point is an input value of a softmax regression, and an attribute and a score to which the keyword belongs are calculated.

제1항에 있어서,
상기 f) 단계는,
상기 추출된 키워드 중 정형 데이터에 저장된 값과 일치하지 않는 타겟 키워드의 경우 NA를 해당 키워드의 라벨로 설정하는 단계;
상기 라벨이 정해진 키워드들 중에서 네거티브 샘플링을 통해 데이터 스키마의 속성을 라벨로 가진 키워드 개수와 NA를 라벨로 가진 키워드의 개수의 비율을 조절하는 단계; 및
상기 설정된 타겟 키워드와 라벨을 입력받아 컨볼루션 신경망 기반 문장 특징점(sentence feature)/어텐션 신경망 기반 키워드 특징점(keyword feature)를 포함하는 신경망을 학습하는 단계를 포함하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.The method according to claim 1,
The step (f)
Setting a NA of the extracted keyword as a label of the keyword if the target keyword does not match the value stored in the formatted data;
Adjusting the ratio of the number of keywords having the attribute of the data schema as a label and the number of keywords having the label of NA as a label through negative sampling among the keywords for which the label is determined; And
And learning the neural network including the set target keyword and label and including a convolutional neural network-based sentence feature / attack neural network based keyword feature point.

삭제delete

제8항에 있어서,
상기 신경망은,
컨벌루션 신경망을 이용하여 계산한 타겟 키워드를 포함하는 문장의 벡터, 타겟 키워드의 임베딩 벡터, 타겟 키워드의 형태소 분석값/개체명 인식값의 임베딩 벡터, 문장 내 타겟 키워드 직전 단어의 형태소 분석값/개체명 인식값의 임베딩 벡터, 및 문장 내 타겟 키워드 직후 단어의 형태소 분석값/개체명 인식값의 임베딩 벡터를 특징점으로 포함하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.9. The method of claim 8,
The neural network,
The embedding vector of the target keyword, the morphological analysis value of the target keyword / the embedding vector of the entity name recognition value, the morpheme analysis value / object name of the immediately preceding word in the sentence, An embedding vector of recognition value, and an embedding vector of the morpheme analysis value / entity name recognition value of the word immediately after the target keyword in the sentence as minutiae points.

삭제delete

제10항에 있어서,
상기 특징점에서 타겟 키워드의 임베딩 벡터는,
타겟 키워드가 2개 이상의 단어로 이루어진 경우 타겟 키워드 각 단어의 임베딩 벡터에 어텐션 신경망을 적용하여 계산되는 것을 특징으로 하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.11. The method of claim 10,
The embedding vector of the target keyword at the minutiae point,
Wherein when the target keyword is composed of two or more words, the affecting neural network is applied to the embedding vector of each word of the target keyword.

제10항에 있어서,
상기 특징점에서 컨벌루션 신경망을 이용하여 문장의 벡터를 계산하는 과정은,
문장 내의 각 단어를 임베딩 벡터로 변환한 값과 문장 내의 타겟 키워드에 대한 문장 내의 각 단어의 상대적인 위치를 임베딩 벡터로 변환한 값을 입력으로 받는 단일 레이어 컨볼루션 신경망을 이용하는 것을 특징으로 하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.11. The method of claim 10,
Calculating a vector of a sentence using the convolutional neural network at the feature point,
Wherein a single layer convolutional neural network receives as input a value obtained by converting each word in a sentence into an embedding vector and a value obtained by converting a relative position of each word in a sentence with respect to a target keyword in a sentence into an embedding vector, English text formatting method.

제10항에 있어서,
상기 특징점에서 타겟 키워드 및 타겟 키워드 전후 단어의 형태소 분석값/개체명 인식값의 임베딩 벡터는 신경망 학습 과정에서 가변성을 갖는 것을 특징으로 하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.11. The method of claim 10,
Wherein the target keyword and the embedding vector of the morpheme analysis value / entity name recognition value of the word before and after the target keyword in the minutiae point are variable in the neural network learning process.

제13항에 있어서,
상기 컨벌루션 신경망의 입력값 중 단어의 상대적인 위치를 임베딩 벡터로 변환한 값은 신경망 학습과정에서 가변성을 갖는 것을 특징으로 하는 컨벌루션 신경망 기반 영문 텍스트 정형화 방법.14. The method of claim 13,
Wherein a value obtained by converting a relative position of a word among input values of the convolutional neural network into an embedding vector is variable in a neural network learning process.