KR102128852B1

KR102128852B1 - Device and method for visualizing key words of features extracted by applying principal component analysis to word vectors from text data

Info

Publication number: KR102128852B1
Application number: KR1020200038520A
Authority: KR
Inventors: 김지혁; 김건민
Original assignee: (주)위세아이텍
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-01

Abstract

The present invention relates to a key word visualizing device of a feature. The key word visualizing device of a feature comprises: a feature information input unit for inputting importance values for target variables of a plurality of features extracted from text data using principal component analysis; a reference importance input unit for setting a reference value related to the importance value based on user input information; a word-specific importance extracting unit for extracting the degree of influence of each feature for each word vector for the plurality of features; a word-by-word importance calculation unit converting the importance of the value extracted by the word-specific importance extracting unit into a percentage; and a visualization execution unit visualizing the importance of each word and information on the extracted features in the form of a Venn diagram graph.

Description

텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 추출한 피처의 주요 단어 시각화 장치 및 방법{DEVICE AND METHOD FOR VISUALIZING KEY WORDS OF FEATURES EXTRACTED BY APPLYING PRINCIPAL COMPONENT ANALYSIS TO WORD VECTORS FROM TEXT DATA}DEVICE AND METHOD FOR VISUALIZING KEY WORDS OF FEATURES EXTRACTED BY APPLYING PRINCIPAL COMPONENT ANALYSIS TO WORD VECTORS FROM TEXT DATA}

본원은 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 추출한 피처의 주요 단어 시각화 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for visualizing a key word of a feature extracted by applying principal component analysis to a word vector generated from text data.

기존의 텍스트 데이터 시각화 과정은 데이터에 형태소 분석을 실시하여 추출한 단어 모음에 대하여 단어의 빈도수를 기준으로 중요도를 나타내는 워드 클라우드 방식을 적용하는 방식을 사용하고 있다. In the existing text data visualization process, a morphological analysis is applied to the data, and a word cloud method indicating importance based on the frequency of words is applied to a collection of words extracted.

그러나 이 방식은 텍스트 데이터에서 단어를 벡터로 만든 뒤 주성분 분석(PCA, Principal Component Analysis)를 적용하여 추출한 피처에 대하여 이 피처에 영향력을 많이 끼친 단어를 뽑아서 시각화를 하지 못한다. However, this method does not visualize by extracting a word that has a large influence on the feature extracted from the text data by applying the Principal Component Analysis (PCA) after making the word a vector.

텍스트 데이터 분석 시 단어의 벡터화를 실시하고 피처를 추출하는 과정에서 피처에 대한 설명력을 얻기 위해서 어느 단어가 피처에 영향력을 많이 미쳤는지에 대한 정보를 파악할 필요성이 있다.In the process of text data analysis, it is necessary to grasp information about which words influenced a feature a lot in order to obtain a description of the feature in the process of vectorizing words and extracting features.

이를 위해 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 추출한 피처의 주요 단어 시각화 기술이 필요하다. For this, it is necessary to visualize the key word of the feature extracted by applying principal component analysis to the word vector generated from the text data.

본원의 배경이 되는 기술은 한국공개특허공보 제10-2015-0048751호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Publication No. 10-2015-0048751.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 추출한 피처의 주요 단어를 시각화하여 어느 단어가 피처에 영향력을 많이 미쳤는지에 대한 정보를 파악할 수 있는 기능을 제공하려는 것을 목적으로 한다. The present application is intended to solve the above-described problems of the prior art, and by applying principal component analysis to word vectors generated from text data, it is possible to visualize the key words of the extracted features to grasp information about which words have influenced the features a lot. It aims to provide a function.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 추출한 피처 3개에 대하여 벤 다이어그램 모양을 활용한 워드 클라우드 방식의 시각화 기능을 제공하려는 것을 목적으로 한다.The present application is to solve the above-described problems of the prior art, and aims to provide a visualization function of a word cloud method using a Venn diagram shape for three features extracted by applying principal component analysis to word vectors generated from text data. Should be

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 피처를 추출하여, 머신러닝을 이용하여 정형 데이터와 텍스트 데이터가 혼합된 데이터를 분석 시 성능을 높일 수 있는 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 피처를 추출하는 장치 및 방법을 제공하는 것을 목적으로 한다. The present application is for solving the above-mentioned problems of the prior art, and extracts a feature by applying principal component analysis to a word vector generated from text data, and performs performance when analyzing data in which structured data and text data are mixed using machine learning. An object of the present invention is to provide an apparatus and method for extracting features by applying principal component analysis to word vectors generated from text data capable of increasing.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 정형 데이터와 텍스트 데이터가 혼합된 데이터를 분석 시 텍스트 데이터에 TF-IDF 방식을 적용하여 생성한 단어 벡터에 주성분 분석을 적용하여 피처를 추출할 수 있는 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 피처를 추출하는 장치 및 방법을 제공하는 것을 목적으로 한다.The present application is intended to solve the problems of the prior art described above, and extracts a feature by applying principal component analysis to a word vector generated by applying a TF-IDF method to text data when analyzing a mixture of structured data and text data. It is an object of the present invention to provide an apparatus and method for extracting features by applying principal component analysis to word vectors generated from text data.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problems to be achieved by the embodiments of the present application are not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치는, 텍스트 데이터에서 주성분 분석을 이용하여 추출한 복수의 피처의 목표 변수에 대한 중요도 값을 입력하는 피처 정보 입력부, 사용자 입력 정보에 기반하여 상기 중요도 값에 연관된 기준 값을 설정하는 기준 중요도 입력부, 상기 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출하는 단어별 중요도 추출부, 상기 단어별 중요도 추출부에서 추출한 값에 대하여 중요도를 백분율로 환산하는 단어별 중요도 계산부 및 각 단어별 중요도와 추출된 피처에 대한 정보를 벤 다이어그램 그래프 형태로 시각화하는 시각화 실행부를 포함할 수 있다. As a technical means for achieving the above technical problem, the apparatus for visualizing a key word of a feature according to an embodiment of the present application is a feature for inputting importance values for target variables of a plurality of features extracted using principal component analysis from text data An information input unit, a reference importance input unit that sets a reference value associated with the importance value based on user input information, and a word-specific importance extraction unit that extracts an influence on each feature for each word vector for the plurality of features, the words It may include a importance calculation unit for each word that converts the importance as a percentage of the value extracted by the importance extraction unit for each star, and a visualization execution unit for visualizing the importance of each word and information about the extracted feature in the form of a Venn diagram graph.

또한, 상기 피처 정보 입력부는, 상기 목표 변수가 설정된 경우, 추출된 상기 피처와 상기 목표 변수와의 영향도를 계산하고, 상기 목표 변수에 대한 중요도 값을 입력할 수 있다. In addition, when the target variable is set, the feature information input unit may calculate an influence degree of the extracted feature and the target variable, and input an importance value for the target variable.

또한, 상기 시각화 실행부는, 각 단어별 중요도와 추출된 피처의 빈도수에 따라 출력 위치, 출력 크기 및 출력 색을 결정하여 클라우드 형태로 시각화할 수 있다. In addition, the visualization execution unit may visualize in the cloud form by determining an output location, an output size, and an output color according to the importance of each word and the frequency of the extracted features.

또한, 피처의 주요 단어 시각화 장치는, 복수의 사전을 이용하여 입력받은 텍스트 데이터에서 주요 단어를 추출하는 단어 추출 프로세서 및 추출된 상기 주요 단어를 텍스트 분석 알고리즘에 적용하여 벡터화한 후 주성분 분석을 이용하여 상기 텍스트 데이터의 피처를 추출하는 피처 추출 실행부를 더 포함할 수 있다. In addition, the feature key word visualization apparatus uses a plurality of dictionaries to extract a key word from the received text data, and extracts the key word into a text analysis algorithm, vectorizes it, and then uses the principal component analysis. A feature extraction execution unit for extracting features of the text data may be further included.

또한, 상기 피처 정보 입력부는, 상기 텍스트 데이터의 변수명과 상기 목표 변수가 있는 경우, 상기 피처 추출 실행부에서 추출된 피처에 기반하여 상기 목표 변수에 대한 변수 중요도 값을 입력할 수 있다. In addition, the feature information input unit may input a variable importance value for the target variable based on the feature extracted by the feature extraction execution unit when the variable name of the text data and the target variable are present.

본원의 일 실시예에 따르면, 피처의 주요 단어 시각화 방법은, 텍스트 데이터에서 주성분 분석을 이용하여 추출한 복수의 피처의 목표 변수에 대한 중요도 값을 입력받는 단계, 사용자 입력 정보에 기반하여 상기 중요도 값에 연관된 기준 값을 설정하는 단계, 상기 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출하는 단계, 상기 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출한 값에 대하여 중요도를 백분율로 환산하는 단계 및 각 단어별 중요도와 추출된 피처에 대한 정보를 벤 다이어그램 그래프 형태로 시각화하는 단계를 포함할 수 있다. According to an embodiment of the present disclosure, the method of visualizing a key word of a feature includes receiving an importance value for a target variable of a plurality of features extracted by using principal component analysis from text data, and based on the user input information. Setting an associated reference value, extracting an influence on each feature for each word vector for the plurality of features, and extracting an influence on each feature for each word vector for the plurality of features It may include the step of converting the importance to a percentage and visualizing the importance of each word and information about the extracted feature in the form of a Venn diagram graph.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present application. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 추출한 피처의 주요 단어를 시각화하여 나타낼 수 있다. According to the above-described problem solving means of the present application, it is possible to visualize and display the main word of the extracted feature by applying principal component analysis to the word vector generated from the text data.

전술한 본원의 과제 해결 수단에 의하면, 텍스트 데이터 분석 시 단어의 벡터화를 실시하고 피처를 추출하는 과정에서 피처에 대한 설명력을 제공할 수 있다.According to the above-described problem solving means of the present application, it is possible to provide explanatory power for features in the process of vectorizing words and extracting features during text data analysis.

전술한 본원의 과제 해결 수단에 의하면, 텍스트 데이터에서 생성한 단어 벡터에 주성분 분석을 적용하여 피처를 추출할 수 있다. According to the above-described problem solving means of the present application, it is possible to extract a feature by applying principal component analysis to a word vector generated from text data.

전술한 본원의 과제 해결 수단에 의하면, 머신러닝을 이용하여 정형 데이터와 텍스트 데이터가 혼합된 데이터를 분석 시 성능을 높일 수 있다.According to the above-described problem solving means of the present application, it is possible to increase performance when analyzing data in which structured data and text data are mixed using machine learning.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effects obtainable herein are not limited to the effects described above, and other effects may exist.

도 1은 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치의 개략적인 구성도이다.
도 2는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치의 시각화 도구를 설명하기 위한 도면이다.
도 3 은 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제1텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다.
도 4a 내지 도 4c는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제2텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다.
도 5a 내지 도 5c는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제3텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다.
도 6a 내지 도 6c는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제4텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다.
도 7a 내지 도 7c는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제5텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다.
도 8은 본원의 일 실시예에 따른 텍스트 데이터 피처 추출 장치의 개략적인 블록도이다.
도 9는 본원의 일 실시예에 따른 텍스트 데이터 장치의 피처 추출 과정을 설명하기 위해 개략적으로 나타낸 개요도이다.
도 10은 본원의 일 실시예에 따른 피처의 주요 단어 시각화 방법에 대한 동작 흐름도이다.1 is a schematic configuration diagram of a feature word visualization device according to an embodiment of the present application.
2 is a diagram for explaining a visualization tool of a device for visualizing a key word of a feature according to an embodiment of the present application.
FIG. 3 is a diagram visualizing a list of words used in features extracted from the first text data in the apparatus for visualizing key words of the feature according to an embodiment of the present application.
4A to 4C are diagrams of visualizing a word list used for features extracted from second text data in a device for visualizing a key word of a feature according to an embodiment of the present application.
5A to 5C are diagrams of visualizing a word list used for features extracted from third text data on a device for visualizing a key word of a feature according to an embodiment of the present application.
6A to 6C are diagrams of visualizing a word list used for features extracted from fourth text data on a device for visualizing a key word of a feature according to an embodiment of the present application.
7A to 7C are diagrams visualizing a word list used for features extracted from the fifth text data in the apparatus for visualizing key words of features according to an embodiment of the present application.
8 is a schematic block diagram of a text data feature extraction apparatus according to an embodiment of the present application.
9 is a schematic diagram schematically illustrating a feature extraction process of a text data device according to an embodiment of the present application.
10 is an operation flowchart of a method for visualizing a key word of a feature according to an embodiment of the present application.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present application pertains may easily practice. However, the present application may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is "connected" to another part, it is not only "directly connected", but also "electrically connected" or "indirectly connected" with another element in between. "It includes the case where it is.

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when a member is positioned on another member “on”, “on the top”, “top”, “bottom”, “bottom”, “bottom”, it means that a member is on another member. This includes cases where there is another member between the two members as well as when in contact.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout the present specification, when a part “includes” a certain component, it means that the component may further include other components, not to exclude other components, unless otherwise stated.

도 1은 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치의 개략적인 구성도이다.1 is a schematic configuration diagram of a feature word visualization device according to an embodiment of the present application.

도 1을 참조하면, 피처의 주요 단어 시각화 장치(150)는 피처 정보 입력부(151), 기준 중요도 입력부(152), 단어별 중요도 추출부(153), 단어별 중요도 계산부(154) 및 시각화 실행부(160)를 포함할 수 있다. 또한, 피처의 주요 단어 시각화 장치(150)는 텍스트 피처 추출 장치(120)로부터 추출된 피처를 제공받을 수 있다. 텍스트 피처 추출 장치(120)는 단어 추출 프로세서(130) 및 피처 추출 실행부(140)를 포함할 수 있다. 다만, 텍스트 피처 추출 장치(120)의 구성이 이에 한정되는 것은 아니다. 예를 들어, 텍스트 피처 추출 장치(102)는 텍스트 데이터를 입력받기 위한 텍스트 데이터 입력부(미도시) 를 포함할 수 있다. 또한, 텍스트 피처 추출 장치(120)는 추출된 피처를 저장하기 위한 데이터베이스(미도시)를 포함할 수 있다. Referring to FIG. 1, the key word visualization device 150 of the feature includes a feature information input unit 151, a reference importance input unit 152, a word importance extraction unit 153, a word importance calculation unit 154, and visualization execution It may include a portion 160. In addition, the key word visualization device 150 of the feature may be provided with features extracted from the text feature extraction device 120. The text feature extraction device 120 may include a word extraction processor 130 and a feature extraction execution unit 140. However, the configuration of the text feature extraction device 120 is not limited thereto. For example, the text feature extraction device 102 may include a text data input unit (not shown) for receiving text data. In addition, the text feature extraction device 120 may include a database (not shown) for storing the extracted features.

본원의 일 실시예에 따르면, 텍스트 데이터를 입력한 후 주요 명사를 추출한 뒤, 텍스트 분석 알고리즘(예를 들어, TF-IDF 기법)을 이용하여 벡터화를 진행한 후 주성분 분석(예를 들어, PCA) 기법을 적용하여 3개의 피처를 추출하는 과정에서 사용자는 최종적으로 추출된 피처만 확인할 수 있다. 그러나 이 피처는 주성분 분석(예를 들어, PCA) 기법으로 추출한 것이어서 숫자로만 구성되어 있어 어느 단어가 각 피처별로 영향도를 많이 끼쳤는지 사용자는 알 수가 없다. According to an embodiment of the present application, after inputting text data, and extracting a major noun, vectorization is performed using a text analysis algorithm (eg, TF-IDF technique) and then principal component analysis (eg, PCA) In the process of extracting three features by applying the technique, the user can check only the extracted features. However, this feature is extracted by the principal component analysis (eg, PCA) technique, so it is composed only of numbers, so the user cannot know which word influenced each feature a lot.

또한, 피처의 주요 단어 시각화 장치(150)는 Scikit-Learn의 decomposition의 함수인 PCA.components_를 이용하여 얻어낸 각 단어 벡터별로 피처에 끼진 영향도를 알 수 있으므로 이 영향도가 높은 단어들을 사용자에게 시각화여 제공하여 사용자에게 어떤 단어가 주요하게 사용되었는지 직관적으로 알 수 있도록 한다. 영향도는 백분율로 환산한 값을 제공하며 특정 값 이상의 영향도를 가진 단어들만 제공하는 방식으로 시각화 방법을 설계하였다.In addition, since the feature word visualizer 150 can know the influence of the feature on each word vector obtained by using PCA.components_, which is a function of deciposition of Scikit-Learn, the words with high influence are given to the user. Visualization is provided so that the user can intuitively know which words are mainly used. The visualization method was designed in such a way that it provides a value converted in percentage and only words with an influence level above a certain value.

또한, 피처의 주요 단어 시각화 장치(150)는 텍스트 데이터의 변수명과 목표변수가 있는 경우, 주성분 분석으로 추출한 피처의 목표변수에 대한 변수 중요도 값을 입력할 수 있다. 또한, 피처의 주요 단어 시각화 장치(150)는 특정 값 이상의 중요도를 가진 단어를 추출하기 위해 사용자가 입력하는 기준 값을 입력할 수 있다. 또한, 피처의 주요 단어 시각화 장치(150)는 단어 벡터에 주성분 분석을 적용하여 추출한 3개의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출할 수 있다. 또한, 피처의 주요 단어 시각화 장치(150)는 단어별 중요도 추출부에서 추출한 값에 대하여 중요도를 백분율로 환산할 수 있다. 또한, 피처의 주요 단어 시각화 장치(150)는 각 단어별 중요도와 추출된 피처에 대한 정보를 벤 다이어그램 모양을 활용한 워드 클라우드 방식을 활용하여 시각화할 수 있다. In addition, when there are variable names and target variables of text data, the key word visualization device 150 of the feature may input a variable importance value for the target variable of the feature extracted by principal component analysis. Also, the apparatus 150 for visualizing a key word of a feature may input a reference value input by a user in order to extract a word having an importance level higher than a specific value. In addition, the main word visualization device 150 of the feature may extract the influence of each feature on each word vector for three features extracted by applying principal component analysis to the word vector. Also, the apparatus 150 for visualizing a key word of a feature may convert the importance level into a percentage of the value extracted by the importance level extraction unit for each word. In addition, the feature word visualization device 150 may visualize the importance of each word and information about the extracted feature using a word cloud method using a venn diagram shape.

본원의 일 실시예에 따르면, 텍스트 피처 추출 장치(120)는 외부 서버(미도시)로부터 텍스트 데이터(110)를 입력받을 수 있다. 또한, 텍스트 피처 추출 장치(120)는 사용자 단말(미도시)로부터 텍스트 데이터(110)를 입력받을 수 있다. 텍스트 데이터(110)는 피처를 추출하기 위한 데이터로서, 한글, 영어, 특수문자, 숫자 등을 포함하는 문서일 수 있다. 또한, 텍스트 데이터(110)는 법률 문서, 계약서, 시방서, ITB, 해양 및 육상 플랜트 데이터, ERP(전사적 자원관리), PMIS(사업관리 정보시스템), 상용데이터, 공공데이터, 빅데이터 통합 정보, 공공데이터, Open API 등을 포함할 수 있다. 사용자는 피처 추출을 원하는 텍스트(문서)를 텍스트 피처 추출 장치(120)에 입력하고, 텍스트 피처 추출 장치(120)는 입력받은 텍스트 데이터에 형태소 분석 후 사전을 이용하여 주요 단어를 추출하고 추출된 주요 단어에 텍스트 분석 알고리즘을 적용하여 벡터화한 후, 벡터에 주성분 분석을 적용하여 미리 설정된 개수(예를 들어, 3개)의 피처를 추출할 수 있다. According to one embodiment of the present application, the text feature extraction device 120 may receive text data 110 from an external server (not shown). Also, the text feature extraction device 120 may receive text data 110 from a user terminal (not shown). The text data 110 is data for extracting features, and may be a document including Korean, English, special characters, and numbers. In addition, text data 110 includes legal documents, contracts, specifications, ITB, offshore and land plant data, ERP (enterprise resource management), PMIS (business management information system), commercial data, public data, big data integrated information, public It can include data, Open API, etc. The user inputs the text (document) desired for feature extraction into the text feature extraction device 120, and the text feature extraction device 120 extracts the key words using the dictionary after morpheme analysis on the received text data and extracts the main words After vectorizing by applying a text analysis algorithm to a word, a pre-set number of features (for example, three) may be extracted by applying a principal component analysis to the vector.

본원의 일 실시예에 따르면, 단어 추출 프로세서(130)는 복수의 사전을 이용하여 입력받은 텍스트 데이터에서 주요 단어를 추출할 수 있다. 단어 추출 프로세서(130)는 사용자 사전, 자체 사전 등의 복수의 사전을 이용하여 주요 단어를 추출할 수 있다. 복수의 사전은 사용자 사전, 자체 사전, 비명사 사전, 인명사전, 자명 사전 등을 포함할 수 있다. 단어 추출 프로세서(130)는 사용자의 입력 정보에 기반하여 사용자 사전을 구축할 수 있다. 또한, 단어 추출 프로세서(130)는 복수의 일반 명사를 포함하는 명사 사전을 구축할 수 있다. 또한, 단어 추출 프로세서(130)는 복수의 일반 명사를 포함하는 명사 사전을 외부 서버로부터 획득할 수 있다. 또한, 단어 추출 프로세서(130)는 텍스트 데이터에서 형태소를 분리하고, 단어가 될 수 있는 형태소를 추출할 수 있다. According to one embodiment of the present application, the word extraction processor 130 may extract a key word from the received text data using a plurality of dictionaries. The word extraction processor 130 may extract key words using a plurality of dictionaries such as a user dictionary and its own dictionary. The plurality of dictionaries may include a user dictionary, own dictionary, non-noun dictionary, human dictionary, and self-explanatory dictionary. The word extraction processor 130 may build a user dictionary based on user input information. In addition, the word extraction processor 130 may construct a noun dictionary including a plurality of general nouns. Also, the word extraction processor 130 may obtain a noun dictionary including a plurality of general nouns from an external server. In addition, the word extraction processor 130 may separate morphemes from text data and extract morphemes that can be words.

예시적으로, 단어 추출 프로세서(130)는 텍스트 데이터에서 명사 후보를 추출할 수 있다. 텍스트 데이터가 한국어일 경우, 단어 추출 프로세서(130)는 명사 추출 함수(예를 들어, soynlp의 LRNounExtractor_v2 함수)를 적용하여 명사 후보를 추출할 수 있다. 또한, 단어 추출 프로세서(130)는 사전을 이용하여 명사를 추출할 수 있다. 명사는 의미상 존재를 가지는 단어의 부류로서, 대상의 이름을 나타내는 것일 수 있다. For example, the word extraction processor 130 may extract noun candidates from text data. When the text data is Korean, the word extraction processor 130 may extract a noun candidate by applying a noun extraction function (eg, the LRNounExtractor_v2 function of soynlp). Also, the word extraction processor 130 may extract a noun using a dictionary. A noun is a class of words having a meaning in a sense, and may indicate the name of an object.

또한, 단어 추출 프로세서(130)는 복수의 사전을 이용하여 명사를 추출할 수 있다. 일예로, 명사 추출 함수(예를 들어, soynlp의 LRNounExtractor_v2 함수)를 이용하여 추출한 단어 중에서는 명사 외에도 한 글자짜리 단어도 있고 특수문자나 외국어가 섞인 단어도 있고 명사가 아닌 단어들도 존재한다. 또한, 함수가 추출하지 못한 단어 중에서 사용자가 추출하기를 원하는 단어들도 있고 추출한 단어 중 사용자가 특별히 더 주요하게 생각하는 핵심단어가 있을 수 있다. 단어 추출 프로세서(130)는 주요 단어들만 남겨서 벡터화를 하기 위해 준비된 사전을 이용하여 단어의 품질을 개선한다.Also, the word extraction processor 130 may extract nouns using a plurality of dictionaries. For example, among words extracted using the noun extraction function (for example, LynounExtractor_v2 function of soynlp), in addition to nouns, there are single-letter words, words with special characters or foreign languages, and words with nouns. In addition, there may be words that the user wants to extract among words that the function does not extract, and among the extracted words, there may be a key word that the user thinks more particularly. The word extraction processor 130 improves the quality of words using a dictionary prepared for vectorization by leaving only the main words.

본원의 일 실시예에 따르면, 피처 추출 실행부(140)는 추출된 주요 단어를 텍스트 분석 알고리즘에 적용하여 벡터화한 후 주성분 분석을 이용하여 텍스트 데이터의 피처를 추출할 수 있다. 일예로, 주성분 분석(PCA)은 변수들이 통계적으로 상관관계가 없도록 데이터 셋을 회전시키는 기술이다. 회전한 뒤에 데이터를 설명하는 데 얼마나 중요하나에 따라 종종 새로운 특성 중 일부만 선택한다. 이 새로운 특성을 주성분(각 주성분은 기존 데이터의 변수 모두의 영향을 받는다.)이라 하며 일부만 택하였을 경우 차원이 줄어들기에 차원 축소의 용도로 사용하기도 한다. 주성분 분석을 이용하여 텍스트 데이터의 피처를 추출하는데, 단어를 벡터화하여 모두 피처로 사용할 경우, 피처의 개수가 너무 많아지게 되므로 예측하는 데 있어 성능이 좋지 않을 수 있다는 문제점이 있다. 따라서 피처 추출 실행부(140)는 주성분 분석을 이용하여 피처를 추출할 수 있다. 추출하는 피처의 수는 시각화하기도 편하고 보편적으로 각 벡터의 분산을 80% 내지 90% 표현할 수 있는 미리 설정된 개수(예를 들어, 3개)를 기본값으로 설정할 수 있다. According to one embodiment of the present application, the feature extraction execution unit 140 may apply the extracted key words to a text analysis algorithm to vectorize and then extract features of text data using principal component analysis. For example, Principal Component Analysis (PCA) is a technique that rotates a data set so that variables are not statistically correlated. Depending on how important it is to describe the data after rotation, you often choose only a few of the new traits. This new property is called the main component (each main component is influenced by all variables in the existing data), and if only a part is selected, it is used for the purpose of dimension reduction because the dimension is reduced. When extracting features of text data using principal component analysis, when vectorized words are used as features, the number of features becomes too large, and thus there is a problem in that performance may be poor in prediction. Therefore, the feature extraction execution unit 140 may extract features using principal component analysis. The number of features to be extracted is easy to visualize, and a preset number (for example, 3) that can express 80% to 90% of the variance of each vector can be set as a default value.

또한, 피처 추출 실행부(140)는 벡터 추출 패키지(예를 들어, Scikit-Learn의 feature_extraction의 서브패키지 중 하나인 feature_extraction.text)를 이용하여 주요 명사를 벡터화할 수 있다. 벡터 추출 패키지(예를 들어, Scikit-Learn의 feature_extraction의 서브패키지 중 하나인 feature_extraction.text)는 문서 전처리 클래스를 제공하며 이 클래스의 함수 중 빈도수를 기반으로 단어를 벡터화하는 함수인 텍스트 분석 함수(예를 들어, TfidfTransformer 함수)를 이용하여 주요 명사들을 벡터화한다.Also, the feature extraction execution unit 140 may vectorize a major noun using a vector extraction package (for example, feature_extraction.text, which is one of the subpackages of feature_extraction of Scikit-Learn). A vector extraction package (e.g., feature_extraction.text, one of the subpackages of Scikit-Learn's feature_extraction) provides a document preprocessing class, which is a text analysis function (e.g. a function that vectorizes words based on frequency) For example, the key nouns are vectorized using the TfidfTransformer function.

본원의 일 실시예에 따르면, 피처 정보 입력부(151)는 텍스트 데이터(110)에서 주성분 분석을 이용하여 추출한 복수의 피처의 목표 변수에 대한 중요도 값을 입력할 수 있다. 일예로, 피처 정보 입력부(151)는 텍스트 데이터(110)에서 주성분 분석을 이용하여 추출한 복수의 피처의 정보를 텍스트 피처 추출 장치(120)로부터 제공받을 수 있다. 일예로, 피처 정보 입력부(151)는 텍스트 변수명이 ‘text’이고, 이 변수에서 추출한 단어가 ‘단어1’, ‘단어2’, ... 였을 경우 PCA에 사용된 각 변수들이 추출된 피처에 어느 정도 영향을 미치고 있는지를 표1과 같이 표현할 수 있다. 표1에 도시된 영향도는 아직 양의 상관관계, 음의 상관관계를 나타내는 값으로 도시된 상태일 수 있다. According to one embodiment of the present application, the feature information input unit 151 may input an importance value for a target variable of a plurality of features extracted using the principal component analysis from the text data 110. For example, the feature information input unit 151 may receive information of a plurality of features extracted using the principal component analysis from the text data 110 from the text feature extraction device 120. For example, when the feature variable input unit 151 has a text variable name of'text' and the words extracted from this variable are'word 1','word 2', ..., each variable used in the PCA is extracted from the feature. Table 1 shows how much influence it has. The influence degree shown in Table 1 may still be shown as a value indicating a positive correlation and a negative correlation.

단어word text_PCA_1text_PCA_1 text_PCA_2text_PCA_2 text_PCA_3text_PCA_3 단어1Word 1 0.0031240.003124 -0.093125-0.093125 0.0074450.007445 단어2Word 2 0.0243560.024356 0.0986310.098631 0.0243230.024323 단어3Word 3 -0.345432-0.345432 -0.145223-0.145223 -0.343252-0.343252 단어4Word 4 -0.002244-0.002244 -0.032174-0.032174 -0.001175-0.001175 단어5Word 5 0.0002290.000229 0.0113120.011312 -0.09022-0.09022 ...... ...... ...... ......

또한, 피처 정보 입력부(151)는 목표 변수가 설정된 경우, 추출된 피처와 목표 변수와의 영향도를 계산하고, 목표 변수에 대한 중요도 값을 입력할 수 있다. 피처 정보 입력부(151)는 피처 추출부(142)에서 벡터화된 단어에 주성분 분석을 이용하여 추출된 피처와 설정된 목표 변수와의 영향도를 계산할 수 있다. 또한, 피처 정보 입력부(151)는 목표 변수에 대한 중요도 값을 입력할 수 있다. 또한, 피처 정보 입력부(151)는 텍스트 데이터(110)의 변수명과 목표 변수가 있는 경우, 피처 추출 실행부(140)에서 추출된 피처에 기반하여 목표 변수에 대한 변수 중요도 값을 입력할 수 있다. 일예로, 피처 정보 입력부(151)는 목표 변수가 있는 경우, 텍스트에서 추출된 피처가 목표 변수와 어느 정도 영향이 있는지 나타내는 값(예측 변수 영향도)을 입력할 수 있다. 예를 들어, 목표 변수명이 ‘target’이고 텍스트 변수명이 ‘text’일 때 추출된 세 피처 ‘text_PCA_1’, ‘text_PCA_2’, ‘text_PCA_3’ 각각에 대하여 목표 변수에 대한 영향도를 계산 후 입력할 수 있다. 여기서 사용된 영향도 계산 방법은, tree 기반 계산 방법 또는 p-value test 방법을 포함할 수 있으나, 이에 한정되는 것은 아니다. In addition, when the target variable is set, the feature information input unit 151 may calculate the degree of influence between the extracted feature and the target variable, and input an importance value for the target variable. The feature information input unit 151 may calculate the degree of influence between the extracted feature and the set target variable by using principal component analysis on the vectorized word in the feature extraction unit 142. Also, the feature information input unit 151 may input an importance value for the target variable. In addition, the feature information input unit 151 may input a variable importance value for the target variable based on the feature extracted by the feature extraction execution unit 140 when there are variable names and target variables of the text data 110. For example, when there is a target variable, the feature information input unit 151 may input a value (predictive variable impact degree) indicating how much the feature extracted from the text has a target variable. For example, when the target variable name is'target' and the text variable name is'text', the influence on the target variable can be calculated and input for each of the three features'text_PCA_1','text_PCA_2', and'text_PCA_3'. . The influence calculation method used herein may include a tree-based calculation method or a p-value test method, but is not limited thereto.

본원의 일 실시예에 따르면, 기준 중요도 입력부(152)는 사용자 입력 정보에 기반하여 중요도 값에 연관된 기준 값을 설정할 수 있다. 또한, 기준 중요도 입력부(152)는 특정 값 이상의 중요도를 가진 단어를 추출하기 위해 사용자가 입력하는 기준 값을 입력할 수 있다. 기준 중요도 입력부(152)는 사용자 단말(미도시)로부터 중요도 값과 연관된 기준 값을 수신할 수 있다. 기준 중요도 입력부(152)는 사용자 단말(미도시)로 기준 값 설정 메뉴를 제공할 수 있다. 사용자는 사용자 단말(미도시)을 이용하여 중요도 값과 연관된 기준 값 정보를 입력할 수 있다. 예를 들어, 기준 중요도 입력부(152)는 사용자 입력 정보가 0.01일 경우, 기준값(기본값)을 0.01으로 설정할 수 있다. 또한, 기준 중요 입력부(152)는 사용자 입력 정보가 수신되지 않는 경우, 미리 설정된 중요도 값에 연관된 기준 값을 설정할 수 있다. 미리 설정된 중요도 값에 연관된 기준 값은 0.001%, 0.005%, 0.01%, 0.05%, 0.1%, 0.5%일 수 있다. According to one embodiment of the present application, the reference importance input unit 152 may set a reference value associated with the importance value based on user input information. In addition, the reference importance input unit 152 may input a reference value input by the user to extract words having a importance level higher than a specific value. The reference importance input unit 152 may receive a reference value associated with the importance value from a user terminal (not shown). The reference importance input unit 152 may provide a reference value setting menu to a user terminal (not shown). The user may input reference value information associated with the importance value using a user terminal (not shown). For example, the reference importance input unit 152 may set the reference value (default) to 0.01 when the user input information is 0.01. Also, when the user input information is not received, the reference important input unit 152 may set a reference value associated with a preset importance value. The reference value associated with the preset importance value may be 0.001%, 0.005%, 0.01%, 0.05%, 0.1%, 0.5%.

본원의 일 실시예에 따르면, 단어별 중요도 추출부(153)는 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출할 수 있다. 단어별 중요도 추출부(153)는 단어 벡터에 주성분 분석을 적용하여 추출한 3개의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출할 수 있다. According to an embodiment of the present application, the importance extracting unit 153 for each word may extract an influence degree of each feature for each word vector for a plurality of features. The importance level extractor 153 for each word may extract the influence of each feature for each word vector for the three features extracted by applying principal component analysis to the word vector.

또한, 단어별 중요도 추출부(153)는 주성분 분석을 이용하여 피처를 추출했을 경우, 주성분 분석에 사용된 각 변수들이 추출된 피처에 미치는 영향도를 추출할 수 있다. 예를 들어, 단어별 중요도 추출부(153)는 Scikit-Learn의 decomposition의 함수인 PCA.components_를 이용하여 영향도를 추출할 수 있다. 이 함수는 각 PCA를 이용하여 피처를 추출했을 경우, PCA에 사용된 각 변수들이 추출된 피처에 어느 정도 영향을 미치고 있는지를 표현할 수 있다.In addition, when a feature is extracted by using the principal component analysis, the importance extracting unit 153 for each word may extract an influence degree of each variable used in the principal component analysis on the extracted feature. For example, the importance extracting unit 153 for each word may extract the influence degree using PCA.components_ which is a function of deciposition of Scikit-Learn. When a feature is extracted using each PCA, this function can express how much each variable used in the PCA affects the extracted feature.

예를 들어, A, B, C, D, E 변수를 PCA 적용하여 두 개의 변수 M, N을 추출했을 경우 [표2]와 같은 영향도가 추출될 수 있다.For example, when two variables M and N are extracted by applying PCA to variables A, B, C, D, and E, the effects shown in [Table 2] can be extracted.

AA BB CC DD EE MM -0.3-0.3 -0.4-0.4 0.50.5 0.70.7 0.10.1 NN 0.10.1 -0.7-0.7 -0.4-0.4 -0.5-0.5 0.30.3

A, B C, D, E 변수가 변수 M에 미치는 영향도는 각각 -0.3, -0.4, 0.5, 0.7, 0.1이고 변수 N에 미치는 영향도는 각각 0.1, -0.7, -0.4, -0.5, 0.3이다. 즉, M에 양의 상관관계를 가장 높게 띄는 변수는 D, 음의 상관관계를 가장 높게 띄는 변수는 B이고 N에 양의 상관관계를 가장 높게 띄는 변수는 E, 음의 상관관계를 가장 높게 띄는 변수는 B이다. 단어별 중요도 추출부(153)는 각 영향도에 제곱하여 더했을 경우 1이 나오도록 생성할 수 있다.예를 들어, 단어별 중요도 추출부(153)는 표2에 도시된 변수의 각 영향도에 제곱을 곱하여 아래와 같이 도출할 수 있다. The influences of variables A, BC, D, and E on variables M are -0.3, -0.4, 0.5, 0.7, and 0.1, respectively, and the influences of variables N are 0.1, -0.7, -0.4, -0.5, and 0.3, respectively. . That is, the variable with the highest positive correlation in M is D, the variable with the highest negative correlation in B, and the variable with the highest positive correlation in N is E, the highest in negative correlation. The variable is B. The importance extracting unit 153 for each word may be generated such that 1 is added when each influence degree is squared. For example, the importance extracting unit 153 for each word is applied to each influence degree of the variable shown in Table 2. Multiplied by the square, we can derive as follows.

본원의 일 실시예에 따르면, 단어별 중요도 계산부(154)는 단어별 중요도 추출부(153)에서 추출한 값에 대하여 중요도를 백분율로 환산할 수 있다. 피처 추출부(142)는 단어별 중요도 추출부(153)에서 추출한 값을 토대로 기존의 각 변수가 PCA로 추출된 피처에 어느 정도 영향을 미치는지 백분율로 나타낼 수 있다. 또한, 각 영향도는 제곱하여 더했을 때 1이 나오도록 계산되어 있으므로 단어별 중요도 계산부(154)는 각 영향도를 제곱한 뒤 100을 곱하여 백분율로 계산할 수 있다.According to an embodiment of the present disclosure, the importance level calculation unit 154 for each word may convert the importance level as a percentage of the value extracted by the importance level extraction unit 153 for each word. The feature extraction unit 142 may represent a percentage of how each existing variable affects the feature extracted by the PCA based on the value extracted by the word-specific importance extraction unit 153. In addition, since each influence degree is calculated so that 1 is added when squared and added, the importance calculation unit 154 for each word can calculate the percentage by multiplying each influence degree and multiplying by 100.

표2에 영향도를 제곱하면 표3과 같이 표현될 수 있으며, 이를 백분율으로 나타내면 표4와 같이 표현될 수 있다. 단어별 중요도 추출부(153)는 이를 통해 각 피처별로 어떤 변수가 주요했는지(영향을 많이 끼쳤는지)파악할 수 있다. If the influence degree is squared in Table 2, it can be expressed as Table 3, and if it is expressed as a percentage, it can be expressed as Table 4. The importance level extracting unit 153 for each word can determine which variables are important (which influenced a lot) for each feature.

AA BB CC DD EE MM 0.090.09 0.160.16 0.250.25 0.490.49 0.010.01 NN 0.010.01 0.490.49 -0.16-0.16 0.250.25 0.090.09

AA BB CC DD EE MM 9%9% 16%16% 25%25% 49%49% 1%One% NN 1%One% 49%49% 16%16% 25%25% 9%9%

본원의 일 실시예에 따르면, 시각화 실행부(160)는 각 단어별 중요도와 추출된 피처에 대한 정보를 벤 다이어그램 그래프 형태로 시각화할 수 있다. 또한, 각 단어별 중요도와 추출된 피처에 대한 정보를 벤 다이어그램 모양을 활용한 워드 클라우드 방식을 활용하여 시각화할 수 있다.또한, 시각화 실행부(160)는 각 단어별 중요도와 추출된 피처의 빈도수에 따라 출력 위치, 출력 크기 및 출력 색을 결정하여 클라우드 형태로 시각화할 수 있다. 클라우드 형태의 시각화하는 것은, 핵심적인 단어를 돋보이게 하는 시각화하여, 기사에서 사용된 단어들을 빈도수에 따라 서로 다른 크기와 색으로 표현할 수 있다. 또한, 시각화 실행부(160)는 사용자 단말(미도시)로 클라우드 형태로 시각화한 그래프를 제공할 수 있다. According to one embodiment of the present application, the visualization execution unit 160 may visualize the importance of each word and information on the extracted features in the form of a Venn diagram graph. In addition, the importance of each word and information about the extracted features may be visualized using a word cloud method using a Venn diagram shape. In addition, the visualization execution unit 160 may determine the importance of each word and the frequency of the extracted features. Depending on the output location, output size and output color can be determined and visualized in the form of a cloud. Cloud-type visualization can visualize key words to stand out and express words used in articles in different sizes and colors according to frequency. In addition, the visualization execution unit 160 may provide a graph visualized in a cloud form by a user terminal (not shown).

도 2는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치의 시각화 도구를 설명하기 위한 도면이다. 도 2를 참조하면, 시각화 실행부(160)는 벤 다이어그램 모양의 워드 클라우드를 적용하여 피처의 주요 단어를 시각화할 수 있다. 2 is a diagram for explaining a visualization tool of a device for visualizing a key word of a feature according to an embodiment of the present application. Referring to FIG. 2, the visualization execution unit 160 may visualize key words of a feature by applying a word cloud in the form of a Venn diagram.

예시적으로 도 2의 (a)를 참조하면, 추출된 복수의 피처별로(예를 들어, 3개의 피처별로) 각각 주요 단어들이 있다. 예시적으로, 단어들의 모임을 W1, W2, W3라 하면 W1, W2, W3는 서로 다른 단어들로만 구성되어 있고, W1, W2, W3는 서로 공유하는 단어들이 존재할 수 있다. 이 공유하는 단어들은 각 피처에 모두 영향을 미쳤으므로 중요한 단어들이라고 간주할 수 있다. 따라서 시각화 실행부(160)는 이를 나타내기 위해 벤 다이어그램 모양의 형태로 시각화를 할 수 있다. 일예로, 시각화 실행부(160)는 제1특징 변수의 제1영역(예를 들어, W1 영역)을 제1색으로 결정하여 출력할 수 있다. 또한, 제1특징 변수의 제1영역의 변수 중요도, 주요 단어 등의 중요도, 영향도에 기반하여 단어의 출력 위치 및 출력 크기를 달리하여 출력할 수 있다. 또한, 시각화 실행부(160)는 제2특징 변수의 제2영역(예를 들어, W2 영역)을 제2색으로 결정하여 출력할 수 있다. 또한, 제2특징 변수의 제2영역의 변수 중요도, 주요 단어 등의 중요도, 영향도에 기반하여 단어의 출력 위치 및 출력 크기를 달리하여 출력할 수 있다. 또한, 시각화 실행부(160)는 제1특징변수 및 제2특징 변수의 교집합 영역(예를 들어, W1&W2)을 제3색으로 결정하여 출력할 수 있다. 또한, 제1특징변수 및 제2특징 변수의 교집합 영역(예를 들어, W1&W2)의 변수 중요도, 주요 단어 등의 중요도, 영향도에 기반하여 단어의 출력 위치 및 출력 크기를 달리하여 출력할 수 있다.For example, referring to FIG. 2(a), there are key words for each extracted plurality of features (for example, for each of three features). For example, when a group of words is W1, W2, and W3, W1, W2, and W3 are composed only of different words, and W1, W2, and W3 may have words shared with each other. These shared words influenced each feature, so they can be considered important words. Therefore, the visualization execution unit 160 may visualize in the form of a venn diagram to indicate this. For example, the visualization execution unit 160 may determine and output the first area (eg, W1 area) of the first feature variable as the first color. In addition, the output position and output size of the word may be output differently based on the importance of the variable in the first area of the first feature variable, the importance of the key word, and the influence degree. Also, the visualization execution unit 160 may determine and output the second region (eg, W2 region) of the second feature variable as the second color. In addition, the output position and output size of the word may be output differently based on the importance of the variable in the second area of the second feature variable, the importance of the key word, and the influence degree. Also, the visualization execution unit 160 may determine and output the intersection region of the first feature variable and the second feature variable (for example, W1&W2) as the third color. In addition, the output position and output size of the word may be output differently based on the importance of the variable, the importance of the key words, and the influence of the variable in the intersection region of the first feature variable and the second feature variable (for example, W1&W2). .

또한, 도 2의 (b)를 참조하면, 단어 구름(word cloud, 워드 클라우드)은 문서에 사용된 단어의 빈도를 계산해서 시각적으로 표현하는 것을 의미한다. 단어 구름(word cloud, 워드 클라우드)을 사용함으로써, 많이 나오는 단어는 크게 표시되기 때문에 한눈에 문서의 핵심 내용을 파악할 수 있다. 또한, 단어 구름(word cloud, 워드 클라우드)은 태그 구름(tag cloud)이라고도 부른다. 태그는 옷이나 물건에 소재나 취급 방법 등을 설명하기 위해 붙이는 꼬리표다. 웹 페이지나 소셜네트워크 서비스(SNS)에서 콘텐츠를 설명하기 위해 붙이는 키워드를 태그라고 부른다. 태그 구름은 웹 사이트에서 태그의 중요도를 글자 크기나 색깔로 표시한다. 또한, 워크 클라우드는 표현하려는 콘텐츠의 성격에 따라 문서 구름(text cloud)과 데이터 구름(data cloud)으로 구분하기도 한다. 문서 구름이 문서에 포함된 단어를 시각적으로 표현한 것이라면 데이터 구름은 단어 대신에 숫자 정보를 크기와 색깔로 표현한 것을 말한다. 예를 들어, 인구 규모에 따라 국가명의 크기나 색을 달리해서 표현하거나 주식시장에서 주가의 등락과 거래량을 반영해 회사명의 크기와 색을 결정한다. 또한, 워드 클라우드 기법을 이용하여 벤 다이어그램의 각 영역별로 주요 단어들을 표시한다. 또한, 시각화 실행부(160)는 빈도수에 해당하는 값에 중요도를 대입함으로써 중요도가 높을수록 단어의 크기가 크게 표시되도록 한다. 중요도를 대입할 때 백분율이 정수가 되게끔 적절히 변환하여 대입한다. 또한, 시각화 실행부(160)는 벤 다이어그램의 영역별로 배경의 색을 다르게 했기 때문에 단어의 색까지 다르면 사용자가 보기가 불편할 수 있으므로 단어의 색은 검은색으로 통일하여 제공할 수 있다. In addition, referring to (b) of FIG. 2, a word cloud (word cloud) means to visually calculate the frequency of words used in a document. By using a word cloud, words that appear frequently appear large, so you can grasp the core content of the document at a glance. In addition, a word cloud (word cloud) is also called a tag cloud. Tags are tags attached to clothes and objects to describe materials and handling methods. Keywords that you attach to describe content on a web page or social network service (SNS) are called tags. The tag cloud displays the importance of tags on a website in text size or color. In addition, the work cloud is divided into a text cloud and a data cloud according to the nature of the content to be expressed. If a document cloud is a visual representation of a word contained in a document, the data cloud is a representation of numeric information in size and color instead of words. For example, depending on the size of the population, the size or color of a country is expressed differently, or the size and color of a company's name are determined by reflecting the fluctuations in stock prices and the trading volume in the stock market. In addition, key words are displayed for each area of the Venn diagram using a word cloud technique. In addition, the visualization execution unit 160 assigns importance to a value corresponding to the frequency, so that the higher the importance, the larger the word size is displayed. When assigning importance, substitute the proper conversion so that the percentage becomes an integer. In addition, since the visualization execution unit 160 has a different color of the background for each area of the Venn diagram, if the color of the word is different, the user may be uncomfortable to view it, so the color of the word may be uniformly provided in black.

도 3 은 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제1텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다.FIG. 3 is a diagram visualizing a list of words used in features extracted from the first text data in the apparatus for visualizing key words of the feature according to an embodiment of the present application.

예시적으로 도 3을 참조하면, 피처의 주요 단어 시각화 장치(150)는 피처에 사용된 단어 목록을 벤 다이어그램 형태의 워드 클라우드를 생성하여 시각화할 수 있다. 피처의 주요 단어 시각화 장치(150)는 피처 영향도가 기준값 이상인 단어들을 벤 다이어그램 형태의 워드 클라우드에 도시할 수 있다. Referring to FIG. 3 as an example, the key word visualization device 150 of the feature may visualize the word list used in the feature by generating a word cloud in the form of a Venn diagram. The key word visualization device 150 of the feature may show words having a feature impact level higher than a reference value in a word cloud in the form of a Venn diagram.

피처의 주요 단어 시각화 장치(150)는 텍스트 변수명이 ‘text’일 때 추출된 세 피처 ‘text_PCA_1’ ‘text_PCA_2’, ‘text_PCA_3’에 대해서 피처에 끼친 영향도가 기준값 이상인 단어들을 벤 다이어그램 안에 워드 클라우드를 이용하여 나타낼 수 있다. 여기서, 피처의 주요 단어 시각화 장치(150)는 벤 다이어그램의 겹치는 부분(교집합)에는 여러 피처에 동시에 영향을 많이 끼친 단어들을 표시할 수 있다. 즉, 가장 정중앙에 있는 단어들은 세 피처 모두에 영향을 많이 끼친 단어를 모은 것이다. 또한, 피처의 주요 단어 시각화 장치(150)는 변수 정보 표시 상자를 도시할 수 있다. 변수 정보 표시 상자는 목표 변수명이 ‘target’이고 텍스트 변수명이 ‘text’일 때 추출된 세 피처 ‘text_PCA_1’, ‘text_PCA_2’, ‘text_PCA_3’에 각각에 대해서 target에 대한 변수 영향도와 각 피처별 주요 단어의 개수를 포함할 수 있다. The key word visualization device 150 of the feature uses a word cloud in the Venn diagram for words having an impact on the feature that is greater than or equal to a reference value for the three features'text_PCA_1','text_PCA_2', and'text_PCA_3' extracted when the text variable name is'text' It can be expressed using Here, the apparatus 150 for visualizing a key word of a feature may display words on the overlapping part (intersection) of the Venn diagram that simultaneously influenced many features simultaneously. In other words, the most central words are words that have influenced all three features. In addition, the key word visualization device 150 of the feature may show a variable information display box. In the variable information display box, the target variable name is'target' and the text variable name is'text'. The extracted three features'text_PCA_1','text_PCA_2', and'text_PCA_3' are each influenced by the target variable and the key words for each feature It may include the number of.

도 8은 본원의 일 실시예에 따른 텍스트 데이터 피처 추출 장치의 개략적인 블록도이다. 도 8을 참조하면, 단어 추출 프로세서(130)는 사용자 사전 입력부(131), 자체 사전 입력부(132), 단어 후보 추출부(133) 및 주요 단어 추출부(134)를 포함할 수 있다. 8 is a schematic block diagram of a text data feature extraction apparatus according to an embodiment of the present application. Referring to FIG. 8, the word extraction processor 130 may include a user dictionary input unit 131, its own dictionary input unit 132, a word candidate extraction unit 133, and a main word extraction unit 134.

본원의 일 실시예에 따르면, 사용자 사전 입력부(131)는 사용자로부터 주요 단어와 관련된 입력 정보를 획득할 수 있다. 사용자 사전 입력부(131)는 사용자가 사용자 단말(미도시)을 통해 입력한 사용자 입력 정보에 기반하여 사용자 사전을 구축할 수 있다. 또한, 사용자 사전 입력부(131)는 사용자가 텍스트 데이터에서 추출하고자 하는 단어가 있을 경우, 해당 단어들을 추출하기 위해 사용자 사전을 구축할 수 있다. 달리 말해, 사용자 사전 입력부(131) 사용자 단말(미도시)을 통해 한글, 영어 특수문자 중 적어도 어느 하나를 입력한 사용자 입력 정보에 기반하여 사용자 사전을 구축할 수 있다. 사용자 사전에 포함된 단어는 주요 단어 추출부에서 단어 추출 시 가장 높은 우선순위가 부여될 수 있다. 또한, 사용자 사전은 자체 사전 입력부(132)에 포함된 자체 사전으로는 데이터 특성에 맞는 단어를 전부 추출할 수 없을 수도 있는 한계를 극복하고, 사용자가 특별히 중요하다고 판단하는 단어들을 최우선순위에 놓기 위해서 사용자가 입력하는 사전이다. According to an embodiment of the present application, the user dictionary input unit 131 may obtain input information related to a key word from a user. The user dictionary input unit 131 may build a user dictionary based on user input information input by a user through a user terminal (not shown). In addition, the user dictionary input unit 131 may build a user dictionary to extract the words when the user wants to extract words from text data. In other words, the user dictionary input unit 131 may build a user dictionary based on user input information input at least one of Hangul and English special characters through a user terminal (not shown). The words included in the user dictionary may be given the highest priority when extracting words from the main word extraction unit. In addition, the user dictionary may overcome limitations that may not be able to extract all words matching data characteristics with its own dictionary included in its dictionary input unit 132, and to put words that the user considers to be particularly important in the highest priority. It is a dictionary input by the user.

본원의 일 실시예에 따르면, 사용자 사전 입력부(131)는 사용자 단말(미도시)로 단어 입력 메뉴를 제공할 수 있다. 예를 들어, 사용자 사전 입력부(131)가 제공하는 어플리케이션 프로그램을 사용자 단말(미도시)이 다운로드하여 설치하고, 설치된 어플리케이션을 통해 단어 입력 메뉴가 제공될 수 있다.According to an embodiment of the present application, the user dictionary input unit 131 may provide a word input menu to a user terminal (not shown). For example, a user terminal (not shown) downloads and installs an application program provided by the user dictionary input unit 131, and a word input menu may be provided through the installed application.

사용자는 단어 입력 메뉴를 통해, 우선적으로 추출해야 될 단어들을 입력할 수 있다. 사용자 사전 입력부(131)는 사용자 단말(미도시)로부터 입력되는 단어들을 수집하여 사용자 사전을 구축할 수 있다. The user may input words to be preferentially extracted through the word input menu. The user dictionary input unit 131 may build a user dictionary by collecting words input from a user terminal (not shown).

사용자 사전 입력부(131)는 사용자 단말(미도시)과 데이터, 콘텐츠, 각종 통신 신호를 네트워크를 통해 송수신하고, 데이터 저장 및 처리의 기능을 가지는 모든 종류의 서버, 단말, 또는 디바이스를 포함할 수 있다.The user dictionary input unit 131 may transmit and receive data, contents, and various communication signals to and from a user terminal (not shown) through a network, and may include all types of servers, terminals, or devices having functions of data storage and processing. .

사용자 단말(미도시)은 네트워크를 통해 사용자 사전 입력부(131)와 연동되는 디바이스로서, 예를 들면, 스마트폰(Smartphone), 스마트패드(Smart Pad), 태블릿 PC, 웨어러블 디바이스 등과 PCS(Personal Communication System), GSM(Global System for Mobile communication), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말기 같은 모든 종류의 무선 통신 장치 및 데스크탑 컴퓨터, 스마트 TV와 같은 고정용 단말기일 수도 있다. The user terminal (not shown) is a device that is linked to the user dictionary input unit 131 through a network, for example, a smart phone, a smart pad, a tablet PC, a wearable device, and a PCS (Personal Communication System). ), Global System for Mobile Communication (GSM), Personal Digital Cellular (PDC), Personal Handyphone System (PHS), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT)-2000, Code Division Multiple Access (CDMA)-2000 , W-CDMA (W-Code Division Multiple Access), all kinds of wireless communication devices such as Wibro (Wireless Broadband Internet) terminals and fixed terminals such as desktop computers and smart TVs.

사용자 사전 입력부(131) 및 사용자 단말(미도시) 간의 정보 공유를 위한 네트워크의 일 예로는 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, 5G 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 유무선 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), 블루투스(Bluetooth) 네트워크, Wifi 네트워크, NFC(Near Field Communication) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함될 수 있으며, 이에 한정된 것은 아니다.Examples of networks for sharing information between the user dictionary input unit 131 and a user terminal (not shown) are 3rd Generation Partnership Project (3GPP) networks, Long Term Evolution (LTE) networks, 5G networks, and World Interoperability for Microwave Access (WIMAX). ) Network, Wired and Wireless Internet, Local Area Network (LAN), Wireless Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), Bluetooth Network, Wifi Network, NFC( Near Field Communication (Near Field Communication) network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) network, and the like may be included, but are not limited thereto.

본원의 일 실시예에 따르면, 자체 사전 입력부(132)는 복수의 일반 명사를 포함하는 명사 사전을 획득할 수 있다. 일예로, 명사 사전은 자체적으로 보유한 명사가 포함된 사전일 수 있다. 예시적으로, 명사 사전에 포함된 단어 수는 31만 7151개이다. 전문 용어(예를 들어, 법률 용어 등 전문직, 전공별로 사용하는 특수한 단어들)의 수는 적으나 일상적인 단어 수는 다수 포함될 수 있다. 또한, 자체 사전 입력부(132)는 복수의 일반 명사를 포함하는 데이터베이스를 외부 서버로부터 획득할 수 있다. 자체 사전 입력부(132)는 외부 서버(미도시)로부터 특정 명사 사전을 획득할 수 있다. 달리 말해, 자체 사전 입력부(132)는 입력받은 텍스트 데이터(110)의 특징(특성)을 고려하여 외부 서버(미도시)로부터 특정 명사 사전을 획득할 수 있다. 예를 들어, 입력받은 텍스트 데이터(110)가 판결문 데이터인 경우, 자체 사전 입력부(132)는 법률 용어가 포함된 명사 사전을 외부 서버(미도시)로부터 획득할 수 있다.According to an embodiment of the present application, the dictionary input unit 132 may obtain a noun dictionary including a plurality of common nouns. As an example, the noun dictionary may be a dictionary including a noun owned by itself. For example, the number of words included in the noun dictionary is 371,151. Although the number of technical terms (for example, special words used for each professional or major) such as legal terms is small, the number of ordinary words may be included. In addition, the dictionary input unit 132 itself may acquire a database including a plurality of common nouns from an external server. The dictionary input unit 132 may acquire a specific noun dictionary from an external server (not shown). In other words, the dictionary input unit 132 may obtain a specific noun dictionary from an external server (not shown) in consideration of the characteristics (characteristics) of the received text data 110. For example, when the received text data 110 is judgment data, the own dictionary input unit 132 may obtain a noun dictionary including legal terms from an external server (not shown).

또한, 자체 사전 입력부(132)는 비(非)명사 사전을 포함할 수 있다. 비(非)명사 사전은 ‘~친다, ~란다, ~렵다, ~므로, ~었다, ~있다, ~없다, ~한다, ~된다, 시킴, ~됨, 있음, 없음, ~하여, ~하면, ~되면, ~렀다, 어디, 않다’ 같이 명사가 아닌 품사를 가진 단어와 ‘??, ?O’ 같이 명사가 될 수 없는 어휘들로 구성되어 있다. 비(非)명사 사전 명사가 아닌 품사와 오탈자에 해당하는 단어를 포함할 수 있다. 또한, 비(非)명사 사전은 100% 명사인 단어가 아닌 단어들에 대하여 명사가 확실히 아닌 단어를 제거하기 위해 사용될 수 있다. In addition, the dictionary input unit 132 itself may include a non-noun dictionary. Non-noun dictionaries are'~struck, ~struck, ~struck, ~so, ~struck, ~seen, ~none, ~shall, ~same, Sikkim, ~same, Yes, None, ~to, ~to, It consists of words with parts of speech that are not nouns, such as'if, where, not,' and vocabularies that cannot be nouns such as'??, ?O'. Non-noun dictionary can contain words that correspond to parts of speech and typos, not to nouns. In addition, a non-noun dictionary can be used to remove words that are definitely not nouns for words that are not 100% noun words.

또한, 자체 사전 입력부(132)는 인명(人名) 사전 및 지명(地名) 사전을 포함할 수 있다. 인명(人名) 사전은 사람 이름으로 구성된 사전일 수 있다. 또한, 지명(地名) 사전은 지역 이름, 지하철역, 도시명 등을 포함하는 사전일 수 있다. 사람 이름과 지역명, 지하철역, 유명 장소 등은 데이터에 따라 필요할 수도 있고 불필요할 수도 있는 단어다. 사용자가 이를 불필요하다고 간주할 경우 제거할 수 있다.In addition, the dictionary input unit 132 may include a person name dictionary and a place name dictionary. The human name dictionary may be a dictionary consisting of human names. Also, the place name dictionary may be a dictionary including a region name, a subway station, a city name, and the like. People's names, local names, subway stations, and famous places are words that may or may not be needed depending on the data. If the user deems it unnecessary, it can be removed.

본원의 일 실시예에 따르면, 단어 후보 추출부(133)는 텍스트 데이터에서 형태소를 분리하고, 단어가 될 수 있는 형태소를 추출할 수 있다. 일예로, 단어 후보 추출부(133)는 텍스트 데이터를 이용하여 형태소 분석을 실시하고, 문장, 구문 등의 단어가 될 수 있는 형태소를 추출할 수 있다. 단어 후보 추출부(133)는 텍스트 데이터에서 명사를 추출하기 위해 Komoran(Korean Morphological Analyzer) 형태소 분석을 수행할 수 있다. Komoran 형태소 분석은 기존 형태소 분석과 달리 여러 어절을 하나의 품사로 분석이 가능하여 공백이 포함된 텍스트 데이터를 더 정확하게 Tokenize할 수 있다.According to an embodiment of the present application, the word candidate extracting unit 133 may separate a morpheme from text data and extract a morpheme that can be a word. As an example, the word candidate extracting unit 133 may perform morpheme analysis using text data and extract morphemes that can be words such as sentences and phrases. The word candidate extracting unit 133 may perform morphological analysis of Komoran (Korean Morphological Analyzer) to extract nouns from text data. Unlike conventional morpheme analysis, Komoran morpheme analysis can analyze multiple words as one part of speech, so that text data with spaces can be more accurately tokenized.

한국어는 L + [R] 구조이다. L 옆에 등장하는 R의 분포는 L이 명사인지 아닌지를 판단하는 좋은 힌트가 된다. 하지만, 조사들을 R로 가지고 있으면서 규칙기반으로 명사를 찾을 수는 없다. 예를 들어, '-은'은 대표적인 조사지만 '손나은'이 '손나 + 은'은 아니다. 연예 뉴스에서 '에이핑크 맴버 손나은'은 자주 등장하지만 '손나 + 은', '손나 + 이', '손나 + 에게' 같은 어절은 자주 등장하지 않는다. 이와 같이 단어 추출 프로세서(130)는 L - R의 이분 그래프(bipartite graph)의 정보를 이용하여 해당 단어가 명사인지 아닌지 판단할 수 있다. 단어 추출 프로세서(130)는 세종 말뭉치를 이용하여 명사 뒤에 등장하는 R set을 모아뒀으며, R set의 단어들은 명사 가능 점수가 학습되어 있다. 명사 가능 점수의 범위는 [-1, 1]이다. 예를 들어, "내서"라는 단어는 -0.530702이고, "있게"라는 단어는 1.000000이고, "있는"이라는 단어는 0.327824이고, "쓰는"이라는 단어는 0.079298이고, "었다며"라는 단어는 -1.000000이고, "였다며"라는 단어는 0.437399이고, '했 + 었다며'이라면 '했'은 명사 점수가 -1.0이다. '재미 + 있게' 3번, '재미 + 있는' 2번 등장하였다면 재미의 명사 가능 점수는 (3 x 1.0 + 2 x 0.33) / 5 = 0.732점이다. 명사 추출의 한계점(threshold)이 0.5라면 '재미'는 명사로 추출될 수 있다. 달리 말해, 단어 후보 추출부(113)는 미리 설정된 명사 추출의 한계점에 기반하여 특정 명사를 추출할 수 있다. Korean is L + [R] structure. The distribution of R that appears next to L is a good hint to judge whether L is a noun or not. However, it is not possible to find nouns based on rules while having Rs as investigations. For example,'-silver' is a representative survey, but'Sonnaeun' is not'Sonna + silver'. In entertainment news,'A pink member Son Na-eun' frequently appears, but words such as'Sona Na + Eun','Sona Na + Lee', and'Sona Na + To' do not appear frequently. In this way, the word extraction processor 130 may use the information of the bipartite graph of L-R to determine whether the word is a noun or not. The word extraction processor 130 collects the R sets appearing after the noun using the Sejong corpus, and the words in the R set are learned with a noun possible score. The range of noun possible scores is [-1, 1]. For example, the word "out" is -0.530702, the word "being" is 1.000000, the word "with" is 0.327824, the word "writing" is 0.079298, and the word "was" is -1.000000 , And the word "was" was 0.437399, and if "was + was said", "had" had a noun score of -1.0. If'Fun + Have' 3 times and'Fun + Have' 2 times, the fun noun score is (3 x 1.0 + 2 x 0.33) / 5 = 0.732. If the threshold of noun extraction is 0.5,'fun' can be extracted as a noun. In other words, the word candidate extracting unit 113 may extract a specific noun based on a threshold of preset noun extraction.

이 방법은 주어진 문서 집합에서 어절들의 구조를 학습하여 그 주어진 문서 집합의 명사를 추출할 수 있다. 즉, 학습 데이터가 필요하지 않은 통계 기반의 자율(unsupervised) 학습 방법이다. 따라서 사전에 등록되지 않은 명사를 추출할 수 있는 장점이 있다. 본원의 단어 후보 추출부(133)는 명사 추출 함수(예를 들어, soynlp의 LRNounExtractor_v2 함수)를 이용하여 명사 후보를 추출할 수 있다. This method can learn the structure of words in a given document set and extract nouns from the given document set. That is, it is a statistical-based unsupervised learning method that does not require learning data. Therefore, there is an advantage of extracting nouns that have not been registered in advance. The word candidate extracting unit 133 of the present application may extract a noun candidate using a noun extraction function (for example, the LRNounExtractor_v2 function of soynlp).

또한, 단어 후보 추출부(133)는 두 글자 이상의 한글 단어로부터 단어가 될 수 있는 형태소를 추출할 수 있다. 특정 함수(예를 들어, LRNounExtractor_v2 함수)를 이용하여 추출한 단어 중에서는 명사 외에도 한글자짜리 단어도 있고 특수문자나 외국어가 섞인 단어도 있고 명사가 아닌 단어들도 존재한다. 또한, 함수가 추출하지 못한 단어 중에서 사용자가 추출하기를 원하는 단어들도 있고 추출한 단어 중 사용자가 특별히 더 주요하게 생각하는 핵심단어가 있을 수 있다. 단어 후보 추출부(133)는 주요 단어들만 남겨서 벡터화를 하기 위해 준비된 사전을 이용하여 단어의 품질을 개선할 수 있다. 단어 후보 추출부(133)는 텍스트 데이터(110)에서 형태소를 분리하고 명사를 추출할 수 있다. 또한, 단어 후보 추출부(133)는 텍스트 데이터(110)에서 2글자 이상의 한글로만 구성된 형태소를 추출할 수 있다. Also, the word candidate extracting unit 133 may extract a morpheme that can be a word from two or more Hangul words. Among the words extracted using a specific function (for example, the LRNounExtractor_v2 function), there are words in Korean characters in addition to nouns, words containing special characters or foreign languages, and words that are not nouns. In addition, there may be words that the user wants to extract among words that the function does not extract, and among the extracted words, there may be a key word that the user thinks more particularly. The word candidate extracting unit 133 may improve the quality of words using a dictionary prepared for vectorization by leaving only the main words. The word candidate extracting unit 133 may separate morphemes from the text data 110 and extract nouns. In addition, the word candidate extracting unit 133 may extract a morpheme composed of only two or more Hangul characters from the text data 110.

예시적으로, 단어 후보 추출부(133)는 두 글자 이상의 한글 단어만 남기는 작업을 수행할 수 있다. 한글자 짜리 단어는 의미가 다양한 것이 많고, 조사인 경우가 많으므로 단어 후보 추출부(133)는 한글자 짜리 단어를 제외할 수 있다. 또한, 단어 후보 추출부(133)는 특수 문자나 외국어가 섞인 단어는 사용자가 입력한 것만 사용하기 위해 텍스트 데이터에 특수 문자 또는 외국어가 섞어있는 경우, 해당 단어를 제외(삭제)할 수 있다. For example, the word candidate extracting unit 133 may perform a task of leaving only two or more Hangul words. Since there are many meanings of Hangul characters, and there are many cases of investigation, the word candidate extracting unit 133 may exclude Hangul characters. Further, the word candidate extracting unit 133 may exclude (delete) the word when the special data or the foreign language is mixed in the text data in order to use only the word input by the user.

본원의 일 실시예에 따르면, 주요 단어 추출부(134)는 단어 후보 추출부(133)에서 추출한 형태소 목록에서 복수의 사전을 이용하여 주요 단어를 추출할 수 있다. 달리 말해, 주요 단어 추출부(134)는 단어 후보 추출부(133)에서 추출한 형태소 목록에서 사용자 사전, 자체 명사 사전, 비(非)명사 사전, 인명(人名)사전, 지명(地名)사전을 이용하여 단어를 추출할 수 있다. According to an embodiment of the present application, the key word extracting unit 134 may extract the key word using a plurality of dictionaries from the morpheme list extracted by the word candidate extracting unit 133. In other words, the main word extraction unit 134 uses a user dictionary, its own noun dictionary, a non-noun dictionary, a human name dictionary, and a geographical name dictionary from the morpheme list extracted by the word candidate extraction unit 133 To extract words.

또한, 주요 단어 추출부(134)는 명사 사전을 이용하여 텍스트 데이터에서 명사인 단어들을 추출할 수 있다. 주요 단어 추출부(134)는 kkma, mecab 등 다른 konlpy의 사전과 자체적으로 구축한 명사 사전을 이용하여 추출한 단어 중 명사인 단어를 추출할 수 있다. 주요 단어 추출부(134)는 사용자 사전의 유무와 관계없이 명사 사전을 이용하여 텍스트 데이터에서 명사인 단어들을 추출할 수 있다. 이렇게 추출한 단어들은 100% 명사로 간주하여 사용될 수 있다. 명사 사전의 단어 보유 수는 31만 7151개이다. 전문 용어(예: 법률 용어 등 전문직, 전공별로 사용하는 특수한 단어들)의 수는 적으나 일상적인 단어 수는 다수 확보되어 있다.Also, the main word extractor 134 may extract words that are nouns from text data using a noun dictionary. The key word extracting unit 134 may extract words that are nouns from words extracted using dictionaries of other konlpy such as kkma, mecab, and built-in noun dictionaries. The key word extracting unit 134 may extract words that are nouns from text data using a noun dictionary regardless of whether or not a user dictionary is present. These extracted words can be used as 100% nouns. The noun dictionary holds 31,151 words. The number of terminology (e.g., special words used by professions and majors such as legal terms) is small, but the number of everyday words is secured.

또한, 주요 단어 추출부(134)는 비 명사 사전을 이용하여 명사가 아닌 단어들을 제거할 수 있다. 주요 단어 추출부(134)는 비(非)명사 사전을 사용하여 100% 명사인 단어가 아닌 단어들을 삭제(제거)할 수 있다. 이때, 비(非)명사 사전은 ‘~친다, ~란다, ~렵다, ~므로, ~었다, ~있다, ~없다, ~한다, ~된다, 시킴, ~됨, 있음, 없음, ~하여, ~하면, ~되면, ~렀다, 어디, 않다’ 같이 명사가 아닌 품사를 가진 단어와 ‘??, ?O’ 같이 명사가 될 수 없는 어휘들로 구성되어 있다. 비(非)명사 사전으로 필터링을 거친 후 남은 단어들은 비지도 학습으로 얻은 신조어, 합성명사, 전문 용어 등으로 간주하고 사용여부를 결정할 수 있다.Also, the main word extracting unit 134 may remove words that are not nouns using a non-noun dictionary. The key word extracting unit 134 may delete (remove) words that are not 100% noun words using a non-noun dictionary. At this time, the non-noun dictionary is'~struck, ~sanda, ~spooky, ~, so, ~s, ~ss, ~nones, ~tos, ~tos, shikim, ~tos, yess, nones, ~tos, ~ It consists of words with parts of speech that are not nouns, such as'if, if, where, not,' and vocabularies that cannot be nouns such as'??, ?O'. After filtering through a non-noun dictionary, the remaining words can be regarded as new words, synthetic nouns, jargon, etc. obtained through unsupervised learning and can be used.

일예로, 사람 이름과 지역명, 지하철역, 유명 장소 등은 데이터에 따라 필요할 수도 있고 불필요할 수도 있는 단어다. 주요 단어 추출부(134)는 사용자의 입력 정보에 기반하여 해당 단어를 삭제(제거)할 수 있다. 주요 단어 추출부(134)는 비 명사 사전을 이용하여 제거된 단어 모음에 대하여 추가적으로 인명, 지명을 제거하고 싶을 때 인명 사전 및 지명 사전을 적용하여 단어를 제거(삭제)할 수 있다. For example, people's names and local names, subway stations, and famous places are words that may or may not be needed depending on the data. The key word extracting unit 134 may delete (remove) the word based on the user's input information. The key word extracting unit 134 may remove (delete) words by applying a human dictionary and a geographical name dictionary when additionally wanting to remove a person's name and a place name for a removed word collection using a non-noun dictionary.

또한, 주요 단어 추출부(134)는 사용자 사전을 이용하여 단어들을 제거할 수 있다. 사용자 사전은 사용자가 해당 텍스트 피처 추출에서 중요하다고 판단하는 단어들을 입력한 사전일 수 있다. Also, the main word extracting unit 134 may remove words using a user dictionary. The user dictionary may be a dictionary in which words the user determines are important in extracting the corresponding text feature.

또한, 주요 단어 추출부(134)는 복수의 사전 중 사용자 사전에 포함된 단어에 제1우선순위를 부여할 수 있다. 또한, 주요 단어 추출부(134)는 복수의 사전 중 명사 사전에 포함된 단어에 제2우선순위를 부여할 수 있다. 주요 단어 추출부(134)는 우선순위가 부여된 단어 목록에서 주요 단어를 추출할 수 있다. 일예로, 주요 단어 추출부(134)는 단어 후보 추출부(133)에서 추출한 형태소 목록에 복수의 사전 중 사용자 사전에 포함된 단어가 존재하는 경우, 해당 단어에 제1우선순위를 부여할 수 있다. 또한, 주요 단어 추출부(134)는 단어 후보 추출부(133)에서 추출한 형태소 목록에 복수의 사전 중 명사 사전에 포함된 단어가 존재하는 경우, 해당 단어에 제2우선순위를 부여할 수 있다. 주요 단어 추출부(134)는 단어 후보 추출부(133)에서 추출한 형태소 목록에 부여된 우선순위에 기반하여 최종 주요 단어를 추출할 수 있다. 일예로, 주요 단어 추출부(134)는 미리 설정된 랭크 이내의 단어를 주요 단어로 추출할 수 있다. 한편, 주요 단어 추출부(134)는 단어 후보 추출부(133)에서 추출한 형태소 목록에 복수의 사전 중 사용자 사전에 포함된 단어가 존재하는 경우, 해당 단어에 제1가중치를 부여할 수 있다. 또한, 주요 단어 추출부(134)는 단어 후보 추출부(133)에서 추출한 형태소 목록에 복수의 사전 중 명사 사전에 포함된 단어가 존재하는 경우, 해당 단어에 제2가중치를 부여할 수 있다. 여기서, 제1가중치는 제2가중치보다 높은 것일 수 있다. 주요 단어 추출부(134)는 가중치가 부여된 단어 목록은 재 생성하고, 주요 단어를 추출할 수 있다. In addition, the main word extracting unit 134 may give a first priority to words included in a user dictionary among a plurality of dictionaries. Also, the main word extracting unit 134 may give a second priority to words included in the noun dictionary among the plurality of dictionaries. The key word extracting unit 134 may extract the key word from the list of words given priority. For example, if the word included in the user dictionary among the plurality of dictionaries exists in the morpheme list extracted by the word candidate extracting unit 133, the main word extracting unit 134 may give the word the first priority. . In addition, when the word included in the noun dictionary among the plurality of dictionaries exists in the morpheme list extracted by the word candidate extracting unit 133, the main word extracting unit 134 may give the word a second priority. The key word extracting unit 134 may extract the final key word based on the priority given to the morpheme list extracted by the word candidate extracting unit 133. For example, the key word extracting unit 134 may extract words within a preset rank as key words. Meanwhile, when a word included in a user dictionary among a plurality of dictionaries exists in the morpheme list extracted by the word candidate extracting unit 133, the main word extracting unit 134 may give a first weight to the corresponding word. In addition, when the word included in the noun dictionary among the plurality of dictionaries exists in the morpheme list extracted by the word candidate extracting unit 133, the main word extracting unit 134 may assign a second weight to the word. Here, the first weight value may be higher than the second weight value. The key word extracting unit 134 may re-generate the weighted word list and extract the key word.

본원의 일 실시예에 따르면, 단어 벡터 생성부(141)는 주요 단어 추출부(134)에서 추출한 주요 단어를 텍스트 분석 알고리즘에 적용하여 주요 단어를 벡터화할 수 있다. 단어 벡터 생성부(141)는 주요 단어 추출부(134)에서 추출한 단어에 대하여 텍스트 분석 알고리즘(예를 들어, TF-IDF기법)을 적용하여 각 단어를 벡터화할 수 있다. 일예로, 단어 벡터 생성부(141)는 벡터 추출 패키지(예를 들어, Scikit-Learn의 feature_extraction의 서브패키지 중 하나인 feature_extraction.text)를 이용하여 주요 단어를 벡터화할 수 있다. 벡터 추출 패키지(예를 들어, Scikit-Learn의 feature_extraction의 서브패키지 중 하나인 feature_extraction.text)는 문서 전처리 클래스를 제공하며 이 클래스의 함수 중 빈도수를 기반으로 단어를 벡터화하는 함수인 텍스트 분석 함수(예를 들어, TfidfTransformer 함수)를 이용하여 주요 명사들을 벡터화할 수 있다. According to one embodiment of the present application, the word vector generator 141 may vectorize the key words by applying the key words extracted by the key word extractor 134 to a text analysis algorithm. The word vector generator 141 may vectorize each word by applying a text analysis algorithm (eg, TF-IDF technique) to the words extracted by the main word extractor 134. As an example, the word vector generator 141 may vectorize key words using a vector extraction package (eg, feature_extraction.text, which is one of subpackages of feature_extraction of Scikit-Learn). A vector extraction package (e.g., feature_extraction.text, one of the subpackages of Scikit-Learn's feature_extraction) provides a document preprocessing class, which is a text analysis function (e.g. a function that vectorizes words based on frequency) For example, the key nouns can be vectorized using the TfidfTransformer function.

일예로, 단어 벡터 생성부(141)는 문서에 포함된 단어의 빈도수를 산출할 수 있다. 텍스트 데이터(문서)에서의 단어의 빈도수는 수학식 1과 같이 표현될 수 있다. 여기서, d는 텍스트 데이터(문서)이고, t는 단어일 수 있다. 즉, 수학식 1은 문서 d에서 단어 t의 빈도수를 의미할 수 있다. TF의 값은 상황에 따라 정규화될 수 있다. As an example, the word vector generator 141 may calculate the frequency of words included in the document. The frequency of words in text data (document) may be expressed as Equation (1). Here, d may be text data (document) and t may be a word. That is, Equation 1 may mean the frequency of the word t in the document d. The value of TF can be normalized depending on the situation.

[수학식 1][Equation 1]

또한, 단어 벡터 생성부(141)는 불린 빈도(Boolean Frequency)를 이용하여 텍스트 데이터(110)에 단어가 포함된 경우 1, 단어가 포함되지 않은 경우 0으로 표기할 수 있다. 일예로, 불린 빈도는 수학식 2와 같이 표현될 수 있다. In addition, the word vector generator 141 may use a Boolean frequency to indicate 1 when the text data 110 contains a word and 0 when the word is not included. As an example, the so called frequency may be expressed as Equation 2.

[수학식 2][Equation 2]

일예로, 단어 벡터 생성부(141)는 불린 빈도를 이용하여 텍스트 데이터(110)에 단어가 포함된 경우 또는 단어가 포함되지 않는 경우를 포기함으로써, TF(Term Frequency)의 값이 지나치게 커지는 것을 방지할 수 있다. 반면, 단어 벡터 생성부(141)는 텍스트 데이터(110)에 단어가 1번이 나타나나 100번이 나타나나 똑같은 가중치를 부여하기 때문에, TF가 중요하지 않은 경우, 즉 단어의 등장 유무만 중요할 때 불린 빈도를 적용하여 단어를 단어의 등장 유무를 표기할 수 있다. For example, the word vector generation unit 141 prevents the value of the TF (Term Frequency) from becoming excessively large by giving up the case where the word or the word data is not included in the text data 110 using the called frequency. can do. On the other hand, since the word vector generator 141 assigns the same weight to the text data 110 as 1 or 100 times, the TF is not important, that is, only the presence or absence of the word is important. When the frequency called when is applied, the word can indicate whether the word appears or not.

또한, 단어 벡터 생성부(141)는 로그 스케일 빈도(Logarithmically Scaled Frequency)를 수학식 3과 같이 표현할 수 있다. 로그 스케일 빈도는 크기를 줄이기 위해 로크 스케일로 변환한 값을 의미할 수 있다. 이때, 텍스트 데이터(110)에 나타나는 단어의 빈도수 차이가 적으면 TF 값의 변화가 크지만 단어의 빈도가 무수히 늘어나는 경우엔 TF 값의 차이가 거의 없게 된다.In addition, the word vector generator 141 may express a logarithmically scaled frequency as shown in Equation (3). The logarithmic scale frequency may mean a value converted to a lock scale to reduce the size. At this time, if the difference in the frequency of words appearing in the text data 110 is small, the change in the TF value is large, but when the frequency of words increases innumerably, there is almost no difference in the TF value.

[수학식 3][Equation 3]

또한, 단어 벡터 생성부(141)는 텍스트 데이터(110)의 길이에 따라 단어의 상대적 빈도 값을 조정할 수 있다. 달리 말해, 증가 빈도(Augmented Frequency)는 문서 길이에 따라 단어의 상대적 빈도 값을 조정해주는 방법으로 단어의 빈도를 문서 내 단어들의 단어 빈도 중 최댓값으로 나눠주는 방법을 의미한다. 증가 빈도는 수학식 4와 같이 표현될 수 있다. 수학식 4는 편차를 줄이기 위해 조정된 수식일 수 있다. In addition, the word vector generator 141 may adjust the relative frequency value of the word according to the length of the text data 110. In other words, augmented frequency means a method of adjusting the relative frequency value of a word according to the length of the document, and dividing the word frequency by the maximum value among the word frequencies of words in the document. The frequency of increase can be expressed by Equation 4. Equation 4 may be an equation adjusted to reduce the deviation.

[수학식 4][Equation 4]

예를 들어, 단어 벡터 생성부(141)는 제1문서에서 A라는 단어가 100번, 제2문서에서 5번, 제3 문서에서 8번이 등장하면 분모가 100으로 고정할 수 있다. For example, the word vector generator 141 may fix the denominator to 100 when the word A appears 100 times in the first document, 5 times in the second document, and 8 times in the third document.

IDF(inverse document frequency)는 특정한 단어가 들어 있는 문서의 수에 반비례하는 수를 의미할 수 있다. IDF(inverse document frequency)는 전체 문서에서 자주 발생하는 단어의 중요도를 낮추기 위해 사용될 수 있다. 이때, 그냥 역수를 취하면 전체 문서의 수가 많아질수록 IDF(inverse document frequency)의 값이 기하급수적으로 커지게 되므로 IDF 또한, 로그를 취한다. IDF(inverse document frequency)는 수학식 5와 같이 표현될 수 있다. IDF (inverse document frequency) may mean a number inversely proportional to the number of documents containing a specific word. IDF (inverse document frequency) can be used to reduce the importance of words that frequently occur in the entire document. At this time, if the inverse document frequency is increased, the IDF (inverse document frequency) value increases exponentially as the total number of documents increases. IDF (inverse document frequency) can be expressed as Equation (5).

[수학식 5][Equation 5]

TF-IDF 인코딩은 한 문서에서 많이 나타나는 중요도는 높이고(TF), 전체 문서에서 자주 발생하는 단어의 중요도는 낮추는 방법이다. TF-IDF는 수학식 6과 같이 표현될 수 있다. TF-IDF encoding is a method of increasing the importance (TF) frequently occurring in one document and decreasing the importance of frequently occurring words in the entire document. TF-IDF may be expressed as Equation (6).

[수학식 6][Equation 6]

예시적으로 수학식 6은 For example, Equation 6 is

값(중요도)이 커지게 됨을 의미할 수 있다.

It can mean that the value (importance) becomes large.

예시적으로, 단어 벡터 생성부(141)는 텍스트 데이터(110)를 입력받고, 텍스트 데이터(110)에 포함된 단어별로 번호를 부여할 수 있다. For example, the word vector generator 141 may receive text data 110 and assign numbers to words included in the text data 110.

Corpus = ['This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?']Corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?']

단어별로 번호 부여 {'and': 0, 'document': 1, 'first': 2, 'is': 3, 'last': 4, 'one': 5, 'second': 6, 'the': 7, 'third': 8, 'this': 9}Number by word {'and': 0,'document': 1,'first': 2,'is': 3,'last': 4,'one': 5,'second': 6,'the' : 7,'third': 8,'this': 9}

단어 벡터 생성부(141)는 TF값만 적용하여 행렬을 생성할 수 있다. 여기서, 행은 단어별로 부여된 번호를 나타내고, 열은, 문장을 나타내는 것일 수 있다. The word vector generator 141 may generate a matrix by applying only TF values. Here, the row may indicate a number assigned to each word, and the column may indicate a sentence.

TF값만 취할 경우 : array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1],When taking only TF values: array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1],

[0, 1, 0, 1, 0, 0, 2, 1, 0, 1],[0, 1, 0, 1, 0, 0, 2, 1, 0, 1],

[1, 0, 0, 0, 0, 1, 0, 1, 1, 0],[1, 0, 0, 0, 0, 1, 0, 1, 1, 0],

[0, 1, 1, 1, 0, 0, 0, 1, 0, 1])[0, 1, 1, 1, 0, 0, 0, 1, 0, 1])

또한, 단어 벡터 생성부(141)는 T TF-IDF 값을 적용하여 행렬을 생성할 수 있다. 여기서, 행은 단어별로 부여된 번호를 나타내고, 열은, 문장을 나타내는 것일 수 있다.In addition, the word vector generator 141 may generate a matrix by applying the T TF-IDF value. Here, the row may indicate a number assigned to each word, and the column may indicate a sentence.

Array ([[0. , 0.3894, 0.5577, 0.462 , 0. , 0. , 0. , 0.3294, 0. , 0.4629],Array ([[0., 0.3894, 0.5577, 0.462, 0., 0., 0., 0.3294, 0., 0.4629],

[0. , 0.2415, 0. , 0.2870, 0. , 0. , 0.8573, 0.2042, 0. , 0.2870],[0. , 0.2415, 0., 0.2870, 0., 0., 0.8573, 0.2042, 0., 0.2870],

[0.5566, 0. , 0. , 0. , 0. , 0.5566, 0. , 0.2652, 0.5566, 0. ], [0.5566, 0., 0., 0., 0., 0.5566, 0., 0.2652, 0.5566, 0.],

[0. , 0.3894, 0.5577, 0.4629 , 0. , 0. , 0. , 0.3294, 0. , 0.4629][0. , 0.3894, 0.5577, 0.4629, 0., 0., 0., 0.3294, 0., 0.4629]

앞서 설명된 예시는 일 예일뿐 이에 한정되는 것은 아니다. 보다 다양한 일 실시예에 존재할 수 있다. The example described above is only an example, and is not limited thereto. It may be present in more various embodiments.

본원의 일 실시예에 따르면, 피처 추출부(142)는 단어 벡터 생성부(141)에서 벡터화된 단어에 주성분 분석을 이용하여 피처를 추출할 수 있다. 피처 추출부(142)는 단어 벡터를 3차원으로 차원 축소하여 새로운 피처 3개를 추출할 수 있다. 주성분 분석(PCA)은 변수들이 통계적으로 상관관계가 없도록 데이터 셋을 회전시키는 기술이다. 회전한 뒤에 데이터를 설명하는 데 얼마나 중요하냐에 따라 종종 새로운 특성 중 일부만 선택한다. 이 새로운 특성을 주성분(각 주성분은 기존 데이터의 변수 모두의 영향을 받는다.)이라 하며 일부만 택하였을 경우 차원이 줄어들기에 차원 축소의 용도로 사용하기도 한다.According to one embodiment of the present application, the feature extraction unit 142 may extract a feature using a principal component analysis on a vectorized word in the word vector generation unit 141. The feature extraction unit 142 may extract three new features by dimensionally reducing the word vector in three dimensions. Principal component analysis (PCA) is a technique that rotates a data set so that variables are not statistically correlated. Depending on how important it is to describe your data after rotation, you often choose only a few of the new traits. This new property is called the main component (each main component is influenced by all variables in the existing data), and if only a part is selected, it is used for the purpose of dimension reduction because the dimension is reduced.

또한, 피처 추출부(142)는 단어를 벡터화하여 모두 피처로 사용할 경우, 피처의 개수가 너무 많아지게 되므로 예측하는데 있어 성능이 좋지 않을 수 있다는 문제점이 있다. 따라서 주성분 분석을 이용하여 피처를 추출한다. 추출하는 피처의 수는 시각화하기도 편하고 보편적으로 각 벡터의 분산을 80% 내지 90% 표현할 수 있는 미리 설정된 개수(예를 들어, 3개)를 기본값으로 한다.In addition, the feature extracting unit 142 has a problem in that performance may not be good in predicting when the words are vectorized and all of them are used as features. Therefore, features are extracted using principal component analysis. The number of extracted features is easy to visualize, and is generally set to a preset number (e.g., 3) that can express the variance of each vector by 80% to 90%.

또한, 피처 추출부(142)는 추출하는 피처의 이름을 생성할 수 있다. 예를 들어, 기존의 텍스트 데이터(110)에 포함된 텍스트 변수명이 ‘text’일 경우, 추출되는 3개의 피처의 이름은 차례대로 ‘text_PCA_1’, ‘text_PCA_2’, ‘text_PCA_3’로 생성할 수 있다.Also, the feature extraction unit 142 may generate the name of the feature to be extracted. For example, when the text variable name included in the existing text data 110 is'text', the names of the three extracted features may be sequentially generated as'text_PCA_1','text_PCA_2', and'text_PCA_3'.

본원의 일 실시예에 따르면, 피처 추출부(142)는 주성분 분석을 이용하여 피처를 추출했을 경우, 주성분 분석에 사용된 각 변수들이 추출된 피처에 미치는 영향도를 추출할 수 있다. 예를 들어, 피처 추출부(142)는 Scikit-Learn의 decomposition의 함수인 PCA.components_를 이용하여 영향도를 추출할 수 있다. 이 함수는 각 PCA를 이용하여 피처를 추출했을 경우, PCA에 사용된 각 변수들이 추출된 피처에 어느 정도 영향을 미치고 있는지를 표현할 수 있다.According to one embodiment of the present application, when the feature extraction unit 142 extracts a feature using principal component analysis, each variable used in principal component analysis may extract an influence degree on the extracted feature. For example, the feature extraction unit 142 may extract the influence degree using PCA.components_ which is a function of deciposition of Scikit-Learn. When a feature is extracted using each PCA, this function can express how much each variable used in the PCA affects the extracted feature.

예시적으로, 텍스트 피처 추출 장치(120)는 텍스트 데이터(110)를 입력받을 수 있다. 텍스트 데이터(110)는 표5와 같은 텍스트를 포함할 수 있다. 일예로, 텍스트 데이터(110)는 법정 판결문 데이터를 포함할 수 있다.For example, the text feature extraction device 120 may receive text data 110. The text data 110 may include text shown in Table 5. For example, the text data 110 may include court judgment data.

precedent Textprecedent Text <신청인, 심판청구인><상대방, 피심판청구인><원심판결>대구고등법원 1981.3.31. 선고 80르93 판결<주 문>상고허가신청을 각하한다.<이 유>소송촉진등에 관한 특례법 제12조에 규정된 상고허가신청은 민사소송사건에 적용되는 것이고, 이 사건과 같은 가사심판사건에는 그 적용이 없음이동법 제2조의 규정에 비추어 명백하다(신청인은당원 81므43호로 상고도 제기하고 있다).그러므로 이 신청은 부적법하여 이를 각하하기로 하고 관여법관의 일치된 의견으로 주문과 같이 결정한다<Applicant, Judge Requester> <An opponent, Judge Requester> <Judgement> Daegu High Court 1981.3.31. Decision 80 Le 93 Decision <Order> Dismissed the appeal permission application. <Reason> The appeal permission application provided in Article 12 of the Special Act on Promotion of Litigation is applied to the civil action case. The application is clear in light of the provisions of Article 2 of the Transfer Act (applicants are also filing appeals with party member 81M43). Decide <피 고 인><상 고 인> 피고인<원심판결>대구지방법원 1990.2.2. 선고 89노1675 판결<주 문>상고를 기각한다.<이 유> 피고인의 상고이유에 대하여 판단한다.국가공무원법 제66조 제1항이헌법 제11조 제1항,제21조 제1항,제31조 제4항,제33조나제37조 제2항에 위반되는 법률이라고 볼 수 없으므로(당원 1990.4.10 선고 90도332 판결 참조), 논지는 이유가 없다.그러므로 피고인의 상고를 기각하기로 관여법관의 의견이 일치되어 주문과같이 판결한다<Defendant> <Sendee> Defendant <Judgement> Daegu District Court 1990.2.2. Judge 89 No 1675 Judgment <Order> Dismisses the appeal. <Reason> Judging by the defendant's reason for appeal, Article 66 of the National Civil Service Act, Clause 1 of the Constitutional Law, Article 11, Paragraph 1, Article 21, Paragraph 1, There is no reason to dismiss the defendant's appeal, as it cannot be regarded as a law in violation of Article 31 (4), Article 33 or Article 37 (2) (see Judges, April 1990, Decision 90°332). In accordance with the opinions of the concerned judges, the judgment is made as in the order. <채 무 자> 주식회사 고성다인레이저<관 리 인><주 문>이 사건 회생절차를 폐지한다.<이 유> 위 사건에 관하여 관리인이 2010. 5. 31. 제출하고, 2010. 7. 26. 수정허가 된 회생계획안은 2010. 7. 26. 개최된 회생계획안의 결의를 위한 관계인 집회에서채무자 회생 및 파산에 관한 법률 제237조의 가결요건에 해당하는 동의를 얻지 못하여 부결되었으므로, 이 법원은채무자 회생 및 파산에 관한 법률 제286조 제1항 제2호에 의하여 주문과 같이 결정한다<Current debtor> Gosung Dyne Laser Co., Ltd. <Management Inn> <Order> abolished the rehabilitation procedure for this case. <Reason> Submitted by the manager on May 31, 2010, and July 26, 2010 The rehabilitation plan approved for amendment was rejected due to the failure to obtain the consent of the provisions of Article 237 of the Act on Rehabilitation and Bankruptcy of the Debtor at a meeting related to the resolution of the rehabilitation plan held on July 26, 2010. Decisions are made in accordance with Article 286 (1) 2 of the Act on Rehabilitation and Bankruptcy. <피 고 인><항 소 인> 검사<검 사> 배성훈<변 호 인> 사법연수생 조민근(국선)<원심판결> 서울남부지방법원 2011. 4. 15. 선고 2010고단4002 판결<주 문>검사의 항소를 기각한다.<이 유>검사는 이 사건 항소이유로서, 피고인이 일본에 머무는 동안 공소시효가 정지된다는 취지의 주장을 하나, 원심이 그 판결이유에서 판단한 바와 같이 공소시효가 정지된다고 볼 수 없다. 따라서, 검사의 항소는 이유 없으므로,형사소송법 제364조 제4항에 의하여 이를 기각한다<Respondent> <Appeal Person> Prosecutor <Prosecutor> Bae Seong-Hun <Respondent Law> Judicial Trainee Min-Geun Cho (Kuksun) <Judgment Judgment> Seoul Southern District Court 2011. 4. 15. Decision 2010 High Court 4002 Judgment <Order> The prosecutor's appeal is dismissed. <Reason> The prosecutor argues that as a ground for appealing this case, the accused that the accused prescription is suspended while the accused is in Japan, the accused prescription is suspended, as judged by the ruling. Can't see Therefore, there is no reason to appeal the prosecutor's office, so it is dismissed pursuant to Article 364 (4) of the Criminal Procedure Law <신 청 인> 동대구세무서장<상 대 방> 김경조<원심판결>대구고등법원 1981.2.10. 선고 80구200 판결<주 문>상고허가신청을 각하한다.<이 유>소송촉진등에 관한 특례법 제12조에 규정된 상고허가신청은 민사소송사건에 적용되는 것이고 이 사건과 같은 행정소송사건에는 그 적용이 없음이동법 제2조의 규정에 비추어 명백하다.그러므로 행정소송사건에 관하여 상고허가신청을 한 이 사건 신청은 부적법하므로 이를 각하하기로 하여 관여법관의 일치된 의견으로 주문과 같이 결정한다<Shin-Chung-In> Director of Dongdaegu Tax Office <Sang Dae Bang> Kim Kyung Jo <Judgment Judgment> Daegu High Court 1981.2.10. Decision 80 Order 200 Declaration <Order> Dismissed the appeal permission application. <Reason> The application for permission to file an appeal under Article 12 of the Special Act on Promotion of Litigation applies to the civil action case, and it applies to administrative action cases such as this. It is clear in light of the provisions of Article 2 of the Transfer Act, so the application for this case, which filed an appeal for an administrative lawsuit, is inappropriate and is decided to dismiss it as an order in accordance with the opinions of the concerned judges. <재항고인> 이상학<원 심>서울고등 1960. 10. 6. 선고 60민공178, 179 판결<이 유> 직권으로 심안컨대경매법 제34조에 의하면 동법에 의한 부동산 경매에 있어서는 부동산 강제경매의 경우와는 달라 신립에 의해서만 경매에 가름하여 입찰을 명할 수 있고 직권으로는 명할 수 없는 것임에도 불구하고 일건 기록에 의하면 경매법원이 이해관계인의 신립없이 직권으로 입찰을 명하여 본건 경락을 허가하였음은 위법하다 아니할 수 없고 동 위법을 간과한 원심결정 역시 위법이다<Resident Complaint> Lee Sang-hak <Won Sim> Seoul High School October 6, 1960 sentenced to 60 Min Gong 178, 179 <Reason> Under the premise, Article 34 of the Auction Act states that in the case of real estate auction under the Act, It is illegal that the auction court ordered the bidding in person without a stakeholder's right to authorize the meridian in spite of the fact that the auction court could not bid by the ex officio. Centrifugal decisions that cannot be overlooked and that have been overlooked are also illegal <신 청 인><피신청인> 교육인적자원부장관(대리인 변호사 김종인)<주 문>이 사건 신청을 모두 기각한다.<신청취지>피신청인이 2004. 2. 5. 신청인류정희,손병기,배인호,윤은현,정성균,이진복에 대하여 한 임원취임승인취소처분 및 같은 날 행한 신청외서정문,이정도,정인기,김병찬,김종찬,강영신,정기오을학교법인 유신학원의 임시이사로 선임한 처분은이 법원 2004구합5751호 사건의 판결 선고시까지 그 효력을 정지한다.<이 유>이 사건 신청은 모두 이유 없으므로 주문과 같이 결정한다<Applicant> <Respondent> Minister of Education and Human Resources Development (Representative Jong-In Kim, Deputy Attorney) <Order> dismisses all cases. <Purpose of Application> Respondent filed on Feb. 5, 2004 Applicants Jung Hee, Byung-Ki Son, In-Ho Bae, The disposition for the removal of the approval of the executive office for Eun-Hyun Yoon, Sung-Kyun Jung, and Jin-Bok Lee, and the external statement made on the same day, Lee Jeong-Jung, In-Ki Kim, Byeong-Chan Kim, Jong-Chan Kim, Young-Shin Kang, Young-Shin Kang, and Jeong-Gi-Ol School were appointed as temporary directors of the Yushin Academy. The validity of the case is suspended until the judgment of the case is declared. <신 청 인> 신분식<상 대 방> 부산직할시 남구청장<원심판결> 대구고등법원 1982.8.17. 선고 81구120 판결<주 문>상고허가신청을 각하한다.<이 유>소송촉진등에관한특례법 제12조에 규정된 상고허가신청은 민사소송사건에 적용되는 것이고 이 사건과 같은 행정소송사건에는 그 적용이 없음이동법 제2조의 규정에 비추어 명백하다.그러므로 행정소송사건에 관하여 상고허가신청을 한 이 사건 신청은 부적법하므로 이를 각하하기로 하여 관여법관의 일치된 의견으로 주문과 같이 결정한다<Shin Cheong-In> Shin Bun-Sik <Sang Dae-Bang> Nam-gu Office Director <Director Judgment> under direct management of Busan, Daegu High Court 1982.8.17. Decision 81 ward 120 Decision <Order> Dismisses the appeal permission application. <Reason> The appeal permission application provided for in Article 12 of the Special Act on Promotion of Litigation applies to civil lawsuits and is subject to administrative action cases such as this. It is clear in light of the provisions of Article 2 of the Transfer Act, so the application for this case, which filed an appeal for an administrative lawsuit, is inappropriate and is decided to dismiss it as an order in accordance with the opinions of the concerned judges. <원고, 피상고인> 유상록 외 1인<피고, 상고인> 서울특별시관재국장<피고 보조참가인, 상고인> 주길환<원심판결>제1심서울고등 1956. 8. 16. 선고 56행24<이 유> 기록에 의하면 원판결은 원고등이 무관리상태에 있는 본건 대지에 대하여 제1차로 피고에 대하여 임대차계약 신청을 한 사실을 확정하고 원고등이 본건 대지에 대한 연고권자이라고 판시하였으나 해 사실만 가지고는 귀속재산에 대한 연고권이 발생할 수 없다 할 것이므로 원판결에는 법률의 해석을 그릇한 위법이 있다<Plaintiff, Defendant> Yoo Sang-rok and 1 other <Defendant, Defendant> Seoul Metropolitan Government Commissioner <Defendant Assistant, Defendant> Joo Gil-Hwan <Preliminary Judgment> 1st Seoul High School 1956. 8. 16. According to the U.S. record, the original judgment confirmed the fact that the plaintiffs first filed a lease contract for the defendant against the land where the plaintiffs were in an unmanaged state, and the plaintiffs judged that the plaintiffs were the owners of the land, but only with the facts Since it is said that the right to rely on the property belonging to it cannot occur, there is a misconduct in the original ruling that misinterprets the law. <피 고 인><상 고 인> 피고인<원심판결>서울형사지방법원 1991.7.24. 선고 91노2429 판결<주 문>상고를 기각한다.<이 유> 상고이유를 본다.원심인용의 제1심판결이 든 증거에 의하면피고인이 다른 사람 소유의 광고용 간판을 백색페인트로 도색하여 광고문안을 지워 버린 사실을 인정할 수 있고 사실이 이와 같다면 재물손괴죄를 구성하는 것이므로 원심판결에 법리의 오해나 채증법칙을 어긴 위법이 없다.그러므로 상고를 기각하기로 관여 법관의 일치된 의견으로 주문과 같이 판결한다<The Defendant> <The Defendant> The Defendant <Judgement> Seoul Court of Justice 1991.7.24. Decision 91 No 2429 Decision <Order> Dismisses appeal. <Reason> Sees the reason for appeal. According to the evidence of the first judgment of the original referee, the defendant paints the advertisement sign owned by another person with white paint, and then the advertisement text. There is no violation of the law of misunderstanding or the law of debt in a centrifugal judgment because it constitutes a property damage crime if it is possible to admit that it has been erased. Adjudicate

단어 추출 프로세서(130)는 표5에 도시된 판결문 데이터 ‘precedentText’에 형태소 분석 후 사전을 이용하여 주요 단어를 추출할 수 있다. 피처 추출 실행부(140)는 추출된 주요 단어에 대해TF-IDF 방식으로 벡터화를 한 후, 벡터에 주성분 분석을 적용하여 피처 3개를 추출할 수 있다. 여기서, 사용된 판결문의 수는 20000개, 사용자 사전에 사용한 단어는 법령 용어 9400개, 명사 사전의 단어의 수는 31만7151개, 비(非)명사 사전의 단어의 수는 9682개, 인명(人名)사전의 단어의 수는 3467만5174개, 지명(地名)사전의 단어의 수는 50675개일 수 있다. The word extraction processor 130 may extract key words using a dictionary after morpheme analysis on the judgment data'precedentText' shown in Table 5. The feature extraction execution unit 140 may vectorize the extracted key words in a TF-IDF manner, and then extract three features by applying a principal component analysis to the vector. Here, the number of judgments used is 20000, the words used in the user dictionary are 9400 statute terms, the number of words in the noun dictionary is 31,7151, the number of words in the non-noun dictionary is 9682, and the human name ( The number of words in the 人名) dictionary may be 3467,5174, and the number of words in the geographical name dictionary may be 50675.

또한, 단어 후보 추출부(133)에서 추출한 단어의 수는 104887개이며 그 중 일부를 나타내면 다음과 같다.In addition, the number of words extracted by the word candidate extracting unit 133 is 104887, and some of them are as follows.

('“Feraud"부분', '건축사업무정지처분은', '"디자인학원경영업"', '권리범위확인심판청구', '거절사정불복항고심판', '판단한다.원판결이유', '한국도시개발주식회사', '1966.11.23', '제95조,제100조', '하천부지점용허가신청', '없이만연', '지점소재', '중임등기', '충당사용', '입금교부', '조사평가', '대여할수', …, '해태사업', '행복동인', '당해영업', '시정가능', '가령식품', '청문장소', '5일전인', '예상손해', '계약번호', '기득임차', '대지29', '197평', '허가된다', '이의제출', '단일채무')('"Feraud" part, "For suspension of architect's business", ""Design Academy Light Sales"", "Request for trial to confirm the scope of right", "Judgment for refusal to object to refusal of judgment", "Reason for judgment." Korea Urban Development Co., Ltd., '1966.11.23','Article 95, Article 100','Application for permission for river branch office','Without rampant','Branch material','Interim registration','Appropriate use', 'Deposit issuance','Survey evaluation','Loan available', …,'Haetae business','Happy driver','Danghae business','Correction possible','Early food','Hearing place', '5 days ago In','Estimated Damage','Contract Number','Obtained Lease','Earth 29', '197 pyeong','Permitted','Submission of Objection','Single Debt')

또한, 주요 단어 추출부(134)에서 추출한 단어의 수는 101625개이며 그 중 일부를 나타내면 다음과 같다.In addition, the number of words extracted by the main word extraction unit 134 is 101625, and some of them are as follows.

('가감', '가검역증', '가격결정', '가격결정방법', '가격기준', '가격동향', '가격변동', '가격변동지수', '가격비교', '가격사정', '가격수준', '가격시점', '가격안정', '가격정보', …, '희망가액', '희망돼지', '희망돼지저금통', '희망백화점', '희망신청', '희망신청서', '희박하므', '히로뽕성분', ' '히로뽕제조', '히로뽕제조원료', '히로뽕주사') ('Subtraction','Virtual Quarantine Certificate','Price Decision','Price Decision','Price Standard','Price Trend','Price Change','Price Change Index','Price Comparison','Price Eval' ','Price level','Price start','Price stabilization','Price information', …,'Hope price','Hope pig','Hope pig savings','Hope department store','Hope application', 'Application for Hope','Rare Hame','Hiropon Ingredients','Hiropon Manufacturing','Hiropon Manufacturing Ingredients','Hiropon Injection')

도 4a 내지 도 4c는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제2텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다. 4A to 4C are diagrams of visualizing a word list used for features extracted from second text data in a device for visualizing a key word of a feature according to an embodiment of the present application.

본원의 일 실시예에 따르면, 제2텍스트 데이터는, 판결문 데이터 'precedentText'일 수 있다. 피처의 주요 단어 시각화 장치(150)는 표5에 되시된 판결문 데이터에 대해 형태소 분석, 단어 추출, 단어의 벡터화를 거쳐 얻은 벡터화된 단어에 대해 주성분 분석을 실시하여 추출한 피처에 어느 단어가 피처에 영향력을 많이 미쳤는지 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화할 수 있다. 형태소 분석, 단어 추출, 단어의 벡터화를 거쳐 얻은 벡터화된 단어에 대해 주성분 분석을 실시는 텍스트 피처 추출 장치(120)에서 수행될 수 있다. 텍스트 텍스트 피처 추출 장치(120)는 판결문 데이터 ‘precedentText’에 형태소 분석 후 주요 단어를 추출하고 그 단어에 대해TF-IDF 방식으로 벡터화를 한 후, 벡터에 주성분 분석을 적용하여 피처 3개를 추출할 수 있다. 이때, 사용된 판결문의 수는 20000개, 피처 추출 시 사용된 단어의 수는 13185개일 수 있다. 또한, 피처의 주요 단어 시각화 장치(150)는 중요도가 0.01%, 0.05%, 0.1%, 0.5%, 1% 이상의 단어 모음을 각각 시각화하여 제공할 수 있으며, 도 4a 내지 도 4c와 같이 도시된 중요도의 기준값에 따라 어느 단어가 영향도가 높았는지 직관적으로 파악할 수 있다.According to one embodiment of the present application, the second text data may be judgment data'precedentText'. The key word visualization device 150 of the feature performs principal component analysis on the vectorized word obtained through morphological analysis, word extraction, and vectorization of the word as shown in Table 5, and which word affects the extracted feature You can visualize how crazy you are in the word cloud form of the Venn diagram. The morpheme analysis, word extraction, and vectorized words obtained through vectorization of words may be performed by the text feature extraction device 120 that performs principal component analysis. Text The text feature extraction device 120 extracts the key words after morpheme analysis on the judgment data'precedentText', vectorizes the words in a TF-IDF method, and then applies the principal component analysis to the vectors to extract three features Can. At this time, the number of rulings used may be 20000, and the number of words used when extracting features may be 13185. In addition, the key word visualization device 150 of the feature may visualize and provide a vowel collection of 0.01%, 0.05%, 0.1%, 0.5%, 1% or more, respectively, and the importance shown in FIGS. 4A to 4C It is possible to intuitively grasp which word has a high influence according to the reference value of.

일예로, 도 4a의 (a)는 제2텍스트 데이터에서 피처 영향도가 0.01% 이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도 4a의 (b)는 제2텍스트 데이터에서 피처 영향도가 0.05% 이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 도4b의(c)는 제2텍스트 데이터에서 피처 영향도가 0.1% 이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도 4b의 (d)는 제2텍스트 데이터에서 피처 영향도가 0.5% 이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도 4c의 (e)는 제2텍스트 데이터에서 피처 영향도가 1% 이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. For example, FIG. 4A(a) may visualize a word having a feature influence degree of 0.01% or more in the second text data in a word cloud method in the form of a Venn diagram. Also, FIG. 4A (b) may be a visualization of a word having a feature influence of 0.05% or more in the second text data in the form of a Venn diagram word cloud. FIG. 4B(c) may visualize a word having a feature impact ratio of 0.1% or more in the second text data in a word cloud method in the form of a Venn diagram. Also, FIG. 4B (d) may be a visualization of a word having a feature impact of 0.5% or more in the second text data in a word cloud form in the form of a Venn diagram. In addition, FIG. 4C(e) may visualize a word having a feature influence of 1% or more in the second text data in a word cloud method in the form of a Venn diagram.

도 5a 내지 도 5c는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제3텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다. 본원의 일 실시예에 따르면, 제3텍스트 데이터는 네이버 영화 리뷰 데이터를 포함할 수 있다. 텍스트 텍스트 피처 추출 장치(120)는 제3텍스트 데이터를 이용하여 형태소 분석, 단어 추출, 단어의 벡터화를 거쳐 얻은 벡터화된 단어에 대해 주성분 분석을 실시할 수 있다. 제3텍스트 데이터는 네이버 영화 리뷰에 등록된 15931개의 리뷰를 모아 생성한 데이터일 수 있다. 텍스트 텍스트 피처 추출 장치(120)는 사용자의 입력 정보에 기반하여 리뷰에서 긍정, 부정을 나타내는 키워드를 직접 입력하여 사용자 사전을 생성할 수 있다. 사용자 사전은 총 393개의 단어를 포함할 수 있다. 이 단어에는 명사뿐만 아니라 형용사, 동사 등 명사가 아닌 품사도 들어있으며 신조어(예를 들어: 허니잼, 예스잼 등)도 포함될 수 있다. 텍스트 텍스트 피처 추출 장치(120)는 명사 추출 시 자체 명사 사전, 비명사 사전, 인명 사전, 지명 사전 전부를 사용하였으며, 사용자 사전까지 반영한 후, 추출된 단어의 총 개수는 2308개이다. 피처의 주요 단어 시각화 장치(150)는 표6과 같이 각 피처별 단어 개수 분포를 생성할 수 있다. 달리 말해, 표6은 제3텍스트 데이터에 각 피처별 단어 개수 분포를 중요도 구간에 기반하여 생성한 것일 수 있다. 각 피처별 상위 중요도 단어는, 제1피처(review_PCA_1)에서 관람(47%), 관람객(47%), 재밌(1%), 영화(1%)를 포함할 수 있다. 또한, 제2피처(review_PCA_2)의 주요 단어는 영화(74%), 재밌(7%), 연기(3%), 너무(2%), 관람(1%)을 포함할 수 있다. 또한, 제3피처(review_PCA_3)의 주요 단어는 재밌(39%), 재미(36%), 재미있(19%), 영화(3%), 재미없(1%)으로 나타났으며, 이를 통해 ‘영화, 재미, 관람’과 관련된 단어가 중요도가 높았음을 알 수 있다.5A to 5C are diagrams of visualizing a word list used for features extracted from third text data on a device for visualizing a key word of a feature according to an embodiment of the present application. According to one embodiment of the present application, the third text data may include Naver movie review data. Text The text feature extraction device 120 may perform principal component analysis on the vectorized words obtained through morpheme analysis, word extraction, and vectorization of words using the third text data. The third text data may be data generated by collecting 15931 reviews registered in the Naver movie review. Text The text feature extraction device 120 may generate a user dictionary by directly entering keywords indicating positive and negative in a review based on user input information. The user dictionary may include a total of 393 words. This word contains not only nouns, but also parts of speech, such as adjectives and verbs, and may include new words (eg: honey jam, yes jam, etc.). Text The text feature extraction device 120 uses all of its own noun dictionaries, non-noun dictionaries, human dictionaries, and nominated dictionaries when extracting nouns. After reflecting the user dictionary, the total number of extracted words is 2308. The key word visualization device 150 of the feature may generate a word count distribution for each feature as shown in Table 6. In other words, Table 6 may be generated based on the importance section of the word count distribution for each feature in the third text data. The high-priority word for each feature may include viewing (47%), spectators (47%), fun (1%), and movies (1%) in the first feature (review_PCA_1). Also, the key words of the second feature (review_PCA_2) may include a movie (74%), funny (7%), acting (3%), too (2%), and viewing (1%). In addition, the key words of the third feature (review_PCA_3) are funny (39%), funny (36%), funny (19%), movie (3%), and not funny (1%). You can see that the words related to'movie, fun, and watching' were of high importance.

중요도 구간Importance section review_PCA_1review_PCA_1 review_PCA_2review_PCA_2 review_PCA_3review_PCA_3 0이상 0.001미만0 or more and less than 0.001 21202120 21382138 22282228 0.001이상 0.005미만0.001 or more but less than 0.005 139139 8888 4343 0.005이상 0.01미만0.005 to less than 0.01 1818 2727 1212 0.01이상 0.05미만0.01 to 0.05 2020 2626 1515 0.05이상 0.1미만0.05 or more and less than 0.1 44 1111 22 0.1이상 0.3미만0.1 to less than 0.3 1One 99 1One 0.3이상 0.5미만0.3 to 0.5 1One 22 22 0.5이상 1미만0.5 or more but less than 1 1One 22 00 1이상 10미만1 to 10 22 44 22 10이상 20미만10 or more and less than 20 00 00 1One 20이상 30미만20 to 30 00 00 00 30이상 40미만30 to 40 00 00 22 40이상 50미만40 to 50 22 00 00 50이상50 or more 00 1One 00

일예로, 도5a의 (a)는 제3텍스트 데이터에서 피처 영향도가 0.001%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도5b의 (b)는 제3텍스트 데이터에서 피처 영향도가 0.005%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 도5b의(c)는 제3텍스트 데이터에서 피처 영향도가 0.01%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도5b의(d)는 제3텍스트 데이터에서 피처 영향도가 0.05%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도5c의 (e)는 제3텍스트 데이터에서 피처 영향도가 0.1%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도5c의 (f)는 제3텍스트 데이터에서 피처 영향도가 0.5%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다.도 6a 내지 도 6c는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제4텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다.As an example, FIG. 5A (a) may be a visualization of a word having a feature influence of 0.001% or more in a third text data in a word cloud form of a Venn diagram. In addition, FIG. 5B(b) may be a visualization of a word having a feature influence of 0.005% or more in the third text data in a word cloud method in the form of a Venn diagram. FIG. 5B(c) may visualize a word having a feature impact degree of 0.01% or more in the third text data in a word cloud form in the form of a Venn diagram. Further, FIG. 5B(d) may visualize a word having a feature impact factor of 0.05% or more in the third text data by using a word cloud method in the form of a Venn diagram. In addition, FIG. 5C(e) may visualize a word having a feature influence of 0.1% or more in the third text data in a word-cloud form of a Venn diagram. In addition, (f) of FIG. 5C may visualize a word having a feature impact degree of 0.5% or more in a third text data in a word cloud form of a Venn diagram shape. FIGS. 6A to 6C are diagrams according to an embodiment of the present application. This is a diagram visualizing the word list used for the feature extracted from the fourth text data in the feature's main word visualization device.

본원의 일 실시예에 따르면, 제4텍스트 데이터는, 네이버 뉴스 타이틀 데이터일 수 있다. 피처의 주요 단어 시각화 장치(150)는 제4텍스트 데이터에 대해 형태소 분석, 단어 추출, 단어의 벡터화를 거쳐 얻은 벡터화된 단어에 대해 주성분 분석을 실시하여 추출한 피처에 어느 단어가 피처에 영향력을 많이 미쳤는지 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화할 수 있다. 형태소 분석, 단어 추출, 단어의 벡터화를 거쳐 얻은 벡터화된 단어에 대해 주성분 분석을 실시는 텍스트 텍스트 피처 추출 장치(120)에서 수행될 수 있다. 텍스트 텍스트 피처 추출 장치(120)는 제4텍스트 데이터에 형태소 분석 후 주요 단어를 추출하고 그 단어에 대해TF-IDF 방식으로 벡터화를 한 후, 벡터에 주성분 분석을 적용하여 피처 3개를 추출할 수 있다. 이때, 텍스트 텍스트 피처 추출 장치(120)는 사용자 사전을 이용하지 않고, 명사 사전, 비명사 사전, 인명 사전, 지명 사전만을 이용하여 명사를 추출할 수 있다. 여기서 사용된 제4텍스트 데이터는, 네이버 뉴스 타이틀 96000개를 모아 생성한 데이터를 포함할 수 있다. 텍스트 텍스트 피처 추출 장치(120)는 사용자 사전까지 반영 후 총 5324개의 단어를 추출할 수 있다. According to one embodiment of the present application, the fourth text data may be Naver news title data. The key word visualization device 150 of the feature performs principal component analysis on the vectorized word obtained through morpheme analysis, word extraction, and vectorization of the fourth text data, and which word has a significant influence on the feature You can visualize whether it is a Venn diagram-shaped word cloud. The morpheme analysis, word extraction, and vectorized words obtained through vectorization of words may be performed by the text-text feature extraction device 120 that performs principal component analysis. Text The text feature extracting device 120 extracts key words after morphological analysis on the fourth text data, vectorizes the words in a TF-IDF method, and applies three components to the vector to extract three features have. In this case, the text text feature extraction device 120 may extract a noun using only a noun dictionary, a noun dictionary, a life dictionary, and a noun dictionary without using a user dictionary. The fourth text data used herein may include data generated by collecting 96,000 Naver News titles. Text The text feature extraction device 120 may extract a total of 5324 words after reflecting the user dictionary.

피처의 주요 단어 시각화 장치(150)는 표7과 같이 각 피처별 단어 개수 분포를 생성할 수 있다. 달리 말해, 표7은 제4텍스트 데이터에 각 피처별 단어 개수 분포를 중요도 구간에 기반하여 생성한 것일 수 있다. 제1피처(text_PCA_1)는 날씨(26%), 먼지(18%), 미세(17%), 미세먼지(17%), 내일(3%), 전국(3%), 더위(2%), 서울(1%)의 주요 단어를 포함할 수 있다. 또한, 제2피처(text_PCA_2)의 주요 단어는 회담(37%), 정상(18%), 트럼프(12%), 북미(12%), 남북(5%), 대통령(2%) 등의 주요 단어가 포함될 수 있다. 또한, 제3피처(text_PCA_3)는 날씨(19%), 미세(13%), 먼지(13%), 미세먼지(13%), 서울(6%), 한국(5%), 전국(4%), 폭염(4%), 내일(3%), 더위(3%), 한파(1%)의 주요 단어를 포함할 수 있다. text_PCA_1, text_PCA_3은 날씨와 관련된 단어가 중요도가 높고 text_PCA_2는 북미 정상회담, 남북정상회담과 관련된 단어가 중요도가 높았음을 알 수 있다. 또한, 피처의 주요 단어 시각화 장치(150)는 중요도가 0.01%, 0.05%, 0.1%, 0.5%, 1% 이상의 단어 모음을 각각 시각화하여 제공할 수 있으며, 도 4a 내지 도 4c와 같이 도시된 중요도의 기준값에 따라 어느 단어가 영향도가 높았는지 직관적으로 파악할 수 있다.The key word visualization device 150 of the feature may generate a word number distribution for each feature as shown in Table 7. In other words, Table 7 may be generated based on the importance section of the word count distribution for each feature in the fourth text data. The first feature (text_PCA_1) is weather (26%), dust (18%), fine (17%), fine dust (17%), tomorrow (3%), nationwide (3%), heat (2%), Can include key words from Seoul (1%). In addition, the key words of the 2nd feature (text_PCA_2) are key words such as talks (37%), summit (18%), Trump (12%), North America (12%), inter-Korean (5%), president (2%), etc. Words can be included. In addition, the third feature (text_PCA_3) is weather (19%), fine (13%), dust (13%), fine dust (13%), Seoul (6%), Korea (5%), nationwide (4%) ), heat waves (4%), tomorrow (3%), heat (3%), cold waves (1%) can include the key words. It can be seen that text_PCA_1 and text_PCA_3 have high importance for words related to weather, and text_PCA_2 has high importance for words related to the North American summit and North-South summit. In addition, the key word visualization device 150 of the feature may visualize and provide a vowel collection of 0.01%, 0.05%, 0.1%, 0.5%, 1% or more, respectively, and the importance shown in FIGS. 4A to 4C It is possible to intuitively grasp which word has a high influence according to the reference value of.

중요도 구간Importance section text_PCA_1text_PCA_1 text_PCA_2text_PCA_2 text_PCA_3text_PCA_3 0이상 0.001미만0 or more and less than 0.001 48484848 46994699 48494849 0.001이상 0.005미만0.001 or more but less than 0.005 286286 383383 256256 0.005이상 0.01미만0.005 to less than 0.01 5252 8888 5656 0.01이상 0.05미만0.01 to 0.05 8080 102102 8484 0.05이상 0.1미만0.05 or more and less than 0.1 1212 2020 2525 0.1이상 0.3미만0.1 to less than 0.3 2525 1818 3232 0.3이상 0.5미만0.3 to 0.5 1010 88 55 0.5이상 1미만0.5 or more but less than 1 33 00 66 1이상 10미만1 to 10 44 22 77 10이상 20미만10 or more and less than 20 33 33 44 20이상 30미만20 to 30 1One 00 00 30이상 40미만30 to 40 00 1One 00 40이상 50미만40 to 50 00 00 00 50이상50 or more 00 00 00

일예로, 도6a의 (a)는 제4텍스트 데이터에서 피처 영향도가 0.001%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도6b의 (b)는 제4텍스트 데이터에서 피처 영향도가 0.005%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 도6b의(c)는 제4텍스트 데이터에서 피처 영향도가 0.01%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도6b의(d)는 제4텍스트 데이터에서 피처 영향도가 0.05%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도5c의 (e)는 제4텍스트 데이터에서 피처 영향도가 0.1%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도5c의 (f)는 제4텍스트 데이터에서 피처 영향도가 0.5%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다.도 7a 내지 도 7c는 본원의 일 실시예에 따른 피처의 주요 단어 시각화 장치에 제5텍스트 데이터에서 추출된 피처에 사용된 단어 목록을 시각화한 도면이다. 본원의 일 실시예에 따르면, 제5텍스트 데이터는 판결문 데이터를 포함할 수 있다. 제5텍스트 데이터는 20000개의 판결문이 포함된 데이터일 수 있다. 피처의 주요 단어 시각화 장치(150)는 20000개의 판결문이 포함된 제5 텍스트 데이터에 대해 형태소 분석, 단어 추출, 단어의 벡터화를 거쳐 얻은 벡터화된 단어에 대해 주성분 분석을 실시하여 추출한 피처에 어느 단어가 피처에 영향력을 많이 미쳤는지 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화할 수 있다. 형태소 분석, 단어 추출, 단어의 벡터화를 거쳐 얻은 벡터화된 단어에 대해 주성분 분석을 실시는 텍스트 텍스트 피처 추출 장치(120)에서 수행될 수 있다. 텍스트 텍스트 피처 추출 장치(120)는 20000개의 판결문이 포함된 제5 텍스트 데이터에 형태소 분석 후 주요 단어를 추출하고 그 단어에 대해TF-IDF 방식으로 벡터화를 한 후, 벡터에 주성분 분석을 적용하여 피처 3개를 추출할 수 있다. 텍스트 텍스트 피처 추출 장치(120)는 사용자 사전을 한국법제연구원에서 제작한 '2015 한영법령 표준용어집‘을 이용하여 법률 용어를 모아 구성한 사전으로 선택하고 명사 추출 과정을 수행할 수 있다. 사용자 사전에 포함된 법령 용어는 9400개이고, 법령명은 1965개이고, 기관명 및 직군은 3809개이며, 이 중 법률 용어 9400개를 사용자 사전에 입력하여 피처 추출을 수행할 수 있다. 명사 추출 시 자체 명사 사전, 비명사 사전, 인명 사전, 지명 사전 전부를 사용하였다. 사용자 사전까지 반영한 후, 추출된 단어의 총 개수는 13185개이다. 또한, 피처의 주요 단어 시각화 장치(150)는 중요도가 0.01%, 0.05%, 0.1%, 0.5%, 1% 이상의 단어 모음을 각각 시각화하여 제공할 수 있으며, 도 7a 내지 도 7c와 같이 도시된 중요도의 기준값에 따라 어느 단어가 영향도가 높았는지 직관적으로 파악할 수 있다. 피처의 주요 단어 시각화 장치(150)는 표8과 같이 각 피처별 단어 개수 분포를 생성할 수 있다. 달리 말해, 표8은 제5텍스트 데이터에 각 피처별 단어 개수 분호를 중요도 구간에 기반하여 생성한 것일 수 있다. 제1피처(precedentText_PCA_1)는 결정(34%), 상고(33%), 경매(4%), 상고인(3%), 상고이유(2%), 부담(2%), 비용(2%), 사실(1%), 경락(1%)의 주요 단어들을 포함할 수 있다. 또한, 제2피처(precedentText_PCA_2)는 고의(11%), 소송(9%), 소인(8%), 부담(5%), 비용(5%), 사실(3%), 소송대리(3%), 소송대리인(3%), 대리인(3%), 대리(3%), 구금(3%), 구금일수(3%), 산입(3%), 검사(2%), 변론(2%), 변론종결(2%), 소비(2%), 감호(2%), 민사소송(2%), 범죄(2%), 공소(2%), 결의(1%), 상고인(1%), 소송비용(1%)의 주요 단어를 포함할 수 있다. 또한, 제3피처(precedentText_PCA_3)는 소인(15%), 상고(12%), 결정(9%), 상고인(4%), 변론종결(4%), 소비(4%), 결의(3%), 감호(3%), 변론(3%), 경매(2%), 기재(2%), 감호청구(2%), 본문(1%), 구금(1%), 구금일수(1%), 등기(1%), 산입(1%), 소유(1%), 소송비용(1%), 공소(1%), 고의(1%), 범죄(1%)의 주요 단어를 포함할 수 있다. 이를 통해 특정 판결과 관련된 단어는 많이 보이지 않고 자주 쓰이는 단어의 중요도가 전반적으로 높았음을 알 수 있다.As an example, FIG. 6A (a) may be a visualization of a word having a feature influence of 0.001% or more in the fourth text data in a word cloud form of a Venn diagram. In addition, FIG. 6B (b) may be a visualization of a word having a feature impact of 0.005% or more in the fourth text data in a word cloud form of a Venn diagram. FIG. 6B(c) may be a visualization of a word having a feature impact of 0.01% or more in the fourth text data, using a word cloud method in the form of a Venn diagram. In addition, FIG. 6B(d) may visualize a word having a feature impact factor of 0.05% or more in the fourth text data in a word cloud method in the form of a Venn diagram. In addition, FIG. 5C(e) may visualize a word having a feature influence of 0.1% or more in the fourth text data in a word cloud method in the form of a Venn diagram. In addition, (f) of FIG. 5C may visualize a word having a feature impact degree of 0.5% or more in the fourth text data in a word cloud form in the form of a Venn diagram. FIGS. 7A to 7C are diagrams according to an embodiment of the present application. This is a diagram visualizing a list of words used for features extracted from the fifth text data in the feature's main word visualization device. According to an embodiment of the present application, the fifth text data may include judgment data. The fifth text data may be data including 20,000 rulings. The key word visualization device 150 of the feature performs principal component analysis on the vectorized words obtained through morpheme analysis, word extraction, and vectorization of words for the fifth text data containing 20,000 rulings, and which words are extracted to the extracted features You can visualize how much influence you have on the features in a word-cloud like Venn diagram. The morpheme analysis, word extraction, and vectorized words obtained through vectorization of words may be performed by the text-text feature extraction device 120 that performs principal component analysis. Text The text feature extraction device 120 extracts a key word after morphological analysis on the fifth text data containing 20,000 rulings, vectorizes the word in TF-IDF method, and applies the principal component analysis to the vector to feature Three can be extracted. Text The text feature extraction device 120 may select a user dictionary as a dictionary consisting of a collection of legal terms using the '2015 Korean-English Standard Terminology Glossary' produced by the Korea Legal Research Institute and perform a noun extraction process. There are 9,400 legal terms included in the user dictionary, 1965 legal names, and 3809 organization names and occupations, among which 9400 legal terms can be input into the user dictionary to perform feature extraction. When extracting nouns, we used all of our own noun dictionaries, non-noun dictionaries, human dictionary, and geographical names dictionary. After reflecting the user dictionary, the total number of extracted words is 13185. In addition, the key word visualization device 150 of the feature may visualize and provide a vowel collection of 0.01%, 0.05%, 0.1%, 0.5%, 1% or more, respectively, and the importance shown in FIGS. 7A to 7C It is possible to intuitively grasp which word has a high influence according to the reference value of. The key word visualization device 150 of the feature may generate a word count distribution for each feature as shown in Table 8. In other words, Table 8 may be generated based on the importance section for the number of words for each feature in the fifth text data. The first feature (precedentText_PCA_1) is determined (34%), appeal (33%), auction (4%), appealer (3%), reason for appeal (2%), burden (2%), cost (2%) , The facts (1%), meridians (1%) can include the key words. In addition, the second feature (precedentText_PCA_2) is deliberate (11%), litigation (9%), postmark (8%), burden (5%), cost (5%), fact (3%), litigation agency (3 %), litigation agent (3%), agent (3%), agent (3%), detention (3%), number of days of detention (3%), income (3%), prosecutor (2%), pleading (2) %), Closing of Pleadings (2%), Consumption (2%), Imprisonment (2%), Civil Litigation (2%), Crime (2%), Prosecution (2%), Resolution (1%), Appealants ( 1%) and litigation costs (1%). In addition, the 3rd feature (precedentText_PCA_3) is postmark (15%), appeal (12%), decision (9%), appeal (4%), closing pleading (4%), consumption (4%), resolution (3) %), Protection (3%), Pleading (3%), Auction (2%), Description (2%), Request for Protection (2%), Text (1%), Detention (1%), Days of Detention (1) %), registered (1%), input (1%), ownership (1%), litigation cost (1%), prosecution (1%), intention (1%), crime (1%) It can contain. Through this, it can be seen that the words related to a specific judgment are not seen much, and the importance of frequently used words is generally high.

중요도 구간Importance section precedentText_PCA_1precedentText_PCA_1 precedentText_PCA_2precedentText_PCA_2 precedentText_PCA_3precedentText_PCA_3 0이상 0.001미만0 or more and less than 0.001 1235012350 1215512155 1248612486 0.001이상 0.005미만0.001 or more but less than 0.005 392392 521521 362362 0.005이상 0.01미만0.005 to less than 0.01 120120 157157 9090 0.01이상 0.05미만0.01 to 0.05 201201 219219 138138 0.05이상 0.1미만0.05 or more and less than 0.1 5353 3939 3030 0.1이상 0.3미만0.1 to less than 0.3 2727 5151 3939 0.3이상 0.5미만0.3 to 0.5 1515 1616 1313 0.5이상 1미만0.5 or more but less than 1 1212 1212 99 1이상 10미만1 to 10 1313 1212 1717 10이상 20미만10 or more and less than 20 1One 33 00 20이상 30미만20 to 30 1One 00 00 30이상 40미만30 to 40 00 00 00 40이상 50미만40 to 50 00 00 1One 50이상50 or more 00 00 00

일예로, 도7a의 (a)는 제5텍스트 데이터에서 피처 영향도가 0.001%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도7b의 (b)는 제5텍스트 데이터에서 피처 영향도가 0.005%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 도7b의(c)는 제5텍스트 데이터에서 피처 영향도가 0.01%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도7b의(d)는 제5텍스트 데이터에서 피처 영향도가 0.05%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도5c의 (e)는 제5텍스트 데이터에서 피처 영향도가 0.1%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다. 또한, 도5c의 (f)는 제5텍스트 데이터에서 피처 영향도가 0.5%이상인 단어를 벤 다이어그램 모양의 워드 클라우드 방식으로 시각화한 것일 수 있다.도 9는 본원의 일 실시예에 따른 텍스트 데이터 장치의 피처 추출 과정을 설명하기 위해 개략적으로 나타낸 개요도이다.As an example, FIG. 7A (a) may be a visualization of a word having a feature influence of 0.001% or more in the fifth text data in a word cloud form of a Venn diagram. In addition, FIG. 7B (b) may be a visualization of a word having a feature influence of 0.005% or more in the fifth text data in a word cloud form in the form of a Venn diagram. 7B (c) may be a visualization of a word having a feature impact of 0.01% or more in the fifth text data, using a word cloud method in the form of a Venn diagram. Further, FIG. 7B(d) may be a visualization of a word having a feature influence of 0.05% or more in the fifth text data in a word cloud method in the form of a Venn diagram. In addition, FIG. 5C(e) may visualize a word having a feature influence of 0.1% or more in the fifth text data in a word cloud method in the form of a Venn diagram. In addition, (f) of FIG. 5C may visualize a word having a feature impact degree of 0.5% or more in the fifth text data in a word cloud method in the form of a Venn diagram. FIG. 9 is a text data device according to an embodiment of the present application This is a schematic diagram schematically illustrating the feature extraction process.

예시적으로 도 9를 참조하면, 텍스트 피처 추출 장치(120)는 텍스트 데이터(110)를 입력받을 수 있다. 텍스트 피처 추출 장치(120)는 입력받은 텍스트 데이터(110)에서 명사를 추출할 수 있다. 또한, 텍스트 피처 추출 장치(120)는 추출된 명사에서 2글자 이상의 한글 명사 후보를 생성할 수 있다. 텍스트 피처 추출 장치(120)는 생성된 한글 명사 후보에서 사전에 있는 명사 즉, 자체 사전에 포함된 명사를 추출할 수 있다. 또한, 텍스트 피처 추출 장치(120)는 명사 후보에서 사전에 없는 명사 즉, 자체 사전에 포함되지 않는 명사를 필터링할 수 있다. 이때, 필터링은, 비명사 사전, 인명 사전, 지명 사전을 이용하여 수행될 수 있으며, 텍스트 피처 추출 장치(120)는 명사 후보에 비명사 사전, 인명 사전, 지명 사전에 포함된 명사를 제거하는 필터링을 수행할 수 있다. 텍스트 피처 추출 장치(120)는 사용자 사전에 포함된 명사 및 자체 명사 사전, 필터링 거친 명사 후보를 기반으로 주요 명사 목록을 생성할 수 있다. 이때, 텍스트 피처 추출 장치(120)는 사용자 사전에 포함된 명사(단어)가 최우선 순위에 놓이도록 주요 명사 목록을 생성할 수 있다. 또한, 텍스트 피처 추출 장치(120)는 사용자 사전에 포함된 명사(단어) 이후에 명사 사전에 포함된 명사(단어)가 놓이도록 주요 명사 목록을 생성할 수 있다. 텍스트 피처 추출 장치(120)는 텍스트 분석 알고리즘(예를 들어, TF-IDF)에 주요 명사 목록을 적용하여 벡터화할 수 있다. 텍스트 피처 추출 장치(120)는 벡터화된 명사로부터 미리 설정된 개수(예를 들어, 3개)의 피처를 추출할 수 있다. For example, referring to FIG. 9, the text feature extraction device 120 may receive text data 110. The text feature extraction device 120 may extract nouns from the input text data 110. Also, the text feature extraction apparatus 120 may generate a Hangul noun candidate of 2 or more characters from the extracted noun. The text feature extraction device 120 may extract a noun in the dictionary, that is, a noun included in its own dictionary, from the generated Hangul noun candidate. Also, the text feature extraction apparatus 120 may filter nouns that are not in the dictionary from noun candidates, that is, nouns that are not included in the dictionary. In this case, the filtering may be performed using a non-noun dictionary, a person dictionary, and a place name dictionary, and the text feature extraction apparatus 120 filters to remove the noun included in the noun dictionary, the person dictionary, and the place name dictionary in the noun candidate You can do The text feature extraction device 120 may generate a list of major nouns based on the nouns included in the user dictionary, its own noun dictionary, and filtered noun candidates. In this case, the text feature extraction apparatus 120 may generate a list of main nouns so that nouns (words) included in the user dictionary are placed in the highest priority. Also, the text feature extraction device 120 may generate a list of major nouns such that nouns (words) included in the noun dictionary are placed after nouns (words) included in the user dictionary. The text feature extraction apparatus 120 may vectorize a text noun list by applying a key noun list to a text analysis algorithm (eg, TF-IDF). The text feature extraction device 120 may extract a preset number (eg, 3) of features from the vectorized noun.

이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, based on the details described above, the operation flow of the present application will be briefly described.

도 10은 본원의 일 실시예에 따른 피처의 주요 단어 시각화 방법에 대한 동작 흐름도이다.10 is an operation flowchart of a method for visualizing a key word of a feature according to an embodiment of the present application.

도 10에 도시된 피처의 주요 단어 시각화 방법은 앞서 설명된 피처의 주요 단어 시각화 장치(150)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 피처의 주요 단어 시각화 장치(150)에 대하여 설명된 내용은 피처의 주요 단어 시각화 방법에 대한 설명에도 동일하게 적용될 수 있다.The key word visualization method of the feature illustrated in FIG. 10 may be performed by the key word visualization device 150 of the feature described above. Therefore, even if omitted, the description of the feature's main word visualization device 150 may be equally applied to the description of the feature's main word visualization method.

단계 S101에서, 피처의 주요 단어 시각화 장치(150)는 텍스트 데이터에서 주성분 분석을 이용하여 추출한 복수의 피처의 목표 변수에 대한 중요도 값을 입력받을 수 있다. In step S101, the key word visualization device 150 of the feature may receive an importance value for a target variable of a plurality of features extracted using principal component analysis from text data.

단계 S102에서, 피처의 주요 단어 시각화 장치(150)는 사용자 입력 정보에 기반하여 상기 중요도 값에 연관된 기준 값을 설정할 수 있다. In step S102, the key word visualization device 150 of the feature may set a reference value associated with the importance value based on user input information.

단계 S103에서, 피처의 주요 단어 시각화 장치(150)는 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출할 수 있다. In step S103, the apparatus 150 for visualizing a key word of a feature may extract an influence degree of each feature for each word vector for a plurality of features.

단계 S104에서, 피처의 주요 단어 시각화 장치(150)는 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출한 값에 대하여 중요도를 백분율로 환산할 수 있다. In step S104, the apparatus 150 for visualizing a key word of a feature may convert the importance as a percentage of a value obtained by extracting an influence on each feature for each word vector for a plurality of features.

단계 S105에서, 피처의 주요 단어 시각화 장치(150)는 각 단어별 중요도와 추출된 피처에 대한 정보를 벤 다이어그램 그래프 형태로 시각화할 수 있다. In step S105, the key word visualization device 150 of the feature may visualize the importance of each word and information about the extracted feature in the form of a Venn diagram graph.

또한, 도면에 도시하진 않았으나, 단계 S101 이전에 피처의 주요 단어 시각화 장치(150)는 텍스트 피처 추출 장치(120)에 의하여 추출된 복수의 피처 정보를 제공받을 수 있다. In addition, although not shown in the drawing, before the step S101, the key word visualization device 150 of the feature may receive a plurality of feature information extracted by the text feature extraction device 120.

본원의 일 실시예에 따르면, 텍스트 피처 추출 장치(120)는 텍스트 데이터(110)를 입력받을 수 있다. 일예로, 텍스트 피처 추출 장치(120)는 외부서버(미도시)로부터 텍스트 데이터(110)를 입력받을 수 있다. 또한, 텍스트 피처 추출 장치(120)는 사용자 단말(미도시)로부터 텍스트 데이터(110)를 입력받을 수 있다. 또한, 텍스트 피처 추출 장치(120)는 텍스트 데이터(110)에서 형태소를 분석할 수 있다. 텍스트 피처 추출 장치(120)는 문장으로 구성된 텍스트 데이터(110)를 형태소 분석 알고리즘에 적용하여 형태소 분석을 수행할 수 있다. 형태소 분석(morphological analysis)은 자연언어 분석의 첫 단계로서, 입력 문자열을 형태소 열로 바꾸는 작업을 의미한다. 형태소(morpheme)는 의미의 최소단위로써, 더 이상 분석 불가능한 가장 작은 의미 요소를 말한다. 또한, 형태소는 문법적 혹은 관계적인 뜻을 나타내는 단어 또는 단어의 부분을 말한다. 한국어 형태소 분석에는, 전처리, 문법 형태소 분리, 체언 분석, 용언 분석, 단일 형태소 분석, 복합어 추정, 조사 생략, 준말 처리, 후처리 과정을 거칠 수 있다. 이때, 문법 형태소 사전, 어휘 형태소 사전, 전문 용어 사전, 사용자 정의 사전, 기분석 사전 등에 기반하여 텍스트 데이터에서 형태소 분석을 수행할 수 있다. 일예로, 텍스트 피처 추출 장치(120)는 konlpy를 이용하여 형태소 분석을 수행할 수 있다. 또한, 텍스트 피처 추출 장치(120)는 형태소가 분석된 텍스트 데이터에서 주요 단어를 추출할 수 있다. 또한, 텍스트 피처 추출 장치(120)는 추출된 주요 단어를 텍스트 분석 알고리즘에 적용하여 벡터화할 수 있다. 또한, 텍스트 피처 추출 장치(120)는 주성분 분석에 벡터화된 주요 단어를 적용하여 피처를 추출할 수 있다. 또한, 텍스트 피처 추출 장치(120)는 추출된 피처를 피처의 주요 단어 시각화 장치(150)로 제공할 수 있다. According to an embodiment of the present application, the text feature extraction device 120 may receive text data 110. For example, the text feature extraction device 120 may receive text data 110 from an external server (not shown). Also, the text feature extraction device 120 may receive text data 110 from a user terminal (not shown). In addition, the text feature extraction device 120 may analyze the morpheme in the text data 110. The text feature extraction device 120 may perform morpheme analysis by applying text data 110 composed of sentences to a morpheme analysis algorithm. Morphological analysis is the first step in the analysis of natural language, which means converting an input string into a morphological column. Morpheme is the smallest unit of meaning and refers to the smallest element of meaning that can no longer be analyzed. In addition, a morpheme refers to a word or part of a word that represents a grammatical or relational meaning. The Korean morpheme analysis may be subjected to preprocessing, grammar morpheme separation, body analysis, verbal analysis, single morpheme analysis, compound word estimation, omission of investigation, verbal processing, and post-processing. In this case, morpheme analysis may be performed on text data based on a grammatical morpheme dictionary, a vocabulary morpheme dictionary, a terminology dictionary, a user-defined dictionary, and a pre-analysis dictionary. As an example, the text feature extraction device 120 may perform morpheme analysis using konlpy. In addition, the text feature extraction device 120 may extract a key word from the text data whose morpheme is analyzed. Also, the text feature extraction device 120 may vectorize the extracted key words by applying it to the text analysis algorithm. In addition, the text feature extraction apparatus 120 may extract a feature by applying a vectorized key word to principal component analysis. Also, the text feature extraction device 120 may provide the extracted feature to the key word visualization device 150 of the feature.

상술한 설명에서, 단계 S101 내지 S105는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S101 to S105 may be further divided into additional steps or combined into fewer steps, depending on the implementation herein. In addition, some steps may be omitted if necessary, and the order between the steps may be changed.

본원의 일 실시 예에 따른 피처의 주요 단어 시각화 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method for visualizing a key word of a feature according to an embodiment of the present application may be implemented in a form of program instructions that can be executed through various computer means and may be recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. Includes hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

또한, 전술한 피처의 주요 단어 시각화 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the key word visualization method of the above-described feature may be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustrative purposes, and those skilled in the art to which the present application pertains will understand that it is possible to easily modify to other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the claims below, rather than the detailed description, and it should be interpreted that all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present application.

110: 텍스트 데이터
120: 텍스트 피처 추출 장치
130: 단어 추출 프로세서
140: 피처 추출 실행부
150: 피처의 주요 단어 시각화 장치
160: 시각화 실행부110: text data
120: text feature extraction device
130: word extraction processor
140: feature extraction execution unit
150: feature key word visualization device
160: visualization execution unit

Claims

피처의 주요 단어 시각화 장치에 있어서,
텍스트 데이터에서 주성분 분석을 이용하여 추출한 복수의 피처의 목표 변수에 대한 중요도 값을 입력하는 피처 정보 입력부;
사용자 입력 정보에 기반하여 상기 중요도 값에 연관된 기준 값을 설정하는 기준 중요도 입력부;
상기 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출하는 단어별 중요도 추출부;
상기 단어별 중요도 추출부에서 추출한 값에 대하여 중요도를 백분율로 환산하는 단어별 중요도 계산부; 및
각 단어별 중요도와 추출된 피처에 대한 정보를 벤 다이어그램 그래프 형태로 시각화하는 시각화 실행부,
를 포함하되,
상기 피처 정보 입력부는,
상기 목표 변수가 설정된 경우, 추출된 상기 피처와 상기 목표 변수와의 영향도를 계산하고, 상기 목표 변수에 대한 중요도 값을 입력하는 것인, 피처의 주요 단어 시각화 장치.In the feature word visualization device,
A feature information input unit for inputting importance values for target variables of a plurality of features extracted from text data using principal component analysis;
A reference importance input unit for setting a reference value associated with the importance value based on user input information;
An importance extracting unit for each word that extracts an influence on each feature for each word vector for the plurality of features;
A importance calculation unit for each word that converts the importance as a percentage of the value extracted by the importance extraction unit for each word; And
Visualization execution unit to visualize the importance of each word and information about the extracted features in the form of a Venn diagram graph,
Including,
The feature information input unit,
When the target variable is set, a key word visualization device of the feature is to calculate the degree of influence between the extracted feature and the target variable, and input an importance value for the target variable.

삭제delete

제1항에 있어서,
상기 시각화 실행부는,
각 단어별 중요도와 추출된 피처의 빈도수에 따라 출력 위치, 출력 크기 및 출력 색을 결정하여 클라우드 형태로 시각화하는 것인, 피처의 주요 단어 시각화 장치. According to claim 1,
The visualization execution unit,
The key word visualization device of the feature is to determine the output location, output size, and output color according to the importance of each word and the frequency of the extracted features to visualize in the cloud form.

제1항에 있어서,
복수의 사전을 이용하여 입력받은 텍스트 데이터에서 주요 단어를 추출하는 단어 추출 프로세서; 및
추출된 상기 주요 단어를 텍스트 분석 알고리즘에 적용하여 벡터화한 후 주성분 분석을 이용하여 상기 텍스트 데이터의 피처를 추출하는 피처 추출 실행부,
를 더 포함하는 피처의 주요 단어 시각화 장치. According to claim 1,
A word extraction processor that extracts key words from the received text data using a plurality of dictionaries; And
A feature extraction execution unit that vectorizes the extracted key words by applying them to a text analysis algorithm and then extracts the features of the text data using principal component analysis.
Feature word visualization device further comprising a.

제4항에 있어서,
상기 피처 정보 입력부는,
상기 텍스트 데이터의 변수명과 상기 목표 변수가 있는 경우, 상기 피처 추출 실행부에서 추출된 피처에 기반하여 상기 목표 변수에 대한 변수 중요도 값을 입력하는 것인, 피처의 주요 단어 시각화 장치.According to claim 4,
The feature information input unit,
When there is a variable name of the text data and the target variable, a key word visualization device of the feature is to input a variable importance value for the target variable based on the feature extracted by the feature extraction execution unit.

컴퓨터로 구현되는 피처의 주요 단어 시각화 장치에 의해 각 단계가 수행되는 피처의 주요 단어 시각화 방법에 있어서,
텍스트 데이터에서 주성분 분석을 이용하여 추출한 복수의 피처의 목표 변수에 대한 중요도 값을 입력받는 단계;
사용자 입력 정보에 기반하여 상기 중요도 값에 연관된 기준 값을 설정하는 단계;
상기 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출하는 단계;
상기 복수의 피처에 대하여 각 단어 벡터별로 각 피처에 미친 영향도를 추출한 값에 대하여 중요도를 백분율로 환산하는 단계; 및
각 단어별 중요도와 추출된 피처에 대한 정보를 벤 다이어그램 그래프 형태로 시각화하는 단계,
를 포함하되,
상기 중요도 값을 입력받는 단계는,
상기 목표 변수가 설정된 경우, 추출된 상기 피처와 상기 목표 변수와의 영향도를 계산하고, 상기 목표 변수에 대한 중요도 값을 입력하는 것인, 피처의 주요 단어 시각화 방법.

A method for visualizing a key word of a feature in which each step is performed by a key word visualization device of a computer-implemented feature,
Receiving an importance value for a target variable of a plurality of features extracted by using principal component analysis from text data;
Setting a reference value associated with the importance value based on user input information;
Extracting a degree of influence on each feature for each word vector for the plurality of features;
Converting the importance as a percentage of a value obtained by extracting the influence on each feature for each word vector for the plurality of features; And
Visualizing the importance of each word and information about the extracted features in the form of a Venn diagram graph,
Including,
The step of receiving the importance value,
When the target variable is set, calculating the degree of influence between the extracted feature and the target variable, and inputting an importance value for the target variable, the key word visualization method of the feature.