KR20100033585A

KR20100033585A - An automatic clustering method of word senses using a word relation graph

Info

Publication number: KR20100033585A
Application number: KR1020080092530A
Authority: KR
Inventors: 이수원; 민병국
Original assignee: 숭실대학교산학협력단
Priority date: 2008-09-22
Filing date: 2008-09-22
Publication date: 2010-03-31

Abstract

PURPOSE: An automatic population technique of the lexeme of using the word linkage graph. Linkage words are constituted with words used in the actual data. It includes in the lexeme population to the new coined word or the foreign language. CONSTITUTION: In the same time prevalence table of the term frequency table and two words is the web document, noun is extracted. The mutual information between the different two words is calculated. The word linkage graph uses the linkage word satisfying the mutual information of the critical over and connects information. The word linkage graph smalls and it partitions to lower part populations of many numbers. The population result it about the mutual information threshold is selected by using the population evaluation index. The keyword which most strongly shows the meaning of population is selected among linkage words.

Description

단어연관그래프를 이용한 단어 의미의 자동 군집 기법{An Automatic Clustering Method of Word Senses Using a Word Relation Graph}An Automatic Clustering Method of Word Senses Using a Word Relation Graph}

본 발명은 시소러스 구축 및 정보 검색시 발생하는 단어의 중의성 문제를 해결하기 위한 단어 의미 자동 군집 기법으로 정보 검색(Information Retrieval) 분야에 관한 것이다.The present invention relates to the field of information retrieval as a word semantic automatic clustering technique for solving the problem of the neutrality of words generated during thesaurus construction and information retrieval.

하나의 단어는 쓰여진 문맥에 따라 하나 이상의 의미를 가질 수 있다. 이를 단어의 중의성이라고 하며, 단어의 의미 그룹을 추출하기 위해서는 중의성 해소를 위한 연구가 선행되어야 한다.A word can have more than one meaning depending on the context in which it is written. This is called the neutrality of words, and in order to extract the semantic group of the words, research for solving the neutrality must be preceded.

종래의 기술로 중의성 해소를 위한 방법은 크게 3가지로 나누어 볼 수 있다. 자연어에 대한 깊은 언어적 통찰에서 얻은 규칙을 이용하는 규칙기반 방법(rule-based method); 기계 가독형 사전과 같은 대량의 지식을 이용하는 지식기반 방법(knowlegde-based method); 대량의 말뭉치(corpus)에서 얻은 통계적 정보를 이용하는 말뭉치 기반 방법(corpus-based method)이 있다.The conventional method for neutralization can be divided into three types. Rule-based methods using rules derived from deep linguistic insight into natural language; A knowledgeleg-based method that uses a large amount of knowledge, such as machine readable dictionaries; There is a corpus-based method that uses statistical information from a large corpus.

말뭉치 기반 방법은 두 가지로, 하나는 교사적 학습방식으로 미리 의미 부착된 말 뭉치로부터 통계정보를 추출하는 방식과 다른 방식은 비교사적 학습방식으로 의미 부착된 말뭉치(sense-tagged corpus)를 필요로 하지 않는다.There are two corpus-based methods, one is the method of extracting statistical information from the pre-stacked corpus, which is a teacher-learning method, and the other requires a sense-tagged corpus, which is a comparative history method. I never do that.

보다 자세하게는 교사적 학습방식은 비교적 좋은 성능을 보이나 활용할 수 있는 의미 부착된 말뭉치가 극히 제한되고, 이를 만드는 작업이 어렵다는 근본적인 문제로 인하여 실용성이 떨어지는 문제점이 있다. 이에 비해 비교사 학습방식은 원시 말뭉치(raw corpus)를 그대로 이용하기 때문에 세밀한 의미 구분이 어렵고 정확도를 보장하기 어렵다는 단점이 있으나 의미 부착 말뭉치가 없는 경우 이를 마련하기 위한 기초작업으로 이용될 수 있어 실용적인 측면이 강하다. 또한, 비교사 학습방식들은 대단위 말뭉치로부터 형태소 분석을 통하여 명사와 동사 등으로 단어를 분리하고 문법적인 제약사항을 고려하여 같은 문서에 나타나는 단어들을 유사 의미로 규정함으로써 유사어를 수집하였다. 또 다른 방법은 문법적인 제약사항을 고려하지 않고 문맥 내의 무순서적인 주변 단어들만을 문맥에 대한 표현으로 사용함으로써 동일한 문맥에 나타난 단어들에 대한 통계적인 특성에 의해 단어 군집화(word clustering)를 수행하여 유사어를 수집한다.In more detail, the teacher learning method has a relatively good performance, but there is a problem that practical use is poor due to the fundamental problem that the attached corpus is extremely limited, and the task of making it is difficult. On the contrary, the comparative learning method uses the raw corpus as it is, so it is difficult to distinguish detailed semantics and it is difficult to guarantee the accuracy. This is strong. In addition, the comparative learning methods collected similar words by separating words into nouns and verbs through morphological analysis from large corpus and defining words that appear in the same document with similar meanings in consideration of grammatical constraints. Another way is to perform word clustering by statistical properties of words that appear in the same context, using only the random words in the context as expressions for the context without considering grammatical constraints. Collect similar words.

단어 군집화는 사람의 개입 없이 기계가 스스로 학습하여 의미가 유사한 단어를 하나로 모으는 자율 기계학습 방법이다. 단어 군집화에서 필요한 속성(feature)을 추출하는 방법으로는 단어의 용례를 속성으로 이용한 방법과 사전의 뜻 풀이말을 속성으로 이용한 방법이 있다. 단어의 용례를 속성으로 이용하는 방법은 “동일한 문맥상에 출현한 두 단어는 의미적으로 유사성을 가진다”라는 가정에 근거한다. 이 방법은 대규모 말뭉치에서 단어 혹은 배치 정보를 추출하여 군집화를 위한 속성으 로 사용한다. 그러나 상대적으로 사용 빈도가 낮은 단어에 대해서는 말뭉치 자체에서도 출현 빈도가 낮기 때문에 속성 벡터 형성이 어렵고, 연관성이 있는 단어라 할지라도 같이 나오는 확률이 낮을 수 있기 때문에 같은 군집을 형성하기 어렵다는 문제점이 있다. 단어의 뜻 풀이말을 속성으로 이용하는 방법은 “유사한 단어는 비슷한 뜻 풀이말을 가진다”라는 가정에 근거한다. 이 연구에서는 사전의 뜻 풀이말과 온톨로지에서 표제어의 위치 정보를 이용하여 단어 군집을 형성하고 이를 바탕으로 단어의 모호성을 해소하여 정보 검색에 적용하였다.Word clustering is an autonomous machine learning method in which machines learn by themselves without human intervention to bring together similar words. The methods for extracting the necessary features from the word clustering include the use of the word usage as an attribute and the use of the definition of the dictionary as an attribute. The use of word usage as an attribute is based on the assumption that "two words appearing in the same context are semantically similar". This method extracts word or batch information from large corpus and uses it as an attribute for clustering. However, it is difficult to form an attribute vector because the occurrence frequency is low even in the corpus itself for the words with a relatively low use frequency, and it is difficult to form the same cluster because the related words may have a low probability of appearing together. The method of using a word's meaning solving as an attribute is based on the assumption that "similar words have similar meanings". In this study, word clusters were formed by using the location information of the headword in the dictionary and the ontology of the dictionary, and based on this, the word ambiguity was solved and applied to the information retrieval.

단어 군집화에 대한 방법으로는 Similarity-based Clustering과 Pairwise Clustering이 있다. Similarity-based Clustering은 두 개의 단어 w1과 w2가 다른 단어들과의 동시출현의 분포가 유사하다면 단어 w1과 w2는 같은 군집에 속하도록 하는 방식이며 유사도로 Jensen-Shannon divergence 등을 사용한다. Pairwise Clustering은 두 단어 w1과 w2의 동시출현 회수가 크다면 단어 w1과 w2를 같은 군집에 속하도록 하는 방식으로 유사도로는 상호정보량(mutual information) 등을 사용한다.The methods for word clustering include similarity-based clustering and pairwise clustering. Similarity-based clustering allows two words w1 and w2 to have similar distributions of other words, so that words w1 and w2 belong to the same cluster and use Jensen-Shannon divergence as similarity. Pairwise clustering uses mutual information for similarity in such a way that the words w1 and w2 belong to the same cluster if the number of simultaneous occurrences of two words w1 and w2 is large.

[문헌 1] 신사임, 최기선. 의미 경계의 현실화를 위한 공기정보의 자동 군집화. 한국정보과학회추계학술대회, 2004, pp.559-561.에서는 TF-IDF를 사용하여 특징 단어들을 추출하고 같은 문서에서 출현한 단어들로 단어 벡터를 구성하여 Similarity-based Clustering 방식으로 단어간 유사도를 계산하였다. 유사성이 높은 단어들은 k-means 클러스터링 알고리즘을 통하여 연관단어 그룹으로 군집화된다. 그러나 이 방법은 클러스터의 개수 k를 미리 지정하여야 하며 중의성이 있는 단어의 경우 어 느 한 군집에만 속하게 되므로 중의성이 있는 단어와 관련된 다른 의미의 단어까지도 같은 군집에 포함하도록 하는 문제점을 가지고 있다.[Document 1] A gentleman, Choi Ki-sun. Automatic Clustering of Air Information for Realization of Semantic Boundaries. The Korean Institute of Information Scientists and Engineers, 2004, pp.559-561. Extract feature words using TF-IDF and construct word vectors from words appearing in the same document to find similarity between words in a similarity-based clustering method. Calculated. Words with high similarity are clustered into related word groups through k-means clustering algorithm. However, in this method, the number of clusters k must be specified in advance, and in the case of a word having neutrality, it belongs to only one cluster. Therefore, there is a problem of including words of different meaning related to the word having neutrality in the same cluster.

[문헌 2] Tomohiko Sugimachi, Akira Ishino, Masayuki Takeda, Fumihiro Matsuo. A Method of Extracting Related Words Using Standardized Mutual Information. Lecture Notes in Computer Science, 2003, pp.478-485.에서는 상호정보량을 사용하여 단어연관그래프를 생성하고 절단점(cut-vertex)의 특징을 갖는 단어에 대해 분리되는 그래프 성분을 단어에 대한 의미그룹으로 생성하였다. 그러나 [문헌 2]의 방법은 단어연관그래프를 규정하는 상호정보량의 임계치가 낮을수록 간선의 수가 많아지기 때문에 절단점을 찾기 어려워진다는 단점이 있다. 또한 중의성을 갖는 단어가 그래프 상에 존재할 경우 의미가 상이한 하부 그래프간에도 연결이 형성되어 절단점을 찾기가 더욱 어려워진다.[2] Tomohiko Sugimachi, Akira Ishino, Masayuki Takeda, Fumihiro Matsuo. A Method of Extracting Related Words Using Standardized Mutual Information. In Lecture Notes in Computer Science, 2003, pp. 478-485, semantic groups for words that generate word-related graphs using mutual information and separate graph components for words that have the characteristics of cut-vertex. Created by. However, the method of [2] has a disadvantage in that the lower the threshold value of the amount of mutual information that defines the word association graph, the larger the number of edges, making it difficult to find a cutting point. In addition, if a word having neutrality exists on a graph, a connection is formed between subgraphs having different meanings, making it more difficult to find a cut point.

인터넷을 이용한 정보검색은 일상생활에서 빼놓을 수 없는 생활 수단으로 대중화 되었으며 웹이라는 거대한 정보 저장소의 양적인 팽창은 양질의 정보를 찾아내기 어렵게 만들고 있다. 이를 위해 Naver, Google 등과 같은 검색엔진들이 사용되고 있으나 현재의 텍스트 매칭 기반의 검색엔진은 문서의 의미를 이해할 수 없기 때문에 사용자의 질의어를 포함하는 문서들을 키워드 매칭의 정확도에 의한 순으로 제공하고 있다. 이는 동일한 단어가 들어간 문서를 대량으로 쏟아놓기 때문에 사용자의 질의 의도와는 다른 검색 결과를 상위로 출력하여 원하는 정보를 찾기 어렵게 만들 수 있다. Information retrieval using the Internet has been popularized as a means of living that is indispensable in daily life, and the quantitative expansion of the huge information repository called the web makes it difficult to find good information. For this purpose, search engines such as Naver and Google are used. However, current search engines based on text matching do not understand the meaning of documents, so documents containing user queries are provided in order of keyword matching accuracy. This results in a large amount of documents containing the same word, making it difficult to find the desired information by outputting search results that differ from the user's query intent.

본 발명에서는 대단위 말뭉치로부터 얻어낸 동시출현정보를 이용해 단어 클러스터링을 수행함으로써 자동으로 의미별 연관단어 군집을 추출하는 방법을 다룬다. 제안하는 방법은 상호정보량으로 단어간의 연관성을 정량화하여 단어연관그래프를 생성시킨 다음, 본 발명에서 정의한 단어의 중복을 허용하는 그래프 분할 알고리즘 HEM_WRG을 사용하여 단어연관그래프의 초기 군집을 생성한다. 생성된 초기 군집은 그래프 기반의 클러스터링 기법인 CHAMELEON 알고리즘을 통하여 연관단어들의 의미 그룹으로 생성된다. 또한 상대적 군집평가인덱스 S_Dbw_WRG를 정의하고, 이를 적용하여 생성된 군집의 적합도를 평가한다.The present invention deals with a method of automatically extracting a group of related words by semantics by performing word clustering using co-occurrence information obtained from a large corpus. The proposed method generates a word association graph by quantifying the correlation between words by the amount of mutual information, and then generates an initial cluster of the word association graph using the graph segmentation algorithm HEM _WRG that allows the overlapping of words defined in the present invention. The generated initial clusters are generated as semantic groups of related words through CHAMELEON algorithm, which is a graph-based clustering technique. In addition, the relative cluster evaluation index S_Dbw _WRG is defined and the fitness of the generated cluster is evaluated by applying it.

본 발명에서 연관단어들은 실제 데이터에서 사용된 단어들로 구성되므로 새로운 신조어나 외래어까지 포함하여 자동으로 구성되어 진다. 또한 상호간에 의미적 연결성이 강한 연관단어끼리 의미 군집을 형성하여 사전에 포함되어 있지 않은 다양한 의미들을 찾아내어 준다. In the present invention, since related words are composed of words used in real data, new words or foreign words are automatically configured. In addition, related words with strong semantic connection form meaning clusters to find various meanings not included in the dictionary.

상기의 결과를 이용하여 정보 검색시 검색 결과에 반영하면 정보 검색 결과의 혼재된 내용을 의미단위의 결과로 분류하여 정보 검색성능을 높일 수 있다. 또한, 시소러스 구축시 기본 시소러스 없이 bottom up 방식의 시소러스를 구축할 경우 가이드라인 역할을 수행함으로써 매우 효율적으로 활용할 수 있다.When the information is reflected in the search result using the above result, the mixed information of the information search result may be classified into the result of the semantic unit to improve the information search performance. In addition, when constructing the thesaurus of the bottom up method without the basic thesaurus, it can be used very efficiently by acting as a guideline.

본 발명에서는 연관단어 클러스터링을 위한 7단계를 포함한다. 즉, 대단위 웹문서에서 명사를 추출하여 단어 빈도 테이블과 두 단어의 동시출현 빈도 테이블을 생성하는 1단계; 서로 다른 임의의 두 단어 간의 상호정보량을 계산하고 정규화하는 2단계; 각 단어에 대해 임계치 θ이상의 상호정보량을 만족하는 연관단어와 연결정보를 이용하여 단어연관그래프를 생성하는 3단계; 선행단계에서 생성된 단어연관그래프를 그래프 분할 알고리즘을 이용하여 작고 많은 수의 하부 군집으로 분할하는 4단계; 분할된 하부 군집들은 더 이상 분할할 수 없는 기초적인(primitive) 의미 군집으로 사용된다. 이를 위하여 본 발명에서는 HEM 알고리즘을 기본으로 단어의 중의적 특징을 반영할 수 있도록 수정한 HEMWRG 알고리즘을 청구한다[청구항 1]. 군집 알고리즘 CHAMELEON을 이용하여 하부 군집들에 대한 군집화를 수행하는 5단계; 군집평가인덱스를 사용하여 상호정보량 임계치에 대해 최적의 군집 결과를 선정하는 6단계를 포함한다. 이를 위해 상대적 군집평가인덱스 S_Dbw를 기본으로 하여 단어연관그래프에 적용할 수 있는 상대적 군집평가인덱스 S_DbwWRG를 청구한다[청구항 2]. 또한, 군집내 연관단어들 중 군집의 의미를 가장 강하게 나타내는 대표단어 선정과정 7단계를 수행한다[청구항 3]. 상기 7단계를 통해 선정된 최적의 군집 결과에 대해 각 군집의 연관단어 중 중심단어를 선택하여 군집의 명명에 사용한다. 연관단어 클러스터링에 대한 전체 흐름도는 [대표도]와 같다.The present invention includes seven steps for associating word clustering. That is, step 1 of generating a word frequency table and a co-occurrence frequency table of two words by extracting a noun from a large web document; Calculating and normalizing the amount of mutual information between any two different words; Generating a word association graph by using the association word and the connection information satisfying the mutual information amount of the threshold value θ or more for each word; Dividing the word association graph generated in the preceding step into a small number of subgroups using a graph partitioning algorithm; The partitioned subpopulations are used as primitive semantic clusters that can no longer be partitioned. To this end, the present invention claims a modified HEMWRG algorithm based on the HEM algorithm to reflect the characteristic of the word [claim 1]. Performing clustering on the lower clusters using the clustering algorithm CHAMELEON; The cluster evaluation index includes six steps of selecting an optimal clustering result for the mutual information threshold. For this purpose, the relative cluster evaluation index S_DbwWRG, which can be applied to the word association graph, is requested based on the relative cluster evaluation index S_Dbw [Claim 2]. In addition, a seven-step process of selecting a representative word that strongly indicates the meaning of a cluster among the group of related words is performed. For the optimal clustering result selected in step 7, the center word is selected from the group of related words in each cluster and used for naming the cluster. The overall flow chart of the association word clustering is as shown in [representative diagram].

1단계: 단어 빈도 테이블 및 동시출현 빈도 테이블 생성Step 1: Create the word frequency table and the co-occurrence frequency table

본 발명의 1단계에서 문서집합에 대한 단어의 출현빈도 및 두 단어의 동시출현 빈도를 계산하기 위해 출현의 여부를 결정, 범위를 지정해야 한다. 범위는 문서 단위로부터 문단, 또는 마침표로 구분되는 문장의 단위까지 선택할 수 있다. 상기과정에서 각각 장단점이 있는데 문서 단위인 경우 동시에 출현하는 단어가 많아서 단어의 연관도를 계산하는데 있어서 장애요인이 될 수 있다. 반대로 문장의 단위인 경우 하나의 문장에서 같이 출현한 단어들은 의미적으로 관계 깊다는 가정을 충분히 충족하게 하지만 연관단어의 수가 너무 적어서 연관성을 추출하기가 어려워지는 단점이 있다. 출현 여부에 대한 범위를 정하는 것은 원시 데이터의 특성을 보고 판단하여야 하며 되도록 하나의 의미 또는 내용을 한정하는 범위 내에서 연관단어들을 충분히 포함할 수 있는 범위로 지정한다. 본 발명에서는 실험데이터의 특성상 문서 단위로 범위를 지정하고 한 문서당 1의 출현빈도로 하여 계산한다. 즉, 빈도는 출현한 문서수와 같다.In the first step of the present invention, in order to calculate the frequency of occurrence of a word for a document set and the frequency of co-appearance of two words, it is necessary to determine and determine the appearance. The range can be selected from document units to paragraphs or units of sentences separated by periods. In the above process, there are advantages and disadvantages, and in the case of document units, there are many words that appear at the same time, which can be an obstacle in calculating the degree of association of words. On the contrary, in the case of a sentence unit, the words appearing together in one sentence sufficiently satisfy the assumption that they are semantically related, but the number of related words is too small, making it difficult to extract the association. Determining the scope of appearance should be based on the characteristics of the raw data, and should be determined as the range that can sufficiently include related words within the scope limiting one meaning or content. In the present invention, the range is specified in units of documents due to the characteristics of the experimental data, and the calculation is performed with a frequency of 1 per document. In other words, the frequency is equal to the number of documents that appear.

원시 데이터로부터 추출된 단어들은 최종적으로 (단어, 빈도) 형태의 단어 빈도 테이블로 만들어져 데이터베이스에 저장한다. 또한 다음 과정에서 단어와 단어의 연관도를 측정하기 위하여 서로 다른 두 단어 쌍에 대한 동시출현 빈도 테이블을 생성한다. 동시출현 빈도 테이블의 레코드는 N개의 단어 중 임의의 i번째 단어와 j번째 단어 w _i 와 w _j 에 대한 동시출현 문서수 f(w _i ,w _j )로 구성하는 것이 바람직하다.The words extracted from the raw data are finally made into a word frequency table in the form of (words, frequency) and stored in the database. In the next process, we also create a table of co-occurrence frequencies for two different word pairs to measure the association between words. The record of the co-occurrence frequency table is preferably composed of any i- word of the N words and the number of co-occurrence documents f ( w _i , w _j ) for the j- th word w _i and w _j .

동시출현 빈도 테이블의 레코드는 두 단어의 조합의 수만큼 생성되므로 전체 단어수가 N개일 경우 생성되는 레코드의 최대 크기는 N(N-1)/2 이다. 일반적으로 대단위의 원시 데이터에서 생성되는 레코드의 크기는 매우 크고 전체 단어의 크기에 따라 계산 시간도 증가하게 되므로 동시출현 빈도 테이블을 생성하기 위한 단어를 제한할 필요가 있다.Since the records of the co-occurrence frequency table are generated by the number of combinations of two words, the maximum size of the record that is generated when the total number of words is N is N ( N -1) / 2. In general, the size of records generated from large raw data is very large and the calculation time increases according to the size of the whole word, so it is necessary to limit the words for generating the co-occurrence frequency table.

단어 간의 연관도를 측정하기 위해서는 충분한 출현문서수가 있어야 하며, 단어와 함께 동시출현한 다른 단어들도 많아야 한다. 따라서 단어 빈도 테이블에서 상위 n개의 단어를 선정하여, 그 단어들만으로 동시출현 빈도 테이블을 구성할 수 있다. 한편 최상위 빈도수부터 m개의 단어들은 여러 단어와 함께 많은 문서에서 출현하는 경향이 있으므로 이 단어들을 제외하는 것으로 단어 연관성 계산의 정확도를 높이는 효과를 볼 수 있다. 최종적으로 빈도 최상위 m개를 제외한 n개의 단어들로 빈도 테이블과 동시출현 빈도 테이블이 생성될 수 있다. [표 1]과 [표 2]는 각각 생성된 단어 빈도 테이블과 동시출현 빈도 테이블의 예이다.In order to measure the degree of association between words, there must be sufficient number of appearance documents, and there must be many other words that co-appear with words. Therefore, the top n words may be selected from the word frequency table, and the co-occurrence frequency table may be constructed using only those words. On the other hand, m words from the highest frequency tend to appear in many documents with several words, so the exclusion of these words can increase the accuracy of word association calculations. Finally, a frequency table and a co-occurrence frequency table may be generated with n words except the top m frequencies. [Table 1] and [Table 2] are examples of generated word frequency tables and co-occurrence frequency tables, respectively.

단어word 빈도frequency 전국Nationwide 49624962 대출loan 13951395 ...... ...... 당일대출Same-day loan 210210

단어1Word 1 단어2Word 2 빈도frequency 대출loan 당일대출Same-day loan 109109 대출loan 전국Nationwide 5757 ...... ...... ...... 대출loan 도서관library 8787

2단계: Step 2: 상호정보량Mutual Information 산출 및 정규화 Output and normalization

본 발명의 2단계에서는 1단계에서 생성한 단어 빈도 테이블과 동시출현 빈도 테이블을 이용하여 단어간 상호정보량을 계산하고 정규화한다. 정규화된 상호정보량(SMI)은 두 단어의 연관도로 사용되며 [표 3]과 같은 형태의 테이블로 생성되는 것이 바람직하다. [도 1]은 1단계와 2단계의 과정을 도식화한 그림이다.In the second step of the present invention, the amount of mutual information between words is calculated and normalized using the word frequency table and the co-occurrence frequency table generated in step 1. The normalized mutual information amount (SMI) is used as an association of two words and is preferably generated as a table as shown in [Table 3]. 1 is a diagram illustrating the process of steps 1 and 2.

단어1Word 1 단어2Word 2 SMISMI 대출loan 당일대출Same-day loan 3.15833.1583 대출loan 도서관library 2.45822.4582 ...... ...... ...... 대출loan 전국Nationwide 1.33501.3350

3단계: 단어연관그래프 생성Step 3: Create a Word Association Graph

본 발명의 3단계에서는 상기단계에서 생성한 정규화된 상호정보량 테이블에 존재하는 임의의 단어 w를 기준단어로 놓고 연관도가 임계치 이상의 연관단어 테이블을 구성하고, 연관단어 테이블 내의 단어간에 존재하는 임계치 이상의 연결정보도 추출하여 연결정보 테이블을 구성하고, 전체 n개 단어를 기준 단어로 하는 n개의 연관단어 테이블과 연결정보 테이블이 생성하는 것이 바람직하다. [표 4]는 단어 "허브"의 연관단어 테이블(좌)와 연결정보 테이블(우) 임계치 θ= 2.1의 예이다.In step 3 of the present invention, an arbitrary word w existing in the normalized mutual information table created in the above step is set as a reference word, and an association word table having an association degree of at least a threshold value is formed, and a threshold value existing between words in the association word table is set. The connection information is also extracted to form a connection information table, and n association word tables and connection information tables having all n words as reference words are preferably generated. [Table 4] shows an example of the association word table (left) and the connection information table (right) of the word "hub".

연관단어 테이블과 연결정보 테이블은 단어연관그래프로 표현하고, 단어연관그래프는 기준단어 및 연관단어를 정점으로 하고 연결정보를 간선으로, 정규화된 상호정보량을 간선의 무게로 하며, 같은 문서 내에서 자주 동시출현 되었던 단어들끼리 강한 연결강도로 연결되고, 유사한 의미의 문서에서 출현한 단어들끼리 뭉쳐져 있는 형태를 갖는 것이 바람직하다. [도 2]는 [도 3]에 의해 구성된 단어 "허브"의 단어연관그래프 G ₂ _.1 ("허브")이다.The association word table and the association information table are represented by word association graphs, and the word association graph is based on the reference word and the association word as a vertex, the connection information is the edge, the normalized amount of mutual information is the weight of the edge, and is often used in the same document. It is desirable to have a form in which words that have been co-expressed are connected with strong connection strength, and words appearing in a document of similar meaning are grouped together. FIG. 2 is a word association graph G ₂ _.1 (“hub”) of the word “hub” constructed by [FIG. 3].

4단계: 그래프 분할Step 4: Segment the graph

본 발명의 4단계에서는 일반적인 그래프 분할과는 다른 조건으로 단어 w에 대해 임계치 θ를 만족하는 단어연관그래프 G(w)를 분할하여 여러 개의 하부 군집으로 생성하는 것이 바람직하다. 단어는 중의성을 가지고 있기 때문에 연관도를 속성으로 연관단어의 군집을 생성한다면 각 군집들 사이에는 중복되는 연관단어가 존재할 수 있다. 또한 두 군집 내의 단어가 하나를 제외하고 모두 중복된다 하여도 차별되는 단어에 의해 두 군집은 서로 다른 의미를 띄게 된다. 예를 들어 단어 "캐논"에 대한 연관단어의 군집 {"삼성", "소니", "디지털카메라"}와 {"삼성", "소니", "프린터"}가 있다면 두 군집은 단어 "삼성"과 "소니"가 중복되어 있어도 "디지털카메라"와 "프린터"에 의해 전혀 다른 의미를 갖게 된다.In the fourth step of the present invention, it is preferable to divide the word association graph G ( w ) satisfying the threshold θ with respect to the word w under conditions different from the general graph division to generate a plurality of subgroups. Since a word has a weight, if a cluster of related words is generated using the degree of association, there may be overlapping related words between the clusters. In addition, even if the words in both clusters overlap except for one, the two clusters have different meanings due to the discriminated words. For example, if there are groups of related words for the word "Canon", {"Samsung", "Sony", "Digital Camera"} and {"Samsung", "Sony", "Printer"}, the two clusters are the words "Samsung" Even if "Sony" overlaps, "Digital Camera" and "Printer" have a completely different meaning.

본 발명에서는 HEM 알고리즘의 기본적인 개념을 이용하되 단어연관그래프의 특징을 고려한 수정된 HEM _WRG (heavy edge matching for word relation graph) 알고리즘을 청구한다.In the present invention, a modified HEM _WRG (heavy edge matching for word relation graph) algorithm using the basic concept of the HEM algorithm is considered.

HEM _WRG 알고리즘은 크게 두 과정으로 나누어진다. 첫 번째 과정으로 단일 정점으로 이루어진 초기 그래프를 무게를 가진 다중정점들로 이루어진 그래프로 변환한다. 다중정점이란 임의의 정점 v와 정점 v에 연결된 모든 정점들을 멤버(member)로 하여 재구성된 정점이다. 초기 그래프의 각 정점으로부터 깊이가 1인 너비우선탐색으로 간선들을 탐색하여 연결된 정점을 찾고 탐색된 정점들과 간선들을 접어 하나의 다중정점으로 만든다. 이 경우 다중정점의 무게는 포함된 간선들의 무게의 합이 된다. [도 3]의 경우 초기 그래프로부터 6개의 정점에 대하여 6개의 다중정점이 생성되었고 아랫줄 표시('_')가 된 정점은 다중정점 생성의 기준이 된 정점을 의미한다. 이러한 탐색 과정을 거쳐 만들어진 다중정점들 중에 포함하는 정점의 개수가 MIN_SIZE를 만족하지 못하는 다중정점이 있다면 변환된 그래프에서 제거한다. [도 3]의 예에서는 모든 다중정점의 크기가 MIN_SIZE를 만족하므로 제거되는 다중점정은 없다.The HEM _WRG algorithm is divided into two processes. The first step is to convert an initial graph of single vertices into a graph of weighted vertices. Multiple vertex is a vertex reconstructed all the vertices connected to any of vertex v and vertex v as a member (member). From the vertices of the initial graph, the edges are searched by a breadth-first search with a depth of 1 to find the connected vertices and fold the found vertices and edges into a single vertex. In this case the weight of the multiple vertices is the sum of the weights of the included edges. In the case of FIG. 3, six vertices are generated for six vertices from an initial graph, and the vertices marked with a lower line ('_') mean vertices that are the basis of multi vertex generation. If there are multiple vertices in the vertex that are created through this search and the number of vertices included does not satisfy MIN_SIZE , it is removed from the converted graph. In the example of FIG. 3, since the size of all the multi-vertexes satisfies MIN_SIZE, no multi-vertexes are removed.

두 번째 과정으로 무게가 가장 큰 다중정점 순으로 순회하면서 하나의 다중정점 내의 각각의 멤버에 대해 그 멤버와 동일한 멤버를 소유하고 있는 다른 다중정점으로 동일한 멤버를 매칭한다. 여기서 매칭은 동일한 멤버를 포함하고 있는 다중정점의 집합이다. 만약 멤버들 중 하나라도 다른 다중정점에 매칭 되지 않은 멤버가 있다면 무게가 다음으로 작은 다중정점을 선택하여 같은 과정을 수행한다. 모든 멤버가 매칭된 다중정점은 제거되고 남겨진 다중정점들이 매칭 알고리즘의 결과가 된다. 남겨진 다중정점들은 각각 전체 그래프의 하부 그래프가 되며 하부 그래프간의 간선절단의 크기는 최소가 된다. 또한 멤버의 중복도 최소한으로 허용하면서 초기 그래프의 위상도 그대로 유지하게 된다. [도 3]의 경우 첫 단계에서 생성된 6개의 다중정점으로부터 정점 b, e, f를 중심으로 생성된 다중정점 3개만이 남겨진다.In the second process, it traverses in the order of the highest weighted vertices, matching each member within a multiverted vertex with the same multiverted vertex that owns the same member. Here, a match is a set of multiple vertices that contain the same member. If any of the members do not match the other vertices, the same procedure is performed by selecting the next smaller vertex. Multiple vertices with all members matched are removed, and the remaining multiple vertices are the result of the matching algorithm. Each of the remaining vertices becomes the bottom graph of the entire graph, and the size of the edge cut between the bottom graphs is minimum. It also keeps the initial graph's phase intact while minimizing duplication of members. In the case of FIG. 3, only three multi-vertexes generated around the vertices b , e , and f are left from the six multi-vertexes generated in the first step.

[도 4]는 실제 데이터에서 추출한 단어연관그래프에서 HEM _WRG 알고리즘을 수행하는 과정을 보여주고 있다. 초기 단어연관그래프로부터 HEM _WRG 알고리즘 통해 15개의 다중정점이 생성되었고 가장 큰 무게를 가진 다중정점인 ‘하부그래프1’은 제일 먼저 선택되어 모든 멤버가 ‘하부그래프8’과 ‘하부그래프11’에 매칭되어 삭제되었다. 마찬가지로 ‘하부그래프2∼7’은 ‘하부그래프8’에 모든 멤버가 매칭되어 삭제되었고, ‘하부그래프9∼10’은 ‘하부그래프11’로, ‘하부그래프12’는 ‘하부그래프13’으로, ‘하부그래프14’는 ‘하부그래프15’로 각각 매칭되어 삭제되었다. HEM _WRG 알고리즘이 완료되면 매칭되지 않은 멤버를 포함한 ‘하부그래프8’, ‘하부그래프11’, ‘하부그래프13’, ‘하부그래프14’만 남겨진다. 단어 “아로마”의 경우 ‘하부그래프8’와 ‘하부그래프11’에 중복된 상태로 분할된다. 또한 단어 “네트워크”와 단어 “공유기”도 ‘하부그래프13’과 ‘하부그래프15’에 중복하여 분할된다.4 is a HEM _WRG in a word association graph extracted from real data. It shows the process of executing the algorithm. Fifteen vertices were generated from the initial word-associated graph using the HEM _WRG algorithm, and the largest weighted vertex, 'subgraph 1', was selected first and all members matched 'subgraph 8' and 'subgraph 11'. And deleted. Likewise, 'Subgraph 2 ~ 7' is deleted by matching all members to 'Subgraph 8', 'Subgraph 9 ~ 10' as 'Subgraph 11', 'Subgraph 12' as 'Subgraph 13' , 'Bottom graph 14' is deleted by matching 'bottom graph 15' respectively. When the HEM _WRG algorithm is completed, only the 'subgraph 8', 'subgraph 11', 'subgraph 13', and 'subgraph 14' including unmatched members are left. In the case of the word "Aroma", it is divided into a state of being overlapped in the lower graph 8 and the lower graph 11. In addition, the word "network" and the word "router" are also divided into "subgraph 13" and "subgraph 15" in duplicate.

HEM _WRG 알고리즘의 전체 과정을 [도 5]에 두 개의 프로시져(Procedure)로 기술하였다. 첫 번째 프로시져 HEM_WRG_COARSENING는 단어연관그래프 G(w)와 다중정점의 최소 멤버 크기인 MIN_SIZE를 입력으로 받아 다중정점의 집합인 MV_all를 출력하고 두 번째 프로시져 HEM_WRG_MATCHING는 첫 번째 프로시져의 출력인 MV_all를 입력으로 받아 매칭 과정을 수행하고, 남겨진 다중정점의 집합 MV_unmatched를 출력한다.The entire process of the HEM _WRG algorithm is described in two procedures in FIG. 5. The first procedure, HEM_WRG_COARSENING, takes the word association graph G (w) and MIN_SIZE, the minimum member size of the multi-vertex, and outputs MV _all , which is a set of multi-vertexes, and the second procedure, HEM_WRG_MATCHING, enters the output of the first procedure, MV _all . It performs the matching process and outputs the remaining MV _unmatched set of multiple vertices.

5단계: 클러스터링Step 5: cluster

본 발명에서는 단어연관그래프에 대한 클러스터링을 수행하기 위해 종래의 기술 CHAMELEON 알고리즘을 사용하였다. 4단계를 거쳐 작게 분할된 하부 그래프들은 단어연관그래프의 기준 단어의 의미를 설명하는 최소 단위가 된다. 클러스터링 단계에서는 하부 그래프들을 최소 단위의 하부 군집으로 삼아 반복적으로 병합하는 과정을 수행한다. 이 때 두 개의 하부 군집이 결합되기 위해서는 상대적 상호연결성도 크고 상대적 근접성도 커야한다.In the present invention, the conventional CHAMELEON algorithm is used to perform clustering on the word association graph. The subgraphs that are divided into four steps are the minimum units for explaining the meaning of the reference word of the word-associated graph. In the clustering step, the lower graphs are used as the lower clusters of the minimum unit to perform the merging repeatedly. In this case, the two lower clusters need to have large relative interconnections and large relative proximity.

CHAMELEON 알고리즘은 두 하부 군집간의 병합을 위하여 상대적 상호연결성 RI와 상대적 근접성 RC를 계산하여 두 척도의 합을 군집의 유사도로 사용한다. 유사도에 따른 병합 기준은 병합 임계치 T_sim에 의해 결정될 수 있다.The CHAMELEON algorithm calculates the relative interconnection RI and the relative proximity RC for merging two subgroups, and uses the sum of the two measures as the similarity of the clusters. The merging criterion according to the similarity may be determined by the merging threshold T _sim .

[도 6]에서 그래프 분할을 통해서 4개의 하부군집이 생성되어 있으며, 유사도 계산을 통해 하부군집을 병합해 나간다. 병합 임계치 T_sim은 0.7로 설정되어 있으므로 하부군집 3번과 하부군집 4번이 병합되고 하부군집 1번과 하부군집 2번은 유사도가 병합 임계치 T_sim 보다 작기 때문에 병합되지 않는다. 즉, 단어 "허브"에 대해서 정규화된 상호정보량 2.1 이상의 연관단어들로만 이루어진 단어연관그래프로부터 3개의 군집인 하부군집 1번과 하부군집 2번 그리고 하부군집 3번과 4번의 병합된 군집이 형성되었으며 각 군집은 "허브"에 대한 의미 군집이 될 수 있다.In FIG. 6, four subgroups are generated through graph division, and the subgroups are merged through similarity calculation. Since the merge threshold T _sim is set to 0.7, subgroup 3 and subgroup 4 are merged, and subgroup 1 and subgroup 2 are not merged because the similarity is smaller than the merge threshold T _sim . That is, three clusters, subgroup 1 and subgroup 2, and subgroups 3 and 4, were formed from a word association graph composed only of related words of 2.1 or more normalized information on the word "hub". Clusters can be semantic clusters for "herbs".

6단계: 군집평가인덱스를 이용한 최적의 군집 결과 선택Step 6: Selecting Optimal Cluster Results Using Cluster Assessment Index

본 발명의 6단계에서는 연관단어 클러스터링의 결과를 평가하기 위하여 종래의 기술 상대적 군집 평가 방법인 S_Dbw 인덱스(scatter, density between clusters)를 사용하였고, 단어연관그래프에 적용하기 위해 단어연관그래프를 위한 상대적 군집평가인덱스 S_Dbw_WRG(scatter, density between clusters for word relation graph)를 정의하여 청구한다. 정의한 군집평가인덱스를 이용하여 연관단어집합의 선정 기준이 되는 연관도 임계치, 즉 정규화된 상호정보량 SMI을 조정하는 방안에 대해 설명한다.In the sixth step of the present invention, the S_Dbw index (scatter, density between clusters), which is a conventional technique of evaluating relative clustering, was used to evaluate the result of associative word clustering. The evaluation index S_Dbw _WRG (scatter, density between clusters for word relation graph) is defined and claimed. The method of adjusting the association threshold, that is, the normalized mutual information amount SMI, as the selection criteria of the association word set using the defined cluster evaluation index.

상대적 군집 평가 인덱스인 S_Dbw를 그래프 기반의 데이터에 대해서 사용하기 위해서는 몇 가지 수정되어야 할 사항들이 있다. 그래프 기반의 데이터는 정점과 간선으로 이루어져 있기 때문에 중점을 구할 수가 없다. 또한 정점에 해당하는 단어들은 다른 단어들에 대한 직접연결만이 간선으로 표현되어 있기 때문에 기하학 공간에서 나타낼 수가 없으며 유클리드 거리 또는 맨하탄 거리 등의 거리 척도를 사용하기 어렵다. 따라서 [수학식 1]의 density를 구하는 식을 수정하여야 한다.In order to use the relative cluster evaluation index S_Dbw for graph-based data, there are some modifications. Since the graph-based data consists of vertices and edges, the midpoint cannot be found. Also, the words corresponding to vertices cannot be represented in geometric space because only direct connections to other words are represented by edges, and it is difficult to use distance measures such as Euclidean distance or Manhattan distance. Therefore, the formula for calculating the density of [Equation 1] should be corrected.

그래프의 밀도는 일반적으로 그래프의 정점간에 존재하는 간선수의 완전연결의 간선수에 대한 비율로 계산된다. 무게를 가진 간선인 경우에는 완전연결의 간선수에 대한 간선들의 무게의 합으로 계산할 수 있다. 정점집합의 크기 |V|가 n이고, 간선집합의 크기가 |E|인 무방향성 그래프 G의 밀도는 [수학식 1]에 의해서 구할 수 있다. 단, w _k 는 k번째 간선의 무게를 의미한다.The density of the graph is usually calculated as the ratio of the liver to the full connection of the livers between the vertices of the graph. In the case of weighted trunks, it may be calculated as the sum of the weights of the trunks for the fully connected trunks. Vertex Set Size | V | is n and the size of the edge set is | The density of the non-directional graph G which is E | can be calculated | required by [Equation 1]. However, w _k means the weight of the k- th edge.

종래 기술 [수학식 2]의 S_Dbw 인덱스에서 Inter-cluster Density인 Dens_bw를 단어연관그래프에 사용하기 위하여 본 발명에서는 두 군집의 중점간의 중점 u _ij 에 대한 밀도를 재정의하여 Dens_bw_WRG를 정의한다. 군집 C _i 와 C _j 에 대한 그래프를 각각 G_i=(V _i ,E _i )와 G_j=(V _j ,E _j )라 할 때, 그래프 G_ij=(V _ij ,E _ij )에서 E _ij = {e _k =(v _k1 ,v _k2 ) | ((v _k1 V ₁ ) and (v _k2 V ₂ )) or ((v _k1 V ₂ ) and (v _k2 V ₁ ))}로 정의되며 V _ij 는 E _ij 에 인접한 정점들의 집합으로 정의된다. 이와 같이 정의된 간선집합 E _ij 는 전체 그래프 G의 간선집 합 E의 부분집합이며 그래프 G_i, G_j의 간선집합 E _i 와 E _j 에 대해서 |E _ij nE _i |=0 이며 |E _ij nE _j |=0의 관계를 갖는다. 이는 생성된 군집이 같은 정점을 중복하여 포함할 수 있기 때문이다. 이와 같이 정의된 정점집합 V _ij 와 간선집합 E _ij 에 의해 그래프 G_ij를 정의할 수 있다. 그래프 G_ij는 그래프 G_i와 G_j 간에 연결된 간선과 정점으로 이루어진 그래프가 되며, 그래프 G_i,j의 밀도로써 두 군집의 중점 간의 중점 u _ij 에 대한 밀도를 대신하여 Dens_bw_WRG를 [수학식 3]과 같이 정의할 수 있다. 단, c는 생성된 군집의 개수이다. [도 7]의 (가)는 단어연관그래프 G의 Dens_bw_WRG를 구하는 과정을 설명한다. 전체 그래프 G는 세개의 군집으로 나누어졌고 군집 C ₁ 과 C ₂ 는 단어 c와 d를 중복하여 포함하고 있다. 이 경우 각 군집에 해당하는 하부그래프 G₁과 G₂를 연결하는 간선의 개수는 3이며 간선무게의 합은 6이다. 따라서 G_1,2의 밀도는 연결된 4개의 정점에 의해 1.0이 계산된다. 같은 과정으로 G_1,3의 밀도로 2.0이, G_2,3의 밀도로 0.66이 계산된다.In order to use Dens_bw, an Inter-cluster Density, in the word association graph in the S_Dbw index of Equation 2, in the present invention, Dens_bw _WRG is defined by redefining the density of the center point u _ij between the center points of two clusters. A graph of the clusters C _i and C _j, each G _i = (V _i, E _i) and G _j = (V _j, E _j) when La, graph G _ij = (V _ij, E _ij) E _ij in = { e _k = ( v _k1 , v _k2 ) | ((v _k1 V ₁ ) and (v _k2 V ₂ )) or ((v _k1 V ₂ ) and (v _k2 V ₁ ))} and V _ij is defined as the set of vertices adjacent to E _ij . The defined edge set E _ij is a subset of the edge set E of the entire graph G. For the edge sets E _i and E _j of the graphs G _i , G _j | E _ij n E _i | Is 0 and | E _ij n E _j | Has a relationship of 0. This is because the generated clusters may contain the same vertices in duplicate. The graph G _ij can be defined by the vertex set V _ij and the edge set E _ij defined as described above. The graph G _ij becomes a graph consisting of the edges and vertices connected between the graphs G _i and G _j , and the density of the graph G _{i, j} is replaced by Dens_bw _WRG instead of the density of the midpoint u _ij between the midpoints of the two clusters. ] Can be defined as Where c is the number of clusters created. FIG. 7A illustrates a process of obtaining Dens_bw _WRG of the word association graph G. As shown in FIG. The entire graph G is divided into three clusters, and clusters C ₁ and C ₂ contain the words c and d in duplicate. In this case, the number of edges connecting the lower graphs G ₁ and G ₂ corresponding to each cluster is 3 and the sum of the edge weights is 6. Therefore, the density of G _1,2 is calculated to be 1.0 by four connected vertices. In the same process, 2.0 is calculated for the density of G _1,3 and 0.66 for the density of G _2,3 .

S_Dbw 인덱스의 Intra-cluster Variance인 Scat는 군집들이 얼마나 밀집되어 모여 있는지를 설명하는 척도이므로 그래프의 밀도 함수로 대체하여 정의할 수 있으며 이를 ScatWRG라 표기한다. 다만, 밀도와 분산은 반대 개념이므로 하부 그래프의 평균밀도를 분모로 하고 전체 그래프 G의 밀도를 분자로 하여 [수학식 4]와 같이 정 의하였다. [도 7]의 (나)는 단어연관그래프 G의 ScatWRG를 구하는 과정을 설명한다. 세 개의 군집 C1, C2, C3에 대한 그래프 G1, G2, G3의 밀도는 각각 1.1, 0.7, 0.8이며 전체 그래프 G의 밀도는 0.5이므로 단어연관그래프 G의 ScatWRG는 0.625로 계산된다.Scat, the intra-cluster variance of the S_Dbw index, is a measure of how clustered clusters are. Therefore, it can be defined by replacing the density function of the graph, and it is expressed as ScatWRG. However, since density and variance are opposite concepts, the average density of the lower graph is used as the denominator and the density of the entire graph G is defined as the formula [4]. 7 (b) illustrates a process of obtaining ScatWRG of the word association graph G. Since the densities of the graphs G1, G2, and G3 for the three clusters C1, C2, and C3 are 1.1, 0.7, and 0.8, respectively, and the density of the total graph G is 0.5, the ScatWRG of the word correlation graph G is calculated as 0.625.

[표 5]와 [도 8]은 단어연관그래프 G1.6("허브")의 병합 임계치 Tsim에 대한 군집평가인덱스 S_DbwWRG의 결과이다. 병합 임계치는 군집 개수의 영향을 주는 설정값으로 구간 0.6∼2.4에서 0.1씩 조정하며 군집평가를 수행하였다. 단어연관그래프 G1.6("허브“)는 |V|=38, |E|=98, density(G)=0.3298 이었고, 평가 결과로 Tsim이 구간 0.4∼0.7에서 최적의 군집을 형성하였다. [표 6]은 MIN_SIZE =3, α=1.5, Tsim=1.3에서의 군집 결과이다. 결과에서 아랫줄 표시(‘_’)로 강조된Table 5 and FIG. 8 show the results of the cluster evaluation index S_DbwWRG for the merge threshold Tsim of the word association graph G1.6 ("hub"). The merge threshold is a setting value that affects the number of clusters. The cluster evaluation was performed by adjusting the intervals from 0.6 to 2.4 by 0.1. The word correlation graph G1.6 ("hub") was | V | = 38, | E | = 98, density (G) = 0.3298, and the result of the evaluation, Tsim formed an optimal cluster in the interval 0.4-0.7. Table 6 shows the clustering results for MIN_SIZE = 3, α = 1.5, and Tsim = 1.3, with the bottom line ('_') highlighted in the results.

단어는 다른 군집에도 중복되어 나타난 연관단어이다. A word is a related word that is duplicated in other clusters.

기준단어Reference word 병합임계치Merge threshold 군집수Cluster DensDens __ bwbw _WRGWRG ScatScat _WRGWRG S_S_ DbwDbw _WRGWRG 허브Herb 0.10.1 33 0.00000.0000 0.25390.2539 0.25390.2539 0.20.2 44 0.05800.0580 0.24440.2444 0.30240.3024 0.30.3 44 0.05800.0580 0.24440.2444 0.30240.3024 0.40.4 55 0.06210.0621 0.18890.1889 0.25100.2510 0.50.5 55 0.06210.0621 0.18890.1889 0.25100.2510 0.60.6 55 0.06210.0621 0.18890.1889 0.25100.2510 0.70.7 55 0.06210.0621 0.18890.1889 0.25100.2510 0.80.8 66 0.10760.1076 0.17630.1763 0.28390.2839 0.90.9 66 0.10760.1076 0.17630.1763 0.28390.2839 1.01.0 66 0.10760.1076 0.17630.1763 0.28390.2839

기준단어Reference word 군집번호Cluster number 연관단어Related Words 허브Herb 1One 공유기, 랜카드, 라우터, 네트워크, 케이블, 거래처, 무선랜Router, LAN Card, Router, Network 22 연결, 판매자, 구매자, 산업분야, antiConnection, seller, buyer, industry, anti 33 아로마, 화장품, 오일, 천연, 바디케어, 향수, 향초, 허브차, 아로마테라피, 목욕용품 Aroma , Cosmetics, Oils, Natural, Body Care, Fragrances, Herbs, Herbal Tea, Aromatherapy , Bath Supplies 44 종류, 요리법, 어원, 샐러드, 미용제품, 향기요법, 활용법, 농장, 아로마테라피 Kind, recipe, etymology, salad, beauty product, fragrance therapy, application, farm, aromatherapy 55 mall, 간병용품, 의료용품, 가정용의료기, 건강보조식품, 아로마 mall, care products, medical supplies, home medical equipment, health supplements, aromas

7단계: 의미군집의 중심단어 선정Step 7: Select the central words of the semantic cluster

연관단어 클러스터링에 의해 생성된 군집은 여러 단어들을 포함하게 된다. 연관단어 군집의 크기가 클 경우 연관단어들이 무슨 의미를 기준으로 모였는지 파악하기 어렵게 될 수 있으므로 군집을 설명하는 Label이 필요하다. Label은 군집내 데이터의 대표성과 함께 직관적인 인식이 가능한 내용으로 작성되어야 하며 연관단어 군집의 경우 군집내 연관단어들 중 군집의 의미를 가장 강하게 나타내는 대표적인 단어들로 Label을 작성하는 것이 좋다. 본 발명에서는 그래프로 표현된 군집 내에서 중심성이 큰 단어들을 선정하여 군집의 Label로 선정하는 방법을 사용하였다. 단, 그래프의 중심성은 연결된 간선의 개수와 간선무게의 합을 동시에 고려하여 계산하였고, 정점에 해당하는 단어 3개를 선택하여 중심단어라고 정의하였다.The cluster created by associative word clustering will contain several words. If the size of the related word cluster is large, it may be difficult to understand what the related words are gathered based on the meaning, so a label describing the cluster is needed. The label should be written with contents that can be intuitively recognized with the representativeness of the data in the cluster. In the case of related word clusters, it is recommended to create a label with representative words that strongly indicate the meaning of the cluster among the related words in the cluster. In the present invention, a method of selecting words having a high centrality within a cluster represented by a graph and selecting a label of a cluster is used. However, the centrality of the graph was calculated by considering the number of connected edges and the sum of the edge weights at the same time, and defined as the central word by selecting three words corresponding to the vertices.

그래프의 중심성 = 연결된 간선의 수 x 연결된 간선 무게의 합Centrality of the graph = number of connected edges x sum of weights of connected edges

[표 7]은 단어연관그래프 G2.1("자동차"), MIN_SIZE=4, α=1.5, Tsim=0.3에서의 클러스터링 결과이다. 표에서 굵게 표시된 단어는 각 군집의 단어 중에서 중심성이 가장 큰 단어 3개를 나타낸다.Table 7 shows the clustering results for the word association graph G2.1 ("car"), MIN_SIZE = 4, α = 1.5, and Tsim = 0.3. The words in bold in the table represent the three most central words among the words in each cluster.

기준단어Reference word 군집번호Cluster number 연관단어Related Words 자동차car 1One 헬기, 비행기, 프라모델, 보트, rc, 모형 Helicopter, airplane, plastic model , boat, rc, model 22 중고차, 보험, 신차, 수입차, 정비, 직거래, 구입가이드, 포뮬러, 기아자동차, 마티즈, 명차, 대우자동차 등.. 29개 Used cars, insurance, new cars , imported cars, maintenance, direct sales, purchase guides, formulas, Kia Motors, Matiz, famous cars, Daewoo Motors, etc .. 29 33 고속도로, 휴게소, 도로안내, 편의시설, 휴식공간 Expressway, Rest Area, Road Guide , Amenities, Rest Area 44 장기대여, 요금표, 차종, 고속도로 Long Term Rentals, Fees, Cars , Highways 55 튜닝, 서스펜션, 배기, 에어댐 Tuning, Suspension, Exhaust , Air Dam 66 부품, joint , hose, 고무부품 Parts, joint , hose , rubber parts

상기 과정의 결과를 바탕으로 정보 검색시 질의 범위가 넓은 경우 관련된 주제별로 선정된 중심단어를 이용하여 검색 결과를 다시 질의하는 방식으로 이용할 수 있다.Based on the results of the above process, when the query range is wide when the information is searched, the search result may be used again by using a central word selected by a related topic.

[도 1]은 정규화된 상호정보량 산출 과정으로 본 발명의 1단계와 2단계를 도식화한 예이다. [도 2]는 단어 "허브"의 단어 연관 그래프 (θ = 2.1); [도 3]은 HEMWRG 알고리즘의 수행 과정 (MIN_SIZE = 3)으로 [도 2]는 [도 3]에 의해 구성된 단어 "허브"의 단어연관그래프 G ₂ _.1 ("허브")이다. [도 4]는 단어연관그래프 G2.1('허브‘)에 대한 실제 데이터에서 추출한 단어연관그래프에서 HEM _WRG 알고리즘을 수행하는 과정을 보여주고 있다. [도 5]는 HEMWRG 알고리즘에 대한 설명으로 첫 번째 프로시져 HEM_WRG_COARSENING는 단어연관그래프 G(w)와 다중정점의 최소 멤버 크기인 MIN_SIZE를 입력으로 받아 다중정점의 집합인 MV_all를 출력하고, 두 번째 프로시져 HEM_WRG_MATCHING는 첫 번째 프로시져의 출력인 MV_all를 입력으로 받아 매칭 과정을 수행하고, 남겨진 다중정점의 집합 MV_unmatched를 출력한다. [도 6]은 단어 “허브”의 군집화 과정 (Tsim = 0.7, α = 1.5)의 예로 그래프 분할을 통해서 4개의 하부군집이 생성되어 있으며, 유사도 계산을 통해 하부군집을 병합해 나가는 과정을 표현한 것이다. [도 7]의 (가)는 단어연관그래프 G의 Dens_bw_WRG를 구하는 과정을 [도 7]의 (나)는 단어연관그래프 G의 ScatWRG를 구하는 과정을 보여준다. [도 8]은 단어연관그래프 G1.6("허브“)의 병합 임계치에 따른 군집평가인덱스 결과를 나타낸 것이다.1 is an example of the first and second steps of the present invention as a normalized mutual information calculation process. 2 is a word association graph of the word "hub" (θ = 2.1); [Figure 3] is carried out process (MIN_SIZE = 3) in Fig. 2] [3] The word "hub" of the word associated graph G ₂ _.1 ( "hub") is configured by a HEMWRG algorithm. [Figure 4] is the HEM _WRG in the word association graph extracted from the actual data for the word association graph G2.1 ('hub') It shows the process of executing the algorithm. [Figure 5] illustrates the HEMWRG algorithm. The first procedure HEM_WRG_COARSENING receives the word association graph G (w) and MIN_SIZE, the minimum member size of the multi-vertex, and outputs MV _all , which is a set of multi-vertex, and the second procedure. HEM_WRG_MATCHING accepts MV _all , the output of the first procedure, and performs the matching process, and outputs the remaining MV _unmatched set of multivertexes. [Figure 6] is an example of the clustering process (Tsim = 0.7, α = 1.5) of the word "hub", four sub-groups are generated through the graph segmentation, representing the process of merging the sub-groups through the similarity calculation. . 7A shows a process of obtaining Dens_bw _WRG of the word association graph G, and FIG. 7B shows a process of obtaining ScatWRG of the word association graph G. FIG. 8 shows cluster evaluation index results according to a merge threshold of the word association graph G1.6 (“hub”).

Claims

HEM 알고리즘을 기본으로 단어의 중의적 특징을 반영할 수 있도록 수정한 HEMWRG 알고리즘을 청구한다Claim the HEMWRG algorithm modified to reflect the dual characteristics of words based on HEM algorithm

HEM _WRG 알고리즘은 첫 번째 과정으로 단일 정점으로 이루어진 초기 그래프를 무게를 가진 다중정점들로 이루어진 그래프로 변환하고, 초기 그래프의 각 정점으로부터 깊이가 1인 너비우선탐색으로 간선들을 탐색하여 연결된 정점을 찾고 탐색된 정점들과 간선들을 접어 하나의 다중정점으로 만드는 모듈; 다중정점의 무게는 포함된 간선들의 무게의 합이 된다. 이러한 탐색 과정을 거쳐 만들어진 다중정점들 중에 포함하는 정점의 개수가 MIN _ SIZE를 만족하지 못 하는 다중정점이 있다면 변환된 그래프에서 제거하는 모듈; The HEM _WRG algorithm first converts an initial graph of single vertices into a graph of multiple vertices with weights, and finds connected vertices by searching edges with a breadth-first search of depth 1 from each vertex of the initial graph. A module that folds the searched vertices and edges into a single vertex; The weight of multiple vertices is the sum of the weights of the included edges. A module for removing from the converted graph if there are multiple vertices whose number of vertices included in the search process does not satisfy MIN _ SIZE ;

두 번째 과정으로 무게가 가장 큰 다중정점 순으로 순회하면서 하나의 다중정점 내의 각각의 멤버에 대해 그 멤버와 동일한 멤버를 소유하고 있는 다른 다중정점으로 동일한 멤버를 매칭, 만약 멤버들 중 하나라도 다른 다중정점에 매칭 되지 않은 멤버가 있다면 무게가 다음으로 작은 다중정점을 선택하여 같은 과정을 수행하는 것을 특징으로 하는 단어연관그래프를 이용하여 자동으로 단어의 의미 그룹을 찾아내는 기법.The second step is to traverse the vertices with the largest weight and match the same member with each other within the vertex, with another vertex that owns the same member, if one of the members If there is a member that does not match a vertex, a method that automatically finds a semantic group of words using a word-association graph, which performs the same process by selecting multiple vertices with the lowest weight.

제 1항에 있어서,The method of claim 1,

그래프의 밀도는 일반적으로 그래프의 정점간에 존재하는 간선수의 완전연결의 간선수에 대한 비율로 계산되는 모듈; 무게를 가진 간선인 경우에는 완전연결의 간선수에 대한 간선들의 무게의 합으로 계산하는 모듈; 것을 특징으로 하는 단어연관그래프를 이용하여 자동으로 단어의 의미 그룹을 찾아내는 기법. The density of the graph is generally a module that is calculated as the ratio of the liver to the full connection of the heirs present between the vertices of the graph; A module for calculating the sum of the weights of the trunks for the fully connected trunks in the case of a trunk with weights; A method of automatically finding a meaning group of words using a word association graph.

제 2항에 있어서,3. The method of claim 2,

연관단어 클러스터링에 의해 생성된 군집은 여러 단어들을 포함하게 된다. 연관단어 군집의 크기가 클 경우 연관단어들이 무슨 의미를 기준으로 모였는지 파악하기 어렵게 될 수 있으므로 군집을 설명하는 Label이 필요하다. Label은 군집내 데이터의 대표성과 함께 직관적인 인식이 가능한 내용으로 작성되어야 하며 연관단어 군집의 경우 군집내 연관단어들 중 군집의 의미를 가장 강하게 나타내는 대표적인 단어들로 Label을 작성하는 것이 좋다. 본 발명에서는 그래프로 표현된 군집 내에서 중심성이 큰 단어들을 선정하여 군집의 Label로 선정하는 방법을 사용하였다. 단, 그래프의 중심성은 연결된 간선의 개수와 간선무게의 합을 동시에 고려하여 계산하였고 정점에 해당하는 단어 3개를 선택하여 중심단어라고 정의하였다.The cluster created by associative word clustering will contain several words. If the size of the related word cluster is large, it may be difficult to understand what the related words are gathered based on the meaning, so a label describing the cluster is needed. The label should be written with contents that can be intuitively recognized with the representativeness of the data in the cluster. In the case of related word clusters, it is recommended to create a label with representative words that strongly indicate the meaning of the cluster among the related words in the cluster. In the present invention, a method of selecting words having a high centrality within a cluster represented by a graph and selecting a label of a cluster is used. However, the centrality of the graph was calculated by considering the number of edges connected and the sum of the edge weights at the same time, and defined as the central word by selecting three words corresponding to the vertices.