KR20100068532A

KR20100068532A - Apparatus and method for keyword extraction and associative word network configuration of document data

Info

Publication number: KR20100068532A
Application number: KR1020080126926A
Authority: KR
Inventors: 임수종; 김현기; 이창기; 최미란; 윤여찬; 황이규; 이충희; 오효정; 허정; 장명길
Original assignee: 한국전자통신연구원
Priority date: 2008-12-15
Filing date: 2008-12-15
Publication date: 2010-06-24
Also published as: KR101060594B1

Abstract

PURPOSE: A device and a method for keyword extraction and an associative word network configuration of document data are provided to extract automatically issue key word from a Blog document group and constitute an associative network in between extracted key words, thereby showing exact keyword according to each document. CONSTITUTION: An issue keyword extractor(104) parses structure information of a document in an inputted web document group. An issue keyword extractor extracts an issue keyword based on analyzed morpheme. An associative work network configurator(106) extracts relations between extracted issue keywords. An indexing unit(108) indexes extracted issue keywords and configured associated word network. According to a control command, a presentation unit(114) suggests the issue keyword and associated word network information.

Description

문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치 및 방법{APPARATUS AND METHOD FOR KEYWORD EXTRACTION AND ASSOCIATIVE WORD NETWORK CONFIGURATION OF DOCUMENT DATA}Keyword extraction and association of document data Network configuration device and method {APPARATUS AND METHOD FOR KEYWORD EXTRACTION AND ASSOCIATIVE WORD NETWORK CONFIGURATION OF DOCUMENT DATA}

본 발명은 문서 데이터 내의 키워드 추출 기술에 관한 것으로서, 특히 웹 문서 상의 대용량 블로그 문서에서 블로그의 태그 정보를 이용하여 키워드를 자동으로 추출하고 추출된 키워드 간의 연관어 네트워크를 공기 정보를 이용하여 구성하고, 이를 사용자에게 제공하는데 적합한 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치 및 방법에 관한 것이다. The present invention relates to a technique for extracting keywords in document data. In particular, a keyword is automatically extracted using tag information of a blog in a large-scale blog document on a web document, and an association network between the extracted keywords is constructed using air information. An apparatus and method for extracting keyword and associated word network configuration of document data suitable for providing the same to a user.

본 발명은 지식경제부 및 정보통신연구진흥원의 IT신성장동력핵심기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호: 2008-S-020-01, 과제명: 웹 QA 기술 개발].The present invention is derived from the research conducted as part of the IT new growth engine core technology development project of the Ministry of Knowledge Economy and the Ministry of Information and Communication Research and Development (Task Management No .: 2008-S-020-01, Title: Web QA Technology Development).

근래에는 인터넷 상에서 사용자가 자신의 관심사에 따라 자유롭게 글을 올릴 수 있는 웹사이트인, 블로그를 통해 개인 출판, 방송, 커뮤니티 등의 다양한 형태의 콘텐츠를 생산해 내고 있다. Recently, various types of contents are produced through blogs, which are a website where users can freely write according to their interests on the Internet.

이러한 블로그는 웹 게시판, 개인 홈페이지의 기능이 혼합되어 있고, 인터넷 홈페이지 제작과 관련된 지식이 없더라도 자신의 공간을 손쉽게 만들 수 있다는 장점이 있다. 즉 블로그 페이지만 있으면, 누구나 텍스트 또는 그래픽 방식을 이용하여 자신의 의견이나 이야기를 올릴 수 있고, 자신이 가지고 있는 이미지 데이터 및 동영상 데이터를 업로딩할 수 있는 새로운 개념의 미디어이다 These blogs have the advantage of a mixture of functions of web bulletin boards and personal homepages, and it is easy to create your own space even if you do not have knowledge related to Internet homepage production. In other words, if you have a blog page, anyone can post their opinions or stories using text or graphic methods, and upload new image data and video data.

기존에는 블로그의 여러 콘텐츠 중 문서 데이터에서 키워드를 추출하여 키워드의 우선순위를 정하는 방법은 단순히 키워드의 출현 빈도에 기반하고 있는데, 이것은 키워드가 출현하는 다양한 태그의 중요성을 감안하지 않는 한계를 드러내고 있다. 즉, 블로그 문서는 블로그 문서를 대표하는 <포스트 태그>, 블로그 문서의 제목인 <포스트 제목 태그>, 그리고 블로그 문서의 본문인 <포스트 본문> 태그가 존재하는데 일반적으로 이러한 태그의 중요 순위는 다르다고 인식이 되어 있다.　 같은 키워드라 하더라도 어떤 태그 안에서 발생하느냐에 따라 중요도가 다르기 때문에 단순히 많이 출현했다고 해도 중요하지 않거나 일반적으로 길이가 긴 태그에서 발생했다고 하면, 무조건적으로 우선순위가 높다고 할 수 없기 때문에 이러한 점을 고려해야 하는데 종래 기술에서는 고려되지 못 하고 있다. Conventionally, the method of prioritizing keywords by extracting keywords from document data among various contents of blogs is simply based on the frequency of occurrence of keywords, which shows a limitation that does not consider the importance of various tags in which keywords appear. In other words, blog documents include <post tags> that represent blog documents, <post title tags> that are blog document titles, and <post body> tags, which are the body of blog documents. Has become. Even if the same keyword is different in importance depending on which tag occurs, it should be considered because even if it appears a lot, it is not important to say that it is not important even if it is not important or occurs in a long tag. Is not considered.

또한 단어와 단어 간에 연관 관계를 파악하기 위해서는 주로 일반 문서 집합에서 명사-격조사-용언 공기 정보 후보를 추출하고 수학적인 방법론을 이용하여 공기 정보를 추출해내고 있지만 이것은 어디까지 ‘배-가-비싸다’와 같은 명사와 용언 간의 연관 정보이다. 그리고 명사와 명사 간의 연관 관계는 의미 관계를 분석하여 수작업이나 반자동 작업을 통해서 어휘 개념망을 구축하고 있는데 어휘 개념망 은 주로 단어 간의 상하 관계나 의미적인 연관성만을 대상으로 하고 있기 때문에 블로그 문서의 단어 중에 많은 부분을 차지하는 고유명사에 대한 정보를 담기가 어렵다는 한계가 있었다.In addition, in order to understand the relationship between words, we usually extract candidates for noun-check-word air information from general document sets and extract air information using mathematical methodology. Information related to the same noun and verb. The relationship between nouns and nouns is based on the analysis of semantic relationships and constructs a lexical concept network through manual or semi-automatic tasks. There is a limit that it is difficult to contain information about proper nouns that take up a large part.

상기한 바와 같이 동작하는 종래 기술에 의한 웹사이트 상의 문서데이터에서 키워드 정보를 제공하는 방식에 있어서는, 문서 내 출현 빈도를 이용한 키워드 추출 방식 및 단어와 단어 간 연관관계는 명사-격조사-용언에 대한 분석과 명사와 명사 간 분석을 통하여 나타내었으나, 출현 빈도만으로 키워드를 추출하기에는 다양한 태그의 중요성이 감안되지는 않았으며, 단어 간 연관관계는 단어 간의 의미적인 연관성만을 대상으로 하기 때문에 고유명사에 대한 정보는 담을 수가 없다는 문제점이 있었다. In the method of providing the keyword information in the document data on the website according to the related art operating as described above, the keyword extraction method using the frequency of appearance in the document and the relationship between words and words are analyzed for nouns-examined-words. Although it is shown through the analysis between and nouns and nouns, the importance of various tags is not taken into account to extract keywords based on the frequency of appearance only. There was a problem that can not hold.

이에 본 발명은, 웹사이트 상에서 블로그 문서의 태그와 공기 정보를 이용하여 블로그 문서 집합 안에서 이슈가 된 키워드를 자동으로 추출하고 추출된 키워드 간의 연관 네트워크를 구성하여 사용자에게 제시할 수 있는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치 및 방법을 제공한다. Accordingly, the present invention, by using the tag and the air information of the blog document on the website automatically extracts the keyword of the issue in the blog document set, the keyword of the document data that can be presented to the user by forming an associative network between the extracted keywords An apparatus and method for extracting and associating a network configuration are provided.

또한 본 발명은, 웹사이트 상의 블로그 문서 집합에서 키워드를 추출하여 빈도나 블로그의 태그 정보 등에 기반하여 키워드의 순위를 매기고, 이러한 키워드 간의 연관성을 파악하여 연관 네트워크를 구성할 수 있는 문서 데이터의 키워드 추 출 및 연관어 네트워크 구성 장치 및 방법을 제공한다. In addition, the present invention, by extracting keywords from the blog document set on the website to rank the keywords based on the frequency or the tag information of the blog, and the like, keyword addition of the document data that can form an associative network by identifying the association between these keywords Provided is an apparatus and method for network configuration.

또한 본 발명은, 블로그 문서의 문서 제목 태그, 사용자가 입력한 포스트 태그, 그리고 문서 본문에서 키워드를 추출하여 키워드의 점수를 계산하여 점수가 높은 키워드를 블로그 문서의 이슈 키워드로 추출하고, 추출된 키워드들의 공기 정보를 이용하여 키워드 간의 연관 네트워크를 구성할 수 있는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치 및 방법을 제공한다.In addition, the present invention, by extracting a keyword from the document title tag of the blog document, the post tag entered by the user, and the document body to calculate the keyword score to extract a keyword with a high score as the issue keyword of the blog document, the extracted keyword Provided are an apparatus and a method for extracting keyword data and associating a network of document data, which can form an associative network between keywords by using the air information.

본 발명의 일 실시예 장치는, 입력된 웹 문서 집합에서 문서의 구조 정보를 파싱하고, 태그 별로 파악된 정보로 형태소 분석을 수행하여 분석된 형태소를 토대로 이슈 키워드를 추출하는 이슈 키워드 추출부와, 추출된 이슈 키워드 간에 연관성을 추출하는 연관어 네트워크 구성부와, 상기 이슈 키워드 추출부로부터 추출된 이슈 키워드와, 상기 연관어 네트워크 구성부로부터 구성된 연관어 네트워크를 색인하여 저장하는 색인부와, 입력된 제어명령에 따라 상기 색인부를 참조하여 상기 이슈 키워드 및 연관어 네트워크 정보를 제시하는 제시부를 포함한다. According to an embodiment of the present disclosure, an apparatus may include: an issue keyword extracting unit configured to parse structure information of a document from an input web document set, and extract an issue keyword based on the analyzed morpheme by performing morphological analysis with information identified for each tag; An association word network component for extracting an association between extracted issue keywords, an issue keyword extracted from the issue keyword extracting portion, an index portion for indexing and storing an association word network constructed from the association word network component, and an input; And a presentation unit for presenting the issue keyword and associated word network information with reference to the index unit according to a control command.

본 발명의 일 실시예 방법은, 입력된 웹 문서 집합에서 문서의 구조 정보를 파싱하여 태그 별로 파악된 정보로 형태소 분석을 수행하는 과정과, 분석된 형태소를 토대로 이슈 키워드를 추출하는 과정과, 추출된 이슈 키워드 간에 연관성을 추출하는 과정과, 상기 이슈 키워드와, 추출된 키워드 연관성 정보를 색인부에 색인하여 저장하는 과정과, 제어명령이 입력된 경우, 상기 색인부를 참조하여 상기 이 슈 키워드 및 연관성 정보를 제시하는 과정을 포함한다.According to an embodiment of the present invention, a method of parsing document structure information from an input web document set and performing morphological analysis using information identified for each tag, extracting an issue keyword based on the analyzed morpheme, and extracting Extracting an association between the extracted issue keywords, indexing and storing the issue keyword and extracted keyword association information in an index unit, and, when a control command is input, refer to the index unit for the issue keyword and the association. It involves the process of presenting information.

본 발명에 있어서, 개시되는 발명 중 대표적인 것에 의하여 얻어지는 효과를 간단히 설명하면 다음과 같다. In the present invention, the effects obtained by the representative ones of the disclosed inventions will be briefly described as follows.

본 발명은, 블로그 문서 집합에서 자동으로 이슈가 되는 키워드를 추출하고, 추출된 키워드 간의 연관 네트워크를 구성하여 각 문서별로 정확한 키워드를 나타낼 수 있으며, 특정 기간에 수집된 블로그 문서 집합에 대해서 사용자는 모든 문서를 탐색해 보지 않더라도 자주 쓰인 이슈 키워드의 순위와 이와 연관된 키워드를 네트워크 형태로 시각적으로 쉽게 접근하게 되어 블로그 문서 집합의 내용을 쉽게 알 수 있는 효과가 있다.According to the present invention, a keyword that is automatically issued from a blog document set can be extracted, and an associated network between the extracted keywords can be formed to represent an accurate keyword for each document. Even if you do not browse the document, it is easy to visualize the contents of the blog document set by easily accessing the ranking of frequently used issue keywords and related keywords in a network form.

이하 첨부된 도면을 참조하여 본 발명의 동작 원리를 상세히 설명한다. 하기에서 본 발명을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. Hereinafter, the operating principle of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. The following terms are defined in consideration of the functions of the present invention, and may be changed according to the intentions or customs of the user, the operator, and the like. Therefore, the definition should be made based on the contents throughout the specification.

본 발명은 웹사이트 상에서 블로그 문서의 태그와 공기 정보를 이용하여 블 로그 문서 집합 안에서 이슈가 된 키워드를 자동으로 추출하여, 추출된 키워드의 순위를 매기고, 이러한 키워드 간에 연관성을 파악하여 연관 네트워크를 구성하는 것으로서, 구체적으로 블로그 문서의 문서 제목 태그, 사용자가 입력한 포스트 태그, 그리고 문서 본문에서 키워드를 추출하여 키워드의 점수를 계산하여 점수가 높은 키워드를 블로그 문서의 이슈 키워드로 추출하고, 추출된 키워드들의 공기 정보를 이용하여 키워드 간의 연관 네트워크를 구성하는 것이다. The present invention automatically extracts keywords in issue from a set of blog documents by using tag and air information of blog documents on a website, ranks the extracted keywords, and constructs an associative network by grasping the relevance between these keywords. Specifically, the keyword title is extracted from the document title tag of the blog document, the post tag input by the user, and the document body to calculate the keyword score to extract the high score keyword as the issue keyword of the blog document, and the extracted keyword. By using the air information of the two to form an associative network between keywords.

이와 같이 본 발명은, 블로그 문서 집합에서 자동으로 이슈가 되는 키워드를 추출하고, 추출된 키워드 간의 연관 네트워크를 구성하여 사용자가 한눈에 블로그 문서 집합의 내용을 우선순위가 주어진 키워드와 키워드 간의 네트워크를 보고 알 수 있도록 한다. 이러한 추출된 키워드만으로는 종래의 포털과 같은 웹사이트에서 제공하는 인기 검색어나 핫 태그(hot tag)와 같아 보이지만 키워드 간의 연관 네트워크까지 사용자에게 제시함으로써, 현재 이슈가 되고 있는 키워드들의 연관 관계까지 한 눈에 파악할 수 있다. As described above, the present invention automatically extracts a keyword that is an issue from a blog document set, constructs an associative network between the extracted keywords, and reports a network between the keyword and the keyword given priority to the contents of the blog document set by the user at a glance. Make sure you know. These extracted keywords may look like popular search terms or hot tags provided by a website such as a conventional portal, but the related networks between the keywords are presented to the user, so that the relations between the keywords that are currently an issue can be seen at a glance. I can figure it out.

또한, 연관 네트워크를 키워드 광고 시스템에 적용할 경우 사용자가 특정 키워드를 입력했을 때 그 키워드와 관련된 연관 네트워크상의 키워드가 광고 키워드에 해당하는 경우에는 사용자가 꼭 광고 키워드를 질의하지 않더라도 연관된 광고 키워드를 노출할 수 있는 장점이 있다.In addition, if the relevant network is applied to the keyword ad system, when a user enters a specific keyword and the keyword on the relevant network related to the keyword corresponds to the ad keyword, the associated ad keyword is exposed even if the user does not necessarily query the ad keyword. There is an advantage to this.

도 1은 본 발명의 바람직한 실시예에 따른 키워드 추출 및 연관어 네트워크 구성 장치의 구조를 도시한 블록도이다. 1 is a block diagram illustrating a structure of an apparatus for constructing a keyword extraction and associated word network according to an exemplary embodiment of the present invention.

도 1을 참조하면, 블로그 문서 집합에서 사용자가 관심 있어할만한 이슈 키워드와 이들 이슈 키워드 간의 연관 관계를 나타내는 이슈 키워드 연관 네트워크를 구성하여 사용자의 요청이 있을 경우 이를 제시하여 주기 위한 것으로서, 키워드 추출 및 연관어 네트워크 구성 장치는, 이슈 키워드 추출부(104), 연관어 네트워크 구성부(106), 이슈 키워드 및 연관어 네트워크 색인부(108), 이슈 키워드 및 연관어 네트워크 제시부(110), 사용자 요구 입력부(112), 결과 제시부(114)를 포함한다. Referring to FIG. 1, an issue keyword association network representing an association relationship between issue keywords that may be of interest to users in a blog document set is provided to present a user's request, and keyword extraction and association are provided. The network configuration apparatus may include an issue keyword extracting unit 104, an associative network constructing unit 106, an issue keyword and an associative network indexing unit 108, an issue keyword and an associative network presenting unit 110, and a user request input unit ( 112, the result presenting unit 114.

수집된 대용량의 블로그 문서 집합(102)에서 사용자가 모든 문서를 읽지 않아도 문서에서 주요하게 사용된 키워드와 그 키워드 간의 관련성이 무엇인지를 파악하기 위해서 먼저 이슈 키워드 추출부(104)에서는 블로그 문서 집합(102)에서 이슈 키워드를 추출하고, 연관 네트워크 구성부(106)에서는 이슈 키워드 추출부(104)로부터 추출된 이슈 키워드 간의 연관 상태를 네트워크 형식으로 구성한다. In order to understand what keywords are mainly used in a document and the relevance between the keywords in the collected large collection of blog documents 102, the issue keyword extracting unit 104 first uses the blog document collection ( In step 102, the issue keyword is extracted, and the association network configuration unit 106 configures the association state between the issue keywords extracted from the issue keyword extraction unit 104 in a network format.

이에 이슈 키워드 추출부(104)와 연관 네트워크 구성부(106)로부터 추출된 이슈 키워드와 연관어에 관한 정보는 사용자에게 제시되기 위해서 이슈 키워드 및 연관 네트워크 색인부(108)에 데이터 형태로 저장된다. Accordingly, the information about the issue keyword and the related word extracted from the issue keyword extraction unit 104 and the association network configuration unit 106 is stored in the form of data in the issue keyword and the association network index unit 108 to be presented to the user.

이슈 키워드 및 연관어 네트워크 제시부(110)에서는 사용자 요구 입력부(112)를 통해 사용자가 처음으로 이슈 키워드 및 연관 네트워크 색인부(108)에 저장된 데이터에 접근하거나 키워드를 클릭하거나 혹은 키워드를 질의하는 등의 요구가 있을 경우에 사용자의 요구 종류를 분석하고, 이슈 키워드 및 연관 네트워크 색인 데이터를 참조하여 결과 제시부(114)를 통해 사용자로부터 요구된 질의에 대한 결과를 사용자에게 제시하게 된다. In the issue keyword and association word network presentation unit 110, a user first accesses the data stored in the issue keyword and the associated network index unit 108, clicks a keyword, or queries a keyword through the user request input unit 112. When there is a request, the user's request type is analyzed, and the results of the query requested by the user are presented to the user through the result presentation unit 114 by referring to the issue keyword and the associated network index data.

도 2 는 본 발명의 실시예에 따른 블로그 문서 집합에서 이슈가 되는 키워드의 추출 및 색인 절차를 도시한 흐름도이다. 2 is a flowchart illustrating a procedure of extracting and indexing keywords as issues in a blog document set according to an embodiment of the present invention.

도 2를 참조하면, 200단계에서 이슈 키워드 추출부(104) 및 연관 네트워크 구성부(106)에서는 블러그 문서 집합(102)의 각 문서 별로 이슈 키워드를 추출하고, 키워드 간의 연관어 네트워크를 구성한 후, 202단계에서 이슈 키워드 및 연관어 네트워크 정보를 이슈 키워드 및 연관 네트워크 색인부(108)에 저장되도록 한다. Referring to FIG. 2, in step 200, the issue keyword extracting unit 104 and the association network constructing unit 106 extract an issue keyword for each document of the blog document set 102, and configure an association word network between the keywords. In operation 202, the issue keyword and related network information may be stored in the issue keyword and related network index unit 108.

이에 대해 좀 더 구체적으로 설명하면, 도 3은 본 발명의 실시예에 따른 블로그 문서 집합에서 이슈가 되는 키워드 간의 연관 네트워크 구성 방식을 도시한 블록도이다. More specifically, FIG. 3 is a block diagram illustrating a method of configuring an associative network between keywords that are an issue in a blog document set according to an embodiment of the present invention.

도 3을 참조하면, 추출 대상이 되는 블로그 문서 집합(102)이 이슈 키워드 추출부(104)로 입력되면, 웹 문서의 태그 정보를 파악하기 위해서 문서의 구조 정보를 파악하는 파싱 단계(300)를 수행하여 문서의 구조를 태그 별로 파악하게 된다. 이렇게 태그 별로 파악된 정보는 태그 별로 형태소 분석(302)을 수행하게 되며, 이때, 형태소 분석은 한글 문서의 경우 키워드의 기본 단위를 파악하기 위해서 수행하게 된다. 이에 태그별 형태소 분석이 완료된 문장의 정보는 이슈 키워드 추출(304)과 동일 태그 내 키워드 공기 관계 추출(308) 단계에서 병렬적으로 사용된다. Referring to FIG. 3, when the blog document set 102 to be extracted is input to the issue keyword extracting unit 104, a parsing step 300 for grasping structure information of a document in order to grasp tag information of a web document is performed. By doing this, the structure of the document is identified by tags. The information identified for each tag is performed by morphological analysis 302 for each tag. In this case, the morphological analysis is performed to identify a basic unit of a keyword in the case of a Korean document. The information of the sentence in which the morphological analysis for each tag is completed is used in parallel in the issue keyword extraction 304 and the keyword air relationship extraction 308 within the same tag.

한편, 위와 같이 블로그 문서의 파싱(300) 및 태그별 형태소의 분석(302) 수행은 이슈 키워드 추출부(104)에서 구현되거나, 이슈 키워드 추출부(104) 및 연관어 네트워크 구성부(106)가 통합된 기능부에서 구현될 수 있으며, 다른 방식으로는 키워드 추출 및 연관어 네트워크 구성 장치에서 이를 수행한 뒤, 이슈 키워드 추출부(104)로 전달할 수도 있다. Meanwhile, the parsing 300 of the blog document and the analysis of the morpheme for each tag 302 may be implemented in the issue keyword extracting unit 104, or the issue keyword extracting unit 104 and the association word network constructing unit 106 may be performed. It may be implemented in an integrated function unit. Alternatively, the keyword extraction and association word network constituent apparatus may do this, and then transfer it to the issue keyword extraction unit 104.

한편, 태그별 형태소 분석이 완료된 문장의 정보는 이슈 키워드 추출부(104)에서 태그 별로 명사, 복합 명사 등의 명사류를 이슈 키워드 후보로 추출(304)하고 발생 빈도 정보와 함께 발생한 태그 정보도 함께 관리하게 된다. 이렇게 추출된 < keyword₁,<빈도₁₁, 태그₁₁>>, < keyword₁,<빈도₁₂, 태그₁₂>>, …, < keyword₁,<빈도₁ _n, 태그₁ _n>>, …, < keyword_m,<빈도_m1, 태그_m1>>, …, < keyword_m,<빈도_mn, 태그_mn>> 정보는 키워드 랭킹 산정(306) 단계의 기본 데이터로 활용 된다. 기존의 방법은 태그 정보는 고려하지 않고 단순 빈도만을 참고하였지만, 본 발명의 실시예에서는 이러한 태그 정보를 하기 <수학식 1>과 같이 가중치를 조절하면서 빈도를 참조하여 이슈 키워드의 우선순위 랭킹을 산정할 때 키워드가 출현하는 태그의 정보도 고려하도록 하였다. On the other hand, the information of the sentence for which the morphological analysis for each tag is completed is extracted in the issue keyword extracting unit 104 with nouns, such as nouns and compound nouns, for each tag as an issue keyword candidate (304), and the tag information generated together with the occurrence frequency information is also managed. Done. The extracted <keyword ₁ , <frequency ₁₁ , tag ₁₁ >>, <keyword ₁ , <frequency ₁₂ , tag ₁₂ >>,. , <keyword ₁ , <frequency ₁ _n , tag ₁ _n >>,… , <keyword _m , <frequency _m1 , tag _m1 >>,… , <keyword _m , <frequency _mn , tag _mn >> information is used as basic data of the keyword ranking calculation step 306. The conventional method refers to only the simple frequency without considering the tag information, but in the embodiment of the present invention, the priority ranking of the issue keyword is calculated by referring to the frequency while adjusting the weight of the tag information as shown in Equation 1 below. In doing so, the information of the tag in which the keyword appears is also considered.

Score_M(keyword) = ∑ⁿα_jfreq_tagj(keyword, 빈도_i) Score _M (keyword) = ∑ ⁿ α _j freq _tagj (keyword, frequency _i )

여기서, M은 추출대상 전체 블로그 문서 집합이고. i는 1~n 블로그 문서의 태그 정보 개수이며, α_j는 태그별 가중치이다.Where M is the entire blog document set to be extracted. i is the number of tag information of 1 to n blog documents, and α _j is the weight for each tag.

이렇게 산정된 키워드 별 점수를 기반으로 하여 상위 n개의 키워드를 이슈 키워드로 선정하게 되고, 키워드 별로 랭킹을 산정(306)하게 된다. 또한, 이슈 키워드 이외에도 기 설정된 점수 또는 랭킹 임계값(threshold)을 초과하는 키워드에 대해서는 사용자에게 제시하기 위해서 이슈 키워드 및 연관어 네트워크 색인부(108)로 전달하여 최종적으로는 이슈 키워드 및 연관어 네트워크 색인 데이터(30)에 저장된다. Based on the calculated scores for each keyword, the top n keywords are selected as the issue keyword, and the ranking for each keyword is calculated 306. In addition, in addition to the issue keyword, keywords that exceed a predetermined score or ranking threshold are delivered to the issue keyword and related network index unit 108 for presentation to the user, and finally the issue keyword and related network index. Data 30 is stored.

또한, 태그별 형태소 분석(302)을 통해 태그별 형태소 분석이 완료된 정보는 연관어 네트워크 구성부(106)에서 연관어 네트워크 정보를 추출하기 위해서 동일 태그 내 키워드 공기 관계를 추출(308)하게 된다. 특정 키워드와 연관된 키워드는 같은 태그 내에서 발생할 경우 공기 관계로 정의하고 빈도로 결정을 하게 된다. 즉, 키워드 i에 대해서 tag_k에서 연관된 공기 정보는 다음과 같이 정의된다. In addition, the information on which the morphological analysis for each tag is completed through the morphological analysis for each tag 302 is extracted 308 for the keyword air relationship in the same tag in order to extract the related network information from the associated network component 106. Keywords associated with specific keywords are defined as air relationships and are determined by frequency when they occur within the same tag. That is, the air information associated with tag _k for keyword i is defined as follows.

Freq_tagk(keyword_i, keyword_j) = tag_k에서 keyword_i, keyword_j가 같이 발생한 빈도_ij Freq _tagk (keyword _i , keyword _j ) = frequency of occurrence of keyword _i and keyword _j together in tag _k _ij

위와 같이 특정 태그 내에서 같이 쓰인 키워드의 경우 언어학적인 의미 관계로는 관련이 없어 보이더라도 블로그 문서 내에서는 연관이 되어 있음을 반영한 것이다. As above, keywords used together within a specific tag reflect that they are related within the blog document even though they may seem irrelevant in linguistic semantics.

그리고 다음 단계로 동일 태그가 아닌 다른 태그에서 사용된 정보에 대해서 도 같은 문서라면 연관이 일부 있기 때문에 이러한 연관 관계에 대해 하기 <수학식 2>와 같이 정의하여 관련성을 반영하게 태그 외 키워드 공기 관계에 대한 추출(310)을 수행하게 된다. In the next step, if the same document is used for information used in a tag other than the same tag, some associations exist. Therefore, this association relationship is defined as shown in Equation 2 below to reflect the relevance. Extraction 310 is performed.

Freq_tagk _tagl (keyword_i, keyword_j) = tag_k _,tag_l에서 keyword_i, keyword_j가 같이 발생한 빈도_ij Freq _tagk _tagl (keyword _i , keyword _j ) = frequency of occurrence of keyword _i and keyword _j together in tag _k _and tag _l _ij

이렇게 추출된 정보에 기반하여 하기 <수학식 4>와 같이 최종적으로 키워드 간 연관 스코어를 산정(312)하여 어떤 키워드끼리 연관도가 높은지를 정하게 된다. Based on the extracted information, as shown in Equation 4, finally, the correlation score between keywords is calculated 312 to determine which keywords are highly related.

Score(keyword_i, keyword_j) = α∑ⁿ Freq_tagk(keyword_i, keyword_j)Score (keyword _i , keyword _j ) = α∑ ⁿ Freq _tagk (keyword _i , keyword _j )

+ β ∑ⁿ ∑ⁿ Freq_tagk _tagl (keyword_i, keyword_j) + β ∑ ⁿ ∑ ⁿ Freq _tagk _tagl (keyword _i , keyword _j )

(k ≠l) (k ≠ l)

여기서, α는 동일 태그 공기 정보의 가중치이고, β는 태그 외 공기 정보의 가중치이며, k, l은 1~n 블로그 문서의 태그 정보 개수임. Here, α is the weight of the same tag air information, β is the weight of non-tag air information, k, l is the number of tag information of 1 ~ n blog document.

상기 <수학식 4>에 의해 keyword _i에 대해 가장 연관이 높은 n개의 키워드를 정하면, 정해진 정보를 이슈 키워드 및 연관어 네트워크 색인부(108)로 전달되어 이슈 키워드 및 연관어 네트워크 색인부(108)에서 해당 정보에 대한 직접 색인을 수행하게 된다. If n keywords with the most relevant keyword _i are determined by Equation (4), the determined information is transmitted to the issue keyword and the related word network indexing unit 108 to issue the issue keyword and the related word network indexing unit 108. Will index the information directly.

도 4는 본 발명의 실시예에 따라 이슈 키워드 및 연관어 네트워크 색인부의 동작 방식을 도시한 도면이다. 4 is a diagram illustrating an operation method of an issue keyword and an associated word network index unit according to an exemplary embodiment of the present invention.

도 4를 참조하면, 이슈 키워드 및 연관어 네트워크 색인부(108)에서는 이슈 키워드 추출부(104)로부터 키워드 랭킹 산정(306)된 키워드 정보와 연관어 네트워크 구성부(106)로부터 키워드간 연관 스코어 산정(312)된 정보의 점수 또는 점수에 따른 순위를 고려하여 해당 정보를 이슈 키워드 및 연관어 네트워크 색인 데이터(108)에 저장한다. Referring to FIG. 4, in the keyword keyword and the related word network indexing unit 108, the keyword information calculated from the keyword keyword extraction unit 104 and the keyword information calculated from the keyword ranking unit 104 and the related keyword between the keyword network unit 106 and the related keyword are calculated. In consideration of the score of the 312 information or the ranking according to the score, the information is stored in the issue keyword and the association network index data 108.

도 5 는 본 발명의 실시예에 따른 사용자 요구에 따라 키워드 및 연관 네트워크를 제시하는 절차를 도시한 흐름도이다. 5 is a flowchart illustrating a procedure for presenting a keyword and an associated network according to a user request according to an embodiment of the present invention.

도 5를 참조하면, S500 단계에서 사용자 요구 입력부(112)를 통해 입력된 사용자 요구 정보를 분석하여 S502 단계에서 해당 이슈 키워드 및 연관어 네트워크를 제시하게 된다. Referring to FIG. 5, in step S500, the user request information input through the user request input unit 112 may be analyzed to present a corresponding keyword and associated word network in step S502.

도 6은 본 발명의 실시예에 따른 사용자 요구에 따라 키워드 및 연관 네트워크를 제시하는 절차를 도시한 흐름도이다. 6 is a flowchart illustrating a procedure for presenting keywords and associated networks in accordance with user requirements according to an embodiment of the present invention.

도 6을 참조하면, 사용자 요구 입력부(112)를 통해 사용자가 처음 블로그 사이트에 접속(600)하거나, 찾고자 하는 대상을 표현한 질의를 입력 혹은 리스트에서 클릭(602)하면, 도 7에 도시된 바와 같이 이슈 키워드 및 연관어 네트워크 제시부(110)에서는 사용자 요구 입력부(112)로부터 전달된 사용자의 요구 종류에 따라 접속으로 인한 이슈 키워드 리스트의 검색(700)을 수행하거나 질의를 이용하여 검색(702)을 수행하게 된다. Referring to FIG. 6, when a user first connects to a blog site 600 through a user request input unit 112 or inputs a query representing a target to be searched (602) from an input or a list, as shown in FIG. 7. The issue keyword and association word network presentation unit 110 performs a search 700 of the issue keyword list due to the connection or a search 702 by using a query according to the type of the user's request transmitted from the user request input unit 112. Done.

이러한 검색은 이슈 키워드 및 연관 네트워크 색인 데이터를 참조(704)하여 효율적으로 검색을 수행하고, 이에 검색 결과가 도출(706)된 경우, 결과 제시부(114)를 통하여 사용자에게 검색 결과를 제시(708)한다. 즉, 사용자의 사이트 접속으로 인한 리스트 검색을 수행한 경우에는 이슈 키워드 리스트를 제공하게 되며, 사용자의 키워드가 입력되거나, 사용자가 특정 이슈 키워드를 클릭한 경우에는 해당 키워드와 관련된 블로그 문서 리스트와 연관어 네트워크를 제공하게 된다. This search efficiently searches by referring to the issue keyword and associated network index data (704), and if the search result is derived (706), the search result is presented to the user through the result presenting unit (114) (708). do. That is, when a list search is performed due to the user's access to the site, an issue keyword list is provided. When a user's keyword is entered or when a user clicks a specific issue keyword, a list of blog documents related to the keyword is associated with the keyword list. To provide a network.

이때, 결과 제시부(114)에서 검색 결과는 키워드와 관련된 블로그 문서 리스트와 연관어 네트워크를 시각적으로 표현하여 구성하여 제시된다. In this case, the search result in the result presenting unit 114 is presented by constructing a visual representation of the network associated with the blog document list associated with the keyword.

이상 설명한 바와 같이, 본 발명은 웹사이트 상에서 블로그 문서의 태그와 공기 정보를 이용하여 블로그 문서 집합 안에서 이슈가 된 키워드를 자동으로 추출하여, 추출된 키워드의 순위를 매기고, 이러한 키워드 간에 연관성을 파악하여 연관 네트워크를 구성하는 것으로서, 구체적으로 블로그 문서의 문서 제목 태그, 사용자가 입력한 포스트 태그, 그리고 문서 본문에서 키워드를 추출하여 키워드의 점수를 계산하여 점수가 높은 키워드를 블로그 문서의 이슈 키워드로 추출하고, 추출된 키워드들의 공기 정보를 이용하여 키워드 간의 연관 네트워크를 구성한다. As described above, the present invention automatically extracts the keywords in question from the blog document set by using the tag and air information of the blog document on the website, ranks the extracted keywords, and grasps the association between these keywords. As a related network, the keyword title is extracted from the document title tag of the blog document, the post tag entered by the user, and the document body, and the keyword score is calculated to extract the high score keyword as the issue keyword of the blog document. In addition, the air information of the extracted keywords is used to construct an association network between the keywords.

한편 본 발명의 상세한 설명에서는 구체적인 실시예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되지 않으며, 후술되는 특허청구의 범위뿐만 아니라 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다.While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments, but is capable of various modifications within the scope of the invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the scope of the following claims, but also by those equivalent to the scope of the claims.

도 1은 본 발명의 바람직한 실시예에 따른 키워드 추출 및 연관어 네트워크 구성 장치의 구조를 도시한 블록도, 1 is a block diagram showing the structure of an apparatus for keyword extraction and association network construction according to an embodiment of the present invention;

도 2 는 본 발명의 실시예에 따른 블로그 문서 집합에서 이슈가 되는 키워드의 추출 및 색인 절차를 도시한 흐름도, 2 is a flowchart illustrating a procedure for extracting and indexing a keyword that is an issue in a blog document set according to an embodiment of the present invention;

도 3은 본 발명의 실시예에 따른 블로그 문서 집합에서 이슈가 되는 키워드 간의 연관 네트워크 구성 방식을 도시한 블록도, 3 is a block diagram illustrating a method of configuring an associative network between keywords that are an issue in a blog document set according to an embodiment of the present invention;

도 4는 본 발명의 실시예에 따라 이슈 키워드 및 연관어 네트워크 색인부의 동작 방식을 도시한 도면, 4 is a view illustrating an operation method of an issue keyword and an associated keyword network indexing unit according to an embodiment of the present invention;

도 5 는 본 발명의 실시예에 따른 사용자 요구에 따라 키워드 및 연관 네트워크를 제시하는 절차를 도시한 흐름도, 5 is a flowchart illustrating a procedure for presenting a keyword and an associated network in accordance with a user request according to an embodiment of the present invention;

도 6은 본 발명의 실시예에 따른 사용자 요구 분석부의 구조를 도시한 블록도, 6 is a block diagram showing the structure of a user request analysis unit according to an embodiment of the present invention;

도 7은 본 발명의 실시예에 따른 이슈 키워드 및 연관어 네트워크 제시부의 구조를 도시한 블록도.7 is a block diagram illustrating a structure of an issue keyword and an associated word network presentation unit according to an embodiment of the present invention.

<　도면의 주요 부분에 대한 부호 설명 > <Description of Signs of Major Parts of Drawings>

104 : 이슈 키워드 추출부 　　　　　　　　　104: issue keyword extraction unit

106 : 연관어 네트워크 구성부 106: association network component

108 : 이슈 키워드 및 연관어 네트워크 색인부　　　　 108: issue keyword and association word network index

110 : 이슈 키워드 및 연관어 네트워크 제시부 110: issue keyword and related words network presentation unit

112 : 사용자 요구 입력부　　　　　　　　　　　112: user request input unit

114 : 결과 제시부114: presenting results

Claims

입력된 웹 문서 집합에서 문서의 구조 정보를 파싱하고, 태그별로 파악된 정보로 형태소 분석을 수행하여 분석된 형태소를 토대로 이슈 키워드를 추출하는 이슈 키워드 추출부와, An issue keyword extracting unit for parsing document structure information from the input web document set and extracting an issue keyword based on the analyzed morpheme by performing morphological analysis using information identified by tags;

추출된 이슈 키워드 간에 연관성을 추출하는 연관어 네트워크 구성부와, An association word network component for extracting an association between extracted issue keywords;

상기 이슈 키워드 추출부로부터 추출된 이슈 키워드와, 상기 연관어 네트워크 구성부로부터 구성된 연관어 네트워크를 색인하여 저장하는 색인부와, An index unit which indexes and stores an issue keyword extracted from the issue keyword extracting unit, and an associated word network constructed from the association word network unit;

입력된 제어명령에 따라 상기 색인부를 참조하여 상기 이슈 키워드 및 연관어 네트워크 정보를 제시하는 제시부 A presentation unit for presenting the issue keyword and associated word network information with reference to the index unit according to the input control command

를 포함하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치. Keyword extraction and association network configuration of the document data comprising a device.

제 1항에 있어서, The method of claim 1,

상기 이슈 키워드 추출부는, The issue keyword extraction unit,

태그별로 분석된 형태소에서 명사류를 이슈 키워드 후보로 추출하여 추출한 키워드 별 빈도와 발생 태그별 정보 및 발생 태그별 가중치를 계산하여 계산된 값이 높은 순으로 우선순위를 설정하는 것을 특징으로 하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치. The document data characterized in that the nominal is extracted from the morphemes analyzed by tag as the issue keyword candidate, and the priority value of the document data is set in ascending order of the calculated values. Keyword extraction and association network configuration devices.

제 1항에 있어서, The method of claim 1,

상기 연관어 네트워크 구성부는, The associative network component,

추출된 이슈 키워드 간의 관련성을 동일 태그 내 키워드 공기 관계를 산출하고, The keyword air relation in the same tag is calculated based on the relevance between the extracted issue keywords.

외부 태그에서 사용된 키워드 간 공기 관계를 추출하고, Extract air relationships between keywords used in external tags,

상기 동일 태그와 외부 태그 정보 사이에 가중치를 달리하여 연관 스코어를 산출하는 것을 특징으로 하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치. And a weighted difference between the same tag and external tag information to calculate an association score.

제 1항에 있어서, The method of claim 1,

상기 색인부는, The index unit,

상기 이슈 키워드 추출부로부터 추출된 이슈 키워드별로 기 설정된 우선순위에 해당하는 이슈 키워드와, An issue keyword corresponding to a preset priority for each issue keyword extracted from the issue keyword extractor;

상기 연관어 네트워크 구성부로부터 구성된 연관어 네트워크 정보 중 기 설정된 연관 스코어 점수를 초과하는 키워드를 색인하여 저장하는 것을 특징으로 하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치. And extracting and storing a keyword that exceeds a predetermined association score score among the related network information configured from the related network configuration unit.

제 1항에 있어서, The method of claim 1,

상기 제시부는, The presenting unit,

사용자로부터 입력된 제어 명령의 정보 요구 종류에 따라 처음 접속 시나 단순 리스트 요구 시에는 이슈 키워드 리스트를 제시하고, According to the information request type of the control command input from the user, the issue keyword list is presented at the first connection or simple list request.

상기 제어명령이 직접 입력된 질의이거나, 키워드 리스트의 클릭인 경우에는 클릭된 키워드와 관련된 블로그 문서 리스트와 연관어 네트워크를 제시하는 것을 특징으로 하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치. And a blog network associated with the clicked keyword and an associative network if the control command is a directly input query or a click of a keyword list.

입력된 웹 문서 집합에서 문서의 구조 정보를 파싱하여 태그별로 파악된 정보로 형태소 분석을 수행하는 과정과, Parsing document structure information from the input web document set and performing stemming analysis using information identified by tags;

분석된 형태소를 토대로 이슈 키워드를 추출하는 과정과, Extracting the issue keyword based on the analyzed morpheme,

추출된 이슈 키워드 간에 연관성을 추출하는 과정과, Extracting associations between extracted issue keywords,

상기 이슈 키워드와, 추출된 키워드 연관성 정보를 색인부에 색인하여 저장하는 과정과, Indexing and storing the issue keyword and the extracted keyword relevance information in an index unit;

제어명령이 입력된 경우, 상기 색인부를 참조하여 상기 이슈 키워드 및 연관성 정보를 제시하는 과정 The process of presenting the issue keyword and the association information with reference to the index unit when a control command is input.

을 포함하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 방법. Keyword extraction and associative network configuration of document data comprising a.

제 6항에 있어서, The method of claim 6,

상기 이슈 키워드를 추출하는 과정은, The process of extracting the issue keyword,

태그별로 분석된 형태소에서 명사류를 이슈 키워드 후보로 추출하는 과정과, Extracting nouns from the morphemes analyzed by tags as issue keyword candidates,

추출한 키워드별 빈도와 발생 태그별 정보 및 발생 태그별 가중치를 계산하는 과정과, Calculating the frequency of each extracted keyword, information by occurrence tag, and weight by occurrence tag,

계산된 값이 높은 순으로 우선순위를 설정하는 과정 The process of setting priorities in order of highest calculated value

을 포함하는 것을 특징으로 하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 방법. Method for extracting keyword and association network configuration of the document data comprising a.

제 6항에 있어서, The method of claim 6,

상기 연관성을 추출하는 과정은, The process of extracting the association,

추출된 이슈 키워드 간의 관련성을 동일 태그 내 키워드 공기 관계를 산출하는 과정과,　 Calculating a keyword air relationship in the same tag between the extracted issue keywords;

외부 태그에서 사용된 키워드 간 공기 관계를 추출하는 과정과, Extracting air relationships between keywords used in external tags,

상기 동일 태그와 외부 태그 정보 사이에 가중치를 달리하여 연관 스코어를 산출하는 과정 Calculating a correlation score by varying a weight between the same tag and external tag information

제 6항에 있어서, The method of claim 6,

상기 색인하여 저장하는 과정은, The process of indexing and storing,

상기 이슈 키워드별로 기 설정된 우선순위에 해당하는 이슈 키워드와, 상기 연관성 정보에서 기 설정된 연관 스코어 점수를 초과하는 키워드를 색인하여 저장하는 것을 특징으로 하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 방법. And indexing and storing an issue keyword corresponding to a predetermined priority for each issue keyword and a keyword exceeding a predetermined association score score in the relevance information.

제 6항에 있어서, The method of claim 6,

상기 이슈 키워드 및 연관성 정보를 제시하는 과정은, The process of presenting the issue keyword and the association information,

사용자로부터 입력된 제어 명령의 정보 요구 종류에 따라 처음 접속 시나 단순 리스트 요구 시에는 이슈 키워드 리스트를 제시하는 과정과, The process of presenting a list of issue keywords upon first access or simple list request according to the information request type of the control command input from the user;

상기 제어명령이 직접 입력된 질의이거나, 키워드 리스트의 클릭인 경우에는 클릭된 키워드와 관련된 블로그 문서 리스트와 연관어 네트워크를 제시하는 과정 In the case where the control command is a directly input query or a click of a keyword list, a process of presenting a blog document list and an association network related to the clicked keyword

을 포함하는 것을 특징으로 하는 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 방법.Method for extracting keyword and association network configuration of the document data comprising a.