KR101318843B1

KR101318843B1 - Blog category classification method and apparatus using time information

Info

Publication number: KR101318843B1
Application number: KR1020110087110A
Authority: KR
Inventors: 이성우; 이지형
Original assignee: 성균관대학교산학협력단
Priority date: 2011-08-30
Filing date: 2011-08-30
Publication date: 2013-10-17
Also published as: KR20130023977A

Abstract

본 발명은 시간 정보를 활용하여 블로그의 카테고리를 분류하는 방법에 있어서, 상기 블로그 내에서 사용되는 단어들을 상기 블로그 내의 문서로부터 추출하는 단어 추출 단계; 상기 추출된 단어들이 상기 블로그 내의 문서 내에서 출현한 빈도를 토대로 상기 단어 출현의 시간적인 분포를 나타내는 종합 시간 정보 값을 계산하는 계산 단계; 및 상기 단어들의 시간 정보에 따라 상기 블로그의 카테고리를 분류하는 분류 단계를 포함하는 시간 정보를 활용한 블로그 카테고리 분류 방법을 제공한다.The present invention provides a method for classifying blog categories using time information, comprising: extracting words used in the blog from a document in the blog; A calculation step of calculating a comprehensive time information value representing a temporal distribution of the word appearance based on a frequency of the extracted words appearing in a document in the blog; And a classification step of classifying a category of the blog according to time information of the words.

Description

시간 정보를 활용한 블로그 카테고리 분류 방법 및 장치{BLOG CATEGORY CLASSIFICATION METHOD AND APPARATUS USING TIME INFORMATION}BLOG CATEGORY CLASSIFICATION METHOD AND APPARATUS USING TIME INFORMATION}

본 발명은 블로그 카테고리 분류 방법 및 장치에 관한 것으로, 보다 상세하게는 시간 정보를 활용하여 시간 흐름에 따라 변화하는 블로그의 내용을 파악하여 카테고라이즈할 수 있는 블로그 카테고리 분류 방법 및 장치에 관한 것이다.
The present invention relates to a method and apparatus for categorizing blog categories. More particularly, the present invention relates to a method and apparatus for categorizing blog categories by grasping content of a blog that changes over time by using time information.

최근 웹 2.0 시대가 되면서 사용자 누구나 정보 및 컨텐츠 등의 생산자가 될 수 있는 환경이 조성되었다. 블로그는 이용자들의 자발적 참여를 바탕으로 정보를 생산하고 공유가 가능한 웹 2.0의 대표적 서비스라고 할 수 있다. 블로그는 개인적인 성격을 가지고 있지만, 때에 따라서는 인터넷을 통해 기존의 어떤 대형 미디어에 못지않은 힘을 발휘할 수 있기 때문에 '1인 미디어'라고도 불려 지기도 한다.With the recent Web 2.0 era, an environment has been created in which anyone can become a producer of information and content. The blog is a representative service of Web 2.0 that can produce and share information based on the voluntary participation of users. Blogs are personal, but they are sometimes referred to as 'one-person media' because they can exert their power over any large media over the Internet.

블로그는 인터넷을 의미하는 웹과 자료를 뜻하는 로그의 합성어인 웹로그를 줄인 말로, 인터넷에서 자신의 관심사에 따라 자유롭게 글을 올리는 개인 사이트를 지칭한다. 이러한 블로그는 형식상의 특성과 내용상의 특성으로 나누어 살펴볼 수 있다. 형식상의 특징은 첫째, 일기 형태이고, 둘째 최신 사건 순으로 글이 배치되므로 최근 글이 가장 위에 올라온다. 셋째, 게시물을 작성함과 동시에 발행되고 반영되며, 넷째, 카테고리화 되어 있고, 트랙백을 통해 그물망처럼 엮여 있다는 것이다.A blog is short for weblog, which is a combination of the web, which means the Internet, and the log, which means data. These blogs can be divided into formal and content characteristics. The formal features are first, diary, and second, the text is arranged in order of the latest events, so the latest text is on top. Thirdly, it is published and reflected at the same time as the post is written. Fourth, it is categorized and weaved like a net through trackback.

과거에는 다수의 사람들로부터 인정받지 못한 사람의 글이나 그림 등이 아무리 뛰어나도 이를 여러 사람에게 알릴 수 있는 여건이 없었기 때문에 진흙 속에 진주에 불과했다. 그러나 인터넷 블로그는 세상의 모든 사람이 볼 수 있는 공간에 놓여져 있기 때문에, 누구든 독자들의 마음을 움직일 수 있는 컨텐츠만 있으면 즉각 반응이 나타난다. 이에 따라 웹문서에서 블로그가 차지하는 비중이 증가하였으며, 블로그를 통한 정보의 공유 및 획득을 위해서는 블로그 페이지들의 분류가 효과적으로 이루어질 필요성이 점차 커졌다.In the past, no matter how excellent the writings or pictures of people who were not recognized by many people, they were nothing but pearls in the mud because there was no way to inform them. However, because Internet blogs are placed in a space where everyone in the world can see them, anybody can respond immediately if they have content that can move the reader's mind. Accordingly, the proportion of blogs in web documents has increased, and in order to share and acquire information through blogs, the necessity of effectively classifying blog pages has gradually increased.

이전 연구에서는 블로그를 분류하기 위해 블로그 내 컨텐츠 자료에 대한 SVM분석, 태그확장을 통한 블로그 자동분류에 초점이 맞추어졌다. 이러한 접근 방식은 사용자가 지속적으로 블로그를 운영하면서 여러 카테고리에 속하는 글들을 올리게 된다면, 블로그의 카테고리를 분류하는데 많은 어려움이 발생하게 된다. In previous research, we focused on SVM analysis of blog's content data and blog classification through tag extension to classify blogs. This approach causes a lot of difficulties in classifying blog categories if the user keeps posting blogs that belong to various categories.

또한, 블로그 내의 문서들이 시간에 따라 빈번하게 새로운 것이 공개되어 업데이트되기 때문에 이전 시점에서 블로그를 분류했던 것이 최근의 블로그의 카테고리를 반영하고 있다고 확신할 수 없다는 문제가 있다. In addition, since the documents in the blog are frequently updated and updated with time, there is a problem that it is not certain that the classification of the blog at the previous time reflects the category of the recent blog.

따라서, 블로그의 최근 경향에 맞게 블로그 카테고리를 정확하게 자동으로 분류할 필요성이 대두되고 있다.
Accordingly, there is a need to accurately and automatically categorize blog categories according to the recent trend of blogs.

대한민국 공개 특허 10-2005-0107886 ("카테고리별로 블로그를 표시하는 방법", 대신증권 주식회사, 2005.11.16 공개)Republic of Korea Patent Publication No. 10-2005-0107886 ("How to display blogs by category", Daishin Securities Co., Ltd., published Nov. 16, 2005)

본 발명의 목적은 블로그의 시간 정보를 활용하여 블로그의 동적인 카테고리 분류의 정확도를 향상시키기 위해, 여러 개의 포스트들로 이루어진 블로그의 웹문서 내의 단어들이 얼마나 꾸준히(Steadily) 그리고, 최근에 얼마나 자주(frequently) 언급되었는지에 따른 단어들의 시간 분포도와 최근 언급 빈도수를 계산하여 블로그의 동적 카테고리 분류의 정확도를 높이는 블로그 카테고리 분류 방법 및 장치를 제공하는 것이다.
It is an object of the present invention to utilize the time information of a blog to improve the accuracy of dynamic categorization of a blog, how consistently and how recently words in a web document of a blog made up of multiple posts have been developed. It is to provide a blog category classification method and apparatus for improving the accuracy of dynamic category classification of blogs by calculating the time distribution and the recent mention frequency of words according to whether they are mentioned frequently.

상술한 문제점을 해결하기 위한 시간 정보를 활용한 블로그 카테고리 분류 방법은 시간 정보를 활용하여 블로그의 카테고리를 분류하는 방법에 있어서, 상기 블로그 내에서 사용되는 단어들을 상기 블로그 내의 문서로부터 추출하는 단어 추출 단계; 상기 추출된 단어들이 상기 블로그 내의 문서 내에서 출현한 빈도를 토대로 상기 단어 출현의 시간적인 분포를 나타내는 종합 시간 정보 값을 계산하는 계산 단계; 및 상기 단어들의 시간 정보에 따라 상기 블로그의 카테고리를 분류하는 분류 단계를 포함할 수 있다.The blog category classification method using time information for solving the above-mentioned problem is a method of classifying blog categories using time information, wherein the word extraction step of extracting words used in the blog from the document in the blog ; A calculation step of calculating a comprehensive time information value representing a temporal distribution of the word appearance based on a frequency of the extracted words appearing in a document in the blog; And a classification step of classifying a category of the blog according to time information of the words.

상기 계산 단계는 상기 추출된 단어들의 상기 출현 빈도를 계산하는 출현 빈도 계산 단계; 상기 출현 빈도를 이용하여 상기 단어들의 최근 출현 빈도를 계산하는 최근 출현 빈도 계산 단계; 및 상기 최근 출현 빈도와 상기 단어들이 얼마나 꾸준하게 출현하는지 나타내는 스태디니스(steadiness) 값을 토대로 상기 종합 시간 정보 값을 계산하는 시간 정보 계산 단계를 포함할 수 있다.The calculating step includes calculating an appearance frequency of the extracted words; Calculating a recent appearance frequency of the words using the appearance frequency; And a time information calculation step of calculating the comprehensive time information value based on the frequency of recent appearance and a steadyness value indicating how steadily the words appear.

상기 출현 빈도 계산 단계에서, 상기 출현 빈도는 상기 블로그 내의 하나의 문서에서 상기 단어가 사용된 횟수를 나타내는 단어 빈도와, 상기 블로그 내의 문서의 수를 상기 단어를 포함하는 문서의 수로 나눈 문서 빈도를 토대로 계산될 수 있다.In the appearance frequency calculating step, the appearance frequency is based on a word frequency indicating the number of times the word has been used in one document in the blog, and the document frequency divided by the number of documents including the word in the blog number. Can be calculated.

상기 최근 출현 빈도 계산 단계에서, 상기 최근 출현 빈도는 상기 블로그 내에서 가장 최근에 언급된 단어에 가중치를 주는 방식으로 계산될 수 있다.In the recent appearance frequency calculation step, the recent appearance frequency may be calculated by weighting the most recently mentioned word in the blog.

상기 최근 출현 빈도 계산 단계에서 상기 최근 출현 빈도는

(여기서, tf _i ×idf _i 는 상기 출현 빈도, current time은 현재 날짜, time of d는 상기 단어가 출현한 날짜 및 D는 상기 단어가 출현하는 전체 문서의 개수를 나타냄)을 통해 계산될 수 있다.In the calculating the recent appearance frequency, the recent appearance frequency is

Where tf _i × idf _i is the frequency of occurrence, current time is the current date, time of d is the date the word appeared and D represents the total number of documents in which the word appeared.

상기 시간 정보 계산 단계에서, 상기 스태디니스 값은 상기 단어의 출현 날짜와 상기 단어가 출현한 전체 기간의 중간 날짜의 차를 상기 단어가 나타난 전체 횟수로 나눈 시간 분산 값일 수 있다.In the time information calculating step, the standard value may be a time variance value obtained by dividing a difference between an appearance date of the word and an intermediate date of an entire period in which the word appears by the total number of times the word appears.

상기 시간 정보 계산 단계에서, 상기 종합 시간 정보 값은 상기 최근 출현 빈도와 상기 스태디니스 값의 곱으로 계산될 수 있다.In the time information calculating step, the aggregate time information value may be calculated as a product of the recent appearance frequency and the stadiness value.

상술한 문제점을 해결하기 위한 시간 정보를 활용한 블로그 카테고리 분류 장치는 시간 정보를 활용하여 블로그의 카테고리를 분류하는 장치에 있어서, 상기 블로그 내에서 사용되는 단어들을 상기 블로그 내의 문서로부터 추출하는 단어 추출부; 상기 추출된 단어들이 상기 블로그 내의 문서 내에서 출현한 빈도를 토대로 상기 단어 출현의 시간적인 분포를 나타내는 종합 시간 정보 값을 계산하는 계산부; 및 상기 단어들의 시간 정보에 따라 상기 블로그의 카테고리를 분류하는 분류부를 포함할 수 있다.A blog category classification apparatus using time information for solving the above problem is a device for classifying blog categories using time information, wherein the word extracting unit extracts words used in the blog from the document in the blog. ; A calculation unit calculating a total time information value representing a temporal distribution of the word appearance based on a frequency of the extracted words appearing in a document in the blog; And a classification unit classifying a category of the blog according to time information of the words.

상기 계산부는 상기 추출된 단어들의 상기 출현 빈도를 계산하는 출현 빈도 계산부; 상기 출현 빈도를 이용하여 상기 단어들의 최근 출현 빈도를 계산하는 최근 출현 빈도 계산부; 및 상기 최근 출현 빈도와 상기 단어들이 얼마나 꾸준하게 출현하는지 나타내는 스태디니스(steadiness) 값을 토대로 상기 단어들의 종합 시간 정보 값을 계산하는 시간 정보 계산부를 포함할 수 있다.The calculator may include an appearance frequency calculator that calculates the frequency of appearance of the extracted words. A recent appearance frequency calculator for calculating a recent appearance frequency of the words by using the appearance frequency; And a time information calculator configured to calculate a comprehensive time information value of the words based on the frequency of recent appearance and a steadyness value indicating how steadily the words appear.

상기 출현 빈도 계산부는 상기 블로그 내의 하나의 문서에서 상기 단어가 사용된 횟수를 나타내는 단어 빈도와, 상기 블로그 내의 문서의 수를 상기 단어를 포함하는 문서의 수로 나눈 문서 빈도를 토대로 상기 출현 빈도를 계산할 수 있다.The appearance frequency calculator may calculate the appearance frequency based on a word frequency indicating the number of times the word is used in one document in the blog, and a document frequency obtained by dividing the number of documents in the blog by the number of documents including the word. have.

상기 최근 출현 빈도 계산부는 상기 블로그 내에서 가장 최근에 언급된 단어에 가중치를 주는 방식으로 상기 최근 출현 빈도를 계산할 수 있다.The recent appearance frequency calculator may calculate the recent appearance frequency by weighting the most recently mentioned word in the blog.

상기 최근 출현 빈도 계산부는

(여기서, tf _i ×idf _i 는 상기 출현 빈도, current time은 현재 날짜, time of d는 상기 단어가 출현한 날짜 및 D는 상기 단어가 출현하는 전체 문서의 개수를 나타냄)을 통해 상기 최근 출현 빈도를 계산할 수 있다.The recent appearance frequency calculation unit

Where tf _i × idf _i is the frequency of occurrence, current time is the current date, time of d indicates the date of appearance of the word and D indicates the total number of documents in which the word appears.

상기 스태디니스 값은 상기 단어의 출현 날짜와 상기 단어가 출현한 전체 기간의 중간 날짜의 차를 상기 단어가 나타난 전체 횟수로 나눈 시간 분산값일 수 있다.The standard value may be a time variance value obtained by dividing a difference between an appearance date of the word and an intermediate date of an entire period in which the word appears by the total number of times the word appears.

상기 시간 정보 계산부는 상기 최근 출현 빈도와 상기 스태디니스 값의 곱으로 상기 종합 시간 정보 값을 계산할 수 있다.
The time information calculator may calculate the comprehensive time information value by multiplying the recent appearance frequency by the stadiness value.

본 발명의 시간 정보를 활용한 블로그 카테고리 분류 방법 및 장치에 따르면, 블로그 포스트의 시간 정보를 활용함으로써 블로그의 동적인 카테고리 분류 향상의 효과가 있다.According to the method and apparatus for categorizing a blog category using time information of the present invention, it is possible to improve the dynamic category classification of a blog by using the time information of a blog post.

또한, 본 발명의 시간 정보를 활용한 블로그 카테고리 분류 방법 및 장치에 따르면, 시간에 따라 변하는 블로그의 취향과 관심사를 파악할 수 있기 때문에, 최초로 분류된 블로그 카테고리가 자동으로 이러한 변화에 맞추어 바뀌는 효과가 있다. 따라서, 사용자가 좀 더 빠르고 편하게 정보를 찾을 수 있는 효과가 있다.
In addition, according to the method and apparatus for categorizing blog categories using time information of the present invention, since the tastes and interests of blogs that change over time can be grasped, the blog categories classified for the first time can be automatically changed according to these changes. . Therefore, the user can find information more quickly and conveniently.

도 1은 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법에 따른 특정 단어의 시간 분포를 추출하는 과정을 설명하기 위한 도면,
도 2는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법을 개략적으로 나타낸 흐름도,
도 3은 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법의 계산 단계를 구체적으로 나타낸 상세 흐름도,
도 4는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법 중 최근 출현 빈도 계산 단계를 설명하기 위한 도면,
도 5는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법 중 시간 정보 계산 단계를 설명하기 위한 도면,
도 6a 및 6b는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법 중 시간 정보 계산 단계에서 단어 출현 시점의 퍼진 정도에 따른 차이점을 설명하기 위한 도면,
도 7은 본 발명의 일 실시예에 따른 블로그 카테고리 분류 장치를 개략적으로 나타낸 블록도,
도 8은 본 발명의 일 실시예에 따른 블로그 카테고리 분류 장치 중 계산부를 구체적으로 나타낸 상세 블록도,
도 9a는 본 발명에 따라 각 블로그 사이트에 대한 블로그 대표 단어들을 나타내는 도면,
도 9b는 본 발명에 따라 분류된 블로그 카테고리의 결과를 나타낸 도면,
도 9c는 평균 정확도를 수치화한 도면을 도시한다.1 is a view for explaining a process of extracting a time distribution of a specific word according to a blog category classification method according to an embodiment of the present invention;
2 is a flowchart schematically illustrating a blog category classification method according to an embodiment of the present invention;
3 is a detailed flowchart illustrating a calculation step of a blog category classification method according to an embodiment of the present invention;
4 is a view for explaining a recent appearance frequency calculation step of the blog category classification method according to an embodiment of the present invention,
5 is a view for explaining a time information calculation step of the blog category classification method according to an embodiment of the present invention;
6A and 6B are diagrams for explaining a difference according to the degree of spread of a word appearance time in a time information calculation step of a blog category classification method according to an embodiment of the present invention;
7 is a block diagram schematically illustrating a blog category classification apparatus according to an embodiment of the present invention;
8 is a detailed block diagram illustrating in detail a calculator in a blog category classification apparatus according to an embodiment of the present invention;
9A illustrates blog representative words for each blog site in accordance with the present invention;
9B is a diagram showing the results of blog categories classified according to the present invention;
9C shows a diagram in which average accuracy is quantified.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

제 1, 제 2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

본 발명에 따른 시간 정보를 활용한 블로그 카테고리 분류 방법은 반드시 블로그에 적용될 수 있으나, 이에 한정되지는 않는다. 블로그 외의 다른 인터넷 커뮤니티에서도 활용할 수 있고, 일반적인 정보 검색시에도 시간 정보를 활용하고자 한다면, 본 발명의 방법을 선택 적용할 수 있다.The blog category classification method using time information according to the present invention may be applied to a blog, but is not limited thereto. If it can be used in other Internet communities other than the blog, and if you want to use the time information when searching for general information, the method of the present invention can be selectively applied.

도 1은 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법에 따른 특정 단어의 시간 분포를 추출하는 과정을 설명하기 위한 도면이다. 도 1에 도시된 바와 같이, 어떤 시간적인 효과는 분류 프로세스의 결과에 큰 영향을 미칠 수 있다. 일반적으로, 블로거들은 수년에 걸쳐 그들의 흥미거리를 주체로 웹문서를 포스팅(post)한다. 블로근는 다양한 포스트로 구성되어 있다. 그리고, 도 1에 도시된 바와 같이, 각 블로그 포스트 내에는 시간 정보가 존재한다. 이러한 각각의 시간 정보를 활용하여, 각 단어들이 얼마나 꾸준히 또는 최근에 얼마나 자주 블로그 포스트 내에서 언급되었는지 알 수 있다. 도 1에 도시된 실시예에 따르면, 단어는 2011년에 총 10번 언급이 되었음을 알 수 있고, 특히 2011년 5월 이후에 6번 언급이 되었으므로, 최근에 이슈가 되었던 단어임을 짐작할 수 있다. 이하, 이러한 시간 정보를 통해 어떻게 블로그 카테고리를 분류하는지 구체적으로 살펴볼 것이다.1 is a diagram illustrating a process of extracting a time distribution of a specific word according to a blog category classification method according to an embodiment of the present invention. As shown in FIG. 1, some temporal effects can have a big impact on the results of the classification process. In general, bloggers post web documents based on their interests over the years. Bloggeun is made up of various posts. As shown in FIG. 1, time information exists in each blog post. Using each of these pieces of time information, we can see how steadily or recently how often each word is mentioned in a blog post. According to the embodiment shown in Figure 1, it can be seen that the word has been mentioned a total of 10 times in 2011, in particular, since it has been mentioned 6 times since May 2011, it can be guessed that the word has recently been an issue. Hereinafter, we will look at how to categorize blog categories based on the time information.

도 2는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법을 개략적으로 나타낸 흐름도이다. 도 2에 도시된 바와 같이, 본 발명에 따른 블로그 카테고리 분류 방법은 블로그 내에서 사용되는 단어들을 추출하는 단계(210), 단어들의 출현 빈도를 토대로 단어 출현의 시간적인 분포를 나타내는 종합 시간 정보 값을 계산하는 단계(220) 및 시간 정보에 따라 블로그의 카테고리를 분류하는 단계(230)를 포함할 수 있다.2 is a flowchart schematically illustrating a blog category classification method according to an embodiment of the present invention. As shown in FIG. 2, the method of classifying blog categories according to the present invention includes extracting the words used in the blog (210), and a comprehensive time information value representing a temporal distribution of word appearances based on the frequency of occurrence of the words. The method may include calculating 220 and classifying a category of the blog according to time information 230.

먼저, 단어 추출 단계(210)에서, 단어 추출부는 블로그 내의 각각의 단어들을 웹문서 내에서 추출한다. 각 단어들의 일련의 시간 정보를 확인하려면 문서 단위가 아니라 단어 단위로 정보를 수집해야 하기 때문에 블로그 내에서 사용되었던 단어를 각 문서 또는 블로그 포스트에서 추출한다. First, in the word extraction step 210, the word extraction unit extracts each word in the blog in the web document. In order to check the time information of each word, the words used in the blog are extracted from each document or blog post because the information must be collected in word units rather than document units.

계산 단계(220)에서, 계산부는 단어 추출 단계(210)에서 추출된 단어들의 출현 횟수, 빈도 등을 토대로 추출된 단어들이 시간적으로 어떤 분포를 나타내는지 보여주는 종합 시간 정보 값을 계산한다. 본 단계(220)는 종래의 블로그 분류 방법과 차별적인 특징을 나타내는 핵심 단계로, 본 단계(220)에서 계산부는 출현 빈도를 통해 최근 출현 빈도를 계산할 수 있고, 최근 출현 빈도를 통해 종합 시간 정보 값을 계산할 수 있다. 이는 이하, 도 3을 참조하여 상세히 설명한다.In the calculating step 220, the calculating unit calculates a comprehensive time information value showing what distribution the words extracted in time based on the number of occurrences, frequency, etc. of the words extracted in the word extraction step 210. This step 220 is a key step that shows a distinctive feature from the conventional blog classification method, in this step, the calculation unit may calculate the recent appearance frequency through the appearance frequency, the comprehensive time information value through the recent appearance frequency Can be calculated. This will be described in detail below with reference to FIG. 3.

분류 단계(230)에서, 분류부는 계산 단계(220)를 통해 획득한 종합 시간 정보 값을 따라 블로그의 카테고리를 분류할 수 있다. 여기서, DMOZ 방식을 사용할 수 있다. DMOZ는 가장 거대하고, 가장 포괄적인, 인간이 편성한 웹의 디렉토리로서, 글로벌 커뮤니티에 의해 조성되고 유지된다. 예컨대 "iphone"이라는 단어에 대해 검색한 경우, 우리는 단어 "iphone"과 관련된 DMOZ 카테고리를 획득할 것이다. DMOZ는 쿼리 단어에 대해 상위 5 카테고리를 보여준다. 각 단어에 대해 상위 5 카테고리를 추출한다. 그리고 각 단어에 대한 상위 5 카테고리로 가장 빈번하게 나타나는 카테고리를 해당 블로그의 대표 카테고리로 확정할 수 있다.In the classification step 230, the classification unit may classify blog categories according to the comprehensive time information value acquired through the calculation step 220. Here, the DMOZ method can be used. DMOZ is the largest, most comprehensive, human-organized directory of webs created and maintained by the global community. For example, if we searched for the word "iphone", we would get a DMOZ category associated with the word "iphone". DMOZ shows the top five categories for query words. Extract the top 5 categories for each word. In addition, a category that appears most frequently as the top 5 category for each word may be determined as the representative category of the blog.

도 3은 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법의 계산 단계(220)를 구체적으로 나타낸 상세 흐름도이다. 도 3에 도시된 바와 같이, 계산 단계(220)는 단어들의 출현 빈도를 계산하는 단계(310), 출현 빈도를 이용하여 최근 출현 빈도를 계산하는 단계(320) 및 최근 출현 빈도와 단어들의 스태디니스(steadiness) 값을 토대로 종합 시간 정보 값을 계산하는 단계(330)를 포함할 수 있다.3 is a detailed flowchart illustrating a calculation step 220 of a blog category classification method according to an embodiment of the present invention. As shown in FIG. 3, the calculating step 220 includes: calculating a frequency of appearance of words 310, calculating a frequency of recent appearance using the frequency of appearance 320, and a status of recent appearance frequencies and words. A step 330 may be calculated based on the steadiness value.

출현 빈도 계산 단계(310)에서, 출현 빈도 계산부는 단어 빈도와 문서 빈도를 통해 단어의 출현 빈도를 계산한다. 여기서, 출현 빈도는 하나의 단어가 문서 집단에서 가지는 중요도를 측정하는 통계적인 측정치이다. 블로그의 중요도는 단어가 문서에서 나타나는 출현 빈도에 비례하여 증가한다는 점에서 이러한 측정치가 필요한 것이다. 출현 빈도는 주어진 사용자의 쿼리(query) 단어에 대해 문서의 유사성을 순위화하여 검색해주는 검색 엔진에서 많이 사용되고 있다. In the appearance frequency calculation step 310, the appearance frequency calculation unit calculates the appearance frequency of the word through the word frequency and the document frequency. Here, the frequency of appearance is a statistical measure of the importance of a word in a document group. The importance of blogs is necessary because these words increase in proportion to the frequency of occurrence of words in the document. The frequency of occurrence is widely used in search engines that rank and search document similarities for a given user's query word.

먼저, 단어 빈도(term frequency)는 단순히 하나의 문서 안에서 특정 단어가 출현하는 빈도수를 나타낸다. 이 값은 긴 문서의 경우 값 자체가 바이어싱(bias)될 수 있으므로 일반적으로는 정규화하여 사용한다. 단어 빈도는 수학식 1을 통해 구할 수 있다.First, the word frequency simply represents the frequency at which a particular word appears in a document. This value is generally normalized because the value itself can be biased for long documents. The word frequency can be obtained through Equation 1.

여기서, tf는 단어 빈도, n_i는 고려되는 단어의 출현 횟수, n_k는 모든 단어들의 출현 횟수를 나타낸다.Where tf is the word frequency, n _i is the number of occurrences of the word considered, and n _k is the number of occurrences of all words.

문서 빈도(inverse document frequency)는 단어의 일반적인 중요도를 표시하는 측정치로, 모든 문서의 개수를 단어가 포함된 문서의 개수로 나눈 후 로그를 취하여 계산할 수 있다. 이는 수학식 2와 같이 표현할 수 있다.Inverse document frequency is a measure of the general importance of a word. The inverse document frequency can be calculated by dividing the number of all documents by the number of documents containing a word and taking a log. This can be expressed as Equation 2.

여기서, idf는 문서 빈도, D는 블로그 내의 문서들의 총 개수,

는 단어 ti가 출현한 문서들의 개수를 나타낸다.Where idf is the document frequency, D is the total number of documents in the blog,

Denotes the number of documents in which the word ti appears.

단어의 출현 빈도(tf-idf)는 단어 빈도와 문서 빈도를 곱해서 획득할 수 있다. 이는 단어가 특정 문서 내에서 얼마나 중요한 의미를 갖는지 평가하는 척도가 된다. 즉, 일반적으로 널리 사용되는 단어들을 제외하고, 특정 문서를 가장 잘 나타내는 단어를 선택하는데 출현 빈도(tf-idf)가 사용된다. The frequency of occurrence of a word tf-idf may be obtained by multiplying the word frequency and the document frequency. This is a measure of how important a word is in a particular document. That is, the frequency of appearance ( tf-idf ) is used to select a word that best represents a particular document, except for words that are generally widely used.

본 발명에서는 출현 빈도(tf - idf)를 계산하기 위해 확보한 이미지 데이터베이스의 정보를 기반으로 각각의 출현 빈도(tf - idf) 측정치를 재정의하여 활용함으로써 사용자가 입력한 쿼리 값에 가장 근사한 이미지를 찾아낼 수 있도록 한다. 출현 빈도(tf - idf)는 특정 키워드가 얼마나 그 문서의 특징을 잘 나타내는지 나타내는 인덱스로, 전술한 바와 같이, 출현 빈도(tf - idf) 계산시, 이미지에 할당된 태그를 기준으로 삼고 있다. 즉, 이미 하나의 이미지를 잘 나타내고 있는 단어들을 보유하고 있다고 판단하고 있으므로, 문서 빈도 값은 큰 의미를 갖지 않을 수 있다. 따라서, 어러한 경우에는, 이미지 태그들의 단어 빈도 값을 이용한다.In the present invention, the image most similar to the query value input by the user is utilized by redefining each occurrence frequency ( tf - idf ) measure based on the information of the image database obtained to calculate the appearance frequency ( tf - id f). Make sure you find it. The frequency of appearance tf - idf is an index indicating how well a particular keyword represents the characteristics of the document. As described above, the frequency of occurrence of the keyword tf - idf is based on a tag assigned to an image when calculating the appearance frequency tf - idf . That is, since it is determined that the word already represents one image well, the document frequency value may not have great meaning. Thus, in this case, the word frequency value of the image tags is used.

다음으로, 최근 출현 빈도 계산 단계(320)에서, 최근 출현 빈도 계산부는 출현 빈도 계산 단계(310)에서 계산된 출현 빈도를 이용하여 최근 출현 빈도를 계산한다. Next, in the recent appearance frequency calculation step 320, the recent appearance frequency calculation unit calculates the recent appearance frequency using the appearance frequency calculated in the appearance frequency calculation step 310.

도 4는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법 중 최근 출현 빈도 계산 단계(320)를 설명하기 위한 도면이다. 도 4에 도시된 바와 같이, 최근 출현 빈도는 블로그 카테고리를 분류하기 위해 문서들과 연관된 시간 정보를 취득하기 위한 출현 빈도에 대한 애드-온(add-on)을 나타낸다. 최초에, 블로거는 방문객들에게 가장 흠미로운 주제에 대한 것을 포스팅한다. 이러한 주제는 카테고리에 속하는 것들이다. 시간이 지남에 따라, 블로거들의 흥미는 변화한다. 이러한 변화에 따라, 블로그 카테고리 역시 변화해야 한다. 즉, 가장 최근에 언급된 단어가 블로그를 대표하는 단어일 확률이 높다. 이는 블로그 내에서 가장 최근에 언급된 단어에 가중치를 주는 방식으로 계산된다.4 is a diagram illustrating a recent appearance frequency calculation step 320 of the blog category classification method according to an embodiment of the present invention. As shown in FIG. 4, the recent appearance frequency represents an add-on to the appearance frequency for obtaining time information associated with documents to classify blog categories. Initially, bloggers post about the most fascinating topics for visitors. These topics belong to the category. Over time, bloggers' interests change. As these changes, blog categories must change as well. In other words, the most recently mentioned word is most likely a word representing a blog. This is calculated by weighting the most recently mentioned words in the blog.

이러한 점에 기반하여, 최근 출현 빈도는 수학식 4와 같이 표현할 수 있다.Based on this, the recent appearance frequency can be expressed as Equation 4.

여기서, Recent Topic Score는 최근 출현 빈도를 나타내고, t_i는 블로그 내의 단어, D는 단어 t_i가 출현된 블로그 포스트들의 총 개수, tf _i ×idf _i 는 단어 t_i의 출현 빈도를 나타낸다. 수학식 4를 통해서, 블로그 내의 단어들의 최근 출현 빈도가 계산된고, 그리고 나서 각 단어의 종합 시간 정보 값을 계산할 수 있다. 이후, 종합 시간 정보 값에 따라 블로그 내의 상위 n개의 단어들을 추출할 수 있고, 이를 통해 블로그의 카테고리를 분류할 수 있다.Where Recent Topic The score represents the recent appearance frequency, t _i represents the word in the blog, D represents the total number of blog posts in which the word t _i appears, and tf _i x id _f _i represents the frequency of the word t _i . Through Equation 4, the frequency of recent appearance of words in a blog is calculated, and then the aggregate time information value of each word can be calculated. Then, the top n words in the blog may be extracted according to the comprehensive time information value, and the categories of the blog may be classified through this.

다음으로, 종합 시간 정보 계산 단계(330)에서. 시간 정보 계산부는 최근 출현 빈도 계산 단계(320)에서 계산된 최근 출현 빈도와 단어들의 스태디니스(steadiness) 값을 토대로 종합 시간 정보 값을 계산한다. Next, in the comprehensive time information calculation step 330. The time information calculator calculates a comprehensive time information value based on the recent appearance frequency calculated in the recent appearance frequency calculation step 320 and the steadiness value of the words.

도 5는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법 중 시간 정보 계산 단계(330)를 설명하기 위한 도면이다. 도 5에 도시된 바와 같이, 시간 정보 계산 단계(330)에서, 시간 정보 계산부는 단어별로 블로그 내에서 얼마나 꾸준히 언급되는지 나타내는 스태디니스 값을 계산한다. 이를 계산하기 위해 본 발명에서는 시간 분산 값(Variance)을 이용한다. 이는 단어의 출현 시점과 단어가 출현한 전체 기간의 중간 날짜의 차를 단어가 나타난 전체 횟수로 나눔으로써 구할 수 있다. 이를 수학식으로 표현하면 다음과 같다.5 is a view for explaining the time information calculation step 330 of the blog category classification method according to an embodiment of the present invention. As shown in FIG. 5, in the time information calculation step 330, the time information calculator calculates a stadiness value indicating how steadily the word is mentioned in the blog for each word. In order to calculate this, the present invention uses a time variance. This can be obtained by dividing the difference between the point of appearance of the word and the intermediate date of the entire period in which the word appears by the total number of times the word appears. This can be expressed as follows.

여기서, Variance는 시간 분산 값, 즉 스태디니스 값이고,

는 한 단어가 문서 내에 출현한 날짜, M은 단어가 출현한 전체 기간에서 중간 날짜, K는 단어가 전체 기간에서 출현한 횟수를 나타낸다.Where Variance is a time variance value, that is, a stadiness value,

Is the date when a word appeared in the document, M is the middle date in the entire period in which the word appeared, and K is the number of times the word appeared in the entire period.

도 5에 도시된 실시예에 따르면,

값은 11/01/15, 11/01/30, ..., 11/06/23까지 총 10개의 값을 가지게 되고, M은 11/01/15와 11/06/23의 중간 날짜인 11/05/15가 되며, K는 10이 된다. 따라서 시간 분산 값은 각

값에서 M을 뺀 값을 K로 나눈 값을 각각 합산하여 구할 수 있다. 이러한 시간 분산 값은 시간에 따라 각 단어가 중간 날짜 값을 중심으로 얼마나 꾸준히 퍼져서 나타났는지 아니면, 중간 날짜 값 근처에서 집중적으로 나타났는지 알아볼 수 있는 지표가 된다. According to the embodiment shown in FIG. 5,

The values have a total of 10 values from 11/01/15, 11/01/30, ..., 11/06/23, and M is 11, the intermediate date between 11/01/15 and 11/06/23. / 05/15 and K equals 10. Therefore, the time variance value is

This can be obtained by summing the values of M minus M by K. This time variance is an indicator of how steadily each word spreads around the middle date value over time or is concentrated near the middle date value.

도 6a 및 6b는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법 중 시간 정보 계산 단계에서 단어 출현 시점의 퍼진 정도에 따른 차이점을 설명하기 위한 도면이다. 도 6에 도시된 바와 같이, 도 6a는 중간 날짜를 중심으로 단어의 출현이 집중되어 있는 모습을 도시하고 있고, 도 6b는 중간 날짜를 중심으로 단어의 출현이 퍼져 있는 모습을 도시하고 있다. 도 6a 및 6b에 도시된 바와 같이, 본 발명의 블로그 카테고리 분류 방법은 각 단어의 시간 분산 값을 계산하기 위해 확률의 표준 편자 개념을 사용하고 있다. 이는 확률의 표준 편차가 관측치들이 평균 또는 중앙값에서 얼마나 퍼져 있는지 알려주기 때문이다. 따라서, 시간 분산 값이 클수록 단어가 전체 문서에 고르게 나타나고 있음을 알 수 있다. 따라서, 스태디니스 값이 커지게 된다. 6A and 6B are diagrams for explaining a difference depending on the degree of spread of a word appearance time in a step of calculating time information in a blog category classification method according to an embodiment of the present invention. As shown in FIG. 6, FIG. 6A illustrates the appearance of words centered around an intermediate date, and FIG. 6B illustrates the appearance of words spreading around an intermediate date. 6A and 6B, the blog category classification method of the present invention uses the standard horseshoe concept of probability to calculate the time variance value of each word. This is because the standard deviation of probabilities tells how spread the observations are from the mean or median. Therefore, it can be seen that the larger the time variance value, the more evenly words appear in the entire document. Therefore, the stadiness value becomes large.

이후, 스태디니스 값을 구하면, 앞서 최근 출현 빈도 계산 단계(320)에서 구한 최근 출현 빈도 값과 스태디니스 값을 통해 종합 시간 정보 값을 계산할 수 있다. 종합 시간 정보 값은 최근 출현 빈도와 스태디니스 값의 곱으로 계산된다. 이를 수학식으로 표현하면, 다음과 같다.Thereafter, when the stadiness value is obtained, the comprehensive time information value may be calculated through the recent appearance frequency value and the stadiness value obtained in the recent appearance frequency calculation step 320. The aggregate time information value is calculated as the product of the recent appearance frequency and the stadiness value. If this is expressed as an equation, it is as follows.

여기서, Total Score는 종합 시간 정보 값,

는 스태디니스 값,

는 최근 출현 빈도를 나타낸다.Where Total Score is the total time information value,

Is the standard value,

Indicates the frequency of recent appearance.

종합 시간 정보 값은 각 단어가 최근에 얼마나 많이 출현했는지, 그리고 블로그 내에서 얼마나 꾸준히 출현했는지 나타내는 인덱스이다. 이를 통해, 블로그의 대표 단어를 선택할 수 있다. 그리고, 분류 단계(230)에서, 종합 시간 정보 계산 단계(330)에서 선택된 단어를 토대로 전술한 DMOZ 방식을 사용하여 블로그의 카테고리를 분류할 수 있다.The aggregate time information value is an index indicating how many times each word has appeared recently and how steadily appeared in the blog. This allows you to choose a representative word for your blog. In the classification step 230, categories of blogs may be classified using the above-described DMOZ method based on the words selected in the comprehensive time information calculation step 330.

도 7은 본 발명의 일 실시예에 따른 블로그 카테고리 분류 장치를 개략적으로 나타낸 블록도이다. 도 7에 도시된 바와 같이, 본 발명에 따른 블로그 카테고리 분류 장치는 단어 추출부(710), 계산부(720) 및 분류부(730)를 포함할 수 있다.7 is a block diagram schematically illustrating a blog category classification apparatus according to an embodiment of the present invention. As shown in FIG. 7, the blog category classification apparatus according to the present invention may include a word extractor 710, a calculator 720, and a classifier 730.

단어 추출부(710)는 블로그 내의 각각의 단어들을 웹문서 내에서 추출하는 역할을 수행한다. 각 단어들의 일련의 시간 정보를 확인하려면 문서 단위가 아니라 단어 단위로 정보를 수집해야 하기 때문에 블로그 내에서 사용되었던 단어를 각 문서 또는 블로그 포스트에서 추출한다. The word extractor 710 extracts each word in the blog in the web document. In order to check the time information of each word, the words used in the blog are extracted from each document or blog post because the information must be collected in word units rather than document units.

계산부(720)는 단어 추출부(710)에서 추출된 단어들의 출현 횟수, 빈도 등을 토대로 추출된 단어들이 시간적으로 어떤 분포를 나타내는지 보여주는 종합 시간 정보 값을 계산한다. 계산부는 출현 빈도를 통해 최근 출현 빈도를 계산할 수 있고, 최근 출현 빈도를 통해 종합 시간 정보 값을 계산할 수 있다. 이는 이하, 도 8을 참조하여 상세히 설명한다.The calculation unit 720 calculates a comprehensive time information value showing how the extracted words represent a distribution in time based on the number of occurrences and the frequency of the words extracted by the word extraction unit 710. The calculation unit may calculate a recent appearance frequency through the appearance frequency, and calculate a comprehensive time information value through the recent appearance frequency. This will be described below in detail with reference to FIG. 8.

분류부(730)는 계산 단계(720)를 통해 획득한 종합 시간 정보 값을 따라 블로그의 카테고리를 분류하는 역할을 수행한다. 전술한 바와 같이, 분류 방법으로는 DMOZ 방식을 사용할 수 있다. DMOZ는 가장 거대하고, 가장 포괄적인, 인간이 편성한 웹의 디렉토리로서, 글로벌 커뮤니티에 의해 조성되고 유지된다. DMOZ는 쿼리 단어에 대해 상위 5 카테고리를 보여준다. 그리고 각 단어에 대한 상위 5 카테고리로 가장 빈번하게 나타나는 카테고리를 해당 블로그의 대표 카테고리로 확정할 수 있다.The classification unit 730 serves to classify blog categories according to the comprehensive time information value obtained through the calculation step 720. As described above, the classification method may use the DMOZ method. DMOZ is the largest, most comprehensive, human-organized directory of webs created and maintained by the global community. DMOZ shows the top five categories for query words. In addition, a category that appears most frequently as the top 5 category for each word may be determined as the representative category of the blog.

도 8은 본 발명의 일 실시예에 따른 블로그 카테고리 분류 장치 중 계산부(720)를 구체적으로 나타낸 상세 블록도이다. 도 8에 도시된 바와 같이, 계산부(720)는 출현 빈도 계산부(810), 최근 출현 빈도 계산부(820) 및 시간 정보 계산부(830)를 포함할 수 있다. 8 is a detailed block diagram illustrating in detail the calculator 720 of the blog category classification apparatus according to the exemplary embodiment of the present invention. As illustrated in FIG. 8, the calculator 720 may include an appearance frequency calculator 810, a recent appearance frequency calculator 820, and a time information calculator 830.

출현 빈도 계산부(810)는 단어 추출부(720)에서 추출된 단어들이 블로그 내에서 얼마나 자주 출현하는지를 나타내는 출현 빈도를 계산한다. 출현 빈도는 블로그 내의 하나의 문서에서 단어가 사용된 횟수를 나타내는 단어 빈도와, 블로그 내의 문서의 수를 단어를 포함하는 문서의 수로 나눈 문서 빈도를 토대로 계산할 수 있다. The appearance frequency calculator 810 calculates an appearance frequency indicating how often words extracted by the word extractor 720 appear in the blog. The frequency of appearance may be calculated based on the word frequency indicating the number of times a word has been used in a document in a blog, and the document frequency divided by the number of documents containing the word by the number of documents in the blog.

최근 출현 빈도 계산부(820)는 출현 빈도 계산부(810)에서 계산된 출현 빈도를 이용하여 단어들의 최근 출현 빈도를 계산한다. 이는 블로그 내에서 가장 최근에 언급된 단어에 가중치를 주는 방식으로 계산된다.The recent appearance frequency calculator 820 calculates the recent appearance frequency of words using the appearance frequency calculated by the appearance frequency calculator 810. This is calculated by weighting the most recently mentioned words in the blog.

최근 출현 빈도 계산부(820)는

(여기서, tf _i ×idf _i 는 출현 빈도, current time은 현재 날짜, time of d는 단어가 출현한 날짜, D는 단어가 출현하는 전체 문서의 개수를 나타냄)을 통해 최근 출현 빈도를 계산할 수 있다.The recent appearance frequency calculator 820

Where tf _i × idf _i is the frequency of appearance, current time is the current date, time of d denotes the date of appearance of the word, and D denotes the total number of documents in which the word appears.

시간 정보 계산부(830)는 최근 출현 빈도와 단어들이 얼마나 꾸준하게 출현하는지 나타내는 스태디니스(steadiness) 값을 토대로 단어들의 종합 시간 정보 값을 계산한다. 여기서, 스태디니스 값은 단어의 출현 날짜와 단어가 출현한 전체 기간의 중간 날짜의 차를 단어가 나타난 전체 횟수로 나눈 시간 분산 값이다. 시간 분산 값을 쓰는 이유는 시간 분산 값은 확률의 표준 편차를 구하는 방식을 통해 획득되는데, 이는 관측치들이 평균 또는 중앙값에서 얼마나 퍼져 있는지 알려주는 인덱스이기 때문이다. 따라서, 시간 분산 값이 클수록 단어가 전체 문서에 고르게 나타나고 있음을 알 수 있고, 따라서, 스태디니스 값이 커지게 된다. The time information calculator 830 calculates a comprehensive time information value of words based on a recent appearance frequency and a steadyness value indicating how steadily words appear. Here, the stadiness value is a time variance value obtained by dividing the difference between the date of appearance of the word and the middle date of the entire period in which the word appears by the total number of occurrences of the word. The reason for writing the time variance is that the time variance is obtained by finding the standard deviation of the probability, because it is an index that tells how spread the observations are from the mean or median. Therefore, it can be seen that the larger the time variance value, the word appears evenly in the entire document, and therefore, the stadiness value becomes larger.

종합 시간 정보 값은 최근 출현 빈도와 스태디니스 값의 곱으로 구할 수 있다. 종합 시간 정보 값은 각 단어가 최근에 얼마나 많이 출현했는지, 그리고 블로그 내에서 얼마나 꾸준히 출현했는지 나타내는 인덱스이다. 이를 통해, 블로그의 대표 단어를 선택할 수 있다. 그리고, 분류부(730)는 종합 시간 정보 계산부(830)에서 선택된 단어를 토대로 전술한 DMOZ 방식을 사용하여 블로그의 카테고리를 분류할 수 있다. The total time information value can be obtained by multiplying the recent appearance frequency by the stadiness value. The aggregate time information value is an index indicating how many times each word has appeared recently and how steadily appeared in the blog. This allows you to choose a representative word for your blog. The classifier 730 may classify blog categories using the above-described DMOZ method based on the words selected by the comprehensive time information calculator 830.

도 9a, 9b 및 9c는 본 발명의 일 실시예에 따른 블로그 카테고리 분류 방법의 효과를 설명하기 위한 도면을 도시한다. 9A, 9B, and 9C illustrate a diagram for describing an effect of a blog category classification method according to an embodiment of the present invention.

도 9a는 본 발명에 따라 각 블로그 사이트에 대한 블로그 대표 단어들을 나타내는 도면이다. 각 대표 단어들은 상대적으로 추출된다. 그러나, 대부분의 수집된 데이터가 일반적인 사이트에 대해서는 대표 단어의 정확도가 떨어진다. 그러나 어떤 한가지 주제에 대해 집중하고 있는 블로그에 대해서는 높은 정확도를 보인다. 특히, bgr.com의 경우 현재 특허와 관련된 단어가 많이 구성되어 있는데 이는 최근 기업 간 스마트폰 특허와 관련된 분쟁이 많이 있기 때문으로 분석할 수 있다. 9A is a diagram illustrating blog representative words for each blog site according to the present invention. Each representative word is extracted relatively. However, most collected data is less accurate for representative words for typical sites. However, it is highly accurate for blogs that focus on one topic. In particular, bgr.com is composed of a lot of words related to the current patent, which can be analyzed because there are many disputes related to smartphone patents between companies.

도 9b와 도 9c는 본 발명에 따라 분류한 블로그 카테고리의 결과이다. 도 9b는 본 발명에 따라 분류된 블로그 카테고리의 결과를 나타낸 도면이고, 도 9c는 평균 정확도를 수치화한 도면이다. 도 9b에 도시된 바와 같이, 각 블로그 사이트에 대해 블로그 카테고리는 매우 다양하게 나타난다. 또한, 도 9c에 도시된 바와 같이, 블로그 분류의 평균 정확도는 어떤 블로그 사이트에서는 높고, 어떤 블로그 사이트에 대해서는 낮게 나타난다. 이는 전술한 바와 같이, 블로그에 따라 특정 카테고리 주제에 대해 집중된 정도의 차이가 있기 때문으로 분석할 수 있다. 9B and 9C are results of blog categories classified according to the present invention. FIG. 9B is a diagram illustrating the results of blog categories classified according to the present invention, and FIG. 9C is a diagram illustrating the average accuracy. As shown in FIG. 9B, blog categories vary greatly for each blog site. Also, as shown in FIG. 9C, the average accuracy of blog classification is high for some blog sites and low for some blog sites. As described above, this can be analyzed because there is a difference in concentration on a specific category topic according to a blog.

이상 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the inventions as defined by the following claims It will be understood that various modifications and changes may be made thereto without departing from the spirit and scope of the invention.

210: 단어 추출 단계 220: 계산 단계
230: 분류 단계
310: 출현 빈도 계산 단계 320: 최근 출현 빈도 계산 단계
330: 시간 정보 계산 단계
710: 단어 추출부 720: 계산부
730: 분류부
810: 출현 빈도 계산부 820: 최근 출현 빈도 계산부
830: 시간 정보 계산부210: word extraction step 220: calculation step
230: classification stage
310: calculating frequency of appearance 320: calculating frequency of recent appearance
330: time information calculation step
710: word extraction unit 720: calculation unit
730: classification unit
810: Appearance frequency calculator 820: Recent appearance frequency calculator
830: time information calculator

Claims

시간 정보를 활용하여 블로그의 카테고리를 분류하는 방법에 있어서,
상기 블로그 내에서 사용되는 단어들을 상기 블로그 내의 문서로부터 추출하는 단어 추출 단계;
상기 추출된 단어들이 상기 블로그 내의 문서 내에서 출현한 빈도를 시간축 상에 분포시켜 상기 단어 출현의 시간적인 분포를 나타내는 종합 시간 정보 값을 계산하는 계산 단계; 및
상기 단어들의 종합 시간 정보 값에 따라 상기 블로그의 카테고리를 분류하는 분류 단계를 포함하는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 방법.
In the method of classifying blog categories using time information,
Extracting words used in the blog from a document in the blog;
Calculating a total time information value representing a temporal distribution of the word appearance by distributing a frequency of occurrence of the extracted words in a document in the blog on a time axis; And
And classifying the category of the blog according to the comprehensive time information values of the words.

제 1 항에 있어서, 상기 계산 단계가,
상기 추출된 단어들의 상기 출현 빈도를 계산하는 출현 빈도 계산 단계;
현재 시점을 기점으로 상기 출현 빈도를 고려하여 상기 단어들의 최근 출현 빈도를 계산하는 최근 출현 빈도 계산 단계; 및
상기 최근 출현 빈도와 상기 단어들의 시간축 상 분포의 균형성 정도를 나타내는 스태디니스(steadiness) 값을 기반으로 산술적 연산을 수행하여 상기 종합 시간 정보 값을 계산하는 시간 정보 계산 단계를 포함하는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 방법.
The method of claim 1, wherein the calculating step,
An appearance frequency calculation step of calculating the appearance frequency of the extracted words;
A recent appearance frequency calculation step of calculating a recent appearance frequency of the words in consideration of the appearance frequency based on a current time point; And
And calculating a time information value by performing an arithmetic operation on the basis of a steadyness value indicating a balance between the recent appearance frequency and the distribution on the time axis of the words. How to categorize blog categories using time information.

제 2 항에 있어서, 상기 출현 빈도 계산 단계에서,
상기 출현 빈도가, 상기 블로그 내의 하나의 문서에서 상기 단어가 사용된 횟수를 나타내는 단어 빈도와, 상기 블로그 내의 문서의 수를 상기 단어를 포함하는 문서의 수로 나눈 문서 빈도를 곱하여 계산되는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 방법.
The method of claim 2, wherein in the calculating the appearance frequency,
Wherein the appearance frequency is calculated by multiplying a word frequency indicating the number of times the word has been used in one document in the blog and a document frequency divided by the number of documents in the blog divided by the number of documents containing the word. How to categorize blog categories using time information.

제 2 항에 있어서, 상기 최근 출현 빈도 계산 단계에서,
상기 최근 출현 빈도가, 상기 블로그 내에서 가장 최근에 언급된 단어에 가중치를 주어 계산되는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 방법.
The method of claim 2, wherein in the calculating the latest appearance frequency,
The blog category classification method using time information, wherein the frequency of recent occurrence is calculated by weighting a word most recently mentioned in the blog.

제 2 항에 있어서, 상기 최근 출현 빈도 계산 단계에서,
상기 최근 출현 빈도가

(여기서, tf_i ×idf_i 는 상기 출현 빈도, tf_i는 단어 빈도, idf_i는 문서 빈도, current time은 현재 날짜, time of d는 상기 단어가 출현한 날짜 및 D는 상기 단어가 출현하는 전체 문서의 개수를 나타냄)을 통해 계산되는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 방법.
The method of claim 2, wherein in the calculating the latest appearance frequency,
The frequency of recent appearance

Where tf _i x idf _i is the frequency of occurrence, tf _i is the frequency of words, idf _i is the frequency of documents, current time is the current date, time of d is the date the word appeared, and D is the total occurrence of the word. Blog category classification method using time information, characterized in that the information is calculated through a).

제 2 항에 있어서, 상기 시간 정보 계산 단계에서,
상기 스태디니스 값이 상기 단어의 출현 날짜와 상기 단어가 출현한 전체 기간의 중간 날짜의 차를 상기 단어가 나타난 전체 횟수로 나눈 시간 분산 값인 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 방법.
The method of claim 2, wherein in the time information calculation step,
And the stadiness value is a time variance value obtained by dividing a difference between an appearance date of the word and an intermediate date of an entire period in which the word appears by the total number of occurrences of the word.

제 5 항에 있어서, 상기 시간 정보 계산 단계에서,
상기 종합 시간 정보 값이 상기 최근 출현 빈도와 상기 스태디니스 값의 곱으로 계산되는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 방법.
The method of claim 5, wherein in the time information calculation step,
The blog category classification method using time information, wherein the comprehensive time information value is calculated as a product of the recent appearance frequency and the stadiness value.

시간 정보를 활용하여 블로그의 카테고리를 분류하는 장치에 있어서,
상기 블로그 내에서 사용되는 단어들을 상기 블로그 내의 문서로부터 추출하는 단어 추출부;
상기 추출된 단어들이 상기 블로그 내의 문서 내에서 출현한 빈도를 시간축 상에 분포시켜 상기 단어 출현의 시간적인 분포를 나타내는 종합 시간 정보 값을 계산하는 계산부; 및
상기 단어들의 종합 시간 정보 값에 따라 상기 블로그의 카테고리를 분류하는 분류부를 포함하는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 장치.
In the apparatus for classifying blog categories by using time information,
A word extracting unit extracting words used in the blog from a document in the blog;
A calculation unit calculating a total time information value representing a temporal distribution of the word appearance by distributing a frequency of the extracted words in a document in the blog on a time axis; And
And a classification unit which classifies the category of the blog according to the comprehensive time information values of the words.

제 8 항에 있어서, 상기 계산부가,
상기 추출된 단어들의 상기 출현 빈도를 계산하는 출현 빈도 계산부;
현재 시점을 기점으로 상기 출현 빈도를 고려하여 상기 단어들의 최근 출현 빈도를 계산하는 최근 출현 빈도 계산부; 및
상기 최근 출현 빈도와 상기 단어들의 시간축 상 분포의 균형성 정도를 나타내는 스태디니스(steadiness) 값을 기반으로 산술적 연산을 수행하여 상기 단어들의 종합 시간 정보 값을 계산하는 시간 정보 계산부를 포함하는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 장치.
The method of claim 8, wherein the calculation unit,
An appearance frequency calculator for calculating the appearance frequency of the extracted words;
A recent appearance frequency calculator configured to calculate a recent appearance frequency of the words in consideration of the appearance frequency based on a current time point; And
And a time information calculator configured to calculate a comprehensive time information value of the words by performing an arithmetic operation on the basis of a steadyness value indicating the balance of the recent appearance frequency and the distribution on the time axis of the words. Blog category classification device utilizing time information.

제 9 항에 있어서, 상기 출현 빈도 계산부가,
상기 블로그 내의 하나의 문서에서 상기 단어가 사용된 횟수를 나타내는 단어 빈도와, 상기 블로그 내의 문서의 수를 상기 단어를 포함하는 문서의 수로 나눈 문서 빈도를 곱하여 상기 출현 빈도를 계산하는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 장치.
The method of claim 9, wherein the appearance frequency calculation unit,
And calculating the appearance frequency by multiplying a word frequency indicating the number of times the word has been used in one document in the blog and a document frequency obtained by dividing the number of documents in the blog by the number of documents containing the word. Blog category categorizer using information.

제 9 항에 있어서, 상기 최근 출현 빈도 계산부가,
상기 블로그 내에서 가장 최근에 언급된 단어에 가중치를 주어 상기 최근 출현 빈도를 계산하는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 장치.
The method of claim 9, wherein the latest appearance frequency calculation unit,
The blog category classification apparatus using the time information, characterized in that the weight of the most recently mentioned words in the blog to calculate the recent appearance frequency.

제 9 항에 있어서, 상기 최근 출현 빈도 계산부가,

(여기서, tf_i ×idf_i 는 상기 출현 빈도, tf_i는 단어 빈도, idf_i는 문서 빈도, current time은 현재 날짜, time of d는 상기 단어가 출현한 날짜 및 D는 상기 단어가 출현하는 전체 문서의 개수를 나타냄)을 통해 상기 최근 출현 빈도를 계산하는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 장치.
The method of claim 9, wherein the latest appearance frequency calculation unit,

Where tf _i x idf _i is the frequency of occurrence, tf _i is the frequency of words, idf _i is the frequency of documents, current time is the current date, time of d is the date the word appeared, and D is the total occurrence of the word. Blog category classification apparatus using time information, characterized in that the recent appearance frequency is calculated through the number of documents.

제 9 항에 있어서,
상기 스태디니스 값이 상기 단어의 출현 날짜와 상기 단어가 출현한 전체 기간의 중간 날짜의 차를 상기 단어가 나타난 전체 횟수로 나눈 시간 분산값인 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 장치.
The method of claim 9,
The apparatus for classifying blog categories using time information, wherein the standard value is a time variance value obtained by dividing a difference between an appearance date of the word and an intermediate date of an entire period in which the word appears by the total number of occurrences of the word. .

제 13 항에 있어서, 상기 시간 정보 계산부가,
상기 최근 출현 빈도와 상기 스태디니스 값의 곱으로 상기 종합 시간 정보 값을 계산하는 것을 특징으로 하는 시간 정보를 활용한 블로그 카테고리 분류 장치.

The method of claim 13, wherein the time information calculation unit,
The blog category classification apparatus using the time information, characterized in that to calculate the total time information value by multiplying the recent appearance frequency and the stadiness value.