KR101218575B1

KR101218575B1 - Trackback spam detection system and method thereof

Info

Publication number: KR101218575B1
Application number: KR1020100122448A
Authority: KR
Inventors: 최중민; 김태환; 전혁수
Original assignee: 한양대학교 에리카산학협력단
Priority date: 2010-12-03
Filing date: 2010-12-03
Publication date: 2013-01-21
Also published as: KR20120061239A

Abstract

개인 미디어의 트랙백 스팸을 탐지하는 시스템 및 그 방법에 관한 것이다. 개인 미디어에 관한 웹페이지를 입력받는 웹페이지 입력부-여기서, 상기 웹페이지는 상기 개인 미디어에 게시된 원본 페이지를 포함하며, 상기 원본 페이지에 링크된 트랙백 페이지 및 상기 트랙백 페이지의 아웃 링크에 대한 링크 페이지 중 하나 이상을 더 포함함-; 상기 웹페이지 간의 유사도를 계산하는 유사도 측정부; 및 상기 유사도와 미리 설정된 임계값을 비교하여 상기 트랙백 페이지가 스팸인지 여부를 판단하는 스팸 판단부를 포함하는 트랙백 스팸 탐지 시스템에 의하면, 네트워크 자원을 효율적으로 이용하고 검색 엔진의 성능을 향상시키며 개인 미디어의 방문자들에게 올바른 정보를 전달하여 포스트의 신뢰성을 높일 수 있다.A system and method for detecting trackback spam in personal media. A web page input unit for receiving a web page about personal media, wherein the web page includes an original page posted on the personal media, the track page linked to the original page and a link page to an out link of the trackback page Further comprises one or more of; A similarity measuring unit for calculating a similarity between the web pages; And a spam determination unit that determines whether the trackback page is spam by comparing the similarity with a preset threshold value. The trackback spam detection system includes a method for efficiently using network resources, improving the performance of a search engine, You can increase the reliability of your posts by delivering the right information to your visitors.

Description

트랙백 스팸 탐지 시스템 및 그 방법{Trackback spam detection system and method thereof}Trackback spam detection system and method thereof

본 발명은 개인 미디어의 트랙백 스팸을 탐지하는 시스템 및 그 방법에 관한 것이다.The present invention relates to a system and method for detecting trackback spam in personal media.

최근 웹 2.0의 도래로 월드 와이드 웹은 개인 미디어의 형태로 발전해 나가고 있다. 특히 웹로그(weblog) 또는 블로그(blog)는 인터넷 게시판 형식을 가지는 개인 미디어로서, 개인의 의사를 표현하는 수단뿐만 아니라 기업의 홍보에까지 널리 사용되고 있다. 블로그와 같은 개인 미디어 문화가 널리 퍼지게 된 까닭은 주인뿐만 아니라 방문자들도 의견을 올릴 수 있는 덧글이나 트랙백(trackback) 기능이 존재하기 때문이다. With the recent advent of Web 2.0, the World Wide Web is evolving into the form of personal media. In particular, weblogs or blogs are personal media having an Internet bulletin board format, and are widely used not only for expressing individual intentions but also for public relations. The personal media culture, such as blogs, has become so widespread because of the presence of comments and trackbacks that allow not only owners but also visitors.

덧글은 방문자가 작성된 글을 읽고 간단한 의견이나 생각을 적을 수 있도록 하며, 트랙백은 방문자 자신이 작성한 글을 다른 사람의 글과 관련이 있다고 생각하여 다른 사람의 글에 자신의 글을 링크시키는 것을 의미한다. 트랙백이 걸린 원문에는 트랙백을 건 사람의 포스트 페이지 주소와 그 페이지를 요약한 내용이 실리게 된다. A comment allows visitors to read a written post and write a brief comment or thought. Trackback means linking a post to someone else's post by thinking that the visitor's own post is related to another's post. . The original trackback contains the postback address of the trackbacker and a summary of the page.

덧글은 긴 글이나 HTML 태그를 허용하지 않기 때문에 텍스트로만 된 글을 올려야 하는 단점이 있는데 비해, 트랙백을 통해 방문자는 자신의 의견을 보다 자유롭게 표현할 수 있으며, 개인 미디어들 사이에 연결고리를 만들어주어 소통 가능한 네트워크를 만들 수 있는 장점이 있다. Comments don't allow long posts or HTML tags, so you have to post text only.However, trackback allows visitors to express their opinions more freely and create a link between personal media. It has the advantage of creating a possible network.

하지만 개인 미디어 문화가 활성화됨에 따라 누구나 올릴 수 있는 자유로운 기능인 트랙백이 스패머(spammer)에 의해 악용되고 있는 실정이다. 스패머는 스팸 페이지를 만들어서 자신의 이익을 취할 뿐만 아니라 사용자들이 원하지 않는 페이지로 사용자들을 유입시켜 네트워크 자원을 낭비하고 검색 엔진의 성능을 저하시킨다. 또한, 해당 개인 미디어의 방문자들에게 잘못된 정보를 전달하고 해당 포스트의 신뢰를 떨어뜨리는 문제점이 있다. However, as the personal media culture is activated, trackback, a free function that anyone can post, is being exploited by spammers. Spammers not only benefit from creating spam pages, but also introduce users to pages they don't want to waste network resources and slow down search engine performance. In addition, there is a problem that delivers wrong information to the visitors of the personal media and lower the trust of the post.

트랙백 스팸은 검색 엔진 결과의 상위에 랭크시키는 것이 아니라 유명한 포스트와 연계하여 자신의 포스트로 사용자들을 유도하는 특징을 가지며, 자신의 포스트로 유도된 사용자들을 또 관련 없는 광고 페이지로 유도하기 위하여 불필요한 링크를 사용하기 때문에 일반적인 웹 스팸을 탐지하는 기술을 적용하기 어려운 바 새로운 탐지 기술이 필요한 실정이다.Trackback spam is a feature that attracts users to their posts in conjunction with popular posts, rather than ranking them above search engine results, and adds unnecessary links to direct users to their unrelated ad pages. It is difficult to apply the technology for detecting general web spam because of the use, so a new detection technology is needed.

본 발명은 자신이 작성한 글이 다른 사람의 글과 관련이 있다고 생각하여 다른 사람의 글에 자신의 글을 링크시키는 트랙백의 특성을 이용하여 원본 페이지와 트랙백 페이지 및/또는 트랙백 페이지의 아웃 링크 내용 상의 유사도와 동시출현(co-occurrence) 정보를 이용하여 트랙백 스팸을 찾는 트랙백 스팸 탐지 시스템 및 그 방법을 제공하기 위한 것이다.The present invention uses the characteristics of trackbacks that link one's posts to others' posts because they believe that their posts are related to the posts of other people. To provide a trackback spam detection system and method for finding trackback spam using similarity and co-occurrence information.

또한, 본 발명은 네트워크 자원을 효율적으로 이용하고 검색 엔진의 성능을 향상시키며 개인 미디어의 방문자들에게 올바른 정보를 전달하여 포스트의 신뢰성을 높일 수 있는 트랙백 스팸 탐지 시스템 및 그 방법을 제공하기 위한 것이다. In addition, the present invention is to provide a trackback spam detection system and method that can efficiently use the network resources, improve the performance of the search engine, and improve the reliability of the post by delivering the correct information to the visitors of the personal media.

본 발명의 이외의 목적들은 하기의 설명을 통해 쉽게 이해될 수 있을 것이다.Other objects of the present invention will be readily understood through the following description.

본 발명의 일 측면에 따르면, 개인 미디어의 트랙백 스팸(trackback spam)을 탐지하는 시스템에 있어서, 상기 개인 미디어에 관한 웹페이지를 입력받는 웹페이지 입력부-여기서, 상기 웹페이지는 상기 개인 미디어에 게시된 원본 페이지를 포함하며, 상기 원본 페이지에 링크된 트랙백 페이지 및 상기 트랙백 페이지의 아웃 링크에 대한 링크 페이지 중 하나 이상을 더 포함함-; 상기 웹페이지 간의 유사도를 계산하는 유사도 측정부; 및 상기 유사도와 미리 설정된 임계값을 비교하여 상기 트랙백 페이지가 스팸인지 여부를 판단하는 스팸 판단부를 포함하는 트랙백 스팸 탐지 시스템이 제공된다. According to an aspect of the invention, in the system for detecting trackback spam (trackback spam) of personal media, a web page input unit for receiving a web page about the personal media, wherein the web page is posted on the personal media And at least one of a trackback page linked to the original page and a link page to an out link of the trackback page; A similarity measuring unit for calculating a similarity between the web pages; And a spam determination unit that compares the similarity with a preset threshold to determine whether the trackback page is spam.

상기 유사도 측정부는 상기 웹페이지를 단어의 벡터 형태로 표현하여 상기 웹페이지 사이의 유사도를 상기 웹페이지에 대한 두 벡터 사이의 유클리드 거리 또는 벡터 공간 상에서 이루는 각도의 코사인 값을 상기 유사도로 계산할 수 있다.The similarity measurer may express the webpage in a vector form of a word to calculate a similarity between the webpages as a cosine value of an Euclidean distance between two vectors for the webpage or an angle formed in a vector space.

상기 유사도 측정부는, 상기 웹페이지에 상응하는 문서로부터 주제어를 추출하고 상기 문서의 단어 집합을 생성하는 주제어 추출 모듈과, 상기 단어 집합을 이용하여 행 성분이 단어를 나타내고 열 성분이 문서를 나타내는 단어-문서 행렬을 구성하고 LSA(Latent Semantic Analysis) 알고리즘에 따라 저차원의 행렬에 대응시킨 후 상기 문서에 해당하는 열 벡터를 내적함으로써 상기 유사도를 계산하는 유사도 계산 모듈을 포함할 수 있다.The similarity measuring unit may include a main word extraction module that extracts a main word from a document corresponding to the web page and generates a word set of the document, and a word in which a row component represents a word and a column component represents a document using the word set. It may include a similarity calculation module for calculating the similarity by constructing a document matrix, corresponding to the low-dimensional matrix according to the latent semantic analysis (LSA) algorithm, and then internalizing the column vector corresponding to the document.

상기 단어-문서 행렬에서 각 값들은 하기 수학식을 사용하여 구한 가중치일 수 있다. Each value in the word-document matrix may be a weight obtained using the following equation.

여기서, w_ij는 문서 i에서 단어 j의 가중치를 나타내고, tf_ij는 문서 i에서 단어 j가 나타난 횟수, df_j는 단어 j가 나타난 문서의 개수, N은 전체 문서의 개수를 나타낸다.Here, w _ij represents the weight of the word j in the document i, tf _ij represents the number of times the word j appeared in the document i, df _j represents the number of documents in which the word j appeared, N represents the total number of documents.

상기 유사도 계산 모듈은, 상기 단어-문서 행렬은 m x n 행렬인 A을, m x n 행렬로써 A 행렬의 행 성분을 나타내는 T 행렬과, n x n 행렬로써 A 행렬의 열 성분을 나타내는 D 행렬과, n x n 행렬로써 각 대각에는 A 행렬의 고유값(eigenvalue)이 내림차순으로 정렬되어 있는 S 행렬을 이용하여

와 같이 분해하고,

을 만족하는 k를 구한 후, w_i=0을 수행하여 수정된 S’를 이용하여 저차원 행렬 A’=TS’D^T를 구하며, 열벡터 내적을 수행하여 상기 유사도를 계산할 수 있다. The similarity calculation module is characterized in that the word-document matrix is A, which is an mxn matrix, a T matrix that represents the row component of the A matrix as an mxn matrix, a D matrix that represents the column component of the A matrix as the nxn matrix, and an nxn matrix, respectively. For diagonals, we use the S matrix with the eigenvalues of the A matrix sorted in descending order.

Disassemble as

After k is satisfied, w _i = 0 to obtain a low dimensional matrix A '= TS'D ^T using the modified S', and the similarity can be calculated by performing a column vector dot product.

한편 본 발명의 다른 측면에 따르면, 개인 미디어의 트랙백 스팸을 탐지하는 방법 및 이를 수행하기 위한 프로그램이 기록된 기록매체가 제공된다.On the other hand, according to another aspect of the present invention, there is provided a recording medium recording a method for detecting trackback spam of personal media and a program for performing the same.

일 실시예에 따른 트랙백 스팸 탐지 방법은, 상기 개인 미디어에 관한 웹페이지를 입력받는 단계-여기서, 상기 웹페이지는 상기 개인 미디어에 게시된 원본 페이지를 포함하며, 상기 원본 페이지에 링크된 트랙백 페이지 및 상기 트랙백 페이지의 아웃 링크에 대한 링크 페이지 중 하나 이상을 더 포함함-; 상기 웹페이지 간의 유사도를 계산하는 단계; 및 상기 유사도와 미리 설정된 임계값을 비교하여 상기 트랙백 페이지가 스팸인지 여부를 판단하는 단계를 포함할 수 있다. The trackback spam detection method according to an embodiment may include receiving a webpage about the personal media, wherein the webpage includes an original page posted on the personal media, and a trackback page linked to the original page. Further comprises at least one of a link page for an out link of the trackback page; Calculating similarity between the web pages; And comparing the similarity with a preset threshold to determine whether the trackback page is spam.

상기 유사도 계산 단계는 상기 웹페이지를 단어의 벡터 형태로 표현하여 상기 웹페이지 사이의 유사도를 상기 웹페이지에 대한 두 벡터 사이의 유클리드 거리 또는 벡터 공간 상에서 이루는 각도의 코사인 값을 상기 유사도로 계산할 수 있다.In the similarity calculation step, the webpage may be expressed in a vector form of a word to calculate the similarity between the webpages as a Euclidean distance between two vectors for the webpage or a cosine of an angle formed in vector space as the similarity. .

상기 유사도 계산 단계는, (a) 상기 웹페이지에 상응하는 문서로부터 주제어를 추출하고 상기 문서의 단어 집합을 생성하는 단계와, (b) 상기 단어 집합을 이용하여 행 성분이 단어를 나타내고 열 성분이 문서를 나타내는 단어-문서 행렬을 구성하고 LSA(Latent Semantic Analysis) 알고리즘에 따라 저차원의 행렬에 대응시킨 후 상기 문서에 해당하는 열 벡터를 내적함으로써 상기 유사도를 계산하는 단계를 포함할 수 있다. The similarity calculating step includes: (a) extracting a main word from a document corresponding to the web page and generating a word set of the document; and (b) using the word set, a row component represents a word and a column component is used. Comprising a word-document matrix representing a document, and corresponding to the low-dimensional matrix according to the latent semantic analysis (LSA) algorithm, and calculating the similarity by internalizing a column vector corresponding to the document.

상기 단계 (b)는, (b-1) 상기 단어-문서 행렬은 m x n 행렬인 A을, m x n 행렬로써 A 행렬의 행 성분을 나타내는 T 행렬과, n x n 행렬로써 A 행렬의 열 성분을 나타내는 D 행렬과, n x n 행렬로써 각 대각에는 A 행렬의 고유값(eigenvalue)이 내림차순으로 정렬되어 있는 S 행렬을 이용하여

와 같이 분해하는 단계; (b-2)

을 만족하는 k를 구한 후, w_i=0을 수행하여 수정된 S’를 이용하여 저차원 행렬 A’=TS’D^T를 구하는 단계; 및 (b-3) 열벡터 내적을 수행하여 상기 유사도를 계산하는 단계를 포함할 수 있다. In the step (b), (b-1), the word-document matrix is A, an m-by-n matrix, a T matrix representing a row component of the A matrix by an mxn matrix, and a D matrix representing a column component of the A matrix by an nxn matrix. And an nxn matrix, each of which has an S matrix in which the eigenvalues of the A matrix are arranged in descending order.

Decomposing as; (b-2)

After k satisfies, W _i = 0 to obtain a low dimensional matrix A '= TS'D ^T using the modified S'; And (b-3) calculating the similarity by performing a column vector dot product.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

본 발명의 실시예에 따르면, 자신이 작성한 글이 다른 사람의 글과 관련이 있다고 생각하여 다른 사람의 글에 자신의 글을 링크시키는 트랙백의 특성을 이용하여 원본 페이지와 트랙백 페이지 및/또는 트랙백 페이지의 아웃 링크 내용 상의 유사도와 동시출현 정보를 이용하여 트랙백 스팸을 찾을 수 있다. According to an embodiment of the present invention, the original page and the trackback page and / or trackback page using the characteristics of the trackback linking one's posts to another person's posts by thinking that the posts they have written are related to other posts Trackback spam can be found using similarity and co-occurrence information in the outlink content of the.

또한, 네트워크 자원을 효율적으로 이용하고 검색 엔진의 성능을 향상시키며 개인 미디어의 방문자들에게 올바른 정보를 전달하여 포스트의 신뢰성을 높일 수 있다.It also improves the reliability of posts by efficiently using network resources, improving the performance of search engines, and delivering the right information to visitors of personal media.

도 1은 본 발명의 일 실시예에 따른 트랙백 스팸 탐지 시스템의 개략적인 구성을 나타낸 블록도,
도 2는 본 발명의 일 실시예에 따른 트랙백 스팸 탐지 시스템의 구조를 나타낸 도면,
도 3은 본 발명의 일 실시예에 따른 유사도 측정부의 개략적인 구성을 나타낸 블록도,
도 4는 본 발명의 일 실시예에 따른 유사도 측정 시 SVD 기술에 의해 분해된 행렬 A를 나타낸 도면,
도 5는 본 발명의 일 실시예에 따른 트랙백 스팸 탐지 방법의 순서도,
도 6은 본 발명의 일 실시예에 따른 유사도 측정 방법의 순서도,
도 7은 본 발명의 다른 실시예에 따른 트랙백 스팸 탐지 방법의 순서도,
도 8a는 원본 페이지의 일례,
도 8b는 스팸이 아닌 트랙백 페이지의 일례,
도 8c는 스팸인 트랙백 페이지의 일례.1 is a block diagram showing a schematic configuration of a trackback spam detection system according to an embodiment of the present invention;
2 is a diagram showing the structure of a trackback spam detection system according to an embodiment of the present invention;
3 is a block diagram showing a schematic configuration of a similarity measuring unit according to an embodiment of the present invention;
4 is a diagram illustrating a matrix A decomposed by an SVD technique when measuring similarity according to an embodiment of the present invention;
5 is a flowchart of a trackback spam detection method according to an embodiment of the present invention;
6 is a flowchart of a method for measuring similarity according to an embodiment of the present invention;
7 is a flowchart of a trackback spam detection method according to another embodiment of the present invention;
8A is an example of an original page,
8B is an example of a trackback page that is not spam,
8C is an example of a trackback page that is spam.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.BRIEF DESCRIPTION OF THE DRAWINGS The present invention is capable of various modifications and various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this specification, the terms "comprises" or "having" and the like refer to the presence of stated features, integers, steps, operations, elements, components, or combinations thereof, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

또한, 명세서에 기재된 "…부", "…모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Also, the terms " part, "" module," and the like, which are described in the specification, refer to a unit for processing at least one function or operation, and may be implemented by hardware or software or a combination of hardware and software.

또한, 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

본 명세서에서는 방문자가 방문한 임의의 개인 미디어의 주인이 작성한 글을 원본 페이지라 하고, 해당 주인의 글에 대하여 방문자가 링크시킨 방문자 자신의 글을 트랙백 페이지라 한다. 그리고 트랙백 페이지가 가지고 있는, 사용자들을 특정 사이트(예를 들어, 스패머가 의도한 광고 사이트 등)로 유도하는 아웃 링크(out-link)의 웹페이지를 링크 페이지라 한다. 여기서, 원본 페이지는 예를 들어 블로그, 잡지 기사, 뉴스 기사 등 중 하나일 수 있다. In the present specification, a post written by the owner of any personal media visited by the visitor is called an original page, and a visitor's own post linked by the visitor to the post of the owner is called a trackback page. In addition, an out-link web page that leads users to a specific site (for example, an advertisement site intended by a spammer) of the trackback page is called a link page. Here, the original page may be one of a blog, a magazine article, a news article, and the like.

이하, 본 발명의 실시예에 대해 관련 도면들을 참조하여 상세히 설명하기로 한다.
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 트랙백 스팸 탐지 시스템의 개략적인 구성을 나타낸 블록도이고, 도 2는 본 발명의 일 실시예에 따른 트랙백 스팸 탐지 시스템의 구조를 나타낸 도면이다. 1 is a block diagram showing a schematic configuration of a trackback spam detection system according to an embodiment of the present invention, Figure 2 is a view showing the structure of a trackback spam detection system according to an embodiment of the present invention.

본 실시예에 따른 트랙백 스팸 탐지 시스템(100)은 방문자가 자신이 작성한 글이 임의의 개인 미디어의 주인의 글과 관련이 있다고 생각하여 해당 주인의 글(원본 페이지)에 방문자 자신의 글(트랙백 페이지)을 링크시키는 트랙백의 특성을 이용하여 트랙백 스팸을 탐지한다. The trackback spam detection system 100 according to the present embodiment considers that a post written by a visitor is related to a post of an owner of an arbitrary personal media, so that the visitor's post (trackback page) is displayed in the post (original page) of the host. Trackback spam is detected using the nature of the trackback linking).

도 1을 참조하면, 트랙백 스팸 탐지 시스템(100)은 웹페이지 입력부(110), 유사도 측정부(120), 스팸 판단부(130)를 포함한다. Referring to FIG. 1, the trackback spam detection system 100 includes a webpage input unit 110, a similarity measurer 120, and a spam determiner 130.

웹페이지 입력부(110)는 트랙백 스팸 여부의 판단 대상이 되는 트랙백 페이지와, 해당 트랙백 페이지의 타겟인 원본 페이지를 입력받는다. 해당 트랙백 페이지의 아웃 링크에 관한 링크 페이지를 추가적으로 더 입력받을 수도 있다(도 2의 10 참조). The web page input unit 110 receives a trackback page, which is a target for determining whether trackback spam, and an original page that is a target of the trackback page. An additional link page for the out link of the trackback page may be additionally input (see 10 of FIG. 2).

유사도 측정부(120)는 웹페이지 입력부(110)를 통해 입력받은 웹페이지에 상응하는 문서 간의 유사도를 측정한다(도 2의 20 참조). 문서 간의 유사도를 측정함에 있어서, 원본 페이지와 트랙백 페이지 간의 유사도(제1 유사도) 및/또는 원본 페이지와 트랙백 페이지의 링크 페이지 간의 유사도(제2 유사도)를 측정할 수 있다. 문서 간의 유사도를 측정하는 유사도 측정부(120)에 대해서는 추후 도 3 이하 도면을 참조하여 상세히 설명하기로 한다. The similarity measurer 120 measures similarity between documents corresponding to the webpage received through the webpage inputter 110 (see 20 of FIG. 2). In measuring the similarity between the documents, the similarity between the original page and the trackback page (first similarity) and / or the similarity between the original page and the link page of the trackback page (second similarity) can be measured. The similarity measurer 120 measuring the similarity between documents will be described in detail later with reference to FIG. 3.

스팸 판단부(130)는 유사도 측정부(120)에서 측정된 유사도에 대하여 평가함으로써 해당 트랙백 페이지가 스팸 트랙백인지 혹은 햄 트랙백(정상 트랙백)인지 여부를 판단한다(도 2의 30 참조). The spam determination unit 130 determines whether the corresponding trackback page is spam trackback or ham trackback (normal trackback) by evaluating the similarity measured by the similarity measurement unit 120 (see 30 in FIG. 2).

일반적으로 스팸 트랙백은 직접 만드는 데에 많은 노력이 들기 때문에 자동으로 생성하게 된다. 그렇기 때문에 의미가 없거나 글이 순서가 없고 조리가 없다. 이러한 트랙백 페이지는 타겟 페이지, 즉 원본 페이지와 비교할 때 유사도가 많이 떨어지게 되는 특징을 가진다. 또한, 스팸 트랙백은 광고를 목적으로 생성하기 때문에 스팸 트랙백의 아웃 링크에 관한 링크 페이지 역시 타겟 페이지, 즉 원본 페이지와 비교할 때 유사도가 많이 떨어지게 되는 특징을 가진다. Spam trackbacks are usually created automatically because they require a lot of effort to make them yourself. For this reason, there is no meaning, the order is out of order, and there is no cooking. The trackback page has a feature that the similarity is much lower than that of the target page, that is, the original page. In addition, since the spam trackback is generated for advertisement purposes, the link page for the outlink of the spam trackback also has a feature that the similarity is much lower than that of the target page, that is, the original page.

따라서, 스팸 판단부(130)는 유사도 측정부(120)에서 측정된 유사도(제1 유사도 및/또는 제2 유사도)가 임계값 이하인 경우 해당 트랙백 페이지를 스팸 페이지로 판단하여 분류할 수 있으며, 여기서 스팸 판단을 위한 임계값은 실험을 통해 결정될 수 있다. Therefore, when the similarity measured by the similarity measurer 120 (first similarity and / or second similarity) is less than or equal to a threshold, the spam determination unit 130 may determine and classify the corresponding trackback page as a spam page. Thresholds for determining spam can be determined through experiments.

도 3은 본 발명의 일 실시예에 따른 유사도 측정부의 개략적인 구성을 나타낸 블록도이고, 도 4는 본 발명의 일 실시예에 따른 유사도 측정 시 SVD 기술에 의해 분해된 행렬 A를 나타낸 도면이다. 3 is a block diagram illustrating a schematic configuration of a similarity measurer according to an embodiment of the present invention, and FIG. 4 is a diagram illustrating a matrix A decomposed by an SVD technique when measuring similarity according to an embodiment of the present invention.

본 실시예에 따른 유사도 측정부(120)는 주제어 추출 모듈(122) 및 유사도 계산 모듈(124)을 포함한다. The similarity measurer 120 according to the present embodiment includes a main word extraction module 122 and a similarity calculation module 124.

주제어 추출 모듈(122)은 각 웹페이지를 대표할 수 있는 주제어를 추출한다. The keyword extraction module 122 extracts a keyword that can represent each web page.

우선 각 웹페이지의 HTML 태그를 제거하고 형태소 분석을 통해 웹페이지 내의 명사를 추출한다. 이는 명사가 문장을 대표하는 주어 및 목적어 중 하나 이상으로 이루어져 있어 웹페이지를 대표하는 주제어로 적합하기 때문이다. 그 후 추출된 명사를 보다 일반적인 형태로 나타내기 위해 불용어(不用語)를 제거하고, 단어를 어근으로 변환하는 스테밍(stemming)을 수행한다. 이 과정에서 추출된 주제어의 집합을 웹페이지의 단어 집합으로 한다. First, remove HTML tags of each web page and extract nouns in the web page through stemming. This is because a noun is composed of one or more of a subject and an object representing a sentence and thus is suitable as a subject representing a web page. Then, in order to represent the extracted nouns in a more general form, it removes stopwords and performs stemming to convert words into roots. The set of subject words extracted in this process is the word set of the web page.

유사도 계산 모듈(124)은 주제어 추출 모듈(122)에서 추출된 주제어의 집합인 웹페이지의 단어 집합을 이용하여 두 웹페이지 사이의 유사도를 계산한다. 유사도 계산 모듈(124)에서 두 웹페이지 사이의 유사도를 비교하기 위한 방법으로 VSM(Vector Space Model) 알고리즘 혹은 LSA(Latent Semantic Analysis) 알고리즘이 적용될 수 있다. The similarity calculation module 124 calculates the similarity between the two web pages using the word set of the web page, which is a set of the main words extracted by the keyword extraction module 122. In the similarity calculation module 124, a vector space model (VSM) algorithm or a latent semantic analysis (LSA) algorithm may be applied as a method for comparing similarities between two web pages.

우선 VSM 알고리즘은 문서를 단어의 벡터 형태로 표현하여 문서 사이의 유사도를 수학적으로 구한다. First, the VSM algorithm mathematically calculates the similarity between documents by expressing the documents in a vector form of words.

[수학식 1][Equation 1]

w_ij는 i번째 문서(문서 i)에 j번째 단어(단어 j)가 포함되어 있으면 1, 그렇지 않으면 0으로 나타낼 수도 있고, 문서 안에서 나타나는 빈도를 정수로 줄 수도 있다. 현재 정보 검색 분야에서는 하기 수학식 2의 방법인 tf·idf로 계산된 가중치 값을 사용한다. w _ij can be represented by 1 if the i-th document (document i) contains the j-th word (word j), or 0, or it can be given as an integer. In the current information retrieval field, a weight value calculated by tf · idf, which is the method of Equation 2, is used.

[수학식 2]&Quot; (2) "

w_ij는 문서 i에서 단어 j의 가중치를 나타내고, tf_ij는 문서 i에서 단어 j가 나타난 횟수, df_j는 단어 j가 나타난 문서의 개수, N은 전체 문서의 개수를 나타낸다. w _ij represents the weight of the word j in the document i, tf _ij represents the number of times the word j appeared in the document i, df _j represents the number of documents in which the word j appeared, N represents the total number of documents.

벡터로 나타내어진 문서 d₁과 d₂의 유사도를 구하는 방법은 두 벡터 사이의 거리를 나타내는 유클리드 거리(Euclidean distance)를 이용하거나, 하기 수학식 3과 같이 벡터 공간 상에서 이루는 각도의 코사인 값으로 유사도가 나타나는 코사인 유사도 방법을 사용할 수 있다. The similarity between the documents d ₁ and d ₂ expressed as a vector can be obtained by using an Euclidean distance representing the distance between the two vectors, or using the cosine of an angle formed in a vector space as shown in Equation 3 below. The cosine similarity method shown can be used.

[수학식 3]&Quot; (3) "

다음으로 LSA 알고리즘은 개념적으로 동시출현 정보를 이용하여 단어의 형태뿐만 아니라 의미를 이용하여 웹페이지 간의 유사도를 측정한다. 예를 들어 ‘사과’라는 단어는 같은 문장에서 ‘나무’ 또는 ‘용서’가 같이 나올 수 있는데, 두 경의 그 의미가 달라지게 된다. ‘나무’와 같이 나오는 경우에는 ‘과일의 사과’를 뜻하고, ‘용서’와 같이 나오는 경우에는 ‘잘못에 대한 용서를 빎’이라는 뜻의 ‘사과’를 뜻하게 된다. Next, the LSA algorithm conceptually measures the similarity between web pages using not only the shape of words but also the meaning using co-appearance information. For example, the word ‘apple’ may be accompanied by ‘tree’ or ‘forgiveness’ in the same sentence, but the meanings of the two verses will be different. The word "tree" means "apple of fruit", and the word "forgiveness" means "apple" which means "forgiveness of wrong."

LSA 알고리즘은 SVD(Singular Value Decomposition) 기술을 사용하여 높은 차원의 단어-문서 빈도 행렬을 낮은 차원의 의미 공간으로 사상시켜 단어와 문서 간의 연관관계를 구할 수 있다. The LSA algorithm uses the Singular Value Decomposition (SVD) technique to map the high-level word-document frequency matrix into the lower-dimensional semantic space to obtain the association between words and documents.

SVD에 따르면, m x n 단어-문서 빈도 행렬 A(m: 단어 수, n: 문서 수, m>n)를 하기 수학식 4와 같이 세 가지 다른 행렬로 분해할 수 있다. According to the SVD, the m x n word-document frequency matrix A (m: number of words, n: number of documents, m> n) can be decomposed into three different matrices as shown in Equation 4 below.

[수학식 4]&Quot; (4) "

T 행렬은 m x n 행렬로써, A 행렬의 행 성분(단어)을 나타내고, D 행렬은 n x n 행렬로써, A 행렬의 열 성분(문서)을 나타낸다. 두 행렬은 모두 직교(orthogonal) 행렬이다. S 행렬은 n x n 행렬로써, 각 대각에는 A 행렬의 고유값(eigenvalue)이 내림차순으로 정렬되어 있다. 도 3을 참조하면, 단어-문서 행렬 A에 대한 SVD 결과가 예시되어 있다. The T matrix is an m x n matrix and represents a row component (word) of the A matrix, and the D matrix is an n x n matrix and represents a column component (document) of the A matrix. Both matrices are orthogonal matrices. The S matrix is an n by n matrix, with each diagonal arranged in descending order of the eigenvalues of the A matrix. Referring to FIG. 3, the SVD results for the word-document matrix A are illustrated.

대각행렬 S에서 중요하지 않은 고유값들은 무시함으로써, 행렬 A와 유사한 행렬 A’를 만들 수 있다. A’는 고유값의 개수가 k개의 차원으로 축소된 행렬이다. By ignoring insignificant eigenvalues in diagonal matrix S, we can make matrix A 'similar to matrix A. A 'is a matrix whose number of eigenvalues is reduced to k dimensions.

SVD를 통해 만들어진 행렬 A’를 이용하여 두 문서 사이의 유사도를 비교할 수 있다. 행렬 A’의 열은 문서에 대한 특성을 나타내는 벡터 값이므로, 행렬 A’의 두 열벡터를 내적하여 두 문서 사이의 유사도를 계산한다. The similarity between two documents can be compared using the matrix A 'created through SVD. Since the columns of the matrix A 'are vector values representing the characteristics of the documents, the similarity between the two documents is calculated by dot producting the two column vectors of the matrix A'.

행렬 A’^T와 행렬 A’를 곱하면 문서 간 유사도 행렬 A’^T·A’=(DS’^TT^T)·(TS’D)^T가 되고, T^TT는 직교행렬이기 때문에 단위행렬 I가 된다. 따라서, 두 문서의 유사도는 하기 수학식 5에 있는 문서 간 유사도 행렬의 셀 (i, j)에 나타난다. Multiplying matrix A ' ^T and matrix A' yields the similarity matrix A ' ^T · A' = (DS ' ^T T ^T ) · (TS'D) ^T between documents and unit matrix I because T ^T T is an orthogonal matrix. Becomes Thus, the similarity of two documents is shown in cells (i, j) of the similarity matrix between documents in Equation 5 below.

[수학식 5][Equation 5]

다시 도 3을 참조하면, 유사도 계산 모듈(124)은 VSM 알고리즘을 이용하는 경우 유클리드 거리 혹은 상기 수학식 3에 따른 코사인 유사도 방법에 의해 유사도를 계산하거나, LSA 알고리즘을 이용하는 경우 상기 수학식 5에 따라 유사도를 계산할 수 있다. Referring again to FIG. 3, the similarity calculation module 124 calculates the similarity by the Euclidean distance or the cosine similarity method according to Equation 3 when using the VSM algorithm, or the similarity according to Equation 5 when using the LSA algorithm. Can be calculated.

이상에서는 트랙백 스팸 탐지 시스템에 대하여 설명하였으며, 도 5 이하 도면을 참조하여 트랙백 스팸 탐지 방법에 대하여 상세히 설명하기로 한다. The trackback spam detection system has been described above, and the trackback spam detection method will be described in detail with reference to FIG. 5.

도 5는 본 발명의 일 실시예에 따른 트랙백 스팸 탐지 방법의 순서도이고, 도 6은 본 발명의 일 실시예에 따른 유사도 측정 방법의 순서도이다. 본 발명의 일 실시예에 따른 트랙백 스팸 탐지 방법의 각 단계들은 도 1에 도시된 트랙백 스탬 탐지 시스템(100)의 각 구성요소에 의해 수행될 수 있다. 5 is a flowchart of a trackback spam detection method according to an embodiment of the present invention, and FIG. 6 is a flowchart of a similarity measurement method according to an embodiment of the present invention. Each step of the trackback spam detection method according to an embodiment of the present invention may be performed by each component of the trackback stamp detection system 100 shown in FIG.

유사도 측정부(120)는 웹페이지 입력부(110)를 통해 입력된 트랙백 스팸 탐지 대상인 트랙백 페이지와 그 타겟인 원본 페이지 간의 유사도를 측정한다(단계 S200). The similarity measuring unit 120 measures the similarity between the trackback page to be trackback spam detection target input through the webpage input unit 110 and the original page which is the target (step S200).

도 6을 참조하면, 유사도를 측정하기 위해, 우선 주제어 추출 모듈(122)은 원본 페이지와 트랙백 페이지 각각으로부터 각 문서를 대표할 수 있는 주제어를 추출하고 추출된 주제어의 집합을 해당 문서의 단어 집합으로 생성한다(단계 S202). Referring to FIG. 6, in order to measure similarity, first, the main word extracting module 122 extracts a main word that can represent each document from each of the original page and the trackback page, and sets the extracted main word as a word set of the document. To generate (step S202).

주제어의 추출은 전술한 것과 같이 각 웹페이지의 HTML 태그를 제거하고 형태소 분석을 통해 명사를 추출함으로써 수행될 수 있다. 추가적으로 불용어 제거 및 스테밍을 통해 일반적인 형태의 명사를 나타낼 수도 있을 것이다. Extraction of the main word may be performed by removing HTML tags of each web page and extracting nouns through morphological analysis as described above. In addition, stopwords and stemming may be used to represent common nouns.

이후 유사도 계산 모듈(124)은 각 문서의 단어 집합을 이용하여 유사도를 계산한다(단계 S204). 유사도를 계산함에 있어서 전술한 것과 같이 VSM 알고리즘을 이용하거나 LSA 알고리즘을 이용할 수 있다. Thereafter, the similarity calculation module 124 calculates the similarity using the word set of each document (step S204). In calculating the similarity, the VSM algorithm or the LSA algorithm may be used as described above.

우선 VSM 알고리즘을 이용하는 경우에는 수학식 1과 같이 나타내어지는 각 문서의 벡터 d₁, d₂에 대하여 유클리드 거리를 계산하거나 수학식 3에 나타난 것과 같이 코사인 유사도 방법을 이용함으로써 원본 페이지와 트랙백 페이지 간의 유사도를 계산할 수 있다. First, when using the VSM algorithm, the similarity between the original page and the trackback page is calculated by calculating Euclidean distance for the vectors d ₁ and d ₂ of each document represented by Equation 1, or by using a cosine similarity method as shown in Equation 3. Can be calculated.

다음으로 LSA 알고리즘을 이용하는 경우에는 주제어 추출 모듈(122)에서 생성된 단어 집합을 이용하여 단어-문서 행렬 A를 구성할 수 있다(단계 S205). 행렬의 각 값들은 상기 수학식 2에 따른 가중치를 구한 값으로 할 수 있다. Next, in the case of using the LSA algorithm, the word-document matrix A may be constructed using the word set generated by the main word extraction module 122 (step S205). Each value of the matrix may be a value obtained by obtaining a weight according to Equation 2.

단어-문서 행렬 A에 대하여 저차원의 A’ 행렬을 구한다(단계 S207). A low-order A 'matrix is obtained for the word-document matrix A (step S207).

이를 위해 우선 주어진 행렬 A를 SVD 과정을 통해 TSD^T로 분해한다(수학식 4 참조). 그리고 하기 수학식 6을 만족하는 k를 구한다. To do this, we first decompose the given matrix A into TSD ^T through the SVD process (see Equation 4). And k satisfying the following formula (6) is obtained.

[수학식 6]&Quot; (6) "

여기서, θ는 미리 설정되는 임계치로서, 일례로 0 내지 1 사이의 값을 가질 수 있다. Here, θ is a preset threshold, and may have a value between 0 and 1, for example.

k를 구한 이후 w_i=0을 수행해 수정된 S’를 구한다. 여기서, i=k+1, k+2, …, n 이다. After finding k, w _i = 0 to find the modified S '. Where i = k + 1, k + 2,... , n.

새로운 행렬 A’=TS’D^T를 구하고(수학식 4 참조), 행렬 A’의 값을 0과 1 사이의 값으로 정규화한다. Find the new matrix A '= TS'D ^T (see Equation 4) and normalize the value of matrix A' to a value between 0 and 1.

이후 행렬 A’의 열벡터 내적(수학식 5 참조)을 수행함으로써 두 문서 간의 유사도를 계산한다(단계 S209). Thereafter, the similarity between the two documents is calculated by performing the column vector dot product of the matrix A '(see Equation 5) (step S209).

다시 도 5를 참조하면, 원본 페이지와 트랙백 페이지 간에 측정된 유사도를 평가한다(단계 S210). 즉, 계산된 유사도가 실험을 통해서 결정된 임계값 c보다 큰지 여부를 판단한다. Referring back to FIG. 5, the similarity measured between the original page and the trackback page is evaluated (step S210). That is, it is determined whether the calculated similarity is greater than the threshold c determined through experiments.

판단 결과 유사도가 임계값 c보다 큰 경우에는 해당 트랙백 페이지를 스팸 페이지가 아닌 것, 즉 정상인 것으로 판단하며(단계 S220), 유사도가 임계값 c 이하인 경우에는 해당 트랙백 페이지를 스팸 페이지로 판단한다(단계 S230). If it is determined that the similarity is greater than the threshold c, the trackback page is determined to be not a spam page, that is, normal (step S220). If the similarity is less than or equal to the threshold c, the trackback page is determined to be a spam page (step S220). S230).

전술한 과정을 통해 원본 페이지와 트랙백 페이지 간의 유사도를 이용하여 해당 트랙백 페이지의 스팸 여부를 판단하고, 트랙백 스팸을 탐지할 수 있게 된다. Through the above-described process, it is possible to determine whether the corresponding trackback page is spam by using the similarity between the original page and the trackback page, and detect the trackback spam.

도 7은 본 발명의 다른 실시예에 따른 트랙백 스팸 탐지 방법의 순서도이다. 본 발명의 다른 실시예에 따른 트랙백 스팸 탐지 방법의 각 단계들은 도 1에 도시된 트랙백 스탬 탐지 시스템(100)의 각 구성요소에 의해 수행될 수 있다. 7 is a flowchart illustrating a trackback spam detection method according to another embodiment of the present invention. Each step of the trackback spam detection method according to another embodiment of the present invention may be performed by each component of the trackback stamp detection system 100 shown in FIG.

본 발명의 다른 실시예에 따른 트랙백 스팸 탐지 방법에 의하면, 원본 페이지와 트랙백 페이지의 아웃 링크에 대한 링크 페이지를 비교하고 그 유사도를 계산하여 트랙백 페이지가 스팸인지 여부를 판단한다. According to the trackback spam detection method according to another embodiment of the present invention, it is determined whether the trackback page is spam by comparing the link page for the out link of the original page and the trackback page and calculating the similarity.

도 5에 도시된 본 발명의 일 실시예에 따른 트랙백 스팸 탐지 방법과 비교할 때, 단계 S300에서 원본 페이지와의 유사도 계산 대상이 트랙백 페이지가 아닌 트랙백 페이지의 링크 페이지인 점에서만 차이가 있으며, 기타 유사도 측정 방법이나 스팸 여부 판단 방법은 동일하게 적용될 수 있다. Compared with the trackback spam detection method according to an embodiment of the present invention shown in FIG. 5, the difference in the similarity with the original page is not the trackback page but the link page of the trackback page in step S300. The measurement method or the spam determination method may be applied in the same way.

즉, 원본 페이지와 링크 페이지에 대하여 도 6에 도시된 유사도 측정 방법을 수행하여 상호 간의 유사도를 측정할 수 있다(단계 S300). In other words, the similarity measurement method of FIG. 6 may be performed on the original page and the link page to measure the similarity between them (step S300).

이후 원본 페이지와 링크 페이지 간에 측정된 유사도를 평가한다(단계 S310). 즉, 계산된 유사도가 실험을 통해서 결정된 임계값 c보다 큰지 여부를 판단한다. Thereafter, the similarity measured between the original page and the link page is evaluated (step S310). That is, it is determined whether the calculated similarity is greater than the threshold c determined through experiments.

판단 결과 유사도가 임계값 c보다 큰 경우에는 해당 링크 페이지를 아웃 링크로 가지고 있는 트랙백 페이지를 스팸 페이지가 아닌 것, 즉 정상인 것으로 판단하며(단계 S320), 유사도가 임계값 c 이하인 경우에는 해당 링크 페이지를 아웃 링크로 가지고 있는 트랙백 페이지를 스팸 페이지로 판단한다(단계 S330). If it is determined that the similarity is greater than the threshold c, the trackback page having the link page as an outlink is determined to be not a spam page, that is, normal (step S320). If the similarity is less than or equal to the threshold c, the corresponding link page is determined. The trackback page having the out link is determined to be a spam page (step S330).

전술한 과정을 통해 원본 페이지와 링크 페이지 간의 유사도를 이용하여 해당 링크 페이지를 아웃 링크로 가지고 있는 트랙백 페이지의 스팸 여부를 판단하고, 트랙백 스팸을 탐지할 수 있게 된다. Through the above-described process, it is possible to determine whether the trackback page that has the link page as an out link is spam by using the similarity between the original page and the link page, and detect the trackback spam.

본 발명의 또 다른 실시예에 따르면, 원본 페이지와 트랙백 페이지 간의 유사도, 원본 페이지와 링크 페이지 간의 유사도를 모두 이용하여 트랙백 페이지의 스팸 여부를 판단할 수도 있다. According to another embodiment of the present invention, whether or not the trackback page is spam may be determined using both the similarity between the original page and the trackback page and the similarity between the original page and the link page.

이 경우 수학식 6에 따른 문서 간 유사도 행렬의 셀(i,j)이 가지는 값은 문서 i와 문서 j의 유사도를 나타내며, 각 문서들 간의 유사도를 하나의 값으로 나타내기 위해 하기 수학식 7을 사용한다. In this case, the value of the cells (i, j) of the similarity matrix between documents according to Equation 6 indicates the similarity between the document i and the document j, and to express the similarity between the documents as one value, use.

[수학식 7][Equation 7]

o는 타겟 페이지, 즉 원본 페이지를 나타내고, t는 트랙백 페이지, l_i는 트랙백의 아웃 링크 페이지를 나타낸다. n은 아웃 링크의 수, sim(a,b)는 문서 a와 문서 b의 LSA 문서 유사도를 나타낸다. α는 댐핑팩터(damping factor)이다. o indicates a target page, that is, an original page, t indicates a trackback page, and l _i indicates an outlink page of the trackback. n is the number of outlinks, and sim (a, b) represents the LSA document similarity between document a and document b. α is the damping factor.

상술한 트랙백 스팸 탐지 방법은 트랙백 스팸 탐지 시스템(100)에 내장된 소프트웨어 프로그램 등에 의해 시계열적 순서에 따른 자동화된 절차로 수행될 수도 있음은 자명하다. 상기 프로그램을 구성하는 코드들 및 코드 세그먼트들은 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다. 또한, 상기 프로그램은 컴퓨터가 읽을 수 있는 정보저장매체에 저장되고, 컴퓨터에 의하여 읽혀지고 실행됨으로써 상기 방법을 구현한다. 상기 정보저장매체는 자기 기록매체, 광 기록매체 및 캐리어 웨이브 매체를 포함한다.It is apparent that the trackback spam detection method described above may be performed by an automated procedure in a time series order by a software program or the like embedded in the trackback spam detection system 100. The codes and code segments that make up the program can be easily deduced by a computer programmer in the field. In addition, the program is stored in a computer-readable information storage medium, and the program is read and executed by a computer to implement the method. The information storage medium includes a magnetic recording medium, an optical recording medium, and a carrier wave medium.

도 8a는 원본 페이지의 일례이고, 도 8b는 스팸이 아닌 트랙백 페이지의 일례이며, 도 8c는 스팸인 트랙백 페이지의 일례를 나타내고 있다. 8A is an example of an original page, FIG. 8B is an example of a trackback page that is not spam, and FIG. 8C shows an example of a trackback page that is spam.

전술한 방법에 의해 도 8a에 예시된 원본 페이지(모바일 증간 현실에 관한 웹페이지)와 도 8b에 예시된 트랙백 페이지(증강 현실의 성장에 관한 웹페이지) 간의 유사도를 계산한 경우 그 값이 0.9 정도로 나와 높은 유사성을 나타내고 있어 해당 트랙백 페이지는 스팸이 아님을 확인할 수 있었다. When the similarity is calculated between the original page (web page on mobile incremental reality) illustrated in FIG. 8A and the trackback page (web page on growth of augmented reality) illustrated in FIG. 8B by the method described above, the value is about 0.9. It showed a high similarity with me, confirming that the trackback page is not spam.

그리고 도 8a에 예시된 원본 페이지와 도 8c에 예시된 트랙백 페이지(증강 현실과 무관한 웹페이지) 간의 유사도를 계산한 경우 그 값이 0.3 정도로 나와 낮은 유사성을 나타내고 있어 해당 트랙백 페이지는 스팸임을 확인할 수 있었다. When the similarity between the original page illustrated in FIG. 8A and the trackback page (web page irrelevant to the augmented reality) illustrated in FIG. 8C is calculated, the value is about 0.3, indicating low similarity, and thus the corresponding trackback page is spam. there was.

즉, 본 발명에 따른 트랙백 스팸 탐지 시스템 및 그 방법을 이용함으로써 개인 미디어의 트랙백들 중 스팸에 해당하는 것들을 탐지할 수 있어 이를 제거할 수 있도록 도와줌으로써 네트워크 자원을 효율적으로 이용하고 검색 엔진의 성능을 향상시키며 개인 미디어의 방문자들에게 올바른 정보를 전달하여 포스트의 신뢰성을 높일 수 있도록 한다.
In other words, by using the trackback spam detection system and method according to the present invention, it is possible to detect and eliminate spam among the trackbacks of the personal media, thereby efficiently using network resources and improving the performance of the search engine. Improve the credibility of posts by delivering the right information to visitors of personal media.

상기에서는 본 발명의 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to embodiments of the present invention, those skilled in the art may variously modify the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. And can be changed.

100: 트랙백 스팸 탐지 시스템
110: 웹페이지 입력부
120: 유사도 측정부
130: 스팸 판단부
122: 주제어 추출 모듈
124: 유사도 계산 모듈100: Trackback Spam Detection System
110: web page input unit
120: similarity measuring unit
130: spam judgment
122: Key word extraction module
124: Similarity Calculation Module

Claims

개인 미디어의 트랙백 스팸(trackback spam)을 탐지하는 시스템에 있어서,
상기 개인 미디어에 관한 웹페이지를 입력받는 웹페이지 입력부-여기서, 상기 웹페이지는 상기 개인 미디어에 게시된 원본 페이지를 포함하며, 상기 원본 페이지에 링크된 트랙백 페이지 및 상기 트랙백 페이지의 아웃 링크에 대한 링크 페이지 중 하나 이상을 더 포함함-;
상기 웹페이지 간의 유사도를 계산하는 유사도 측정부; 및
상기 유사도와 미리 설정된 임계값을 비교하여 상기 트랙백 페이지가 스팸인지 여부를 판단하는 스팸 판단부를 포함하되,
상기 유사도 측정부는 상기 원본 페이지 및 트랙팩 페이지의 단어를 각각 벡터 형태로 표현하여 상기 원본 페이지와 상기 트랙백 페이지의 유사도를 계산하는 것을 특징으로 하는 트랙백 스팸 탐지 시스템. In a system for detecting trackback spam of personal media,
A web page input unit for receiving a web page about the personal media, wherein the web page includes an original page posted on the personal media, and a link to a trackback page linked to the original page and an out link of the trackback page Further includes one or more of the pages;
A similarity measuring unit for calculating a similarity between the web pages; And
Comprising a spam determination unit for determining whether the trackback page is spam by comparing the similarity and a predetermined threshold value,
And the similarity measurer calculates the similarity between the original page and the trackback page by expressing the words of the original page and the trackpack page in a vector form, respectively.

제1항에 있어서,
상기 유사도 측정부는 상기 원본 페이지 및 트랙벡 페이지의 두 벡터 사이의 유클리드 거리 또는 벡터 공간 상에서 이루는 각도의 코사인 값을 상기 유사도로 계산하는 것을 특징으로 하는 트랙백 스팸 탐지 시스템. The method of claim 1,
And the similarity measurer calculates a cosine value of an Euclidean distance between two vectors of the original page and the trackback page or an angle formed in a vector space as the similarity.

개인 미디어의 트랙백 스팸(trackback spam)을 탐지하는 시스템에 있어서,
상기 개인 미디어에 관한 웹페이지를 입력받는 웹페이지 입력부-여기서, 상기 웹페이지는 상기 개인 미디어에 게시된 원본 페이지를 포함하며, 상기 원본 페이지에 링크된 트랙백 페이지 및 상기 트랙백 페이지의 아웃 링크에 대한 링크 페이지 중 하나 이상을 더 포함함-;
상기 웹페이지 간의 유사도를 계산하는 유사도 측정부 및
상기 유사도와 미리 설정된 임계값을 비교하여 상기 트랙백 페이지가 스팸인지 여부를 판단하는 스팸 판단부를 포함하되,
상기 유사도 측정부는,
상기 웹페이지에 상응하는 문서로부터 주제어를 추출하고 상기 문서의 단어 집합을 생성하는 주제어 추출 모듈과,
상기 단어 집합을 이용하여 행 성분이 단어를 나타내고 열 성분이 문서를 나타내는 단어-문서 행렬을 구성하고 LSA(Latent Semantic Analysis) 알고리즘에 따라 유사도를 계산하는 유사도 계산 모듈을 포함하는 것을 특징으로 하는 트랙팩 스팸 탐지 시스템. In a system for detecting trackback spam of personal media,
A web page input unit for receiving a web page about the personal media, wherein the web page includes an original page posted on the personal media, and a link to a trackback page linked to the original page and an out link of the trackback page Further includes one or more of the pages;
A similarity measuring unit for calculating a similarity between the web pages;
Comprising a spam determination unit for determining whether the trackback page is spam by comparing the similarity and a predetermined threshold value,
The similarity measuring unit,
A subject extraction module for extracting a subject from a document corresponding to the web page and generating a word set of the document;
And a similarity calculation module for constructing a word-document matrix in which a row component represents a word and a column component represents a document using the word set, and calculates similarity according to a late semantic analysis (LSA) algorithm. Detection system.

제3항에 있어서,
상기 유사도 계산 모듈은 상기 단어-문서 행렬을 상기 LSA 알고리즘에 따라 저차원의 행렬에 대응시킨 후 상기 문서에 해당되는 열 벡터를 내적함으로써 유사도를 계산하며,
상기 단어-문서 행렬에서 각 값들은 하기 수학식을 사용하여 구한 가중치인 것을 특징으로 하는 트랙백 스팸 탐지 시스템.

여기서, w_ij는 문서 i에서 단어 j의 가중치를 나타내고, tf_ij는 문서 i에서 단어 j가 나타난 횟수, df_j는 단어 j가 나타난 문서의 개수, N은 전체 문서의 개수를 나타냄.The method of claim 3,
The similarity calculation module calculates the similarity by mapping the word-document matrix to a low-dimensional matrix according to the LSA algorithm and then internalizing a column vector corresponding to the document.
Each value in the word-document matrix is a weight obtained using the following equation.

Here, w _ij represents the weight of word j in document i, tf _ij represents the number of times word j appeared in document i, df _j represents the number of documents in which word j appears, and N represents the total number of documents.

제4항에 있어서,
상기 유사도 계산 모듈은,
상기 단어-문서 행렬은 m x n 행렬인 A을, m x n 행렬로써 A 행렬의 행 성분을 나타내는 T 행렬과, n x n 행렬로써 A 행렬의 열 성분을 나타내는 D 행렬과, n x n 행렬로써 각 대각에는 A 행렬의 고유값(eigenvalue)이 내림차순으로 정렬되어 있는 S 행렬을 이용하여

와 같이 분해하고,

을 만족하는 k를 구한 후,
w_i=0을 수행하여 수정된 S’를 이용하여 저차원 행렬 A’=TS’D^T를 구하며,
열벡터 내적을 수행하여 상기 유사도를 계산하는 것을 특징으로 하는 트랙백 스팸 탐지 시스템.5. The method of claim 4,
The similarity calculation module,
The word-document matrix is A, an m-by-n matrix, a T-matrix representing the row component of the A matrix by an m-by-n matrix, a D-matrix representing the column components of the A matrix by an nxn matrix, and an intrinsic A matrix at each diagonal as an nxn matrix. Using an S matrix with eigenvalues sorted in descending order

Disassemble as

After finding k satisfying
Find the low dimensional matrix A '= TS'D ^T using the modified S' by performing w _i = 0,
A trackback spam detection system comprising calculating a similarity by performing a column vector dot product.

개인 미디어의 트랙백 스팸(trackback spam)을 탐지하는 방법에 있어서,
상기 개인 미디어에 관한 웹페이지를 입력받는 단계-여기서, 상기 웹페이지는 상기 개인 미디어에 게시된 원본 페이지를 포함하며, 상기 원본 페이지에 링크된 트랙백 페이지 및 상기 트랙백 페이지의 아웃 링크에 대한 링크 페이지 중 하나 이상을 더 포함함-;
상기 웹페이지 간의 유사도를 계산하는 단계; 및
상기 유사도와 미리 설정된 임계값을 비교하여 상기 트랙백 페이지가 스팸인지 여부를 판단하는 단계를 포함하되,
상기 유사도를 계산하는 단계는 상기 원본 페이지 및 트랙팩 페이지의 단어를 각각 벡터 형태로 표현하여 상기 원본 페이지와 상기 트랙백 페이지의 유사도를 계산하는 것을 특징으로 하는 트랙백 스팸 탐지 방법. A method for detecting trackback spam in personal media,
Receiving a web page relating to the personal media, wherein the web page includes an original page posted on the personal media, and includes a trackback page linked to the original page and a link page for out links of the trackback page. More than one;
Calculating similarity between the web pages; And
And comparing the similarity with a preset threshold to determine whether the trackback page is spam.
The calculating of the similarity may include calculating words of similarity between the original page and the trackback page by expressing the words of the original page and the track pack page in a vector form, respectively.

제6항에 있어서,
상기 유사도를 계산하는 단계는 상기 원본 페이지 및 트랙백 페이지의 두 벡터 사이의 유클리드 거리 또는 벡터 공간 상에서 이루는 각도의 코사인 값을 상기 유사도로 계산하는 것을 특징으로 하는 트랙백 스팸 탐지 방법.The method according to claim 6,
The calculating of the similarity may include calculating the cosine of the Euclidean distance between two vectors of the original page and the trackback page or the angle formed in the vector space as the similarity.

개인 미디어 트랙백 스팸(trackback spam)을 탐지하는 방법에 있어서,
상기 개인 미디어에 관한 웹페이지를 입력받는 단계-여기서, 상기 웹페이지는 상기 개인 미디어에 게시된 원본 페이지를 포함하며, 상기 원본 페이지에 링크된 트랙백 페이지 및 상기 트랙백 페이지의 아웃 링크에 대한 링크 페이지 중 하나 이상을 더 포함함-;
상기 웹페이지 간의 유사도를 계산하는 단계 및
상기 유사도와 미리 설정된 임계값을 비교하여 상기 트랙백 페이지가 스팸인지 여부를 판단하는 단계를 포함하되,
상기 유사도 계산 단계는,
(a) 상기 웹페이지에 상응하는 문서로부터 주제어를 추출하고 상기 문서의 단어 집합을 생성하는 단계와,
(b) 상기 단어 집합을 이용하여 행 성분이 단어를 나타내고 열 성분이 문서를 나타내는 단어-문서 행렬을 구성하고 LSA(Latent Semantic Analysis) 알고리즘에 따라 유사도를 계산하는 단계를 포함하는 것을 특징으로 하는 트랙백 스팸 방지 방법. A method for detecting personal media trackback spam,
Receiving a web page relating to the personal media, wherein the web page includes an original page posted on the personal media, and includes a trackback page linked to the original page and a link page for out links of the trackback page. More than one;
Calculating similarity between the web pages;
And comparing the similarity with a preset threshold to determine whether the trackback page is spam.
The similarity calculation step,
(a) extracting a subject word from a document corresponding to the web page and generating a word set of the document;
(b) using the word set to construct a word-document matrix in which the row component represents a word and the column component represents a document, and calculating similarity according to a late semantic analysis (LSA) algorithm. How to stop spam.

제8항에 있어서,
상기 단계(b)는 상기 단어-문서 행렬을 상기 LSA 알고리즘에 따라 저차원의 행렬에 대응시킨 후 상기 문서에 해당되는 열 벡터를 내적함으로써 유사도를 계산하며,
상기 단어-문서 행렬에서 각 값들은 하기 수학식을 사용하여 구한 가중치인 것을 특징으로 하는 트랙백 스팸 탐지 방법.

여기서, w_ij는 문서 i에서 단어 j의 가중치를 나타내고, tf_ij는 문서 i에서 단어 j가 나타난 횟수, df_j는 단어 j가 나타난 문서의 개수, N은 전체 문서의 개수를 나타냄.9. The method of claim 8,
In step (b), the word-document matrix is mapped to a low-dimensional matrix according to the LSA algorithm, and then the similarity is calculated by internalizing a column vector corresponding to the document.
Each value in the word-document matrix is a weight obtained using the following equation.

제9항에 있어서,
상기 단계 (b)는,
(b-1) 상기 단어-문서 행렬은 m x n 행렬인 A을, m x n 행렬로써 A 행렬의 행 성분을 나타내는 T 행렬과, n x n 행렬로써 A 행렬의 열 성분을 나타내는 D 행렬과, n x n 행렬로써 각 대각에는 A 행렬의 고유값(eigenvalue)이 내림차순으로 정렬되어 있는 S 행렬을 이용하여

와 같이 분해하는 단계;
(b-2)

을 만족하는 k를 구한 후,
w_i=0을 수행하여 수정된 S’를 이용하여 저차원 행렬 A’=TS’D^T를 구하는 단계; 및
(b-3) 열벡터 내적을 수행하여 상기 유사도를 계산하는 단계를 포함하는 것을 특징으로 하는 트랙백 스팸 탐지 방법.10. The method of claim 9,
The step (b)
(b-1) The word-document matrix is represented by the matrix A, which is an m-by-n matrix, the matrix T, which represents the row component of the matrix A as the matrix mxn, the matrix D, which represents the column component of the matrix A by the matrix nxn, and the diagonals by the matrix n-by-n. Has an S matrix with eigenvalues arranged in descending order.

Decomposing as;
(b-2)

After finding k satisfying
performing w _i = 0 to obtain a low dimensional matrix A '= TS'D ^T using the modified S'; And
(b-3) calculating the similarity by performing a column vector dot product.

제6항 내지 제10항 중 어느 한 항에 기재된 트랙백 스팸 탐지 방법을 수행하기 위해 컴퓨터에 의해 실행될 수 있는 명령어들의 프로그램이 유형적으로 구현되어 있으며 컴퓨터에 의해 판독될 수 있는 기록매체.A computer program that is tangibly embodied in a program of instructions executable by a computer to perform the trackback spam detection method according to any one of claims 6 to 10.