KR20100092145A

KR20100092145A - System and method for search modeling using relation dictionary

Info

Publication number: KR20100092145A
Application number: KR1020090011371A
Authority: KR
Inventors: 최지훈; 김광현
Original assignee: 엔에이치엔(주)
Priority date: 2009-02-12
Filing date: 2009-02-12
Publication date: 2010-08-20
Also published as: JP2010186474A; JP5450135B2; KR100994349B1

Abstract

PURPOSE: A system and a method for search modeling using a relation dictionary are provided to prevent the site abusing through a query by filtering the query for a site based on a relation dictionary including keywords which indicate the characteristics of the site. CONSTITUTION: A data collector(202) collects query for a site and the click frequency of the query. A composition element determining unit(203) searches the query or the title of the site from the relation dictionary for the site. A site indexer(204) controls index weight applied to the query or the title according to search result and indexes the site. The relation dictionary includes a keyword based on the directory structure of the site and an anchor text structure and has high relation for the site.

Description

연관도 사전을 이용한 검색 모델링 시스템 및 방법{SYSTEM AND METHOD FOR SEARCH MODELING USING RELATION DICTIONARY}Search modeling system and method using relevance dictionary {SYSTEM AND METHOD FOR SEARCH MODELING USING RELATION DICTIONARY}

본 발명의 일실시예는 검색 모델링에 관한 것으로, 보다 자세하게는, 검색자의 의도를 반영하고, 사이트 어뷰징을 제거하는 검색 모델링에 관한 것이다.One embodiment of the present invention relates to search modeling, and more particularly, to search modeling that reflects the intention of the searcher and removes site abusing.

종래의 검색 시스템은 사용자가 입력한 질의와 사이트의 타이틀이 정확하게 일치하는 지 여부를 먼저 판단하여 일치하는 사이트를 사용자에게 제공하였다. 그러나, 사용자에게 관심있는 키워드가 사이트의 제목으로 활용되는 경우, 실제 사용자가 입력한 질의에 매칭하는 사이트는 검색 의도와는 전혀 무관하게 제공되는 문제점이 있었다.The conventional search system first determines whether the query entered by the user exactly matches the title of the site, and provides the user with a matching site. However, if a keyword of interest to the user is used as the title of the site, there is a problem in that the site matching the query input by the actual user is provided irrespective of the search intention.

그리고, 특정 사이트가 다양한 물품에 관한 웹 페이지를 포함하고 있는 경우, 사용자가 상기 물품 중 어느 하나의 물품 명칭을 질의로 입력한다면, 상기 특정 사이트가 검색될 수 있다. 그러나, 사이트 검색의 경우, 사이트의 대표성이 중요하기 때문에 물품에 대한 대표 제조사의 사이트와 같이 공식적인 사이트가 검색될 필요가 있지만, 실제 검색 결과는 다양한 브랜드의 물품을 판매하는 쇼핑몰이 검색되어 사이트 어뷰징이 발생되는 문제점이 있었다.In addition, when a specific site includes web pages about various items, if a user inputs a name of any one of the items as a query, the specific site may be searched. However, in the case of site search, since the representativeness of the site is important, it is necessary to search the official site such as the site of the representative manufacturer of the product. There was a problem that occurred.

마지막으로, 종래의 검색 시스템은 사이트를 인기도에 따라 정렬하여 노출하였다. 이 때, 인기도는 사이트에 대한 클릭 빈도로 결정되는 경우가 많았다. 여기서, 클릭 빈도로 인기도를 결정하는 경우, 해당 사이트 관리자가 부정 클릭을 통해 클릭 빈도를 향상시킴으로써 상위 랭크를 고착화시키는 어뷰징 문제가 존재하였다.Finally, conventional search systems have exposed sites sorted by popularity. At this time, popularity was often determined by the frequency of clicks on the site. Here, when determining the popularity by the click frequency, there was an abusing problem in which the site administrator fixed the high rank by improving the click frequency through fraudulent clicks.

본 발명은 사이트의 특성을 나타내는 키워드로 구성된 연관도 사전을 이용하여 해당 사이트에 대한 질의를 정제함으로써, 질의를 통한 사이트 어뷰징을 방지할 수 있는 검색 모델링 시스템 및 방법을 제공할 수 있다.The present invention can provide a search modeling system and method that can prevent site abusing through a query by refining a query for a corresponding site using a relevance dictionary composed of keywords representing the characteristics of the site.

본 발명은 연관도 사전을 이용하여 사이트의 타이틀로부터 의미있는 키워드를 추출하여 사이트에 대해 색인함으로써, 타이틀을 통한 사이트 어뷰징을 방지할 수 있는 검색 모델링 시스템 및 방법을 제공할 수 있다.The present invention can provide a search modeling system and method that can prevent site navigating through a title by extracting a meaningful keyword from the title of the site using the relevance dictionary and indexing the site.

본 발명은 사이트 클릭 빈도뿐만 아니라, 페이지랭크, 툴바 방문 빈도 및 사이트 체류 시간을 고려하여 사이트의 인기도를 판단함으로써, 사이트 관리자의 클릭 어뷰징을 방지하고, 인기도의 객관성을 향상시킬 수 있는 검색 모델링 시스템 및 방법을 제공할 수 있다.The present invention provides a search modeling system that can determine site popularity by considering page rank, toolbar visit frequency, and site dwell time as well as site click frequency, thereby preventing site administrators from clicking abusing and improving objectivity of popularity. It may provide a method.

본 발명의 일실시예에 따른 검색 모델링 시스템은 사이트에 대해 적어도 하나의 질의와 상기 질의 각각의 클릭 빈도를 수집하는 데이터 수집부, 상기 사이트에 대한 연관도 사전을 이용하여 상기 질의 및 상기 사이트의 타이틀이 연관도 사전에 존재하는 지 여부를 판단하는 구성 요소 판단부 및 상기 연관도 사전에 존재하는 지 여부에 따라 상기 질의 및 상기 타이틀에 적용될 색인어 가중치를 조절하여 상기 사이트를 색인하는 사이트 색인부를 포함할 수 있다.Search modeling system according to an embodiment of the present invention is a data collection unit for collecting at least one query for the site and the frequency of clicks of each query, the query and the title of the site by using the relevance dictionary for the site A component determining unit for determining whether the association exists in the dictionary and a site indexing unit for indexing the site by adjusting an index word weight to be applied to the query and the title according to whether the association exists in the dictionary. Can be.

본 발명의 일실시예에 따른 검색 모델링 시스템은 상기 사이트의 디렉토리 구조, 사이트 키워드 또는 앵커 택스트 구조를 이용하여 상기 사이트에 대한 연관도 사전을 생성하는 연관도 사전 생성부를 더 포함할 수 있다.The search modeling system according to an embodiment of the present invention may further include an association dictionary generation unit that generates an association dictionary for the site using a directory structure, a site keyword, or an anchor text structure of the site.

본 발명의 일실시예에 따른 검색 모델링 방법은 데이터 수집부가 사이트에 대해 적어도 하나의 질의와 상기 질의 각각의 클릭 빈도를 수집하는 단계, 구성 요소 판단부가 상기 사이트에 대한 연관도 사전을 이용하여 상기 질의 또는 상기 사이트의 타이틀이 연관도 사전에 존재하는 지 여부를 판단하는 단계 및 사이트 색인부가 상기 연관도 사전에 존재하는 지 여부에 따라 상기 질의 또는 상기 타이틀에 적용될 색인어 가중치를 조절하여 상기 사이트를 색인하는 단계를 포함할 수 있다.In the search modeling method according to an embodiment of the present invention, the data collection unit collects at least one query and a click frequency of each query for a site, and a component determination unit uses the relevance dictionary for the site. Or determining whether a title of the site exists in the association dictionary and indexing the site by adjusting an index word weight to be applied to the query or the title according to whether the site index unit exists in the association dictionary. It may include a step.

본 발명의 일실시예에 따른 검색 모델링 방법은 연관도 사전 생성부가 상기 사이트의 디렉토리 구조, 사이트 키워드 또는 앵커 택스트 구조를 이용하여 상기 사이트에 대한 연관도 사전을 생성하는 단계를 더 포함할 수 있다.The search modeling method according to an embodiment of the present invention may further include generating, by the association dictionary generation unit, an association dictionary for the site using a directory structure, a site keyword, or an anchor text structure of the site.

본 발명의 일실시예에 따르면, 사이트의 특성을 나타내는 키워드로 구성된 연관도 사전을 이용하여 해당 사이트에 대한 질의를 정제함으로써, 질의를 통한 사이트 어뷰징을 방지할 수 있는 검색 모델링 시스템 및 방법이 제공된다.According to an embodiment of the present invention, there is provided a search modeling system and method that can prevent site abusing through a query by refining a query for a corresponding site using a relevance dictionary composed of keywords representing the characteristics of the site. .

본 발명의 일실시예에 따르면, 연관도 사전을 이용하여 사이트의 타이틀로부터 의미있는 키워드를 추출하여 사이트에 대해 색인함으로써, 타이틀을 통한 사이트 어뷰징을 방지할 수 있는 검색 모델링 시스템 및 방법이 제공된다.According to an embodiment of the present invention, there is provided a search modeling system and method that can prevent site navigating through a title by extracting a meaningful keyword from the title of the site using an association dictionary and indexing the site.

본 발명의 일실시예에 따르면, 사이트 클릭 빈도뿐만 아니라, 페이지랭크, 툴바 방문 빈도 및 사이트 체류 시간을 고려하여 사이트의 인기도를 판단함으로써, 사이트 관리자의 클릭 어뷰징을 방지하고, 인기도의 객관성을 향상시킬 수 있는 검색 모델링 시스템 및 방법이 제공된다.According to an embodiment of the present invention, by determining the popularity of the site in consideration of not only the site click frequency, but also the page rank, the toolbar visit frequency, and the time spent on the site, it is possible to prevent site administrators from clicking abusing and improve the objectivity of the popularity. A search modeling system and method are provided.

이하, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명에 따른 실시예를 상세하게 설명한다. 다만, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다.Hereinafter, with reference to the contents described in the accompanying drawings will be described in detail an embodiment according to the present invention. However, the present invention is not limited to or limited by the embodiments. Like reference numerals in the drawings denote like elements.

도 1은 본 발명의 일실시예에 있어, 검색 모델링 시스템을 이용한 사이트 검색 과정을 설명하기 위한 도면이다.1 is a view for explaining a site search process using a search modeling system in an embodiment of the present invention.

도 1은 사용자(101)가 질의 AB를 입력하는 경우를 설명한다. 그러면, 검색 모델링 시스템(102)은 질의 AB에 매칭하는 사이트를 검색하여 사용자(101)에게 제공한다. 이 때, 질의 AB에 매칭하는 사이트를 사이트 X(103-1), 사이트 Y(103-2) 및 사이트 Z(103-3)이라 가정한다.1 illustrates a case in which the user 101 inputs a query AB. The search modeling system 102 then searches and provides the site 101 that matches the query AB. At this time, it is assumed that the sites matching the query AB are the site X 103-1, the site Y 103-2, and the site Z 103-3.

종래의 검색 모델링 시스템은 질의 AB와 미리 저장된 사이트의 타이틀이 정확하게 일치하는 지 여부를 먼저 판단하여 일치하는 사이트를 사용자(101)에게 제공하였다. 그러나, 사용자(101)에게 널리 사용되는 키워드가 사이트의 제목으로 활용되는 경우, 실제 사용자(101)가 입력한 질의에 매칭하는 사이트는 검색 의도와는 전혀 무관하게 제공되는 문제점이 있었다.The conventional search modeling system first determines whether the query AB and the title of the pre-stored site exactly match, and provides the user 101 with the matching site. However, if a keyword widely used for the user 101 is used as the title of the site, the site matching the query input by the actual user 101 has a problem in that it is provided irrespective of the search intention.

예를 들면, 사이트 X(103-1)의 타이틀이 '명품'이고, 사이트의 웹 페이지에 포함된 내용은 '장난감'에 관한 것이라고 가정하자. 만약, 사용자(101)가 명품 쇼핑몰을 검색하고자 '명품'이라는 질의를 입력하더라도 기존의 검색 모델링 시스템 은 단순히 '명품'이라는 질의와 정확하게 일치하는 타이틀을 가진 사이트를 검색함으로써, 사용자의 정보 요구인 '명품'과는 전혀 무관한 사이트 X(103-1)를 사용자(101)에게 제공하는 문제가 있었다.For example, suppose that the title of site X 103-1 is 'luxury', and the content included in the site's web page is about 'toy'. If the user 101 inputs a query of 'luxury' to search a luxury shopping mall, the existing search modeling system simply searches a site having a title that exactly matches the query of 'luxury'. There was a problem in providing the user 101 with the site X 103-1 which has nothing to do with luxury.

또한, 특정 사이트가 다양한 물품에 관한 웹 페이지를 포함하고 있는 경우, 사용자가 상기 물품 중 어느 하나의 물품 명칭을 질의로 입력한다면, 상기 특정 사이트가 검색될 수 있다. 그러나, 사이트 검색의 경우, 사이트의 대표성이 중요하기 때문에 물품에 대한 대표 제조사의 사이트와 같이 공식적인 사이트가 검색되는 것이 바람직할 수 있다.In addition, when a specific site includes web pages about various items, if the user inputs the name of any one of the items as a query, the specific site may be searched. However, in the case of site search, since representativeness of the site is important, it may be desirable to search for an official site such as a representative manufacturer's site for the article.

예를 들어, 특정 사이트가 다양한 브랜드(a,b,c)를 취급하고 있을 때, 사용자(101)가 질의 a를 입력하는 경우, 종래의 검색 모델링 시스템은 브랜드 a를 판매하는 일반 쇼핑몰 사이트를 사용자에게 제공함으로써 브랜드 a의 대표 제조사나 공식적으로 인정되는 사이트를 제공하지 않는 문제점이 있었다.For example, when a user 101 inputs a query when a specific site handles various brands (a, b, c), the conventional search modeling system uses a general shopping mall site that sells a brand a. There was a problem in not providing a representative manufacturer or officially recognized site of brand a by providing to.

그리고, 질의 AB에 매칭하는 사이트가 복수 개 존재하는 경우, 종래의 검색 모델링 시스템은 사이트를 인기도에 따라 정렬하여 노출한다. 이 때, 인기도는 사이트에 대한 클릭 빈도로 결정되는 경우가 많았다. 여기서, 클릭 빈도로 인기도를 결정하는 경우, 해당 사이트 관리자가 부정 클릭을 통해 클릭 빈도를 향상시킴으로써 상위 랭크를 고착화시키는 어뷰징 문제가 존재하였다.If there are a plurality of sites matching the query AB, the conventional search modeling system arranges and exposes the sites according to popularity. At this time, popularity was often determined by the frequency of clicks on the site. Here, when determining the popularity by the click frequency, there was an abusing problem in which the site administrator fixed the high rank by improving the click frequency through fraudulent clicks.

이러한 문제점에 대응하여 본 발명의 일실시예에 따른 검색 모델링 시스템(102)은 사이트에 대한 질의 및 타이틀을 정제하여 사이트에 대한 어뷰징을 해소할 수 있다. 일례로, 검색 모델링 시스템(102)은 사이트의 앵커 텍스트 구조와 디 렉토리 구조에 기초한 연관도 사전을 이용하여 질의 또는 사이트의 타이틀을 정제할 수 있다. 그러면, 검색 모델링 시스템(102)은 연관도 사전을 통해 정제된 질의 또는 타이틀에 높은 색인어 가중치를 적용하여 사이트를 색인할 수 있다. 그러면, 실제 사용자(101)가 입력하는 질의에 대해 사용자의 요구를 반영하고, 사이트 어뷰징 행위를 방지할 수 있는 사이트 결과가 제공될 수 있다.In response to this problem, the search modeling system 102 according to an embodiment of the present invention may solve the abusing of the site by refining the query and the title of the site. In one example, search modeling system 102 may refine the title of a query or site using an association dictionary based on the site's anchor text structure and directory structure. The search modeling system 102 can then index the site by applying a high index weight to the refined query or title via the relevance dictionary. Then, a site result that reflects the user's request for the query input by the actual user 101 and prevents the site abusing behavior may be provided.

또한, 검색 모델링 시스템(102)은 사이트의 인기도를 사이트에 대한 클릭 빈도뿐만 아니라, 사용자가 실제 사이트에 객관적인 영향을 끼치는 지표를 통해 사이트 스코어를 부여함으로써, 클릭 빈도에 의한 사이트 어뷰징 행위를 방지할 수 있다.In addition, the search modeling system 102 can prevent site abusing behavior by the frequency of clicks by assigning the site's popularity not only to the frequency of clicks on the site but also to the user's indexes that have an objective impact on the actual site. have.

검색 모델링 시스템(102)의 전체 구성에 대해서는 도 2에서 구체적으로 설명된다.The overall configuration of the search modeling system 102 is described in detail in FIG.

도 2는 본 발명의 일실시예에 따른 검색 모델링 시스템의 전체 구성을 도시한 블록 다이어그램이다.2 is a block diagram showing the overall configuration of a search modeling system according to an embodiment of the present invention.

도 2를 참고하면, 검색 모델링 시스템(102)은 연관도 사전 생성부(201), 데이터 수집부(202), 구성 요소 판단부(203), 사이트 색인부(204) 및 인기도 판단부(205)를 포함할 수 있다.Referring to FIG. 2, the search modeling system 102 includes an association degree dictionary generation unit 201, a data collection unit 202, a component determination unit 203, a site indexing unit 204, and a popularity determination unit 205. It may include.

연관도 사전 생성부(201)는 사이트의 디렉토리 구조, 사이트 키워드 또는 앵커 택스트 구조를 이용하여 상기 사이트에 대한 연관도 사전(relation dictionary)을 생성할 수 있다. 이 때, 연관도 사전은 사이트의 디렉토리 구조 및 앵커 텍스트 구조에 기초하여 추출된 키워드로 사이트와 연관도가 높은 키워드 의 집합을 의미할 수 있다.The association dictionary generation unit 201 may generate a relation dictionary for the site using a directory structure, a site keyword, or an anchor text structure of the site. In this case, the relevance dictionary is a keyword extracted based on the directory structure and anchor text structure of the site, and may mean a set of keywords highly related to the site.

도 2를 참고하면, 연관도 사전 생성부(201)는 사이트 자료 추출부(206), 키워드 결정부(207) 및 리스트 생성부(208)를 포함할 수 있다.Referring to FIG. 2, the association degree generation unit 201 may include a site data extraction unit 206, a keyword determination unit 207, and a list generation unit 208.

사이트 자료 추출부(206)는 사이트에 대한 디렉토리 구조, 사이트 키워드 또는 앵커 텍스트를 포함하는 사이트 자료를 추출할 수 있다. 이 때, 디렉토리 구조는 사이트를 특정 주제에 따라 분류하기 위한 기준을 의미할 수 있다. 그리고, 사이트 키워드는 사이트로 접속할 때, 사용자가 입력한 키워드를 의미할 수 있다. 또한, 앵커 텍스트는 X 사이트에서 Y 사이트로 이동할 때, Y 사이트로 이동하기 위해 사용자가 클릭한 링크(X 사이트에 존재)에 포함된 텍스트를 의미할 수 있다. 동일한 Y 사이트를 이동되더라도, 링크에 포함된 앵커 텍스트는 다를 수 있다.The site data extraction unit 206 may extract site data including a directory structure, site keywords, or anchor text for the site. In this case, the directory structure may mean a criterion for classifying a site according to a specific theme. The site keyword may mean a keyword input by the user when accessing the site. In addition, the anchor text may refer to text included in a link (existing in the X site) that the user clicks to move to the Y site when moving from the X site to the Y site. Even if the same Y site is moved, the anchor text contained in the link may be different.

키워드 결정부(207)는 추출된 사이트 자료를 분석하여 키워드를 결정할 수 있다. 일례로, 사이트 자료가 디렉토리 구조 또는 사이트 키워드인 경우, 키워드 결정부(207)는 디렉토리 구조 또는 사이트 키워드에 포함된 공백, 쉼표, 마침표를 고려하여 키워드를 결정할 수 있다. 그리고, 사이트 자료가 앵커 텍스트인 경우, 링크의 하이퍼텍스트를 형태소 분석한 후 남은 텍스트를 키워드로 결정할 수 있다.The keyword determining unit 207 may determine the keyword by analyzing the extracted site data. For example, when the site material is a directory structure or a site keyword, the keyword determination unit 207 may determine a keyword in consideration of spaces, commas, and periods included in the directory structure or site keyword. If the site data is anchor text, the remaining text after stemming the hypertext of the link may be determined as a keyword.

리스트 생성부(208)는 결정된 키워드를 이용하여 리스트를 생성할 수 있다. 이렇게 생성된 리스트의 조합을 통해 연관도 사전이 생성될 수 있다.The list generator 208 may generate a list using the determined keyword. The association dictionary may be generated through the combination of the generated lists.

데이터 수집부(202)는 사이트에 대해 적어도 하나의 질의와 상기 질의 각각의 클릭 빈도를 수집할 수 있다. 이 때, 질의는 사용자가 클릭하여 사이트에 접속하게 된 원인이 되는 키워드를 의미할 수 있다.The data collector 202 may collect at least one query for the site and the click frequency of each of the queries. In this case, the query may mean a keyword that causes the user to click to access the site.

구성 요소 판단부(203)는 사이트에 대한 연관도 사전을 이용하여 질의 또는 사이트의 타이틀이 연관도 사전에 존재하는 지 여부를 판단할 수 있다.The component determiner 203 may determine whether the query or the title of the site exists in the relevance dictionary by using the relevance dictionary for the site.

일례로, 구성 요소 판단부(203)는 질의를 키워드 단위로 추출하고, 추출된 키워드가 상기 연관도 사전에 존재하는 지 여부에 따라 질의를 색인어 가중치를 적용할 질의 그룹으로 분류할 수 있다.For example, the component determiner 203 may extract the query in keyword units, and classify the query into a query group to which the index word weight is applied based on whether the extracted keyword exists in the association degree dictionary.

이 때, 구성 요소 판단부(203)는 키워드가 연관도 사전에 존재하는 경우, 질의를 색인어 가중치가 높게 적용할 제1 질의 그룹으로 분류하고, 키워드가 연관도 사전에 존재하지 않는 경우, 색인어 가중치를 낮게 적용할 제2 질의 그룹으로 분류할 수 있다. 또한, 키워드 중 일부만 연관도 사전에 존재하는 질의는 키워드 전체가 연관도 사전에 존재하는 질의보다 색인어 가중치를 낮게 적용될 수 있다.In this case, the component determining unit 203 classifies the query as a first query group to which the index word weight is to be applied when the keyword exists in the association degree dictionary, and the index word weight when the keyword does not exist in the association degree dictionary. Can be classified into a second query group to be applied lower. In addition, a query in which only some of the keywords exist in the relevance dictionary may have a lower index weight than the query in which the entire keyword exists in the relevance dictionary.

이와 같이, 구성 요소 판단부(203)는 질의를 구성하는 키워드가 연관도 사전에 존재하는 지 여부를 판단하여 판단 결과에 따라 색인어 가중치를 조절함으로써 질의를 정제할 수 있다.As described above, the component determiner 203 may determine whether the keyword constituting the query exists in the association degree dictionary, and refine the query by adjusting the index word weight according to the determination result.

일례로, 구성 요소 판단부(203)는 적어도 하나의 질의에 대한 클릭 임계값을 정의하고, 상기 클릭 임계값보다 큰 클릭 빈도를 나타내는 질의에 대해 연관도 사전에 존재하는 지 여부를 판단할 수 있다. For example, the component determiner 203 may define a click threshold value for at least one query, and determine whether or not an association degree exists in advance for a query indicating a click frequency greater than the click threshold value. .

예를 들어, 구성 요소 판단부(203)는 질의에 대한 클릭 빈도의 두 번째 최대값 중 미리 설정한 비율을 클릭 임계값으로 정의할 수 있다. 결국, 구성 요소 판단부(203)는 클릭 임계값보다 낮은 클릭 빈도를 나타내는 질의를 필터링하여 사이트 색인이 이루어지지 않도록 함으로써, 사이트 색인의 정확성을 향상시킬 수 있 다.For example, the component determiner 203 may define a preset ratio of the second maximum value of the click frequency for the query as the click threshold value. As a result, the component determining unit 203 may improve the accuracy of the site index by filtering the query indicating a click frequency lower than the click threshold so that the site index is not made.

일례로, 구성 요소 판단부(203)는 사이트의 타이틀을 키워드 단위로 추출하고, 추출된 키워드가 연관도 사전에 존재하는지 여부에 따라 키워드를 색인어 가중치를 적용되는 타이틀 그룹으로 분류할 수 있다. 이 때, 구성 요소 판단부(203)는 키워드가 연관도 사전에 존재하는 경우, 키워드를 색인어 가중치가 높게 적용할 제1 타이틀 그룹으로 분류하고, 키워드가 연관도 사전에 존재하지 않는 경우, 키워드를 색인어 가중치를 낮게 적용할 제2 타이틀 그룹으로 분류할 수 있다.For example, the component determiner 203 may extract the title of the site by keyword, and classify the keyword into a title group to which the index word weight is applied according to whether the extracted keyword exists in the association degree dictionary. In this case, the component determining unit 203 classifies the keyword as the first title group to which the index word weight is to be applied when the keyword exists in the relevance dictionary, and when the keyword does not exist in the relevance dictionary, The index term weight may be classified into the second title group to which the index value is to be applied.

이와 같이, 구성 요소 판단부(203)는 사이트의 타이틀을 구성하는 키워드가 연관도 사전에 존재하는 지 여부를 판단하여 적용되는 색인어 가중치를 조절함으로써 타이틀을 정제할 수 있다. 즉, 구성 요소 판단부(203)는 연관도 사전을 통해 타이틀을 구성하는 키워드 중 사이트에 의미있는 키워드(meaningful title)를 추출할 수 있다.In this way, the component determiner 203 can refine the title by determining whether a keyword constituting the title of the site exists in the association degree dictionary and adjusting the index word weight applied. That is, the component determiner 203 may extract a meaningful keyword for the site from the keywords constituting the title through the association degree dictionary.

사이트 색인부(204)는 질의로부터 추출된 키워드 또는 타이틀로부터 추출된 키워드가 연관도 사전에 존재하는 지 여부에 따라 질의 또는 타이틀에 적용될 색인어 가중치를 조절하여 사이트를 색인할 수 있다. 구체적으로, 사이트 색인부(204)는 연관도 사전에 존재하는 키워드를 포함하는 질의에 대해 색인어 가중치를 높게 설정함으로써, 질의에 대한 사이트 검색 확률을 높일 수 있다. 또한, 사이트 색인부(204)는 연관도 사전에 존재하는 키워드를 포함하는 타이틀에 대해 색인어 가중치를 높게 설정함으로써, 타이틀을 질의로 입력한 경우 질의에 대한 사이트 검색 확률을 높일 수 있다.The site indexing unit 204 may index the site by adjusting the index word weight to be applied to the query or the title according to whether the keyword extracted from the query or the keyword extracted from the title exists in the association degree dictionary. In detail, the site indexing unit 204 may increase the site search probability for the query by setting a high index word weight for the query including the keyword existing in the relevance dictionary. In addition, the site indexing unit 204 may set a high index word weight for a title including a keyword existing in the association degree dictionary, thereby increasing the site search probability for the query when the title is entered as a query.

인기도 판단부(205)는 사이트에 대한 페이지 랭크, 클릭 빈도, 툴바 방문 빈도 또는 사이트 체류 시간 중 적어도 하나의 인기도 요소를 이용하여 질의에 색인된 하나 이상의 사이트의 인기도를 판단할 수 있다. 즉, 검색 모델링 시스템(102)은 특정 질의가 입력되면 판단된 인기도에 따라 노출 순위를 정렬하여 사용자에게 제공할 수 있다.The popularity determining unit 205 may determine the popularity of one or more sites indexed in the query using at least one popularity factor among page rank, click frequency, toolbar visit frequency, or site dwell time for the site. That is, when a specific query is input, the search modeling system 102 may sort the exposure rankings according to the determined popularity and provide them to the user.

일례로, 페이지 랭크(PageRank)는 WWW(world wide web)과 같은 하이퍼링크 구조를 가지는 문서의 상대적 중요도에 따른 가중치를 의미할 수 있다. 그리고, 클릭 빈도(ClickCount)는 하이퍼링크를 통해 사이트를 클릭한 횟수를 의미하며, 툴바 방문 빈도(Toolbar VisitCount)는 툴바를 통해 사이트를 방문한 빈도를 의미할 수 있다. 또한, 사이트 체류 시간(Site DwellTime)은 사용자가 사이트를 방문하여 체류한 평균 시간을 의미할 수 있다. 각각의 인기도 판단 요소에 대한 정의는 일례에 불과하고, 시스템의 구성에 따라 세부적인 정의는 변경될 수 있다.For example, the page rank may mean a weight according to a relative importance of a document having a hyperlink structure such as a world wide web (WWW). The click frequency (ClickCount) may refer to the number of times the site is clicked through the hyperlink, and the Toolbar VisitCount may refer to the frequency of visiting the site through the toolbar. In addition, the site dwell time may mean an average time spent by the user visiting the site. The definition of each popularity judgment element is only an example, and the detailed definition may be changed according to the configuration of the system.

도 3은 본 발명의 일실시예에 있어서, 사이트 자료를 이용하여 연관도 사전을 생성하는 기준을 설명하기 위한 일례이다.3 is an example for explaining a criterion for generating an association dictionary using site data according to an embodiment of the present invention.

사이트 자료 추출부(206)는 사이트에 대한 디렉토리 구조, 사이트 키워드 또는 앵커 텍스트를 포함하는 사이트 자료를 추출할 수 있다.The site data extraction unit 206 may extract site data including a directory structure, site keywords, or anchor text for the site.

도 3을 참고하면, 사이트 네이버에 대한 디렉토리 구조, 사이트 키워드 및 앵커 텍스트가 도시되어 있다.Referring to FIG. 3, a directory structure, site keywords, and anchor text for a site neighbor are shown.

이 때, 디렉토리 구조는 사이트를 특정 주제에 따라 분류하기 위한 기준을 의미할 수 있다. 즉, 도 3을 참고하면, 사이트 '네이버'는 인터넷과 관련된 포털 사이트 인 것을 의미할 수 있다. 디렉토리 구조는 사이트의 특성과 연관된 것이며, 사이트마다 하나 이상 결정될 수 있다.In this case, the directory structure may mean a criterion for classifying a site according to a specific theme. That is, referring to FIG. 3, the site 'naver' may mean a portal site related to the Internet. The directory structure is associated with the site's characteristics and can be determined one or more per site.

그리고, 사이트 키워드는 사이트로 접속할 때, 사용자가 입력한 키워드를 의미할 수 있다. 도 3을 참고하면, 사용자가 '검색 포털, 포탈, 포털, 지식인, 정보검색, nhn'이라는 키워드를 통해 사이트 '네이버'에 접속한 것을 알 수 있다.The site keyword may mean a keyword input by the user when accessing the site. Referring to FIG. 3, it can be seen that the user accesses the site 'naver' through the keyword 'search portal, portal, portal, intellectual, information search, nhn'.

앵커 텍스트는 X 사이트에서 Y 사이트로 이동할 때, Y 사이트로 이동하기 위해 사용자가 클릭한 링크(X 사이트에 존재)에 포함된 텍스트를 의미할 수 있다. 도 3을 참고하면, 사이트 '네이버'에 접속하기 위해 A 사이트에 포함된 링크는 '지식 포털 네이버'를 포함하고 있고, B 사이트에 포함된 링크는 '정보 검색 최고의 포털 사이트'를 포함한 것을 알 수 있다.The anchor text may refer to text contained in a link (existing in Site X) that the user clicked to navigate to Site Y when moving from Site X to Site Y. Referring to FIG. 3, it can be seen that the link included in the site A includes the 'knowledge portal naver' and the link included in the site B includes the 'best portal for information search' to access the site 'naver'. have.

도 3에서, "컴퓨터, 인터넷 > 포털사이트 > 네이버"라는 네이버의 디렉토리 구조에서 공백, 쉼표, 마침표 단위로 분석하면 "컴퓨터 인터넷 포털사이트 네이버"라는 키워드가 결정될 수 있다. 마찬가지로 "검색포털, 포탈, 포털, 지식인, 정보검색, nhn"이라는 사이트 키워드에서 공백, 쉼표, 마침표 단위로 분석하면, "검색 포털 포탈 포털 지식인 정보검색"이라는 키워드가 결정될 수 있다.In FIG. 3, the keyword "computer internet portal site naver" may be determined by analyzing spaces, commas, and periods in a directory structure of Naver "computer, internet> portal site> Naver". Similarly, if the site keyword "search portal, portal, portal, intellectual, information search, nhn" is analyzed by space, comma, and period, the keyword "search portal portal intellectual information search" may be determined.

그리고, "<a herf=url> 지식포털 네이버 </a>"라는 앵커 텍스트에서 형태소를 분석한 후 나머지 명사인 "지식 포털 네이버"라는 키워드가 결정될 수 있다. 만약, "<a herf=url> NO1 대한민국의 지식 창고 </a>"라는 앵커 텍스트인 경우, 형태소 분석 후 나머지 명사인 "대한민국 지식 창고"라는 키워드가 결정될 수 있다.Then, after analyzing the morpheme from the anchor text "<a herf=url> knowledge portal naver </a>", the keyword "Knowledge Portal Naver", which is the remaining noun, may be determined. In the case of the anchor text "<a herf=url> knowledge warehouse of Korea of Korea </a>", after the morphological analysis, the keyword "Korea knowledge warehouse", which is the remaining noun, may be determined.

리스트 생성부(208)는 결정된 키워드를 이용하여 리스트를 생성할 수 있다. 즉, 디렉토리 구조, 사이트 키워드 또는 앵커 텍스트에서 결정한 키워드를 조합하여 리스트를 생성함으로써, 사이트에 대한 연관도 사전이 생성될 수 있다.The list generator 208 may generate a list using the determined keyword. That is, by generating a list by combining a directory structure, a site keyword, or a keyword determined from anchor text, an association dictionary for a site may be generated.

도 4는 본 발명의 일실시예에 있어, 사이트에 대한 연관도 사전을 생성하는 과정을 설명하기 위한 일례이다.FIG. 4 is an example for explaining a process of generating an association dictionary for a site according to an embodiment of the present invention.

구체적으로, 도 4는 사이트(402)에 대한 앵커 텍스트 구조(401) 및 디렉토리 구조(403)를 통해 생성된 연관도 사전(304)의 구체적인 일례를 도시하고 있다.Specifically, FIG. 4 illustrates a specific example of an associative dictionary 304 generated through an anchor text structure 401 and a directory structure 403 for a site 402.

도 4에서, '바이크 붐붐'이라는 사이트(402)는 오토바이와 관련된 사이트라고 가정한다. 그러면, '바이크 붐붐'이라는 사이트(402)는 '중고 오토바이, 오토바이 판매, 오토바이 용품, 오토바이 중개, 바이크 용품, 스쿠터 매매, 중개' 등과 같은 앵커 텍스트로 구성된 앵커 텍스트 구조(401)를 가질 수 있다.In FIG. 4, it is assumed that a site 402 named 'bike boom boom' is a site associated with a motorcycle. Then, a site 402 called a bike boom boom may have an anchor text structure 401 composed of anchor text such as 'used motorcycles, motorcycle sales, motorcycle supplies, motorcycle brokerages, bike articles, scooter sales, brokerages'.

또한, '바이크 붐붐'이라는 사이트(402)는 '기업>쇼핑몰>오토바이'와 같은 디렉토리로 구성된 디렉토리 구조(403)를 가질 수 있다.In addition, a site 402 called a bike boom boom may have a directory structure 403 composed of a directory such as "company> shopping mall> motorcycle".

그러면, 연관도 사전 생성부(201)는 사이트의 디렉토리 구조, 사이트 키워드 또는 앵커 택스트 구조를 이용하여 상기 사이트에 대한 연관도 사전(relation dictionary)을 생성할 수 있다. 일례로, 연관도 사전 생성부(201)는 사이트에 대 한 디렉토리 구조, 사이트 키워드 또는 앵커 텍스트를 포함하는 사이트 자료를 추출하고, 추출된 사이트 자료를 분석하여 키워드를 결정한 후, 결정된 키워드를 이용하여 리스트를 생성할 수 있다.Then, the association dictionary generation unit 201 may generate a relation dictionary for the site using a directory structure, a site keyword, or an anchor text structure of the site. For example, the relevance dictionary generation unit 201 extracts site data including a directory structure, site keywords, or anchor text for a site, analyzes the extracted site data, determines keywords, and then uses the determined keywords. You can create a list.

연관도 사전 생성부(201)는 앵커 텍스트 각각을 형태소 분석 단위로 파싱하거나(예를 들면, 중고 오토바이를 중고 및 오토바이) 사이트 제목을 파싱하거나(예를 들면, 바이크 붐붐을 바이크 및 붐붐) 또는 디렉토리 키워드를 파싱하여(예를 들면, 기업>매매>오토바이 용품을 매매, 오토바이 및 용품) 연관도 사전(404)을 생성할 수 있다. 이와 같이, 연관도 사전(404)은 사이트(402)의 특성을 반영하는 키워드를 포함하며, 결국, 사이트와 연관도가 높은 키워드로 구성될 수 있다.The relevance dictionary generation unit 201 parses each anchor text into a stemming unit (for example, used motorcycles and used motorcycles), parses a site title (for example, a bike boom boom and a bike boom boom), or a directory. The association may be parsed (e.g., business> sale> motorcycle supplies, motorcycles and supplies) to generate an association dictionary 404. As such, the relevance dictionary 404 includes a keyword that reflects the characteristics of the site 402, and thus, may be composed of a keyword that is highly related to the site.

도 5는 본 발명의 일실시예에 있어, 사이트의 질의에 대해 연관도 사전을 이용하여 정제하는 과정을 설명하기 위한 일례이다.FIG. 5 is an example for explaining a process of refining a query of a site using an association dictionary in accordance with one embodiment of the present invention.

데이터 수집부(202)는 사이트에 대해 적어도 하나의 질의와 상기 질의 각각의 클릭 빈도를 수집할 수 있다. 도 5를 참고하면, 사이트(502)에 대한 <질의, 클릭 빈도>의 쌍(503-1~503-11)을 포함하는 사이트 컬렉션(501)이 도시된다. 즉, 데이터 수집부(202)는 사이트 컬렉션(501)을 수집할 수 있다.The data collector 202 may collect at least one query for the site and the click frequency of each of the queries. Referring to FIG. 5, a site collection 501 is shown that includes pairs 503-1 through 503-11 of <query, click frequency> for site 502. That is, the data collection unit 202 may collect the site collection 501.

여기서, 질의는 사이트(502)를 방문할 때 사용자가 입력한 키워드 집합을 의미하고, 클릭 빈도는 질의에 매칭하는 사이트(502)를 사용자가 클릭한 횟수를 의미한다. 예를 들어, 사이트 컬렉션(501) 중 "대림(38)"이라는 것은 사용자가 '대림'이라는 질의를 입력하여 도출된 '바이크 붐붐'이라는 사이트(502)를 38번 클릭한 것을 의미한다.Here, the query refers to a set of keywords input by the user when visiting the site 502, and the click frequency refers to the number of times the user clicks on the site 502 matching the query. For example, "Daelim 38" in the site collection 501 means that the user clicked 38 times on the site 502 called "bike boom boom" derived by inputting the query "Daelim".

구성 요소 판단부(203)는 사이트에 대한 연관도 사전을 이용하여 질의가 연관도 사전에 존재하는 지 여부를 판단할 수 있다. 일례로, 구성 요소 판단부(203)는 사이트 컬렉션(501)에 연관도 사전을 적용하여 질의를 질의 그룹으로 분류할 수 있다.The component determiner 203 may determine whether the query exists in the relevance dictionary by using the relevance dictionary for the site. For example, the component determiner 203 may classify the query into a query group by applying the relevance dictionary to the site collection 501.

'바이크 붐붐'이라는 사이트(502)가 도 4와 같은 앵커 텍스트 구조와 디렉토리 구조를 갖는다고 할 때, 사이트 컬렉션(501) 중 대림(38)(503-1), 혼다(203)(503-3) 및 효성(116)(503-11)은 연관도 사전에 포함되지 않을 수 있다. 즉, 대림(38)(503-1), 혼다(203)(503-3) 및 효성(116)(503-11)은 '바이크 붐붐'에서 판매하는 제품 특성에 관한 것이고, 사이트 특성과는 연관도가 낮다고 할 수 있다.When the site 502 called the bike boom boom has an anchor text structure and a directory structure as shown in FIG. 4, Daelim 38 (503-1) and Honda 203 (503-3) of the site collection 501 ) And Hyosung 116 (503-11) may not be included in the association degree dictionary. That is, Daelim (38) (503-1), Honda (203) (503-3), and Hyosung (116) (503-11) relate to product characteristics sold by the 'bike boom' and are associated with site characteristics. It can be said that the degree is low.

그러면, 구성 요소 판단부(203)는 질의 중 대림(38)(503-1), 혼다(203)(503-3) 및 효성(116)(503-11)은 연관도 사전에 포함되지 않으므로, 색인어 가중치가 낮게 설정된 질의 그룹 2(505-2)로 분류할 수 있고, 나머지 질의는 연관도 사전에 포함되므로 색인어 가중치가 높게 설정된 질의 그룹 1(505-1)로 분류할 수 있다.Then, the component determining unit 203 does not include the Daelim 38, 503-1, Honda 203, 503-3, and Hyosung 116, 503-11 in the query. It can be classified into query group 2 505-2 with a low index word weight, and the remaining queries can be classified into query group 1 505-1 with a high index word weight since the association degree is included in the dictionary.

결국, 사용자가 '바이크 붐붐'이라는 사이트(502)의 특성에 적합한 질의 그룹 1(505-1)의 질의를 입력한 경우, '바이크 붐붐'이라는 사이트(502)가 상위에 노출될 수 있다. 반대로, 사용자가 사이트(502)의 특성보다는 판매 중인 제품 특성에 적합한 질의 그룹 2(505-2)의 질의를 입력한 경우 '바이크 붐붐'이라는 사이트(502)가 상위에 노출되지 않아 사이트에 대한 어뷰징을 방지할 수 있다. 특히, 사이트를 통해 다양한 제품이 판매되는 사이트에 있어, 제품 특성과 관련된 질의(예를 들면, 제품 명칭)가 입력되는 경우 상기 사이트가 상위에 노출되는 어뷰징을 적절하게 방지할 수 있다.As a result, when the user inputs a query of the query group 1 505-1 that is suitable for the characteristics of the site 502 called a bike boom boom, the site 502 called a bike boom boom may be exposed to a higher level. Conversely, if a user enters a query from query group 2 (505-2) that is appropriate for the product feature being sold rather than the feature of the site 502, the site 502 called 'bike boom boom' is not exposed to the parent and is abusing the site. Can be prevented. In particular, in a site where various products are sold through the site, when a query (eg, a product name) related to product characteristics is input, the abusing of the site to the upper part can be appropriately prevented.

도 6은 본 발명의 일실시예에 있어, 사이트의 타이틀에 대해 연관도 사전을 이용하여 정제하는 과정을 설명하기 위한 일례이다.FIG. 6 illustrates an example of a process of refining a title of a site by using an association degree dictionary.

데이터 수집부(202)는 사이트에 대해 적어도 하나의 질의와 상기 질의 각각의 클릭 빈도를 수집할 수 있다. 도 6을 참고하면, 사이트(602)에 대한 <질의, 클릭 빈도>의 쌍(603-1~603-4)을 포함하는 사이트 컬렉션(601)이 도시된다. 즉, 데이터 수집부(202)는 사이트 컬렉션(601)을 수집할 수 있다.The data collector 202 may collect at least one query for the site and the click frequency of each of the queries. Referring to FIG. 6, a site collection 601 is shown that includes pairs 603-1 through 603-4 of <query, click frequency> for site 602. That is, the data collector 202 may collect the site collection 601.

구성 요소 판단부(203)는 사이트에 대한 연관도 사전을 이용하여 사이트의 타이틀이 연관도 사전에 존재하는 지 여부를 판단할 수 있다. 일례로, 구성 요소 판단부(203)는 사이트 컬렉션(601)에 연관도 사전을 적용하여 사이트의 타이틀을 구성하는 키워드를 질의 그룹으로 분류할 수 있다.The component determiner 203 may determine whether the title of the site exists in the relevance dictionary using the relevance dictionary for the site. For example, the component determiner 203 may apply a relevance dictionary to the site collection 601 to classify keywords constituting the title of the site into query groups.

일례로, 구성 요소 판단부(203)는 사이트의 타이틀을 키워드 단위로 추출하고, 추출된 키워드가 연관도 사전에 존재하는지 여부에 따라 키워드를 색인어 가중치를 적용할 타이틀 그룹으로 분류할 수 있다. 즉, 구성 요소 판단부(203)는 사이트의 타이틀인 '대한치과의사협회'를 '대한','치과','의사','협회'의 키워드 단위로 추출하고, 각각의 키워드가 연관도 사전에 존재하는 지 판단할 수 있다.For example, the component determiner 203 may extract the title of the site in keyword units, and classify the keyword into a title group to which the index word weight is applied based on whether the extracted keyword exists in the association degree dictionary. That is, the component determining unit 203 extracts the title of the site of the Korean Dental Association, which is a keyword unit of Korean, Dental, Doctors, and Associations. You can determine if it exists in.

만약, '치과, 의사'가 '대한치과의사협회'라는 사이트의 연관도 사전에 포함되는 경우, 구성 요소 판단부(203)는 사이트의 타이틀을 구성하는 키워드 중 '치 과'와 '의사'를 색인어 가중치가 높게 설정된 타이틀 그룹 1(604-1)로 분류할 수 있다. 반대로, '대한'과 '협회'가 '대한치과의사협회'라는 사이트의 연관도 사전에 존재하지 않는 경우, 구성 요소 판단부(203)는 사이트의 타이틀을 구성하는 키워드 중 '대한'과 '협회'를 색인어 가중치가 낮게 설정된 타이틀 그룹 2(604-2)로 분류할 수 있다.If the 'dentist, doctor' is included in the dictionary of association of the 'Korean dental association' in advance, the component determination unit 203 is to determine the 'dentist' and 'doctor' among the keywords constituting the site title It may be classified into Title Group 1 604-1 having a high index word weight. On the contrary, in the case where the association of 'Korean' and 'association' does not exist in advance in the site of 'Korean dental association', the component determination unit 203 determines the 'Korean' and 'association' among the keywords constituting the title of the site. 'May be classified into Title Group 2 604-2 having a low index word weight.

결국, 사용자가 '치과' 또는 '의사'라는 질의를 입력하는 경우, '치과'와 '의사'를 포함하는 타이틀을 가진 '대한치과의사협회'가 상위에 노출됨으로써 사용자가 입력한 질의에 대해 사용자에게 널리 인식된 공식적인 사이트가 사이트 검색에 우선적으로 제공될 수 있다.After all, when a user enters a query such as 'dental' or 'doctor', the 'Korean Dental Association' with titles including 'dental' and 'pseudo' is exposed to the upper level so that the user can respond to the query entered by the user. The official site, which is widely recognized by, can be provided for site search.

도 7은 본 발명의 일실시예에 있어, 질의에 대한 사이트의 인기도를 판단하는 과정을 설명하기 위한 일례이다.7 is an example for explaining a process of determining the popularity of a site for a query in an embodiment of the present invention.

도 7을 참고하면, 질의 Q(701)에 대한 검색 페이지(703)에 질의 Q(701)에 색인된 사이트 X, 사이트 Y 및 사이트 Z가 제공된다고 가정한다. 일례로, 인기도 판단부(205)는 사이트에 대한 페이지 랭크, 클릭 빈도, 툴바 방문 빈도 또는 사이트 체류 시간 중 적어도 하나의 인기도 요소(702)를 이용하여 질의 Q(701)에 색인된 사이트 X, 사이트 Y 및 사이트 Z의 인기도를 판단할 수 있다.Referring to FIG. 7, it is assumed that the search page 703 for the query Q 701 is provided with the site X, the site Y, and the site Z indexed in the query Q 701. In one example, the popularity determining unit 205 uses the popularity factor 702 of at least one of page rank, click frequency, toolbar visit frequency, or site dwell time for the site, the site X indexed in the query Q 701, the site. The popularity of Y and site Z can be determined.

사이트에 대한 페이지 랭크, 툴바 방문 빈도 또는 사이트 체류 시간은 사용자가 단순히 사이트를 클릭하는 것에 그치지 않고, 실제 사용자의 행위를 고려한 인기도 요소(702)이므로, 이러한 인기도 요소를 통해 사이트 스코어를 설정함으로써, 부정 클릭을 통한 사이트 어뷰징 행위를 방지할 수 있다.Because page rank, toolbar visit frequency, or site dwell time for a site is more than just a user clicking on the site, it is a popularity factor 702 that takes into account the actual user's behavior, so that by setting a site score through this popularity factor, Prevent site browsing through clicks.

결국, 사용자가 질의 Q(701)를 입력하는 경우, 검색 결과 페이지에 질의 Q(701)로 색인된 사이트 X, 사이트 Y 및 사이트 Z가 인기도에 따라 배열되어 사용자에게 제공될 수 있다.As a result, when the user enters the query Q 701, the site X, site Y, and site Z indexed by the query Q 701 on the search result page may be arranged and provided to the user according to popularity.

도 8은 본 발명의 일실시예에 있어, 사이트의 질의에 대해 연관도 사전을 이용하여 정제하는 전체 과정을 도시한 플로우차트이다.8 is a flowchart illustrating an entire process of refining a query of a site by using an association dictionary.

데이터 수집부(202)는 사이트에 대한 적어도 하나의 질의 및 상기 질의 각각의 클릭 빈도를 수집할 수 있다(S801).The data collector 202 may collect at least one query for the site and a click frequency of each of the queries (S801).

구성 요소 판단부(203)는 질의를 키워드 단위로 추출할 수 있다(S802). 그러면, 구성 요소 판단부(203)는 질의로부터 추출된 키워드가 연관도 사전에 존재하는 지 여부를 판단할 수 있다(S803). 만약, 키워드가 연관도 사전에 존재하는 경우, 구성 요소 판단부(203)는 해당 질의를 질의 그룹 1로 분류할 수 있다(S804). 반대로, 키워드가 연관도 사전에 존재하지 않는 경우, 구성 요소 판단부(203)는 해당 질의를 질의 그룹 2로 분류할 수 있다(S805). 이와 같은 과정을 통해 사이트에 대한 질의를 연관도 사전을 통해 정제할 수 있다.The component determiner 203 may extract the query in keyword units (S802). Then, the component determiner 203 may determine whether the keyword extracted from the query exists in the association degree dictionary (S803). If the keyword exists in the association degree dictionary, the component determiner 203 may classify the corresponding query into the query group 1 (S804). On the contrary, when the keyword does not exist in the association degree dictionary, the component determiner 203 may classify the query into the query group 2 (S805). Through this process, queries on the site can be refined through the relevance dictionary.

사이트 색인부(204)는 질의 그룹 1에 속하는 질의에 대해 높은 색인어 가중치를 적용하고, 질의 그룹 2에 속하는 질의에 대해 낮은 색인어 가중치를 적용할 수 있다(S806). 그러면, 사이트 색인부(204)는 색인어 가중치가 적용된 질의를 매칭하는 사이트에 색인함으로써, 사이트를 색인할 수 있다(S807).The site index unit 204 may apply high index word weights to the queries belonging to the query group 1 and apply low index word weights to the queries belonging to the query group 2 (S806). Then, the site indexing unit 204 may index the site by indexing the site that matches the query to which the index word weight is applied (S807).

도 9는 본 발명의 일실시예에 있어, 사이트의 타이틀에 대해 연관도 사전을 이용하여 정제하는 전체 과정을 도시한 플로우차트이다.FIG. 9 is a flowchart illustrating an entire process of refining a title of a site using an association dictionary.

구성 요소 판단부(203)는 사이트의 타이틀을 키워드 단위로 추출할 수 있다(S901). 그리고, 구성 요소 판단부(203)는 추출된 키워드가 연관도 사전에 존재하는 지 여부를 판단할 수 있다(S902). 만약, 키워드가 연관도 사전에 존재하는 경우, 구성 요소 판단부(203)는 키워드를 타이틀 그룹 1로 분류할 수 있다(S903). 반대로, 키워드가 연관도 사전에 존재하지 않는 경우, 구성 요소 판단부(203)는 키워드를 타이틀 그룹 2로 분류할 수 있다(S904). 이와 같은 과정을 통해 사이트의 타이틀 중 사이트에 의미있는 키워드가 추출되어 타이틀이 정제될 수 있다.The component determiner 203 may extract the title of the site in keyword units (S901). In addition, the component determiner 203 may determine whether the extracted keyword exists in the association degree dictionary (S902). If the keyword exists in the association degree dictionary, the component determining unit 203 may classify the keyword as the title group 1 (S903). On the contrary, when the keyword does not exist in the association degree dictionary, the component determiner 203 may classify the keyword into the title group 2 (S904). Through this process, a meaningful keyword is extracted from the title of the site and the title can be refined.

그러면, 사이트 색인부(204)는 타이틀 그룹 1로 분류된 키워드는 높은 색인어 가중치를 적용하고, 타이틀 그룹 2로 분류된 키워드는 낮은 색인어 가중치를 적용할 수 있다(S905). 그리고, 사이트 색인부(204)는 색인어 가중치가 적용된 키워드를 사이트에 색인할 수 있다(S906).Then, the site indexing unit 204 may apply a high index word weight to the keywords classified as the title group 1, and apply a low index word weight to the keywords classified as the title group 2 (S905). In addition, the site indexing unit 204 may index the keyword to which the index word weight is applied (S906).

도 8 및 도 9에서 설명되지 않은 구체적인 부분은 도 1 내지 도 7의 설명을 참고할 수 있다.Detailed descriptions not described with reference to FIGS. 8 and 9 may refer to the descriptions of FIGS. 1 to 7.

또한 본 발명의 일실시예에 따른 연관도 사전을 이용한 검색 모델링 방법은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터 판독 가능 매체를 포함한다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매 체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.In addition, a search modeling method using an association dictionary according to an embodiment of the present invention includes a computer readable medium including program instructions for performing various computer-implemented operations. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The media may be program instructions that are specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and floptical disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명 사상은 아래에 기재된 특허청구범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.As described above, the present invention has been described by way of limited embodiments and drawings, but the present invention is not limited to the above-described embodiments, which can be variously modified and modified by those skilled in the art to which the present invention pertains. Modifications are possible. Accordingly, the spirit of the present invention should be understood only by the claims set forth below, and all equivalent or equivalent modifications thereof will belong to the scope of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

101: 사용자101: user

102: 검색 모델링 시스템102: search modeling system

103-1~103-3: 사이트103-1 ~ 103-3: Site

Claims

사이트에 대해 적어도 하나의 질의와 상기 질의 각각의 클릭 빈도를 수집하는 데이터 수집부;A data collection unit collecting at least one query for the site and a click frequency of each of the queries;

상기 사이트에 대한 연관도 사전을 이용하여 상기 질의 또는 상기 사이트의 타이틀이 연관도 사전에 존재하는 지 여부를 판단하는 구성 요소 판단부; 및A component determining unit that determines whether the query or the title of the site exists in the relevance dictionary by using the relevance dictionary for the site; And

상기 연관도 사전에 존재하는 지 여부에 따라 상기 질의 또는 상기 타이틀에 적용될 색인어 가중치를 조절하여 상기 사이트를 색인하는 사이트 색인부Site indexing unit for indexing the site by adjusting the index weight to be applied to the query or the title according to whether the association exists in the dictionary

를 포함하는 검색 모델링 시스템.Search modeling system comprising a.

제1항에 있어서,The method of claim 1,

상기 연관도 사전은,The association degree dictionary,

상기 사이트의 디렉토리 구조 및 앵커 텍스트 구조에 기초하여 추출된 키워드로 상기 사이트와 연관도가 높은 것을 특징으로 하는 검색 모델링 시스템.The search modeling system of claim 1, wherein the keyword is extracted based on the directory structure and anchor text structure of the site.

제1항에 있어서,The method of claim 1,

상기 사이트의 디렉토리 구조, 사이트 키워드 또는 앵커 택스트 구조를 이용하여 상기 사이트에 대한 연관도 사전을 생성하는 연관도 사전 생성부Relevance dictionary generation unit for generating a relevance dictionary for the site by using the directory structure, site keywords, or anchor text structure of the site

를 더 포함하는 검색 모델링 시스템.Search modeling system further comprising.

제3항에 있어서,The method of claim 3,

상기 연관도 사전 생성부는,The association degree generation unit,

상기 사이트에 대한 디렉토리 구조, 사이트 키워드 또는 앵커 텍스트를 포함하는 사이트 자료를 추출하는 사이트 자료 추출부;A site material extracting unit for extracting site material including a directory structure, a site keyword, or anchor text for the site;

상기 추출된 사이트 자료를 분석하여 키워드를 결정하는 키워드 결정부; 및A keyword determination unit to determine a keyword by analyzing the extracted site data; And

상기 결정된 키워드를 이용하여 리스트를 생성하는 리스트 생성부List generating unit for generating a list using the determined keyword

제1항에 있어서,The method of claim 1,

상기 구성 요소 판단부는,The component determination unit,

상기 질의를 키워드 단위로 추출하고, 상기 추출된 키워드가 상기 연관도 사전에 존재하는 지 여부에 따라 상기 질의를 색인어 가중치를 적용할 질의 그룹으로 분류하는 것을 특징으로 하는 검색 모델링 시스템The search modeling system, wherein the query is extracted in keyword units, and the query is classified into a query group to which index word weights are applied based on whether the extracted keyword exists in the relevance dictionary.

제5항에 있어서,The method of claim 5,

상기 구성 요소 판단부는,The component determination unit,

상기 적어도 하나의 질의에 대한 클릭 임계값을 정의하고, 상기 클릭 임계값보다 큰 클릭 빈도를 나타내는 질의에 대해 연관도 사전에 존재하는 지 여부를 판단하는 것을 특징으로 하는 검색 모델링 시스템.And defining a click threshold for the at least one query, and determining whether there is an association degree dictionary for the query indicating a click frequency greater than the click threshold.

제1항에 있어서,The method of claim 1,

상기 구성 요소 판단부는,The component determination unit,

상기 사이트의 타이틀을 키워드 단위로 추출하고, 상기 추출된 키워드가 상기 연관도 사전에 존재하는지 여부에 따라 상기 키워드를 색인어 가중치를 적용할 타이틀 그룹으로 분류하는 것을 특징으로 하는 검색 모델링 시스템.And retrieving the title of the site in keyword units, and classifying the keyword into a title group to which index word weights are applied based on whether the extracted keyword exists in the relevance dictionary.

제1항에 있어서,The method of claim 1,

상기 사이트 색인부는,The site index unit,

상기 질의 또는 상기 타이틀이 연관도 사전에 존재하는 경우 색인어 가중치를 증가시켜 상기 질의 또는 상기 타이틀에 적용하는 것을 특징으로 하는 검색 모델링 시스템.And if the query or the title exists in an association dictionary, an index term weight is increased and applied to the query or the title.

제1항에 있어서,The method of claim 1,

상기 사이트에 대한 페이지 랭크, 클릭 빈도, 툴바 방문 빈도 또는 사이트 체류 시간 중 적어도 하나의 인기도 요소를 이용하여 상기 질의에 색인된 하나 이상의 사이트의 인기도를 판단하는 인기도 판단부Popularity determination unit for determining the popularity of one or more sites indexed in the query using at least one popularity factor of page rank, click frequency, toolbar visit frequency or site dwell time for the site

검색 모델링 시스템이 수행하는 검색 모델링 방법에 있어서,In the search modeling method performed by the search modeling system,

데이터 수집부가 사이트에 대해 적어도 하나의 질의와 상기 질의 각각의 클 릭 빈도를 수집하는 단계;Collecting, by the data collector, at least one query for the site and the click frequency of each of the queries;

구성 요소 판단부가 상기 사이트에 대한 연관도 사전을 이용하여 상기 질의 또는 상기 사이트의 타이틀이 연관도 사전에 존재하는 지 여부를 판단하는 단계; 및Determining, by a component determining unit, whether the query or the title of the site exists in the relevance dictionary using the relevance dictionary for the site; And

사이트 색인부가 상기 연관도 사전에 존재하는 지 여부에 따라 상기 질의 또는 상기 타이틀에 적용될 색인어 가중치를 조절하여 상기 사이트를 색인하는 단계Indexing the site by adjusting an index word weight to be applied to the query or the title according to whether a site index unit exists in the association dictionary;

를 포함하는 검색 모델링 방법.Search modeling method comprising a.

제10항에 있어서,The method of claim 10,

상기 연관도 사전은,The association degree dictionary,

상기 사이트의 디렉토리 구조 및 앵커 텍스트 구조에 기초하여 추출된 키워드로 상기 사이트와 연관도가 높은 것을 특징으로 하는 검색 모델링 방법.The search modeling method of claim 1, wherein the keyword is extracted based on the directory structure and anchor text structure of the site.

제10항에 있어서,The method of claim 10,

연관도 사전 생성부가 상기 사이트의 디렉토리 구조, 사이트 키워드 또는 앵커 택스트 구조를 이용하여 상기 사이트에 대한 연관도 사전을 생성하는 단계A relevance dictionary generation unit generating a relevance dictionary for the site by using a directory structure, a site keyword, or an anchor text structure of the site

를 더 포함하는 검색 모델링 방법.Search modeling method further comprising.

제12항에 있어서,The method of claim 12,

상기 연관도 사전을 생성하는 단계는,Generating the association dictionary,

상기 사이트에 대한 디렉토리 구조, 사이트 키워드 또는 앵커 텍스트를 포함하는 사이트 자료를 추출하는 단계;Extracting site material including a directory structure, site keywords, or anchor text for the site;

상기 추출된 사이트 자료를 분석하여 키워드를 결정하는 단계; 및Analyzing the extracted site data to determine keywords; And

상기 결정된 키워드를 이용하여 리스트를 생성하는 단계Generating a list using the determined keyword

를 포함하는 검색 모델링 방법.Search modeling method comprising a.

제10항에 있어서,The method of claim 10,

상기 연관도 사전에 존재하는 지 여부를 판단하는 단계는,Determining whether the association exists in the dictionary,

상기 구성 요소 판단부가 상기 질의를 키워드 단위로 추출하고, 상기 추출된 키워드가 상기 연관도 사전에 존재하는 지 여부에 따라 상기 질의를 색인어 가중치를 적용할 질의 그룹으로 분류하는 것을 특징으로 하는 검색 모델링 방법The component determining unit extracts the query in keyword units, and classifies the query into a query group to which an index word weight is applied based on whether the extracted keyword exists in the association degree dictionary.

제14항에 있어서,The method of claim 14,

상기 구성 요소 판단부가 상기 적어도 하나의 질의에 대한 클릭 임계값을 정의하고, 상기 클릭 임계값보다 큰 클릭 빈도를 나타내는 질의에 대해 연관도 사전에 존재하는 지 여부를 판단하는 것을 특징으로 하는 검색 모델링 방법.The component determining unit defines a click threshold value for the at least one query, and determines whether there is an association degree dictionary for a query indicating a click frequency greater than the click threshold value; .

제10항에 있어서,The method of claim 10,

상기 구성 요소 판단부가 상기 사이트의 타이틀을 키워드 단위로 추출하고, 상기 추출된 키워드가 상기 연관도 사전에 존재하는지 여부에 따라 상기 키워드를 색인어 가중치를 적용할 타이틀 그룹으로 분류하는 것을 특징으로 하는 검색 모델링 방법.The modeling unit extracts the title of the site in keyword units, and classifies the keyword into a title group to which index word weights are applied based on whether the extracted keyword exists in the relevance dictionary. Way.

제10항에 있어서,The method of claim 10,

상기 사이트를 색인하는 단계는,Indexing the site,

상기 사이트 색인부가 상기 질의 또는 상기 타이틀이 연관도 사전에 존재하는 경우 색인어 가중치를 증가시켜 상기 질의 또는 상기 타이틀에 적용하는 것을 특징으로 하는 검색 모델링 방법.The search indexing method of claim 1, wherein the site index unit increases the index word weight and applies the query or the title to the query or the title when the query or the title exists in the relevance dictionary.

제10항에 있어서,The method of claim 10,

인기도 판단부가 상기 사이트에 대한 페이지 랭크, 클릭 빈도, 툴바 방문 빈도 또는 사이트 체류 시간 중 적어도 하나의 인기도 요소를 이용하여 상기 질의에 색인된 하나 이상의 사이트의 인기도를 판단하는 단계Determining, by the popularity determining unit, the popularity of one or more sites indexed in the query using at least one popularity factor among page rank, click frequency, toolbar visit frequency, or site dwell time for the site.

제10항 내지 제18항 중 어느 한 항의 방법을 실행하기 위한 프로그램이 기록된 컴퓨터에서 판독 가능한 기록 매체.A computer-readable recording medium having recorded thereon a program for executing the method of claim 10.