KR100490442B1

KR100490442B1 - Apparatus for clustering same and similar product using vector space model and method thereof

Info

Publication number: KR100490442B1
Application number: KR10-2002-0014270A
Authority: KR
Inventors: 김기응; 서홍석; 전태수; 김태형; 조문옥
Original assignee: 삼성에스디에스 주식회사
Priority date: 2002-03-16
Filing date: 2002-03-16
Publication date: 2005-05-17
Also published as: KR20030075219A

Abstract

본 발명은 동일 또는 유사한 제품의 가격을 비교하는 쇼핑 포털이나 가격비교에이전트를 구축시에 같은 제품이라도 쇼핑몰사이트마다 다르게 표기된 동일/유사 제품을 클러스터링하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치 및 그 방법에 관한 것으로, 질의검색어에 해당하는 웹페이지들을 쇼핑몰사이트에서 수집하는 웹페이지수집부; 웹페이지별로 질의검색어를 설명하는 단어들을 추출하여 클러스터를 형성하는 단일클러스터형성부; 웹페이지별로 추출한 단어들로 벡터를 형성하는 벡터변환부; 웹페이지별 벡터간의 유사도를 계산하는 유사도 계산부; 유사도에 따라서 클러스터를 결합하는 클러스터결합부를 포함하므로, 동일/유사제품을 분류하여 각 제품간의 가격을 비교할 수 있고 키워드검색에 있어서도 동일/유사제품명을 같이 묶어 비교하므로 사용자의 편리성을 크게 증가시킬 수 있다.The present invention is the same / similar product clustering device using a vector document model for clustering the same / similar products marked differently for each shopping mall site even when the same product when building a shopping portal or price comparison agent comparing the prices of the same or similar products and the same A method, comprising: a web page collecting unit for collecting web pages corresponding to a query search term at a shopping mall site; A single cluster forming unit for extracting words describing a query search word for each web page to form a cluster; A vector conversion unit for forming a vector from words extracted for each web page; A similarity calculator which calculates similarity between vectors for each web page; It includes a cluster combiner that combines clusters according to similarity, so you can classify the same / similar products and compare prices between them. Also, in keyword search, you can compare the same / similar product names together to greatly increase user convenience. have.

Description

벡터문서모델을 이용한 동일/유사제품 클러스트링 장치 및 그 방법 {Apparatus for clustering same and similar product using vector space model and method thereof}Apparatus for clustering same and similar product using vector space model and method

본 발명은 웹페이지에서 추출한 단어를 클러스터링하는 분야에 관한 것으로, 특히 질의검색어를 입력받아 그 검색어에 따라서 벡터문서모델을 이용하여 클러스터링하는 장치 및 그 방법에 관한 것이다.The present invention relates to the field of clustering words extracted from a web page, and more particularly, to an apparatus and a method for receiving a query search word and clustering using a vector document model according to the search word.

종래의 제품 가격 비교 에이전트의 경우, 동일/유사 제품을 다루는 방법에 따라 두가지 종류로 나누어진다. 첫째는, 동일 제품 인식을 포기한 경우로서, 단순 키워드 검색을 이용하여 키워드 매치가 이루어지는 모든 제품들을 임의의 순서로 나열하고, 동일 제품인지 아닌 지의 판단은 사용자가 직접 내리는 형태이다. 둘째는, 동일 제품 인식을 하되, 가격 비교 에이전트에 가입한 쇼핑몰 사이트의 수작업에 의존하는 형태이다. 즉, 쇼핑몰 사이트에서 새로운 제품을 등록할 때, 가격 비교 에이전트에서 제공하는 인터페이스를 통하여 수작업으로 동일 제품 그룹에 등록시키는 방식이다. 위의 첫째 방법, 즉, 동일 제품 인식을 포기한 경우, 가격 비교 에이전트에서 얻는 검색 결과에는 키워드가 매치되는 모든 제품들 간의 동일 상품 인식을 사용자가 직접 행하여야 하므로 사용자의 지식에 의존해야하는 문제점이 있다. 둘째 방법, 즉, 가격 비교 서비스에 가입한 쇼핑몰 사이트의 수작업에 의존하는 경우, 가격 비교 서비스가 능동적이지 못하고 수동적이기 때문에, 가격 비교 에이전트 측에서는 가입자(즉, 쇼핑몰 사이트)를 확보해야 하고 또한 가입자 측에서는 많은 수작업이 필요한 문제점이 있다.In the case of a conventional product price comparison agent, it is divided into two types according to the method of dealing with the same / similar products. First, in the case of abandoning the recognition of the same product, a simple keyword search is used to list all the products that match the keyword in an arbitrary order, and the determination of whether or not the same product is made directly by the user. Second, it recognizes the same product but relies on the manual work of a shopping mall site subscribed to a price comparison agent. That is, when a new product is registered at a shopping mall site, the same product group is manually registered through an interface provided by a price comparison agent. In the first method, that is, abandoning the recognition of the same product, there is a problem in that the search result obtained by the price comparison agent has to rely on the knowledge of the user because the user must directly recognize the same product among all products matching the keyword. In the second method, that is, when relying on the manual operation of a shopping mall site subscribed to the price comparison service, since the price comparison service is not active and passive, the price comparison agent must acquire a subscriber (that is, a shopping mall site), and on the subscriber side, There is a problem that requires manual work.

이렇게 수집된 제품 페이지들로부터 관련된 제품들을 검색할 때 사용자들은 다음과 같은 부담을 접하게 된다. 첫째, 키워드 검색 방식을 따르므로 사용자가 정확히 의도하지 않는, 동일하지는 않지만 관련된 제품의 페이지들을 검색 결과로 얻는다. 둘째, 사용자가 의도하는 제품이 여러 쇼핑몰 사이트에 걸쳐 제공되기 때문에 같은 제품이 중복되어 결과로 보고된다. 마지막으로, 동일 제품이라 하더라도 쇼핑몰 사이트마다 비슷하지만 서로 다른 표현을 쓴다. 예를 들어, "삼성 M5670"과 "Samsung M5670"은 동일 제품이지만, 한 곳은 영문, 한 곳은 한글로 제조사 표기를 하는 식이다Users are faced with the following burden when searching for relevant products from these collected product pages. Firstly, the keyword search method is used to obtain, as search results, pages of related, but not identical, products that the user does not intend exactly. Secondly, since the product intended by the user is provided over several shopping mall sites, the same product is duplicated and reported as a result. Lastly, even though the same product is used, it is similar for each shopping mall site but uses different expressions. For example, "Samsung M5670" and "Samsung M5670" are identical products, but one is written in English and one is written in Korean.

본 발명이 이루고자 하는 기술적 과제는, 상기 문제점들을 해결하기 위해서 검색어를 이용하여 수집한 웹페이지에서 제품을 나나내는 단어들을 클러스터링하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치 및 그 방법을 제공하는데 있다. An object of the present invention is to provide a similar / similar product clustering apparatus and method using a vector document model for clustering words representing products in a web page collected by using a search term in order to solve the above problems. .

본 발명이 이루고자 하는 또 다른 기술적 과제는, 상기 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 있다.Another object of the present invention is to provide a computer-readable recording medium having recorded thereon a program for executing the method on a computer.

상기의 과제를 이루기 위한 본 발명에 따른 벡터문서모델을 이용한 동일/유사제품 클러스트링장치에 있어서, 질의검색어에 해당하는 웹페이지들을 쇼핑몰사이트에서 수집하는 웹페이지수집부;상기 웹페이지별로 상기 질의검색어를 설명하는 단어들을 추출하여 클러스터를 형성하는 단일클러스터형성부;상기 웹페이지별로 추출한 단어들로 벡터를 형성하는 벡터변환부;상기 웹페이지별 벡터간의 유사도를 계산하는 유사도 계산부;상기 유사도에 따라서 클러스터를 결합하는 클러스터결합부를 포함한다.In the same / similar product clustering apparatus using a vector document model according to the present invention for achieving the above object, a web page collecting unit for collecting the web page corresponding to the query query in the shopping mall site; A single cluster forming unit for extracting words to form a cluster to form a cluster; A vector conversion unit for forming a vector from the words extracted for each web page; Similarity calculator for calculating the similarity between the vector for each web page; Cluster according to the similarity It includes a cluster coupling portion for coupling.

상기의 과제를 이루기 위한 본 발명에 따른 벡터문서모델을 이용한 동일/유사제품 클러스트링방법에 있어서, (a) 질의검색어에 해당하는 웹페이지들을 쇼핑몰사이트에서 수집하는 단계;(b) 상기 웹페이지별로 상기 질의검색어를 설명하는 단어들을 추출하여 클러스터를 형성하는 단계;(c) 상기 웹페이지별로 추출한 단어들로 벡터를 형성하는 단계;(d) 상기 웹페이지별 벡터간의 유사도를 계산하는 단계; (e) 상기 유사도에 따라서 클러스터를 결합하는 단계를 포함한다.In the same / similar product clustering method using a vector document model according to the present invention for achieving the above object, (a) collecting the web page corresponding to the query query in the shopping mall site; (b) the web page for each Extracting words describing a query search word to form a cluster; (c) forming a vector from words extracted for each web page; (d) calculating similarity between the vectors for each web page; (e) combining the clusters according to the similarity.

이하에서, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 대하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention.

도 1은 본 발명에 따른 벡터문서모델을 이용한 동일/유사제품 클러스트링장치에 대한 블록도를 나타내는 도면이다. 1 is a block diagram showing the same / similar product clustering apparatus using a vector document model according to the present invention.

도 2는 본 발명에 따른 벡터문서모델을 이용한 동일/유사제품 클러스트링방법의 흐름을 나타내는 도면이다.2 is a view showing the flow of the same / similar product clustering method using a vector document model according to the present invention.

이하 도 1 및 도 2를 함께 설명하기로 한다.1 and 2 will be described together.

클러스트링장치(100)의 웹페이지수집부(110)는 관리자의 요구에 따라 쇼핑몰 사이트(101)를 크로울링(crawling)하고 제품명 등에 대한 웹페이지를 수집한다(210단계). 제품명에 대한 웹페이지 수집은 두가지 방식이 있다. 첫째는 관리자가 지정한 주기에 따라 쇼핑몰 사이트로부터 제품 페이지들을 수집하여 로컬 DB에 저장하는 방식이고, 둘째는 사용자(103)가 검색질의어를 입력하면 각 쇼핑몰 사이트에 검색질의어를 보냄으로서 관련 제품 페이지들을 실시간으로 수집하여 결과를 보고하는 방식이다. The web page collecting unit 110 of the clustering apparatus 100 crawls the shopping mall site 101 according to a manager's request and collects a web page about a product name and the like (step 210). There are two ways to collect web pages for product names. The first method is to collect product pages from the shopping mall site and store them in a local DB according to the administrator-specified cycle. The second is to send the search query to each shopping mall site when the user 103 inputs a search query. This is how you collect and report the results.

본 발명에서는 사용되는 클러스터링 알고리즘은 계층 클러스터링 알고리즘 (hierarchical clustering algorithm)을 이용하였다. 클러스터링 알고리즘을 이용하기 위해 다음과 같은 전처리 단계를 거친다.The clustering algorithm used in the present invention uses a hierarchical clustering algorithm. In order to use the clustering algorithm, the following preprocessing steps are performed.

제품에 대한 웹페이지의 문자열을 단어뭉치(bag of words)로 변환한다. 문자치환부(120)는 문자열로부터 특수문자들(예를 들면, 알파벳, 한글, 숫자 또는 "." 이외의 문자들)을 스페이스로 치환하여 일반문자열로 치환하고(220단계), 단일클러스터형성부(130)는 스페이스로 분리된 단어들을 추출(230단계)하여 단어의 출현 빈도만 기록하여 일반문자열을 벡터로 변환하게 된다(240단계). 단일클러스터형성부(130)는 하나의 아이템을 가지는 클러스터(단일클러스터, singleton cluster)를 형성하게 된다. 이와 같은 과정이 전처리단계이다. 이때, 문자열에 나타나는 각각의 단어들이 벡터의 요소를 이루게 된다. Convert a string of web pages about a product into a bag of words. The character replacing unit 120 replaces special characters (for example, letters other than alphabets, Korean, numbers, or ".") With a space and replaces them with a normal string (step 220), and a single cluster forming unit ( 130 extracts words separated by a space (step 230), records only the frequency of occurrence of the word, and converts the general string into a vector (step 240). The single cluster forming unit 130 forms a cluster having a single item (single cluster, singleton cluster). This process is a pretreatment step. At this time, each word appearing in the string forms an element of the vector.

예를 들어 설명하면, 제품명의 문자열이 "삼성 P4 1.5GHz M5060-SS + 삼성 17인치 (일반:703S)"로 주어졌을 때, 특수문자들을 스페이스로 치환하면 "삼성 P4 1.5GHz M5060 SS 삼성 17인치 일반 703S"로 변환되고, 이를 벡터로 나타내기 위해단어들을 추출하면 {"삼성", "P4", "1.5Ghz", "M5060", "SS", "17인치", "일반", "703S"}를 얻는다(단어 "삼성"은 두번 나타나므로, 한번만 추출된다). 추출한 문자열을 희소벡터(Sparse Vector)로 변환하면, ["삼성"=2, "P4"=1, "1.5GHz"=1, "M5060"=1, "SS"=1, "17인치"=1, "일반"=1, "703S"=1]를 얻는다.For example, if the product name string is given as "Samsung P4 1.5GHz M5060-SS + Samsung 17" (general: 703S), replacing special characters with "Samsung P4 1.5GHz M5060 SS Samsung 17" 703S ", and extract the words to represent it as a vector: {" Samsung "," P4 "," 1.5Ghz "," M5060 "," SS "," 17 "," Normal "," 703S "} (The word" Samsung "appears twice, so it is only extracted once). If you convert the extracted string into a sparse vector, ["Samsung" = 2, "P4" = 1, "1.5GHz" = 1, "M5060" = 1, "SS" = 1, "17" "= 1, "normal" = 1, "703S" = 1].

벡터변환부(140)는 각각의 제품에 대한 웹페이지에 대한 클러스터들에 포함되어 있는 벡터들의 중심벡터를 계산한다(250단계). 중심 벡터는 클러스터들에 포함되어 있는 벡터의 평균치로 계산된다. 예를 들어, 어떤 클러스터에 다음과 같은 세 개의 벡터가 포함되어 있다고 가정하면, The vector conversion unit 140 calculates the center vectors of the vectors included in the clusters for the web page for each product (step 250). The center vector is calculated as the average of the vectors contained in the clusters. For example, suppose a cluster contains three vectors:

v1 = ["린나이"=1, "가스"=1, "렌지"=1, "RHS"=1, "430S"=1] v1 = ["Rinnai" = 1, "Gas" = 1, "Range" = 1, "RHS" = 1, "430S" = 1]

v2 = ["린나이"=1, "가스"=1, "스토브"=1, "RHS"=1, "430S"=1] v2 = ["Rinnai" = 1, "gas" = 1, "stove" = 1, "RHS" = 1, "430S" = 1]

v3 = ["린나이"=1, "RHS"=1, "430S"=1] v3 = ["Rinnai" = 1, "RHS" = 1, "430S" = 1]

클러스터의 중심 벡터 c는 다음과 같이 계산된다.The center vector c of the cluster is calculated as follows.

c = ["린나이"=1, "가스"=0.67, "렌지"=0.33, "스토브"=0.33, "RHS"=1, "430S"=1]c = ["Rinnai" = 1, "gas" = 0.67, "range" = 0.33, "stove" = 0.33, "RHS" = 1, "430S" = 1]

유사도계산부(150)는 중심벡터간의 내적을 계산하여(260단계) 그 내적의 값이 가장 큰 값이 나오는 웹페이지를 하나의 클러스터로 형성하는데, 하나의 클러스터로 형성하는 것은 2개의 클러스터를 합집합으로 표현하는 것이다(270단계). 중심 벡터간의 각도에 대한 cosine 값 계산은 벡터 간의 내적(inner product)을 계산한 다음 정규화(normalization) 작업을 통해 이루어진다. 예를 들어, 두 개의 중심 벡터가 다음과 같다고 가정하면,The similarity calculation unit 150 calculates the dot product between the center vectors (step 260) to form a web page in which the value of the dot product has the largest value as one cluster. (Step 270). The calculation of cosine values for angles between the center vectors is performed by calculating the inner product between the vectors and then normalizing them. For example, suppose the two center vectors are

c1 = ["린나이"=1, "가스"=1, "렌지"=1, "RHS"=1, "430S"=1] c1 = ["Rinnai" = 1, "Gas" = 1, "Range" = 1, "RHS" = 1, "430S" = 1]

c2 = ["린나이"=1, "가스"=1, "렌지"=1, "RHS"=1, "430W"=1] c2 = ["Rinnai" = 1, "Gas" = 1, "Range" = 1, "RHS" = 1, "430W" = 1]

두 벡터의 내적은 c1·c2는 4이고(왜냐하면, "린나이", "가스", "렌지", "RHS"만이 중첩므로), 벡터의 크기는 두 벡터 모두 ∥c1∥= ∥c1∥= 이므로, cosine 값은 (c1·c2)/(∥c1∥·∥c2∥) = 4/( *) = 0.8이 된다. 그러므로, 클러스터결합부(160)는 중심 벡터간의 cosine값이 가장 큰 두 클러스터를 하나로 합치고 이때의 cosine값이 미리 정한 임계치 e보다 작아지면 클러스터들을 합치는 작업이 마무리된다. 임계치 e는 알고리즘 실행자에 따라서 임의값이 될 수 있다.The dot product of the two vectors c1 · c2 is 4 (since only "Rinnai", "gas", "range", and "RHS" overlap), the size of the vector is both vectors c1∥ = ∥c1∥ = Therefore, the cosine value is (c1 · c2) / (∥ c1∥ · ∥ c2∥) = 4 / ( * ) = 0.8. Therefore, the cluster combiner 160 merges the two clusters having the largest cosine value between the center vectors into one, and when the cosine value is smaller than the predetermined threshold value e, the clustering operation is completed. The threshold e can be any value depending on the algorithm performer.

동의어사전부(170)는 하나로 합치지는 클러스터의 벡터값들을 저장(280단계)한다. 이렇게 저장된 클러스터들은 사용자가 질의검색어로 어떤 제품을 검색할려고 할 때 사용자가 입력한 질의검색어에 대한 결과물을 클러스터제공부(180)를 통해서 사용자에게 제공할 수 있다. 즉, 사용자가 임의의 단어로 검색하면 그 단어를 포함하고 제품들을 찾고 제품들이 속하는 클러스터들을 제공한다. 예를 들어, 사용자가 "가스 렌지"라는 단어를 입력하여 제품을 검색한다면, 린나이 가스 렌지에 해당하는 클러스터만 "가스"와 "렌지"를 포함하고 있다면 린나이 가스 렌지 RHS-430S 제품 페이지들의 클러스터를 표시한다(만약 "가스"와 "렌지" 단어를 포함하는 다른 클러스터가 있다면 이 또한 표시하는데, 클러스터들끼리 구분을 하여 사용자가 동일 제품인지 상이 제품인지 쉽게 식별하도록 표시한다).The synonym dictionary 170 stores the vector values of the clusters merged into one (step 280). The clusters stored in this way may provide the user with the cluster providing unit 180 a result of the query search word input by the user when the user tries to search for a product using the query search word. That is, when a user searches for an arbitrary word, it includes that word, finds products, and provides clusters to which the products belong. For example, if a user enters the word "gas stove" to search for a product, if only the cluster corresponding to the Rinnai gas stove contains "gas" and "range", then the cluster of Rinnai gas stove RHS-430S product pages (If there are other clusters that contain the words "gas" and "range", this is also displayed, so that the clusters can be distinguished so that the user can easily identify whether they are the same product or the same product).

본 발명에 의한 계층 클러스터링방법을 동일/유사 제품 인식에 적용할 수 있으나, 짧은 제품명 표기 때문에 정확도가 떨어지는 문제를 해결하기 위하여 동의어 사전을 이용하였다. 제품 클러스터링에 이용되는 동의어 사전은, 예를 들어, "삼성"과 "Samsung"은 동일한 의미를 가지는 단어라는 것을 저장하는 사전이다. 동의어사전부(170)에 대한 구축방법에 대해서 설명한다.Although the hierarchical clustering method according to the present invention can be applied to the recognition of the same / similar products, a synonym dictionary was used to solve the problem of inaccuracy due to short product name notation. The synonym dictionary used for product clustering is, for example, a dictionary that stores that "Samsung" and "Samsung" have the same meaning. A construction method for the synonym dictionary 170 will be described.

기존의 정보 검색(information retrieval) 분야의 동의어 사전 자동 구축 알고리즘은 주로 통계적인 방법을 취한다. 즉, 임의의 두 단어가 동의어인지 아닌지를 판별하기 위해, 문서 내에 두 단어가 얼마나 자주 함께 등장하는지 상관도(correlation)를 구하여 일정 임계치 이상이면 동의어로 취급한다. 하지만, 이러한 방법은 제품 표기와 같이 매우 짧은(예를 들면, 5~10단어) 데이터에 대해서는 효과적으로 상관도를 계산할 수 없다. 따라서, 본 발명의 동의어 사전 자동 구축에서는 소량의 동일 제품 데이터가 저장되어 있다고 가정하고, 이를 토대로 동의어 사전을 자동 구축하고 동의어 사전을 다른 제품들의 클러스터링에 이용하는, 기계 학습의 부트스트래핑(bootstrapping)기법을 이용하였다.Conventional automatic dictionary building algorithms in the field of information retrieval mainly take a statistical method. That is, to determine whether two words are synonymous or not, a correlation between how often two words appear together in a document is calculated and treated as a synonym if it is equal to or more than a predetermined threshold. However, this method cannot effectively calculate correlations for very short (eg 5-10 words) data such as product notation. Accordingly, the automatic synonym dictionary construction of the present invention assumes that a small amount of the same product data is stored, and based on this, a bootstrapping technique of machine learning, which automatically constructs a synonym dictionary and uses the synonym dictionary for clustering of other products, is used. Was used.

동의어사전 구축은 제품 표기에는 어휘들이 배타적이라는 관찰로부터 얻어졌다. 예를 들어, 임의의 한 클러스터에 "삼성 TV"와 "Samsung TV" 두개의 제품 표기가 포함되어있다고 가정하다. "삼성"과 "Samsung"은 동의어이지만, "삼성"과 "TV" 혹은 "Samsung"과 "TV"는 동의어가 아니다. 즉, 제품을 표기하는 데이터를 살펴보면, "Samsung"이란 단어만을 쓰거나 "삼성"이라는 단어만을 쓰지만, 두 단어를 동시에 쓰는 경우는 극히 드물다. 따라서, 본 동의어 사전 자동 구축 알고리즘은 제품을 표기할 때, 사용되는 주요 단어의 사용 패턴이 배타성을 가지는 현상을 이용한 휴리스틱에 기반한다. 즉, 배타성이 높은 두 단어일수록 동의어로 적합하다고 가정한다. 동의어 사전 자동 구축 알고리즘에 쓰이는 동의어 적합도는 수학식1과 같이 정의된다.Synonym dictionaries derive from the observation that vocabulary is exclusive to product notation. For example, suppose any one cluster contains two product notations, "Samsung TV" and "Samsung TV." "Samsung" and "Samsung" are synonymous, but "Samsung" and "TV" or "Samsung" and "TV" are not synonymous. In other words, if you look at the data describing the product, you only use the word "Samsung" or only the word "Samsung", but rarely use both words at the same time. Therefore, this automatic synonym dictionary construction algorithm is based on the heuristic using the phenomenon that the usage pattern of the key words used when writing the product has exclusivity. In other words, it is assumed that two words with high exclusiveness are more suitable as synonyms. Synonym suitability used in the synonym dictionary automatic construction algorithm is defined as in Equation (1).

, ,

여기에서,는는 는 중에서 제품표기에 단어i가 등장하지 않고 단어j가 등장하는 제품명의 집합을 나타낸다. 중에서 제품표기에 단어i가 등장하고 단어j가 등장하지 않는 제품명의 집합을 나타내고 단어i 또는 단어j가 제품의 표기에 등장하는 제품 클러스터 중에서 k번째 클러스터를 나타내고 i,j는 각각 제품명을 표기하는 단어를 나타내고From here, Is Is Is In the product notation, the word i does not appear, but the word j appears in the set of product names. Among the product clusters in which the word i appears in the product notation and the word j does not appear in the product notation, the word i or the word j represents the kth cluster among the product clusters appearing in the notation of the product, and i, j is a word indicating the product name respectively. Indicates

수학식1은 다음과 같이 설명된다. 가가 0에 가깝다면, 단어i 및 단어j가 항상 같이 등장하는 경우이다. 이는 동의어 후보로서의 적합성이 떨어지는데, 단어i는 "삼성", 단어j는 "TV" 또는 단어i는 "소니", 단어j는 "워크맨"의 예와 같이 제품 표기에는 서로 연관은 있지만 다른 의미를 가지는 단어들이 자주 함께 등장하기 때문이다. 따라서, 본 발명에서는 일정 임계치를 미리 정하고, 동의어 사전 자동 구축 알고리즘을 실행시켜 임계치 이상의 단어들을 동의어로 취급하여 동의어 사전에 추가하는 방식을 택하였다. 1에 가깝다면, 단어 i 및 j가 규칙적으로 동일한 제품을 표기하는 문자열 내에서 서로 배타적으로 쓰이고 있음을 알 수 있다. 즉, 우리가 원하는 동의어 후보임을 의미한다.Equation 1 is described as follows. end Is close to 0, the word i and the word j always appear together. This is less suitable as a synonym candidate, where the word i is "Samsung", the word j is "TV" or the word i is "Sony", and the word j is related to the product notation but has a different meaning. Because words often appear together. Accordingly, in the present invention, a predetermined threshold value is determined in advance, and the automatic synonym dictionary building algorithm is executed to treat words above the threshold as synonyms and add the synonym dictionary. If it is close to 1, we can see that the words i and j are used exclusively of each other in a string that regularly denotes the same product. That means we are candidates for synonyms.

도 3은 동의어사전을 이용하여 사용자가 입력한 검색어를 변환하는 흐름을 나타내는 도면으로, 질의검색어의 입력결과로 검색된 결과물인 수집된 모든 제품페이지에 대한 벡터들에 대해서, 단어i가 존재하고 단어i 및 단어j가 동의어사전에 있으면, 단어j를 제품 페이지벡터에 추가하는 것을 설명한다.FIG. 3 is a diagram illustrating a flow of converting a search word input by a user using a synonym dictionary. For the vectors of all the collected product pages which are the search results of the query result, the word i exists and the word i And if the word j is in a synonym dictionary, the addition of the word j to the product page vector will be described.

질의검색어에 의한 결과로 수집된 웹페이지를 문자벡터로 변환(310단계)하고문자벡터로 변환된 웹페이지에 존재하는 임의의 단어를 동의어사전에서 검색하여 동의어의 유무를 확인한다(320단계). 동의어가 동의어 사전에 존재하면, 수학식2에 의해서 동의어의 벡터요소값을 문자벡터에 추가(330단계)한다.The web page collected as a result of the query search word is converted into a character vector (step 310), and any word existing in the web page converted into the character vector is searched in the synonym dictionary to confirm the existence of the synonym (step 320). If the synonym exists in the synonym dictionary, Equation 2 adds the vector element value of the synonym to the character vector (step 330).

즉, 단어i 및 단어j 사이의 동의어 적합도 는 0 내지 1사이에서 나타나기 때문에 와의 곱을 에 더한다. 이러한 과정을 통해서 변환된 각 제품페이지 벡터들을 기반으로 클러스터링을 실시한다.That is, the synonym goodness of fit between words i and j Since appears between 0 and 1 Wow Multiply by Add to Through this process, clustering is performed based on the converted product page vectors.

도 4는 사용자가 입력한 검색어를 확장된 검색어로 변환하는 일실시예에 대한 흐름을 나타내는 도면이다. 4 is a flowchart illustrating an example of converting a search word input by a user into an extended search word.

이미 구축되어 있는 클러스터들을 사용자가 검색할 때, 질의어 확장 용도로도 쓰일 수 있다. 사용자가 "지펠 냉장고"라는 검색어를 입력하였을때(410단계),동의어사전을 검색하여(420단계) 동의어사전에 검색어가 있는지를 확인하고(430단계), 동의어 사전에 "지펠" 및 "Zipel"이 저장되어 있으면동의어 "Zipel"을 추출(441단계), 질의검색어가 "지펠 Zipel 냉장고"로 확장되어 검색(450단계)되므로 "Zipel"로 표기된 지펠 냉장고들을 놓치지 않아 검색 정확도가 높아진다. 동의어 사전에 동의어가 없다면, "지펠냉장고"로 검색되므로 "Zipel"로 표기된 지펠 냉장고들을 검색하지 못 할 수도 있다.When a user searches for clusters that are already built, they can also be used for query expansion. When the user enters the search term "Zippel refrigerator" (step 410), the synonym dictionary is searched (step 420) to check whether there is a search word in the synonym dictionary (step 430), and "Zipel" and "Zipel" in the synonym dictionary. If this is stored, the synonym "Zipel" is extracted (step 441), and the query search term is extended to "Zipel Zipel refrigerator" and searched (step 450), so that the search accuracy is improved by not missing the Zipel refrigerators designated as "Zipel". If there are no synonyms in the synonym dictionary, you may not be able to search for Zipel refrigerators labeled "Zipel" because "Jipel Refrigerators" will be searched.

도 5는 각 쇼핑몰 사이트로부터 제품 페이지들을 수집하여 데이터베이스에 저장하고, 저장된 데이터로부터 동일/유사 제품을 우선 순위로 나열한 일실시예를 나타내는 도면으로, 가격 비교 사이트의 데이터 베이스에 저장된 제품들 중 "단순 모델번호 매치 알고리즘"이 실패하여 관리자의 직접 수정을 요하는 제품들을 나열하는 윈도우 다이얼로그이다. FIG. 5 is a diagram illustrating an embodiment in which product pages are collected from each shopping mall site and stored in a database, and the same / similar products are listed in priority order from the stored data. This is a window dialog that lists products that "model number matching algorithm" failed and require manual modification by the administrator.

도 6은 도 5에서 유사제품을 순서대로 정렬한 일실시에를 나타내는 도면으로, "샤프전자수첩"을 누른 경우에 유사한 제품 순서대로 다시 정렬하여 관리자의 수작업을 돕도록 한 것이다.. 클러스터링에는 cosine 값 비교를 통해 제품 유사도를 나열하였다. FIG. 6 is a view illustrating an embodiment in which similar products are arranged in order in FIG. 5, and when the "Sharp Electronic Handbook" is pressed, rearranged in the order of similar products to help the administrator's manual work. Product similarity is listed by comparing the values.

도 7은 사용자 입력한 질의검색어에 따라서 출력되는 검색결과가 클러스터링되지 않은 상태로 사용자에게 보고되는 일실싱예를 나타내는 도면으로, 사용자가 키워드를 이용해 질의어를 입력하면(예를 들면, 가수명=서태지) 우선적으로 검색 결과가 클러스터링되어있지 않은 상태로 사용자에게 보고된다. FIG. 7 is a diagram illustrating an example of a silishing in which search results output according to a user input query query are reported to the user without being clustered. When a user inputs a query using a keyword (eg, artist name = Seo Taiji), As a result, search results are reported to the user without being clustered.

도 8은 도 7의 검색결과를 클러스터링하여 동일 앨범끼리 묶어서 표현한 검색결과를 나타내는 도면이다. FIG. 8 is a diagram illustrating search results obtained by clustering search results of FIG. 7 by grouping identical albums together.

음반 검색에는 몇몇 음반 판매 사이트는 앨범 제목을 표기하지 않는 대신, "1집", "2집"의 형태로 표기하는 경우가 있는데, 이런 문제점을 해결하기 위하여, "[숫자]집" 형태의 표기가 같을 경우, 기존의 앨범 제목을 무시하고 동일한 음반으로 묶어 클러스터링한 결과이다.In the record search, some record sales sites do not write the album title, but in the form of "1st collection" and "2nd collection". To solve this problem, "[number] collection" is written. If is the same, it is the result of ignoring the existing album title and clustering the same album.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 하드디스크, 플로피디스크, 플래쉬 메모리, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드로서 저장되고 실행될 수 있다. The invention can also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, hard disk, floppy disk, flash memory, optical data storage device, and also carrier waves (for example, transmission over the Internet). It also includes the implementation in the form of. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상에서 설명한 바와 같이, 본 발명에 의하면, 가격 비교 에이전트 측면에서 볼 때, 쇼핑몰 사이트의 적극적인 협조 없이도 동일/유사 제품을 분류하여 이들간의 가격을 비교할 수 있고, 사용자 측면에서 볼 때, 기존의 무질서한 키워드 검색 결과를 탈피하여 자동으로 동일/유사 제품들을 묶어서 비교하기 때문에 사용자의 편리성이 크게 증가한다.As described above, according to the present invention, in terms of a price comparison agent, the same / similar products can be classified and compared between them without active cooperation of a shopping mall site, and in terms of users, existing disordered keywords The user's convenience is greatly increased because the same / similar products are automatically bundled and compared from the search results.

또한, 동의어 사전 자동 구축을 통하여 보다 정확도가 높은 클러스터링을 실현함과 동시에, 이미 구축되어 있는 소량의 동일 제품 정보를 이용하여 관리자의 특별한 개입 없이 기계 학습을 통해 자동으로 사전을 작성하는 편리성을 제공한다. 또한, 동의어 사전을 이용하여 수집되는 동일 제품 정보가 늘어남에 따라 동의어 사전의 어휘가 풍부해지고, 이에 따라 더욱 다양한 제품에 대한 동일 제품 인식을 수행할 수 있다.In addition, more accurate clustering is realized through automatic synonym dictionary construction, and the convenience of automatically creating dictionaries through machine learning without special intervention by administrators using a small amount of the same product information already established. do. In addition, as the same product information collected using the synonym dictionary increases, the vocabulary of the synonym dictionary becomes richer, and thus, the same product recognition for more various products can be performed.

도 1은 본 발명에 따른 벡터문서모델을 이용한 동일/유사제품 클러스트링장치의 블록도를 나타내는 도면이다.1 is a block diagram of an identical / similar product clustering apparatus using a vector document model according to the present invention.

도 3은 동의어사전을 이용하여 사용자가 입력한 검색어를 확장검색어로 변환하는 흐름을 나타내는 도면이다.3 is a diagram illustrating a flow of converting a search word input by a user into an extended search word using a synonym dictionary.

도 5는 각 쇼핑몰 사이트로부터 제품 페이지들을 수집하여 데이터베이스에 저장하고, 저장된 데이터로부터 동일/유사 제품을 우선 순위로 나열한 일실시예를 나타내는 도면이다.FIG. 5 is a diagram illustrating an embodiment in which product pages are collected from each shopping mall site, stored in a database, and the same / similar products are listed in priority order from the stored data.

도 6은 도 5에서 유사제품을 순서대로 정렬한 일실시에를 나타내는 도면이다.FIG. 6 is a diagram illustrating one embodiment in which analogous products are arranged in order in FIG. 5. FIG.

도 7은 사용자 입력한 질의검색어에 따라서 출력되는 검색결과가 클러스터링되지 않은 상태로 사용자에게 보고되는 일실시예를 나타내는 도면이다.FIG. 7 is a diagram illustrating an embodiment in which a search result output according to a user input query search word is reported to a user without being clustered.

Claims

쇼핑몰 사이트로부터 상품정보에 해당하는 문자열을 포함하고 있는 웹페이지를 수집하는 웹페이지수집부;Web page collection unit for collecting a web page containing a string corresponding to the product information from the shopping mall site;

상기 웹페이지별로 각각의 웹페이지에 포함되어 있는 문자열로부터 단어들을 추출하여 상품정보에 대한 분류단위인 클러스터를 형성하는 단일클러스터형성부;A single cluster forming unit for extracting words from a string included in each web page for each web page to form a cluster which is a classification unit for product information;

상기 각각의 웹페이지에 포함되어 있는 문자열로부터 추출한 단어들로 벡터를 형성하는 벡터변환부;A vector conversion unit for forming a vector from words extracted from strings included in each web page;

상기 각의 웹페이지별로 형성된 벡터들간의 유사도를 계산하는 유사도계산부; 및A similarity calculator for calculating similarity between the vectors formed for each web page; And

상기 계산된 유사도에 기초하여 상기 단일클러스터형성부에 의해 형성된 클러스터들을 결합하는 클러스터결합부;를 포함하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.Clustering unit for combining the cluster formed by the single cluster forming unit based on the calculated similarity; same / similar product clustering apparatus using a vector document model comprising a.

제 1항에 있어서, The method of claim 1,

상기 웹페이지들에 포함된 특수문자를 제거하고 일반문자들로만 구성되는 문자열들을 형성하는 문자치환부를 더 포함하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.The same / similar product clustering device using a vector document model, characterized in that it further comprises a character replacing unit for removing the special characters contained in the web pages and form strings consisting of only general characters.

제 1항에 있어서, The method of claim 1,

상기 단일클러스터형성부는 상기 웹페이지별로 추출한 제품명을 나타내는 단어들의 집합을 형성하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.The single cluster forming unit is the same / similar product clustering device using a vector document model, characterized in that for forming a set of words representing the product name extracted for each web page.

제 1항에 있어서, The method of claim 1,

상기 벡터변환부는 상기 웹페이지별로 추출한 단어들의 개수를 벡터의 요소값으로 하는 희소행렬을 형성하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.And the vector converting unit forms a sparse matrix having the number of words extracted for each web page as an element value of a vector.

제 1항에 있어서, The method of claim 1,

상기 유사도계산부는 상기 벡터로 변환된 클러스터별로 중심벡터들을 형성하여 상기 중심벡터들간의 내적을 계산하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.And the similarity calculator calculates the inner product between the center vectors by forming the center vectors for each cluster transformed into the vectors.

제 5항에 있어서, The method of claim 5,

상기 클러스터결합부는 상기 중심벡터들간의 내적의 크기가 가장 큰 2개의 클러스터를 하나의 클러스터로 결합것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.The cluster coupling unit is similar / similar clustering device using a vector document model, characterized in that to combine the two clusters having the largest inner product size between the center vectors into one cluster.

제 6항에 있어서, The method of claim 6,

상기 클러스터결합부는 상기 중심벡터들간의 내적의 크기가 소정의 임계치까지 클러스터간의 결합을 계속하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.The cluster coupling unit is similar or similar product clustering apparatus using a vector document model, characterized in that the size of the inner product between the center vectors continue to combine between the clusters to a predetermined threshold.

제 1항에 있어서, The method of claim 1,

상기 추출한 단어들 각 단어들간의 적합도를 계산하여 동의어로 등록하는 동의어저장부를 더 포함하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.The synonym storage device using a vector document model, characterized in that it further comprises a synonym storage unit for calculating the suitability between the extracted words of each word to register as a synonym.

제 1항에 있어서, The method of claim 1,

사용자로부터 입력받은 질의검색어에 해당하는 클러스터를 클라이언트에게 제공하는 정보제공부를 더 포함하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링장치.The same / similar product clustering device using a vector document model, characterized in that it further comprises an information providing unit for providing the client with a cluster corresponding to the query query received from the user.

(a) 쇼핑몰 사이트로부터 상품정보에 해당하는 문자열을 포함하고 있는 웹페이지를 수집하는 단계;(a) collecting a web page including a character string corresponding to product information from a shopping mall site;

(b) 상기 웹페이지별로 각각의 웹페이지에 포함되어 있는 문자열로부터 단어들을 추출하여 상품정보에 대한 분류단위인 클러스터를 형성하는 단계;(b) extracting words from strings included in each web page for each web page to form a cluster which is a classification unit for product information;

(c) 상기 각각의 웹페이지에 포함되어 있는 문자열로부터 추출한 단어들로 벡터를 형성하는 단계;(c) forming a vector from words extracted from a string included in each web page;

(d) 상기 각의 웹페이지별로 형성된 벡터들간의 유사도를 계산하는 단계; 및(d) calculating the similarity between the vectors formed for each web page; And

(e) 상기 계산된 유사도에 기초하여 상기 단일클러스터형성부에 의해 형성된 클러스터들을 결합하는 단계;를 포함하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.(e) combining clusters formed by the single cluster forming unit based on the calculated similarity; and the same / similar product clustering method using a vector document model.

제 10항에 있어서, The method of claim 10,

상기 (b)단계 이전에, 상기 웹페이지들에 포함된 특수문자를 제거하고 일반문자들로만 구성되는 문자열들을 형성하는 단계를 더 포함하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.Prior to step (b), removing the special characters included in the web pages, and forming a string consisting of only the general characters, the same / similar product clustering method using a vector document model, characterized in that it further comprises.

제 10항에 있어서, The method of claim 10,

상기 (b)단계는 상기 웹페이지별로 추출한 제품명을 나타내는 단어들의 집합을 형성하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.The step (b) is the same / similar product clustering method using a vector document model, characterized in that to form a set of words representing the product name extracted for each web page.

제 10항에 있어서, The method of claim 10,

상기 (c)단계는 상기 웹페이지별로 추출한 단어들의 개수를 벡터의 요소값으로 하는 희소행렬을 형성하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.The step (c) of the same / similar product clustering method using a vector document model, characterized in that to form a sparse matrix using the number of words extracted for each web page as the element value of the vector.

제 10항에 있어서, The method of claim 10,

상기 (d)단계는 상기 벡터로 변환된 클러스터별로 중심벡터들을 형성하여 상기 중심벡터들간의 내적을 계산하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.The step (d) of the same / similar product clustering method using a vector document model, characterized in that for calculating the inner product between the center vectors by forming the center vectors for each cluster transformed into the vector.

제 10항에 있어서, The method of claim 10,

상기 (e)단계는 상기 중심벡터들간의 내적의 크기가 가장 큰 2개의 클러스터를 하나의 클러스터로 결합하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.The step (e) of the same / similar product clustering method using a vector document model, characterized in that combining the two clusters having the largest inner product size between the center vectors into one cluster.

제 15항에 있어서, The method of claim 15,

상기 (e)단계는 상기 중심벡터들간의 내적의 크기가 소정의 임계치가 될 때까지 클러스터간의 결합을 계속하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.The step (e) of the same / similar product clustering method using a vector document model, characterized in that the coupling between the clusters continues until the magnitude of the inner product between the central vectors is a predetermined threshold.

제 10항에 있어서, The method of claim 10,

상기 추출한 각 단어들간의 적합도를 계산하여 동의어로 등록하는 단계를 더 포함하는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.Comprising the step of calculating the goodness of fit between the extracted words and registering the synonyms of the same / similar product using a vector document model.

제 17항에 있어서,The method of claim 17,

상기 추출한 각 단어들간의 적합도는 아래의 수학식에 의해서 계산되며,The goodness of fit between the extracted words is calculated by the following equation,

, ,

여기에서,는는 는 중에서 제품표기에 단어i가 등장하지 않고 단어j가 등장하는 제품명의 집합을 나타내는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법. 중에서 제품표기에 단어i가 등장하고 단어j가 등장하지 않는 제품명의 집합을 나타내고 단어i 또는 단어j가 제품의 표기에 등장하는 제품 클러스터 중에서 k번째 클러스터를 나타내고 I 및 j는 각각 제품명을 표기하는 단어를 나타내는 것을 특징으로 하는 벡터문서모델을 이용한 동일/유사제품 클러스트링방법.From here, Is Is Is The same / similar product clustering method using the vector document model, characterized in that it represents a set of product names in which the word j does not appear in the product notation. Among the product clusters in which the word i appears in the product notation and the word j does not appear in the product notation, the word i or the word j represents the kth cluster among the product clusters appearing in the notation of the product, and I and j are words representing the product name, respectively. The same / similar product clustering method using a vector document model, characterized in that for representing.

제 10항 내지 제 18항 중 어느 한 항의 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체.A computer-readable recording medium having recorded thereon a program for executing the method of claim 10 on a computer.