KR20020015838A

KR20020015838A - Method for re-adjusting ranking of document to use user's profile and entropy

Info

Publication number: KR20020015838A
Application number: KR1020000048963A
Authority: KR
Inventors: 최준혁
Original assignee: 전홍건; 학교법인 통진학원
Priority date: 2000-08-23
Filing date: 2000-08-23
Publication date: 2002-03-02
Also published as: KR100378240B1

Abstract

PURPOSE: A method for adjusting a ranking of a document using an entropy and a user profile is provided to increase the accuracy of searching by using an entropy value and a user profile according to probability information between a user query language and a subject of each document. CONSTITUTION: A user records a query language the user wants to search(S10). A volume of probability vector information of each word of the query language is calculated(S20). A user profile composed of keywords and frequencies of the keywords is designed for reflecting the user's preference in case that a ranking of a document is adjusted(S30). An entropy value between the user profile and index words extracted from the searching according to the user's subject is calculated(S40). After that, the ranking of the document is readjusted and outputted in order from the lowest entropy value calculated(S50).

Description

엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법{Method for re-adjusting ranking of document to use user's profile and entropy}Method for re-adjusting ranking of document to use user's profile and entropy}

본 발명은 엔트로피와 사용자 프로파일을 적용한 문서 순위조정 방법에 관한 것으로서, 더 상세하게는 사용자 질의어와 각 문서의 주제어들간의 확률정보에 따른 엔트로피값과 사용자 프로파일을 이용함으로써 정보검색 시스템에서 검색의 정확도를 향상시킬 수 있도록 한 엔트로피와 사용자 프로파일을 적용한 문서 순위조정 방법에 관한 것이다.The present invention relates to a document ranking adjustment method using an entropy and a user profile, and more particularly, by using an entropy value and a user profile according to probability information between a user query word and subject words of each document, the accuracy of a search in an information retrieval system is improved. It's about document ranking with entropy and user profiles to improve it.

한편, 본 발명은 엔트로피값의 계산을 단어간의 공기 정보를 이용한 상호 정보량을 이용하고, 엔트로피값의 계산시 역문헌 빈도에 의한 가중치 계산방법에 베이지안 학습(Bayesian learning)을 이용한 가중치 부여방법을 결합하여 적용함으로써 검색 문서에 대한 정확도 향상을 달성토록 한 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법에 관한 것이다.Meanwhile, the present invention uses a mutual information amount using air information between words to calculate an entropy value, and combines a weighting method using Bayesian learning with a weight calculation method based on inverse literature frequency when calculating an entropy value. By applying the entropy and user profile to achieve improved accuracy of search documents by applying the document ranking method.

또 다른 한편, 본 발명은 정보검색 시스템을 이용하는 사용자들의 최근 관심도와 검색행위 및 사용자 질의어의 의미정보를 반영하기 위하여 종래 검색에 사용되었던 키워드들과 이들에 대한 사용빈도로 구성된 사용자 프로파일을 설계하여 엔트로피 계산에 반영하고, 실제 엔트로피 계산시에는 시간의 복잡도에 따른 검색시간의 지연을 억제하기 위하여 사용자 프로파일내에 있는 이전의 질의어 중 최상위 빈도를 갖는 설정된 수의 질의어들만이 계산에 참여하게 되고, 최종적인 검색문서에 대한 순위는 엔트로피의 총합이 증가하는 순으로 재조정되게 한 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법에 관한 것이다.On the other hand, the present invention is designed to entropy by designing a user profile consisting of the keywords used in the conventional search and the frequency of use thereof in order to reflect the recent interest, search behavior and semantic information of the user query words of users using the information retrieval system In order to reflect the calculation, and in the actual entropy calculation, only a set number of queries having the highest frequency among previous queries in the user profile participate in the calculation in order to suppress the delay of the search time due to the complexity of time. The ranking of documents relates to a document ranking adjustment method using entropy and a user profile that causes the total entropy to be rearranged in increasing order.

최근 컴퓨터의 높은 보급률과 인터넷의 발전으로 방대한 양의 정보가 인터넷 상에 웹 문서의 형태로 존재하고 있다. 이러한 웹 문서는 여러 사이트에 분산되어 있고, 표현되어 있는 정보 또한 동적으로 변화하는 특성을 갖고 있다. 따라서 이와 같이 산재해 있는 정보들 중에서 사용자가 원하는 자료를 찾기는 쉽지 않다.Recently, due to the high penetration rate of computers and the development of the Internet, a vast amount of information exists in the form of web documents on the Internet. These web documents are distributed in various sites, and the information presented is also dynamically changing. Therefore, it is not easy to find the data the user wants among the scattered information.

일반적으로 정보검색 시스템이란 필요로 하는 정보를 수집하여 내용을 분석한 뒤, 검색하기 쉬운 형태로 가공하여 두었다가 사용자의 정보요구가 발생할 때 해당되는 정보를 검색결과로 제공하는 시스템이다. 이러한 정보검색 시스템의 중요한 기능 중의 하나는 단순히 사용자 질의를 만족하는 문서들의 집합을 검색하는 것이 아니라, 질의를 만족하는 정도에 따라 검색된 문서들에 순위를 부여함으로써 사용자들이 필요한 정보를 얻는데 소요되는 시간을 최소화하는 것이다.In general, an information retrieval system is a system that collects the necessary information, analyzes the content, processes it into an easy-to-search form, and provides the corresponding information as a search result when a user's information request occurs. One of the important functions of this information retrieval system is not simply to search the collection of documents that satisfy the user's query, but to rank the searched documents according to the degree of satisfying the query. Minimize.

정보검색 모델의 여러 유형 중에서 개념모델은 검색기법에 따라 정합(exact match) 방법과 비정합(inexact match) 방법으로 나눌 수 있다. 정합방법에는 텍스트 패턴탐색과 불리언 모델이 포함되며, 비정합 방법에는 확률모델, 벡터 공간 모델, 클러스터링 모델 등이 포함된다. 이러한 분류는 상호 배타적이지 않으므로 두 가지 이상의 모델을 함께 혼용하여 설계할 수도 있다.Among the various types of information retrieval models, conceptual models can be divided into an exact match method and an inexact match method according to the retrieval method. Matching methods include text pattern search and Boolean models, and non-matching methods include probability models, vector space models, and clustering models. These classifications are not mutually exclusive, so you can design two or more models together.

정보검색의 여러 개념모델 중에서도 특히 내용검색에 관한 연구는 많이 이루어져 왔는데, 이는 크게 전문탐색(full text scanning), 역색인 파일(inverted index file), 요약파일(signature file), 클러스터링(clustering) 방법 등으로 분류할 수 있다.Among the various conceptual models of IR, there have been many researches on content retrieval, including full text scanning, inverted index files, signature files, and clustering methods. Can be classified as

도 1은 일반적인 웹 정보검색 시스템의 한 형태로서, 먼저 웹 로봇에 의해 수집된 각각의 웹 문서들에 문헌 식별자가 할당된다. 이후, 수집된 모든 문서들을 대상으로 형태소 분석을 통한 명사구 분석을 수행하여 색인어 후보들을 추출한다. 이렇게 추출된 문서들의 각 색인어 후보들은 역문헌 빈도에 의한 용어 가중치가 부여되며, 이를 기반으로 역색인 파일을 구축한다.1 is a form of a general web information retrieval system in which a document identifier is first assigned to each web document collected by a web robot. Thereafter, nominal phrase analysis is performed on the collected documents to extract index word candidates. Each index word candidate of the extracted documents is given a term weight based on the inverse document frequency and based on this, an inverse index file is constructed.

불리언 모델을 기초로 설계되어 있는 대부분의 상업적인 정보검색 시스템에서 각각의 문서는 주제어들로 구성된 색인어 목록으로 표현되어 있다. 이를 이용하는 사용자의 정보요구는 질의(query)로 표현되는데, 이는 각 문서에서 문서를 대표하는 주제어들의 존재 여부를 검색하기 위한 표현형태를 의미한다.In most commercial IR systems designed on the basis of Boolean models, each document is represented by a list of index terms consisting of subjects. The information request of the user using the information is expressed by a query, which means an expression form for searching for the presence or absence of the main words representing the document in each document.

불리언 모델에서 사용자 질의를 만족하는 문서들에 대한 평가함수와 이에 대한 선택 기준은 대부분의 시스템이 공통적인 기준을 사용한다. 즉, 사용자의 정보요구로 표현되는 대부분의 질의들은 불리언 논리에 의해 표현되며, 이는 각 문서에 불리언 질의에 포함되어 있는 색인어들이 존재하는지의 여부에 따라 해당 문서가 적합문서인지, 아닌지에 대한 평가가 이루어진다.In the Boolean model, the evaluation function and the selection criteria for documents satisfying the user's query use a common standard for most systems. That is, most of the queries expressed by the user's information request are expressed by Boolean logic, which determines whether the document is a conforming document according to whether or not index documents included in the Boolean query exist in each document. Is done.

대부분의 불리언 모델은 역색인 화일을 이용하는데, 역색인 화일을 이용하는 정보검색 모델은 웹 로봇이 수집한 모든 문서를 대상으로 주제어들 및 문서들의 리스트 식별자를 포함하는 역색인 화일 리스트를 구성하고, 생성된 역색인 화일 리스트를 주요어의 알파벳, 또는 한글의 자.모 순서에 따라 정렬시킨 화일을 이용하여 검색한다. 이는 사용자의 정보요구가 반영된 질의어가 해당되는 화일에 존재하는지의 여부에 따라 검색결과를 얻을 수 있다.Most boolean models use inverted index files. An IR model uses an inverted index file to construct and generate an inverted index file list that contains a list identifier of keywords and documents for all documents collected by a web robot. The searched list of inverted index files is searched using files sorted according to alphabets of main words or alphabetical order of Hangul. The search result may be obtained depending on whether or not the query word reflecting the user's information request exists in the corresponding file.

역색인 화일을 이용하는 불리언 모델은 사용자의 정보요구를 정확히 표현하고 반영하기 어려우며, 검색결과에 대한 문서의 수는 사용자 질의어가 포함된 문서의 수에 따라 결정된다. 일반적으로 이러한 시스템에서는 사용자 질의 및 각 문서의 색인어들에 대한 중요도를 반영하는 가중치가 고려되지 않았다. 또한 질의에 따른 검색결과는 사용자의 의도와는 관련없이 시스템의 설계자가 미리 설계한 역색인 화일의 순서에 따라 검색결과를 얻을 수 있으며, 사용자 질의에 대한 의미정보를 정확히 반영하지 못한다. 따라서, 불리언 모델은 시스템에서 제공하는 제한된 방법을 사용해야만 검색하고자 하는 문헌 대상을 조절할 수 있는데, 대부분의 검색 결과들은 사용자의 질의 의도를 만족시키지 못하며 질의 의도와는 관련없는 문서의순으로 검색 결과를 제공한다. 이러한 불리언 모델은 도서관 사서나 시스템 사용에 익숙한 전문 사용자에게는 매우 강력한 온라인 탐색 기능을 제공하지만, 시스템을 자주 이용하지 않는 대부분의 단순 사용자에게는 기대할 수 없는 수준의 서비스를 제공한다.The boolean model using the inverted index file is difficult to accurately express and reflect the user's information request, and the number of documents for the search result is determined by the number of documents including the user query. In general, such systems do not take into account weights that reflect the importance of user queries and index terms for each document. In addition, the search results according to the query can obtain the search results according to the order of inverted index files designed by the system designer in advance regardless of the user's intention, and do not accurately reflect the semantic information of the user query. Therefore, the Boolean model can adjust the document targets to be searched only by using the limited methods provided by the system. Most of the search results do not satisfy the user's query intent and the search results are in the order of documents that are not related to the query intention. to provide. This boolean model provides very powerful online navigation for librarians and professional users who are used to using the system, but provides an unprecedented level of service for most simple users who do not use the system frequently.

일반적으로 대부분의 단순 사용자들은 탐색하고자 하는 데이터 집합에 있는 용어는 잘 알고 있지만, 훈련과 연습이 부족한 관계로 불리언 시스템에서 필요로 하는 복합 질의어를 사용하는데는 익숙하지 않은 관계로 좋은 결과를 기대하기는 어렵다.In general, most simple users are familiar with the terms in the datasets they want to explore, but lack the training and practice, and are unfamiliar with using complex queries required by Boolean systems. it's difficult.

상기와 같은 내용에 따르면, 웹 상에서 정보검색 엔진을 사용하는 사용자의 정보요구는 사용자 질의에 정확히 반영되어 질의어를 포함하는 해당 웹 문서들을 탐색한 후에, 사용자의 의도가 정확히 반영된 순으로 검색결과는 순위화되어야 한다. 그러나 대부분의 웹 정보검색 엔진들을 사용해 보면, 사용자의 의도와는 전혀 다른 결과들이 상위에 나타나는 경우가 많이 있다. 이는 사용자의 정보 요구를 정확히 반영하기 위한 노력이 부족한 것으로 판단된다. 따라서 사용자의 정보요구를 정확히 반영하는 웹 검색엔진을 설계하기 위하여 시스템은 더욱 지능적으로 설계되고 구현되지 않으면 안된다.According to the above description, the information request of the user using the information search engine on the web is accurately reflected in the user query, and after searching the corresponding web documents including the query, the search results are ranked in the order of accurately reflecting the user's intention. It must be mad. However, when using most web search engines, there are many cases in which results that are completely different from the user's intentions appear at the top. This is judged to be insufficient efforts to accurately reflect the user's information request. Therefore, the system must be designed and implemented more intelligently to design a web search engine that accurately reflects user's information request.

본 발명은 상기와 같은 문제점을 해결하기 위하여 창출된 것으로서, 사용자 질의어와 각 문서의 주제어들간의 확률정보에 따른 엔트로피값과 사용자 프로파일을 이용함으로써 정보검색 시스템에서 검색의 정확도를 향상시킬 수 있도록 한 엔트로피와 사용자 프로파일을 적용한 문서 순위조정 방법을 제공함에 그 목적이 있다.The present invention has been created to solve the above problems, by using an entropy value and a user profile according to probability information between user query words and subject words of each document, entropy to improve the accuracy of the search in the information retrieval system The purpose of this document is to provide a document ranking method using a user profile and a user profile.

본 발명의 다른 목적은 엔트로피값의 계산을 단어간의 공기 정보를 이용한 상호 정보량을 이용하고, 엔트로피값의 계산시 역문헌 빈도에 의한 가중치 계산방법에 베이지안 학습을 이용한 가중치 부여방법을 결합하여 적용함으로써 검색 문서에 대한 정확도 향상을 달성토록 한 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법을 제공하는데 있다.Another object of the present invention is to search for the entropy value by using the mutual information amount using the air information between words, and by applying the weighting method using Bayesian learning to the weight calculation method by the inverse literature frequency when calculating the entropy value It is to provide a document rank adjustment method using entropy and user profile to achieve the improved accuracy of the document.

본 발명의 또 다른 목적은 정보검색 시스템을 이용하는 사용자들의 최근 관심도와 검색행위 및 사용자 질의어의 의미정보를 반영하기 위하여 종래 검색에 사용되었던 키워드들과 이들에 대한 사용빈도로 구성된 사용자 프로파일을 설계하여 엔트로피 계산에 반영하고, 실제 엔트로피 계산시에는 시간의 복잡도에 따른 검색시간의 지연을 억제하기 위하여 사용자 프로파일내에 있는 이전의 질의어 중 최상위 빈도를 갖는 설정된 수의 질의어들만이 계산에 참여하게 되고, 최종적인 검색문서에 대한 순위는 엔트로피의 총합이 증가하는 순으로 재조정되게 한 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법을 제공하는데 있다.Another object of the present invention is to entropy by designing a user profile consisting of the keywords used in the conventional search and the frequency of use thereof to reflect the recent interest, search behavior and semantic information of the user query words of users using the information retrieval system In order to reflect the calculation, and in the actual entropy calculation, only a set number of queries having the highest frequency among previous queries in the user profile participate in the calculation in order to suppress the delay of the search time due to the complexity of time. The ranking of documents provides a method of adjusting document rankings by applying entropy and user profile, which causes the total entropy to be re-adjusted in increasing order.

도 1은 일반적인 웹 정보검색 시스템의 한 실시예의 형태도.1 is a diagram of one embodiment of a general web information retrieval system;

도 2는 본 발명에 따른 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법의 흐름도.2 is a flowchart of a document rank adjustment method applying entropy and a user profile according to the present invention;

도 3은 본 발명이 적용되는 웹 정보검색 시스템의 개요도.3 is a schematic diagram of a web information retrieval system to which the present invention is applied;

도 4는 본 발명 엔트로피 함수의 일실시예의 그래프.4 is a graph of one embodiment of the present invention entropy function.

도 5는 본 발명에 따른 상호정보량 모듈의 일실시예도.Figure 5 is an embodiment of a mutual information module according to the present invention.

도 6은 본 발명에 따른 상호 정보량 계산 모듈의 일실시예도.Figure 6 is an embodiment of a mutual information amount calculation module according to the present invention.

도 7은 본 발명의 사용자 질의어의 사용 빈도가 반영된 사용자 프로파일의 일실시예도.Figure 7 is an embodiment of a user profile reflecting the frequency of use of the user query of the present invention.

도 8은 본 발명에 따른 엔트로피 기반의 문서 순위 조정의 일실시예도.8 is an embodiment of entropy-based document ranking adjustment according to the present invention.

도 9는 본 발명 색인어 추출을 위한 형태소 분석기 모듈의 일실시예도.Figure 9 is an embodiment of a morpheme analyzer module for extracting the index word of the present invention.

도 10은 본 발명 색인어와 문서의 식별력의 일실시예의 그래프.Figure 10 is a graph of one embodiment of the discriminating power of the present invention index words and documents.

도 11은 본 발명에 따른 단어 빈도와 문서 빈도 모듈의 일실시예도.11 is an embodiment of a word frequency and document frequency module in accordance with the present invention.

도 12는 본 발명에 따른 역색인 파일 생성 모듈의 일실시예도.12 is an embodiment of an inverted index file generation module according to the present invention.

도 13은 본 발명 역화일 자료구조의 일실시예도.Figure 13 is an embodiment of the present invention file structure.

도 14는 본 발명 색인어 자료구조의 비교 테이블 도면.14 is a comparison table diagram of the present invention index word data structure.

도 15는 본 발명에 따른 의미정보를 포함하는 확장된 역화일 자료구조의 일실시예도.15 is an embodiment of an extended inverse file data structure including semantic information according to the present invention.

도 16은 본 발명에 따른 사용자 질의어 및 사용자 프로파일과 각 문서의 색인어들과의 엔트로피 계산 개념도.16 is a conceptual diagram of entropy calculation between a user query and a user profile and index words of each document according to the present invention.

도 17은 본 발명의 엔트로피값을 계산하기 위한 알고리즘의 일실시예도.Figure 17 illustrates one embodiment of an algorithm for calculating entropy values of the present invention.

상기 목적을 달성하기 위하여 본 발명에 따른 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법은,In order to achieve the above object, a document ranking adjustment method using entropy and a user profile according to the present invention is provided.

엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법에 있어서,In the document ranking adjustment method using entropy and user profile,

사용자가 검색을 원하는 질의어를 기록하는 단계;Recording a query that the user wants to search for;

소정의 정보이론을 통해, 사용자 질의어의 의미정보를 반영하고 상기 질의어 각 단어간의 공기정보를 기반으로 하여 상기 질의어 각 단어의 확률벡터 정보량을 계산하는 단계;Reflecting the semantic information of the user query word through a predetermined information theory and calculating a probability vector information amount of each word of the query word based on air information between the words of the query word;

문서 순위 조정시에 사용자의 기호를 반영하기 위해 가장 최근의 검색에 사용되었던 키워드들과 이들의 사용빈도로 구성된 사용자 프로파일을 설계하는 단계;Designing a user profile consisting of keywords used in the most recent search and their frequency of use to reflect the user's preferences in document ranking adjustment;

상기 사용자 프로파일과 질의어들을 사용자 주제어에 따른 검색 결과내에서 추출한 각 문서들의 색인어들과의 엔트로피 계산을 수행하는 단계;Performing entropy calculation with index terms of respective documents extracted from the user profile and query terms in a search result according to a user main term;

상기 단계에서 계산된 엔트로피값이 적은 순서로 문서의 순위를 재조정하여 출력하는 단계;를 포함하고,And re-ordering and outputting the documents in order of decreasing entropy value calculated in the step.

상기 사용자 프로파일은 각 사용자의 아이디와 패스워드에 따른 기존 키워드들과 그에 대한 사용빈도로 구성된 환형 연결 리스트의 형태로서 사용자 로그인과 동시에 사용 빈도의 순서로 정렬되고,The user profile is in the form of an annular linked list consisting of existing keywords according to each user's ID and password and the frequency of use thereof, and are arranged in order of use frequency at the same time as the user login.

상기 엔트로피 계산은 정확도를 높이기 위하여 상기 사용자의 각 주제어들의 역문헌 빈도와 베이지안 학습(Bayesian learning)에 의한 가중치값을 산출 적용하고 이를 기반으로 해서 상기 계산을 수행하도록 된 점에 그 특징이 있다.The entropy calculation is characterized in that it calculates and applies the inverse literature frequency and Bayesian learning weight value of each subject of the user in order to increase the accuracy, and performs the calculation based on this.

본 발명의 바람직한 실시예에 있어서, 상기 소정의 정보이론에 있어서의 정보량은 전달될 메시지에 대한 불확실성의 정도를 나타내며, 상기 불확실성의 정도가 클수록 정보량이 많은 것이며, 상기 불확실성은 메시지의 발생 확률에 역수를 취하고 이 값에 로그를 취한 값에 각 메시지의 발생 확률을 가중치로 취한 평균 정보량으로 표현된다.In a preferred embodiment of the present invention, the amount of information in the predetermined information theory indicates the degree of uncertainty of the message to be delivered, and the greater the degree of uncertainty, the greater the amount of information, and the uncertainty is inversely proportional to the probability of occurrence of the message. It is expressed as the average amount of information obtained by weighting the probability of occurrence of each message by taking the logarithm of this value.

본 발명의 바람직한 실시예에 있어서, 상기 엔트로피 계산은 평균 정보량의 값에 로그를 취한 형태를 이용한다.In a preferred embodiment of the present invention, the entropy calculation uses a form in which the logarithm of the average information amount is taken.

이하, 첨부한 도면을 참조하면서 본 발명에 따른 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법의 바람직한 실시예를 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the document rank adjustment method applying the entropy and the user profile according to the present invention.

도 2는 본 발명에 따른 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법의 흐름도이고, 도 3은 본 발명이 적용되는 웹 정보검색 시스템의 개요도, 도 4는 본 발명 엔트로피 함수의 일실시예의 그래프, 도 5는 본 발명에 따른 상호정보량 모듈의 일실시예도, 도 6은 본 발명에 따른 상호 정보량 계산 모듈의 일실시예도, 도 7은 본 발명의 사용자 질의어의 사용 빈도가 반영된 사용자 프로파일의 일실시예도이다. 그리고, 도 8은 본 발명에 따른 엔트로피 기반의 문서 순위 조정의 일실시예도, 도 9는 본 발명 색인어 추출을 위한 형태소 분석기 모듈의 일실시예도, 도 10은 본 발명 색인어와 문서의 식별력의 일실시예의 그래프, 도 11은 본 발명에 따른 단어 빈도와 문서 빈도 모듈의 일실시예도, 도 12는 본 발명에 따른 역색인 파일 생성 모듈의 일실시예도, 도 13은 본 발명 역화일 자료구조의 일실시예도이다. 또한, 도 14는 본 발명 색인어 자료구조의 비교 테이블 도면, 도 15는 본 발명에 따른 의미정보를 포함하는 확장된 역화일 자료구조의 일실시예도, 도 16은 본 발명에 따른 사용자 질의어 및 사용자 프로파일과 각 문서의 색인어들과의 엔트로피 계산 개념도, 도 17은 본 발명에 따른 엔트로피값을 계산하기 위한 알고리즘의 일실시예도이다.2 is a flow chart of a document rank adjustment method applying an entropy and a user profile according to the present invention, Figure 3 is a schematic diagram of a web information retrieval system to which the present invention is applied, Figure 4 is a graph of one embodiment of the present invention entropy function, Figure 5 is an embodiment of a mutual information amount module according to the present invention, Figure 6 is an embodiment of a mutual information amount calculation module according to the present invention, Figure 7 is an embodiment of a user profile reflecting the frequency of use of the user query of the present invention to be. 8 is an embodiment of entropy-based document ranking adjustment according to the present invention, FIG. 9 is an embodiment of a morpheme analyzer module for extracting an index word of the present invention, and FIG. 10 is an embodiment of identification power of an index word and a document of the present invention. 11 is an example of a word frequency and document frequency module according to the present invention, FIG. 12 is an embodiment of an inverted index file generation module according to the present invention, and FIG. 13 is an embodiment of the present invention reverse file data structure. It is an illustration. 14 is a diagram illustrating a comparison table of an index word data structure according to the present invention, and FIG. 15 is an embodiment of an extended inverse file data structure including semantic information according to the present invention. FIG. 16 is a user query word and a user profile according to the present invention. Is a conceptual diagram of entropy calculation with index words of each document, and FIG. 17 is an embodiment diagram of an algorithm for calculating an entropy value according to the present invention.

도 2를 참조하면, 본 발명에 따른 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법은, 사용자가 검색을 원하는 질의어를 기록하는 단계(S10)와; 소정의 정보이론을 통해, 사용자 질의어의 의미정보를 반영하고 상기 질의어 각 단어간의 공기정보를 기반으로 하여 상기 질의어 각 단어의 확률벡터 정보량을 계산하는 단계(S20)와; 문서 순위 조정시에 사용자의 기호를 반영하기 위해 가장 최근의 검색에 사용되었던 키워드들과 이들의 사용빈도로 구성된 사용자 프로파일을 설계하는 단계(S30)와; 상기 사용자 프로파일과 질의어들을 사용자 주제어에 따른 검색 결과내에서 추출한 각 문서들의 색인어들과의 엔트로피 계산을 수행하는 단계(S40); 및 S40 단계에서 계산된 엔트로피값이 적은 순서로 문서의 순위를 재조정하여 출력하는 단계(S50);를 포함하여 구성된다. S30 단계에서의 사용자 프로파일은 각 사용자의 아이디와 패스워드에 따른 기존 키워드들과 그에 대한 사용빈도로 구성된 환형 연결 리스트의 형태로서 사용자 로그인과 동시에 사용 빈도의 순서로 정렬되고, S40 단계에서의 엔트로피 계산은 정확도를 높이기 위하여 상기 사용자의 각 주제어들의 역문헌 빈도와 베이지안 학습(Bayesian learning)에 의한 가중치값을 산출 적용하고 이를 기반으로 해서 상기 계산을 수행한다.Referring to FIG. 2, the method of adjusting document rank by applying entropy and a user profile according to the present invention includes: recording a query word that a user wants to search (S10); Reflecting the semantic information of the user query word through a predetermined information theory and calculating a probability vector information amount of each word of the query word based on air information between the words of the query word (S20); Designing a user profile consisting of keywords used in the most recent search and their frequency of use to reflect the user's preferences in document ranking adjustment (S30); Performing entropy calculation with index terms of respective documents extracted from the user profile and query terms in a search result according to a user main term (S40); And resizing and outputting the documents in order of decreasing entropy value calculated in step S40 (S50). The user profile in step S30 is in the form of an annular linked list consisting of existing keywords according to each user's ID and password and the frequency of use thereof, and is sorted in the order of use frequency simultaneously with the user login. In order to increase the accuracy, the descriptive frequency and weight values of Bayesian learning of each subject word of the user are calculated and applied based on the calculation.

상기와 같이 구성된 본 발명 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법의 작용 및 동작을 설명하면 다음과 같다.Referring to the operation and operation of the document ranking adjustment method applied to the present invention entropy and user profile configured as described above are as follows.

먼저, 본 발명의 작용 및 동작을 설명하기에 앞서 본 발명 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법과 유기적으로 관련되는 기술내용들에 대해 살펴보면 다음과 같다.First, prior to describing the operation and operation of the present invention, the technical contents of the document ranking adjustment method to which the present invention entropy and the user profile are applied will be described.

문서를 검색하는데 있어 문서 순위부여(document ranking) 방법을 사용하면 좀더 사용자 위주의 검색 시스템이 될 수 있다. 이러한 시스템에서 사용자는 불리언 연결어가 아닌 문장이나 구와 같은 간단한 질의어를 입력해서 사용자 의도와 관련된 순서로 순위가 부여된 문헌목록을 검색할 수 있는데, 이를 대표하는 모델 중의 하나로 벡터 공간 모델(vector space model)이 있다.Using document ranking to search for documents can lead to a more user-oriented search system. In such a system, the user can input a simple query such as a sentence or phrase rather than a Boolean linkage to retrieve a list of documents ranked in the order related to the user's intention. One of the representative models is the vector space model. There is this.

벡터 공간 모델에서 각 문서와 사용자 질의는 N-차원의 벡터 공간 모델로 표현되는데, 여기서 N은 각각의 문서에 존재하는 주제어들의 개수를 의미한다. 이 모델에서 사용자 질의와 각 문서들간의 매칭함수는 사용자 질의와 각 문서들간의 유사도에 따른 의미거리(semantic distance)에 의해 측정된다. Salton의 SMART 시스템의 경우 사용자 질의와 문서간의 유사도 측정은 벡터들 사이의 코사인 각도에 의해 계산되는데, 이 경우에 있어 검색결과는 유사도가 감소하는 순으로 사용자에게 되돌려진다.In the vector space model, each document and user query are represented by an N-dimensional vector space model, where N is the number of main words in each document. In this model, the matching function between the user query and each document is measured by the semantic distance according to the similarity between the user query and each document. In Salton's SMART system, the similarity measure between the user query and the document is calculated by the cosine angle between the vectors. In this case, the search results are returned to the user in decreasing order of similarity.

모든 문서에 대해 유사도를 계산하는 것은 복잡도에 따른 검색시간의 지연을 가져온다. 따라서 이러한 문제를 해결하기 위한 방안으로 역색인 화일을 참조하여 사용자 질의를 만족하는 주제어들이 존재하는 문서들만을 탐색하는 방법과 모든 문서들을 의미 유사도에 따라 미리 군집화하여 이를 대상으로 유사도를 계산함으로서 사용자 질의와 의미적으로 가장 유사한 군집만을 탐색대상으로 이용하는 방법이 있다. 이는 검색대상이 되는 문서 전체를 탐색하는 대신 정보요구 주제와 관련된 문헌군집만을 탐색함으로써 탐색시간의 절약과 검색효율의 향상을 기대하는 방법이다.Calculating the similarity for all documents introduces a delay in search time according to the complexity. Therefore, in order to solve this problem, a method of searching only documents in which subject words satisfying a user query exist by referring to an inverted index file and all documents are grouped according to semantic similarity in advance and the similarity is calculated from the user query. There is a method of using only clusters that are most similarly semantically as search targets. This method is expected to save search time and improve search efficiency by searching only the document clusters related to information request topics, instead of searching the entire document to be searched.

문헌군집화 기법은 문헌에 부여된 색인어나 기계적으로 추출된 키워드를 문헌내용의 식별요소로 삼아 문서군집을 형성한다. 이렇게 형성된 문헌군집은 군집을 대표하는 군집 프로파일(profile)을 갖게 되며, 정보의 탐색시 정보요구와 각 군집 프로파일이 대조되어 정보요구에 가장 유사한 군집이 선택되는 것이다.The document clustering technique forms document clusters by using index terms or mechanically extracted keywords as the identification elements of document contents. The document cluster thus formed has a cluster profile representative of the cluster, and when searching for information, the clusters most similar to the information request are selected by comparing the information request with each cluster profile.

문헌군집화 기법을 웹 정보검색에 적용한다는 것은 밀접하게 관련된 문헌들이 동일한 정보요구에 대해 모두 적합하다는 가설에 근거한다. 이는 군집화일 검색이 같은 군집에 속하는 유사한 내용의 문헌들은 같은 질의에 대해 적합문헌일 확률이 높다는 것으로, 문헌군집 기법에 의해 유사한 내용의 문헌들을 같은 군집에 모아줌으로써 전체 문헌집단을 여러 개의 군집으로 나누어 줄 수 있음을 의미한다.The application of document clustering techniques to web information retrieval is based on the hypothesis that closely related documents all fit the same information needs. This suggests that documents with similar content belonging to the same cluster are likely to be relevant documents for the same query. By using the document clustering technique, the entire document group is divided into several clusters by collecting similar documents in the same cluster. It means you can give.

문서군집 기법에 관한 연구는 상당히 활발하게 진행되어 왔는데, 대표적인 연구 결과로는 순차적 군집탐색과 군집문서 탐색에 대한 연구가 있다. 일반적으로 군집기반의 검색방법들은 디스크를 사용하는 물리적인 측면과 검색효율 측면에서 우수함을 나타내고 있다. 그러나 대부분의 군집 알고리즘은 군집을 형성하기 위해 많은 시간이 소요되며, 검색효율이나 검색시간 면에서 성능이 좋지 못하고 형성된 군집의 속성이 바람직하지 못하다는 등 여러 문제점을 안고 있다. 실제로 이러한 군집 알고리즘을 많은 양의 문헌에 효과적으로 사용하기는 힘들며, 대부분의 시스템들이 수 백개 정도의 문헌에만 실험적으로 적용하여 이용한다. 따라서 문서군집 알고리즘을 검색 대상이 되는 전체 문서에 적용하기보다는 사용자 질의를 만족하는 문서들에 대해서만 적용하여 군집시간에 대한 문제를 해소하고, 검색대상 문서들을사용자 질의의 의미에 따라 군집화하여 군집속성을 만족하는 방향으로 연구가 수행되고 있다.The research on document clustering technique has been conducted very actively, and the representative research results are sequential cluster search and cluster document search. In general, cluster-based retrieval methods are superior in terms of physical and disk search efficiency. However, most clustering algorithms take a lot of time to form clusters, and have various problems such as poor performance in terms of search efficiency and search time, and undesirable attributes of formed clusters. Indeed, it is difficult to use such clustering algorithm effectively for a large amount of literature, and most systems experimentally apply only to hundreds of literatures. Therefore, rather than applying the document clustering algorithm to the entire documents to be searched, the problem of clustering time is solved by applying only the documents that satisfy the user query, and the documents to be searched are clustered according to the meaning of the user query. Research is being conducted in a satisfactory direction.

정확도를 높이기 위한 한국어 정보검색 시스템에 관한 기존 연구를 살펴보면, 대부분의 연구들은 정확한 색인어 추출을 위한 명사 및 복합명사 처리에 주력하고 있다. 이러한 연구들 중의 하나는 문서를 대표하는 주제어들만을 이용하여 정보검색 시스템을 구성하기보다는 한글이 갖고 있는 특성 중의 하나인 동음 이의어와 파생어들로 인한 단어들의 모호성을 고려하여 각 주제어외에도 명사구와 간단한 문장들을 포함하는 키팩트(keyfact) 개념을 도입하였는데, 키팩트는 사용자가 문서내에서 찾고자 하는 사실을 의미한다. 그러나, 키팩트 추출을 위해서는 명사 사전외에 많은 양의 동사/형용사 사전을 필요로 하고 있으며, 따라서 많은 시간과 노력이 필요하다.Looking at the existing researches on the Korean information retrieval system to improve the accuracy, most studies are focused on the processing of nouns and compound nouns for accurate index word extraction. One of these studies is to consider noun phrases and simple sentences in addition to each subject in consideration of the ambiguity of words caused by homonyms and derivatives, which are one of the characteristics of Korean characters, rather than constructing an information retrieval system using only the subjects representing documents. We introduced the concept of a keyfact, which means that the user wants to find something in the document. However, in order to extract a key fact, a large amount of verb / adjective dictionaries is required in addition to the noun dictionary, and thus a lot of time and effort are required.

또 다른 연구에서는 불리언 검색 시스템에서 사용자 질의를 만족하는 정도를 나타내고자 시소러스(thesaurus)를 기반으로 하는 순위결정 알고리즘을 사용한다. 시소러스는 단어들을 의미에 따라 개념 관계로 표현한 일종의 어휘분류 사전으로 개념과 개념사이를 계층관계, 전체-부분관계, 연관관계 등의 특정관계를 사용하여 계층구조로 표현하고 있다. 이러한 시소러스는 색인 작업시 적절한 색인 표목의 선택과 색인어의 통제를 위해 사용할 뿐만 아니라 정보검색시 적절한 탐색어의 선택을 위해 사용한다. 따라서 시소러스를 이용한 정보검색은 용어의 통제 목적외에 탐색어 확장을 통한 검색효율 향상에 기여한다.Another study uses a thesaurus-based ranking algorithm to indicate the degree of satisfaction of a user query in a Boolean search system. Thesaurus is a kind of lexical classification dictionary that expresses words in conceptual relations according to their meanings and expresses the concepts and concepts in a hierarchical structure using specific relations such as hierarchical relations, whole-part relations, and relations. This thesaurus is used not only for the selection of appropriate index headings and the control of index words when indexing, but also for the selection of appropriate search words for information retrieval. Therefore, information retrieval using thesaurus contributes to the improvement of retrieval efficiency through expansion of search term in addition to the purpose of term control.

시소러스 기반의 정보검색 시스템은 문서의 색인어가 시소러스로부터 선택되기 때문에 문서의 특정한 용어에 관계없이 같은 내용을 갖는 문서는 같은 색인어에 의해 검색될 수 있으며, 색인어들 사이의 연관성 정보를 이용하여 정보검색 시스템의 재현율을 증가시키는데 이용될 수 있다. 그러나 시소러스 유형의 어휘 계층들은 단어의 의미에 따라 구축된 것이기 때문에 실제 말뭉치(corpus)에서 만나는 용법과 다소 다른 경우가 발생한다. 따라서 어휘 계층에 의한 유사성을 정보검색시에 그대로 이용하는 것은 재현율을 증가시켜 정확도를 떨어뜨리는 결과를 초래할 수 있다.Since the thesaurus-based information retrieval system is selected from the thesaurus, documents having the same contents can be searched by the same index, regardless of the specific terms of the document, and the information retrieval system using the association information between the index terms. Can be used to increase the recall. However, since thesaurus types are built according to the meaning of words, they are somewhat different from the usage found in actual corpus. Therefore, using the similarity of the lexical hierarchy as it is during information retrieval may increase the recall and reduce the accuracy.

시소러스 기반의 정보검색 시스템중의 일실시예에서는 자연언어 정보검색 시스템에서의 정확도 향상을 위해 상호 정보(mutual information)를 이용한 2단계 문서순위 결정모델 기법을 제안하고 있다. 이 연구에서 2차 문서순위 조정은 질의어내의 검색어들과 각 문헌의 주제어들과의 상호 정보량값을 계산하여 이를 바탕으로 문서의 순위를 재조정한다.One embodiment of a thesaurus-based information retrieval system proposes a two-stage document ranking model using mutual information to improve accuracy in a natural language information retrieval system. In this study, the second document ranking adjustment calculates the mutual information value between the search terms in the query word and the key words of each document, and re-adjusts the document ranking based on this.

본 발명에서는 사용자 질의어에 대한 검색 결과의 순위를 조정하기 위해 단어와 단어사이의 연관성을 정량적으로 나타내는 상호 정보량값을 이용한 엔트로피 기반의 유사도 측도(measure)를 이용한다. 여기서, 엔트로피는 메시지가 갖는 불확실성의 정도인 평균 정보량을 기반으로 계산되는데 두 문서간의 유사도는 엔트로피값이 작을수록 크다.In the present invention, an entropy-based similarity measure using quantitative information representing a quantitative relationship between words and words is used to adjust the ranking of search results for a user query. Here, entropy is calculated based on the average amount of information, which is the degree of uncertainty in a message. The similarity between two documents is larger as the entropy value is smaller.

본 발명에서의 엔트로피 계산은 평균 정보량의 값에 로그를 취한 형태를 이용한다. 로그 함수의 성질에서 알 수 있듯이 어떤 값들에 로그를 취하면 값들의 변화에 둔감해진다. 예를 들어, 10, 100, 1000, 10000의 네 개의 값이 있을 때 가장 큰 값과 가장 작은 값의 차이는 무려 9,990이 된다. 하지만 이 값들에 로그를 취하게 되면 1(=log10), 2(=log100), 3(=log1000), 4(=log10000)가 되어 가장 큰 값과 가장 작은 값의 차이는 3에 불과하다. 따라서, 상호 정보량과 상호 정보량을 정보량으로 사용하는 엔트로피는 서로 같은 개념의 유사도를 사용하고 있지만, 정보량의 변화에 따라 민감하게 또는 둔감하게 작용한다.The entropy calculation in the present invention uses a form in which the logarithm of the average information amount is taken. As you can see from the nature of the logarithmic function, taking a log of some values is insensitive to changes in the values. For example, when there are four values of 10, 100, 1000, and 10000, the difference between the largest value and the smallest value is 9,990. However, if you log these values, you get 1 (= log10), 2 (= log100), 3 (= log1000), 4 (= log10000), and the difference between the largest and smallest values is only three. Therefore, the entropy that uses the mutual information amount and the mutual information amount as the information amount uses the similarity of the same concept, but acts sensitively or insensitively according to the change of the information amount.

일반적인 검색 엔진들은 자연언어 형태의 질의문을 이해할 수 없고, 언어의 의미론적 지식과 주제영역 지식을 이용한 문헌의 내용을 정확히 처리하지 못한다. 또한 대부분의 검색 엔진들은 추론 기능이 없고, 사용자에 관한 사전(prior) 정보를 활용하지 못하는 등 여러 문제점들을 안고 있다. 이러한 문제들을 해결하기 위하여 상호 정보량을 이용한 적합성 피드백 방식의 지능형 정보검색 시스템에 대한 연구가 시도되고 있다.The general search engines cannot understand the query form of natural language and cannot process the contents of the literature using semantic knowledge and subject area knowledge of language correctly. Most search engines also have many problems, such as lack of reasoning and inability to use prior information about users. In order to solve these problems, a research on an intelligent information retrieval system using a conformity feedback method using mutual information amount has been attempted.

검색 엔진에 지능을 부여하기 위해서는 단순한 데이터나 정보 이외에 체계화된 지식을 활용할 수 있어야 하며, 자연언어 이해 능력과 문제 해결을 위한 추론 능력을 가지고 있어야 한다. 즉, 지능형 검색 엔진은 다양한 지식 베이스를 활용하는 지식기반 시스템이어야 하며, 구축된 지식을 이용하여 적절한 추론을 행할 수 있는 시스템이어야 한다. 여기서 추론 기능은 다음과 같이 세 단계로 설명할 수 있다.To give intelligence to search engines, it is necessary to be able to utilize systematic knowledge in addition to simple data or information, and to have natural language comprehension and reasoning for problem solving. In other words, the intelligent search engine should be a knowledge-based system utilizing various knowledge bases, and a system capable of performing proper inferences using the established knowledge. The reasoning function can be explained in three steps as follows.

1) 색인 지식을 이용한 정보요구와 문헌간의 관련성 추론1) Inference of relationship between information request and literature using index knowledge

2) 사용자의 지식을 이용한 적절한 추론2) Appropriate Reasoning Using User's Knowledge

3) 주제 지식을 이용한 새로운 질의어 추론3) New Query Reasoning Using Subject Knowledge

본 발명에서 사용하는 한국어 웹 정보검색 시스템의 전체적인 구성의 일례는도 3과 같다.An example of the overall configuration of the Korean web information retrieval system used in the present invention is shown in FIG.

본 발명에서는 기존의 한국어 웹 정보검색 시스템과는 달리 시스템에 지능을 부여하기 위하여 말뭉치로부터 단어간의 연관성 정도인 상호 정보량을 계산한다.In the present invention, unlike the existing Korean web information retrieval system, in order to give the system intelligence, the amount of mutual information, which is the degree of correlation between words, is calculated from the corpus.

일반적으로 사용자가 원하는 정보의 성향을 파악하는 일은 매우 중요하면서도 이를 실제로 모델화하여 구현하기에는 현실적으로 많은 기술적 어려움이 있다. 이를 위해 기존의 사용자 질의어 입력방식보다는 사용자의 행동이나 입력을 분석하여 사용자의 관심을 간접적으로 추론하는 형태의 인터페이스가 요구된다. 또한, 사용자 성향의 학습에 의한 정보 필터링을 효과적으로 구현하기 위해서는 사용자의 정보이용 성향을 표현하고 학습에 따라 그 내용을 갱신하는 방법과 웹상의 정보를 효과적으로 표현하는 방법, 그리고 학습에 의한 필터링을 수행하는 방법 등의 구성요소가 필요하다.In general, it is very important to grasp the propensity of information desired by a user, but there are many technical difficulties to actually model and implement the information. To this end, an interface that infers the user's interest indirectly by analyzing the user's behavior or input rather than the existing user query input method is required. In addition, in order to effectively implement information filtering by learning of the user's inclination, the information usage tendency of the user is expressed and updated according to the learning, the information on the web is effectively expressed, and the filtering by the learning is performed. Components such as methods are required.

정보검색 시스템에서 사용자 질의어의 검색과 선택, 그리고 재현율의 저하없이 사용자의 정보요구에 가장 잘 부합하는 검색문서들을 검색결과의 상위에 순위화하여 시스템에 대한 사용자의 실제 만족도를 증가시키는 일은 중요하다. 이러한 실제 만족도를 증가시키기 위한 본 발명의 목적과 범위는 다음과 같이 요약될 수 있다.In an information retrieval system, it is important to increase the user's actual satisfaction with the system by ranking the search documents that best fit the user's information needs at the top of the search results without retrieving and selecting the user's query word and reducing the recall rate. The purpose and scope of the present invention for increasing such actual satisfaction can be summarized as follows.

기존의 역화일 중심의 한국어 정보검색 시스템은 단순히 사용자 질의어가 문서내에 존재하는지에 대한 여부만을 반영한다. 이는 문맥 정보에 대한 취급을 소홀히 함으로써 사용자 질의어와 의미적으로 연관되는 문서들을 정확히 검색하는 데는 한계가 있다. 대부분의 정보검색 시스템 이용자들은 재현률보다는 높은 정확도를요구하는데, 특히 자신의 정보요구에 따른 검색결과 문서들이 상위에 순위화되어 제공되길 희망한다. 이러한 사용자 요구를 만족시키기 위하여 본 발명에서는 단어간의 상호 정보량에 기반한 확률정보를 통하여 사용자 질의어와 각 문서들의 주제어들 사이의 엔트로피를 계산한다. 이때, 정확한 엔트로피 계산을 위해 베이지안 추정치와 역문헌 빈도에 의한 가중치값을 사용하였으며, 최근의 사용자 질의어들로 구성된 사용자 프로파일을 설계하여 이용하였다. 또한, 정확한 엔트로피 계산값을 얻기 위해 확률정보 계산식에 상호 정보량을 적용하기 위한 수식을 유도하였다. 이렇게 계산된 엔트로피값은 검색결과에 대한 문서들의 순위조정 방안으로 사용되어, 사용자 질의에 대한 최종 검색결과에 대해 엔트로피값이 적은 문서의 순으로 문서 순위는 재조정된다. 이러한 문서 순위조정 방법은 단독적인 모델로 사용될 수도 있다.The existing reverse file-based Korean information retrieval system simply reflects whether or not the user query exists in the document. This negates the handling of contextual information, so there is a limit to the exact retrieval of documents that are semantically related to the user query. Most information retrieval system users require higher accuracy than reproducibility, and in particular, it is hoped that search result documents according to their information needs will be ranked at the top. In order to satisfy such a user requirement, the present invention calculates entropy between the user query word and the subject words of each document through probability information based on the mutual information of words. At this time, the Bayesian estimate and the weight of the inverse literature frequency were used for accurate entropy calculation, and a user profile composed of recent user queries was designed and used. Also, in order to obtain accurate entropy calculation value, a formula for applying mutual information amount to probability information calculation formula was derived. The calculated entropy value is used as a ranking method of documents for the search results, and the document ranking is readjusted in order of documents having the lowest entropy value for the final search result for the user query. This document ranking method may be used as a standalone model.

본 발명에서는 정보검색의 정확도 향상을 위한 방안으로 엔트로피와 사용자 프로파일을 이용한 문서 순위조정 알고리즘을 제안한다. 순위 조정을 위한 엔트로피 계산은 단어간의 공기 정보를 기반으로 한 상호 정보량값을 이용하는데, 이는 사용자 질의어에 따른 검색결과들을 대상으로 문서간의 유사도를 구하기 위한 확률 정보량의 계산에 사용된다. 또한 엔트로피 계산시, 사용자 질의어에 대한 의미 정보를 반영하기 위하여 사용자의 최근 질의어로 구성된 사용자 프로파일을 설계하며, Sparck Jones의 역문헌 빈도와 베이지안(Bayesian) 학습에 의한 가중치 부여 방법을 제안하며, 이 값을 엔트로피 계산에 반영하여 문서순위 조정에 따른 정확도 향상을 위한 알고리즘을 제안한다.In order to improve the accuracy of information retrieval, the present invention proposes a document ranking algorithm using entropy and user profile. Entropy calculation for rank adjustment uses mutual information amount value based on air information between words, which is used to calculate probability information amount for finding similarity between documents for search results according to user query word. In addition, we design a user profile composed of the user's recent query words to reflect the semantic information about the user query in entropy calculation, and propose a weighting method by Sparck Jones's bibliographic frequency and Bayesian learning. We propose an algorithm to improve the accuracy of document rank adjustment by reflecting this in entropy calculation.

이상에서 설명한 바와 같은 관련 기술내용과 연계한 본 발명 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법의 작용 및 동작을 살펴본다.It looks at the operation and operation of the document rank adjustment method applying the present invention entropy and user profile in connection with the related technical content as described above.

본 발명은 한국어 웹 정보검색 시스템의 정확도 향상을 위하여 엔트로피와 사용자 프로파일을 기반으로 한 검색 순위 조정시스템을 설계하는데 바람직하게 이용된다. 이를 위해 본 발명이 적용되는 시스템에서는 역파일 내에 사용자 질의어가 존재하는지의 여부만을 반영하는 기존의 검색 시스템과는 달리, 사용자 질의어에 포함되어 있는 의미정보를 반영하기 위하여 정보이론을 이용한다(S10)(S20).The present invention is preferably used to design a search ranking adjustment system based on entropy and user profile to improve the accuracy of Korean web information retrieval system. To this end, the system to which the present invention is applied uses an information theory to reflect the semantic information included in the user query, unlike the existing search system that reflects only whether the user query exists in the inverse file (S10) ( S20).

엔트로피와 사용자 프로파일을 기반으로 한 문서 순위 조정시스템의 구현을 위해, 본 발명에서는 정보이론을 이용한 각 단어의 확률벡터 정보량을 계산한다. 여기서, 확률벡터 정보량의 계산은 단어간 공기 정보를 기반으로 한 상호 정보량의 계산에 의해 수행되는데, 이를 위해 새로운 계산식을 유도하여 이용한다(S20).In order to implement a document ranking adjustment system based on entropy and user profile, the present invention calculates the probability vector information amount of each word using information theory. Here, the calculation of the probability vector information amount is performed by the calculation of the mutual information amount based on the air information between words. For this purpose, a new calculation formula is derived and used (S20).

또한, 문서 순위 조정시에 사용자의 기호를 반영하기 위하여 가장 최근에 검색에 사용되었던 키워드들과 이들의 사용빈도로 구성된 사용자 프로파일을 설계한다(S30). 이렇게 설계된 사용자 프로파일은 각 사용자의 아이디와 패스워드에 따른 기존 키워드들과 그에 대한 사용 빈도로 구성된 환형 연결리스트의 형태로서 사용자 로그인과 동시에 사용 빈도의 순서로 정렬된다.In addition, in order to reflect the user's preferences when adjusting document rankings, a user profile composed of the most recently used keywords and their usage frequency is designed (S30). The designed user profile is an annular linked list consisting of existing keywords according to each user's ID and password and the frequency of use. The user profile is arranged in order of frequency of use as the user logs in.

사용자 프로파일과 질의어들은 사용자 주제어에 따른 검색 결과내에서 추출한 각 문서들의 색인어들과의 엔트로피 계산을 수행하는데, 정확도를 높이기 위한 방법으로 각 주제어들의 역문헌 빈도와 베이지안 학습(Bayesian learning)에 의한 가중치값을 산출하여 적용하고 이를 기반으로 엔트로피 계산을 수행한다(S40). 최종적으로 엔트로피값이 적은 문서의 순으로 문서 순위는 재조정된다(S50).User profiles and query terms perform entropy calculations with index terms of the documents extracted from the search results according to the user's main subject.In order to improve the accuracy, the backlit frequency of each subject and the weight value by Bayesian learning Calculate and apply and calculate entropy based on this (S40). Finally, the document rank is readjusted in order of the documents having the lowest entropy value (S50).

본 발명 엔트로피와 사용자 프로파일에 있어서, S20 단계의 정보이론에 대해 살펴보면 다음과 같다.In the present invention entropy and user profile, look at the information theory of step S20 as follows.

정보이론에서 정보량은 전달될 메시지에 대한 불확실성의 정도를 나타낸다. 여기서, 불확실성의 정도가 클수록 정보량이 많은 것으로, 불확실성은 메시지의 발생 확률에 역수를 취하고, 이 값에 로그를 취한 값에 각 메시지의 발생 확률을 가중치로 취한 평균 정보량으로 표현된다. 즉, 메시지가 갖는 평균 정보량은 정보원이 생산해 낼 수 있는 다양한 메시지들로부터 하나의 메시지를 선택할 때 부여되는 선택의 자유를 정량화한 것이다. 이때, 어느 메시지가 선택될 것인가에 대한 불확실성의 정도는 메시지의 평균 정보량인 엔트로피로 표현되며, 메시지에 대한 불확실성은 메시지가 전달됨으로써 얻는 정보에 의해 구할 수 있다. 따라서 메시지가 전달하는 정보량은 엔트로피로 측정된다. 즉, 정보는 불확실성 또는 엔트로피로 표현된다.In information theory, the amount of information represents the degree of uncertainty about the message to be conveyed. Here, the greater the degree of uncertainty, the greater the amount of information, and the uncertainty is expressed as the average amount of information that takes the inverse of the probability of occurrence of the message and takes the log of this value as the weight of the probability of occurrence of each message. That is, the average amount of information that a message has is a quantification of the freedom of choice that is given when one message is selected from the various messages that an information source can produce. In this case, the degree of uncertainty as to which message is selected is expressed as entropy, which is the average amount of information of the message, and the uncertainty about the message can be obtained by information obtained by transmitting the message. Thus, the amount of information conveyed by a message is measured in entropy. That is, information is expressed in uncertainty or entropy.

불확실성에 대한 양을 나타내는 엔트로피 H는 아래 공식 1과 같이 정의할 수 있다.Entropy H, which represents the amount of uncertainty, can be defined as in Equation 1 below.

(공식 1) (Formula 1)

공식 1에서, 메시지 i가 갖는 정보량 logp_i는 메시지가 선택될 확률 p_i에 의해 결정된다. 따라서, H는 n개의 메시지가 갖는 평균 정보량이며, p_i는 i번째 메시지의 발생 확률을 나타낸다.In Equation 1, the amount of information logp _i of message _i is determined by the probability p _i that the message is selected. Therefore, H is the average amount of information possessed by n messages, and p _i represents the probability of occurrence of the i-th message.

일반적으로 엔트로피 H는 수식적으로 아래와 같은 특성을 갖는다.In general, entropy H has the following characteristics.

도 4는 엔트로피 함수를 의미한다. 여기서, p가 균일한 분포일 때, 즉 p=1/2일 때 H(p,1-p)는 최대값을 갖는다.4 is an entropy function Means. Here, H (p, 1-p) has a maximum value when p is a uniform distribution, that is, when p = 1/2.

본 발명에 따른 엔트로피와 사용자 프로파일에 있어서, 상호 정보량과 엔트로피에 대해 살펴보면 다음과 같다.In the entropy and user profile according to the present invention, the mutual information amount and entropy are as follows.

상호 정보량은 단어들의 공기 단어 정보를 기반으로 계산할 수 있는데, 공기 단어란 두 단어가 문서내에 함께 출현함을 일컫는다. 이러한 단어들의 출현 정보는 공기 정보를 구축하는데 사용된다.The amount of mutual information can be calculated based on the air word information of words, which means that two words appear together in a document. The appearance information of these words is used to construct the air information.

도 5는 본 발명에서 사용하는 공기 정보 모듈로 단어와 단어, 그리고 윈도 사이즈의 입력 정보를 이용하여 실행하면 단어가 포함된 문서의 수와 해당 문서에 나타나는 공기 횟수, 그리고 이러한 정보를 이용하여 계산한 상호 정보량을 얻을 수 있다.5 is an air information module used in the present invention, when executed using words, words, and window size input information, the number of documents containing a word, the number of times of air appearing in the document, and calculated using such information. The amount of mutual information can be obtained.

도 5는 "엔진"과 "검색"이란 키워드에 대해 윈도 사이즈가 1,000인 경우 문서의 위치와 출현빈도 및 공기하는 횟수를 구하는 모듈로서, 이를 기반으로 단어와 단어간의 연관성을 이용한 상호 정보량을 계산한 것을 나타낸다. 이는 실험 대상문서에서 추출된 색인어에 대해 일정한 범위의 윈도(window, 100어절)내에서 공기 단어를 추출하고 출현 빈도를 계산하며, 이러한 정보를 기반으로 공기 단어 목록이 작성되면 색인어와 공기 단어간의 상호 정보량을 계산한다.FIG. 5 is a module for calculating the position, frequency of occurrence of documents, and the number of times of airing when the window size is 1,000 for the keywords "engine" and "search". Indicates. It extracts air words within a range of windows (100 phrases) and calculates the frequency of occurrence of the index words extracted from the test target document. Calculate the amount of information.

상호 정보량(Mutual Information)이란, 확률변수 Y에 의해 제공되는 확률변수 X의 양 즉, 확률변수 X와 Y사이의 결합(joint) 관계를 정량적으로 나타낸 것이다. 여기서, 확률변수 X와 Y의 확률변수값으로서 조사하고자 하는 단어들을 대입시키면 단어와 단어 사이의 의존 관계를 정량적으로 나타낼 수 있다.Mutual information is a quantitative representation of the amount of random variable X provided by the random variable Y, that is, the joint relationship between the random variables X and Y. Here, by substituting words to be investigated as random variable values of random variables X and Y, the dependency relationship between words can be quantitatively represented.

상호 정보량의 계산은 공식 2에 의해 구할 수 있는데, 이는 단어와 단어의 연관성을 정량적으로 나타내기 위한 식이다. 공식 2에서 크기가 N인 말뭉치(corpus)에서 단어 x와y가 사용된 횟수를 각각 f(x)와 f(y)라 하고, x와 y가 한 문장 내에서 함께 사용된 상대적인 빈도를 freq(x,y)라 했을 때, N이 충분히 크다면 정확한 상호 정보량인 공식 2의 값은 크기가 N인 말뭉치의 상대적 빈도인 공식 3에 의해 근사적으로 구할 수 있다.Calculation of the mutual information amount can be calculated by Equation 2, which is a formula for quantitatively expressing the association between words. In Formula 2, the number of words x and y used in a corpus of size N is called f (x) and f (y), respectively. x, y), if N is large enough, the value of Equation 2, the exact amount of mutual information, can be approximated by Equation 3, the relative frequency of corpus of size N.

(공식 2) (Formula 2)

(공식 3) (Formula 3)

공식 3의 의미와 특성은 아래와 같이 여섯 가지로 정의할 수 있다.The meaning and characteristics of Equation 3 can be defined in six ways.

첫째, 단어 x와 y가 관련성이 있다면, 확률 p(x,y)는 p(x)*p(y)보다 클 것이며, 결과적으로 MI(x,y)값은 0보다 큰 값이 될 것이다.First, if the words x and y are related, then the probability p (x, y) will be greater than p (x) * p (y) and consequently the MI (x, y) value will be greater than zero.

둘째, 단어 x와 y가 그다지 큰 관련성이 없다면, 확률 p(x,y)와 p(x)*p(y)는 거의 같은 값이 될 것이며, 결과적으로 MI(x,y)는 0에 가까운 값을 얻을 것이다.Second, if the words x and y are not very relevant, then the probabilities p (x, y) and p (x) * p (y) will be about the same, resulting in MI (x, y) being close to zero. Will get the value.

셋째, 단어 x와 y가 전혀 관계가 없다면, 확률 p(x,y)는 p(x)*p(y)의 값보다 적은 값이 될 것이며, 결국 MI(x,y)의 값은 0이 될 것이다.Third, if the words x and y have no relation at all, the probability p (x, y) will be less than the value of p (x) * p (y), so that the value of MI (x, y) is 0 Will be.

넷째, 일반적으로 상호 정보량은 MI(x,y) = MI(y,x)가 만족되는 대칭성(symmetric)을 갖는다. 하지만 본 발명에서 적용하는 상호 정보량은 색인어의 공기 단어 추출시에는 대칭성을 허용하지 않는다. 즉, 색인어 x가 가중치 계산에 의하여 선정된 후 일정한 범위의 윈도 내에서 공기 단어 y의 출현 빈도를 계산하는 값과, 색인어 y가 가중치 계산에 의하여 선정된 후 일정한 범위의 윈도 내에서 공기 단어 x의 출현 빈도를 계산할 때 서로 다른 별개의 값이 산출되기 때문이다.Fourth, in general, the mutual information amount has a symmetric in which MI (x, y) = MI (y, x) is satisfied. However, the amount of mutual information applied in the present invention does not allow symmetry in extracting air words of index words. That is, a value for calculating the frequency of occurrence of the air word y in a range of windows after the index word x is selected by the weight calculation, and a value of the air word x in a range of windows after the index word y is selected by the weight calculation. This is because different distinct values are calculated when calculating the frequency of appearance.

다섯째, 상호 정보량 수식은 단어와 단어 사이의 의존 관계를 확률적으로 나타낸 것이므로 복합어의 정보량을 정량적으로 나타낼 수 있다.Fifth, since the mutual information amount formula is a stochastic representation of the relationship between words and words, the information amount of the compound word can be quantitatively represented.

여섯째, 상호 정보량 수식은 발생 빈도가 적은 단어는 단어간 의존 관계의 객관성을 확인할 충분한 근거가 되지 못하므로 비교적 크기가 큰 문서에 적용해야 한다.Sixth, the mutual information quantity formula should be applied to relatively large documents because words with low occurrence frequency are not sufficient grounds to confirm the objectivity of the dependencies between words.

도 6은 사용자 질의어가 "팩시밀리"인 경우, 사용자 질의어와 공기하는 단어들의 목록과 이를 기반으로 한 상호 정보량을 계산하는 모듈을 나타낸다.FIG. 6 illustrates a module for calculating a list of user queries and deferred words and the amount of mutual information based on the case when the user query is "fax".

도 6에서 상호 정보량의 계산시, 문서에서의 확률변수간의 관계는 단어와 단어 사이의 관계에 대한 값을 의미하므로 단어의 발생 빈도에 대한 평균을 구하는것은 의미가 없다. 따라서, 공식 4를 상호 정보량을 구하는데 이용한다.In calculating the amount of mutual information in FIG. 6, since the relationship between random variables in a document means a value for a relationship between words, it is not meaningful to obtain an average of occurrence frequencies of words. Therefore, Equation 4 is used to find the amount of mutual information.

(공식 4) (Formula 4)

공식 4는로 재정의할 수 있으며, 이는 다시 공식 5로 정의할 수 있다.Formula 4 is This can be redefined as Equation 5.

(공식 5) (Formula 5)

공식 5에서 p(y)는 질의 y를 포함하고 있는 문서에서의 색인어 집합을 의미하므로 항상 1로 간주할 수 있다. 즉,log(py)=log1=0이므로 문서 y에서 x가 출현할 확률은 x와 y의 상호 정보량에 비례함을 알 수 있다.In Equation 5, p (y) means the set of indexes in the document containing the query y, so it can always be regarded as 1. That is, since log (py) = log1 = 0, it can be seen that the probability that x appears in the document y is proportional to the amount of mutual information between x and y.

(공식 6) (Formula 6)

공식 6에서 엔트로피 계산을 위한 확률벡터 정보값은 단어와 단어 사이의 상호 정보량으로 대치하여 사용할 수 있다.In Equation 6, the probability vector information value for entropy calculation can be used as the amount of mutual information between words.

엔트로피 계산에 사용되는 확률 정보값 p_i는 공식 7과 같은 조건을 만족해야 하기 때문에 정규화 과정을 거치게 된다.Since the probability information value p _i used for entropy calculation must satisfy the condition as shown in Equation 7, it is normalized.

(공식 7) (Formula 7)

만약, 각 확률벡터의 정보값들에 대한 정규화 과정을 거치지 않게 되면, 즉 확률벡터 개개의 값들이 척도(scale)가 같지 않을 때는 척도가 큰 정보값에 의해 전체 엔트로피값이 변하게 된다. 이렇게 되면 척도가 상대적으로 작은 정보값들은전체 엔트로피값의 변화에 영향을 줄 수 없게 되어, 사용자의 원하는 정보가 반영된 문서를 검색할 수 없게 된다. 하지만 정규화 과정을 거쳐 모든 문서들의 확률 정보값들이 같은 척도가 되면 사용자의 정보를 균등하게 모두 수용하므로 정확한 검색이 이루어질 수 있다.If the information values of the probability vectors are not normalized, that is, when the individual values of the probability vectors do not have the same scale, the total entropy value is changed by the larger information value. In this case, information values having a relatively small scale cannot influence the change of the overall entropy value, and thus, a document in which the user's desired information is reflected cannot be retrieved. However, if the probability information values of all documents through the normalization process are the same scale, accurate search can be performed because the information of the user is equally received.

본 발명 엔프로피와 사용자 프로파일에 있어서, 사용자 프로파일 설계에 대해 살펴본다.In the present invention, a profile and a user profile will be described.

사용자 프로파일(user profile)은 사용자 개인의 관심 분야를 기술해 놓은 것이다. 정보검색 시스템을 설계함에 있어 사용자 개개인의 기호나 관심 분야를 미리 알고 있다면, 사용자의 관심 분야에 속하는 검색 결과만을 각 사용자에게 우선적으로 검색해 줌으로써 사용자에게 편의를 제공할 수 있다.A user profile describes a user's personal interests. In designing an information retrieval system, if the user's preferences or interests are known in advance, the user can be provided with convenience by first searching each user only the search results belonging to the user's interests.

정보검색 시스템에서 사용자 프로파일은 시스템의 핵심적인 구성요소는 아니다. 사용자 프로파일에 의하지 않고서도 대부분의 시스템은 사용자들이 원하는 검색 결과를 제시해 줄 수 있기 때문이다. 하지만 대부분의 검색 결과는 사용자들이 원하지 않는 결과들도 포함하고 있기 때문에 사용자들은 자신이 원하는 결과를 얻기 위하여 많은 수의 문서들을 일일이 확인하거나 다시 질의를 수정해야 하는 번거로움이 있다. 이러한 문제점들은 사용자 프로파일을 도입함으로써 완전하지는 않지만 부분적으로 해결할 수 있다.In an IR system, user profiles are not a critical component of the system. This is because most systems can give users the search results they want without relying on user profiles. However, since most of the search results include results that users do not want, users have to check a large number of documents or modify the query again to get the results they want. These problems can be partially but partially solved by introducing user profiles.

지금까지 연구된 사용자 프로파일 시스템으로는 단어목록과 연관 피드백을 이용한 시스템, 사용자 모델을 이용한 시스템, 유전자 알고리즘 또는 단어간의 공기 정보를 이용한 시스템, 그리고 시소러스를 이용한 시스템 등이 있다. 이러한 시스템들은 정보 여과에 관련된 시스템이다. 그러나 정보 여과에서 사용자 관심은 약간 넓은 범위의 관심도로 볼 수 있지만 정보검색 시스템에서의 사용자 관심은 특정한 분야에 대한 제한된 분야의 정보 욕구라고 볼 수 있다. 따라서 이런 시스템들을 그대로 정보검색 시스템에 적용하기에는 문제가 있다.The user profile systems studied so far include systems using word lists and associative feedback, systems using user models, systems using genetic algorithms or word-to-word air information, and systems using thesaurus. These systems are related to information filtration. However, user interest in information filtration can be seen as a rather wide range of interest, but user interest in information retrieval systems can be seen as a limited field of information needs for a specific field. Therefore, there is a problem in applying these systems to the information retrieval system as it is.

정보검색 시스템에서의 사용자 프로파일에 대한 연구로는 베이지안 확률을 이용하는 방법과 NN(Nearist Neighbour)방법, 결정 트리, Rocchio의 알고리즘, 신경망을 이용하는 방법 등이 있다. 이러한 연구는 대부분이 에이전트 시스템에서 사용자의 관심을 학습하는 방법으로 연구되었다.The researches on user profiles in IR systems include Bayesian Probability, Near Neighbor (NN), Decision Trees, Rocchio's Algorithm, and Neural Networks. Most of these studies have been conducted to learn the interest of users in the agent system.

도 7은 본 발명에서 설계한 사용자 프로파일로서, 가장 최근에 검색에 사용되었던 질의어들 및 사용 빈도로 구성된다. 여기서, 사용 빈도가 가장 높은 단어는 사용자의 관심을 가장 잘 반영하고 있다고 보고, 사용 빈도에 따라 내림차순 정렬되어 있는 사용자 프로파일을 설계한다.7 is a user profile designed according to the present invention, and is composed of query words and frequency of use which are most recently used for searching. Here, the word with the highest frequency of use is considered to best reflect the user's interest, and the user profile is arranged in descending order according to the frequency of use.

도 7의 사용자 프로파일에서 과거 검색에 사용되었던 질의어들과 해당 질의어들의 빈도 정보는 사용자 질의어와 더불어 엔트로피 계산을 위하여 사용된다. 본 발명에서 설계한 사용자 프로파일이 사용자의 검색 의도를 반영하기까지는 다소 축적된 빈도 정보가 필요할 것으로 예상된다. 그러나 시간이 경과할수록 사용자의 관심 분야에 대한 프로파일은 사용자 질의어와 더불어 엔트로피 계산에 대한 가중치로서 작용하기 때문에 사용자의 기호도를 반영하여 정확도를 높이는데 많은 영향이 있음을 실험을 통해 알 수 있었다.In the user profile of FIG. 7, query information that has been used for a past search and frequency information of the corresponding query words is used for entropy calculation along with the user query word. It is expected that some accumulated frequency information is needed until the user profile designed in the present invention reflects the search intention of the user. However, as time passed, the user's profile of interest acts as a weight for entropy calculation along with the user's query word.

본 발명에 있어서 엔트로피 기반의 문서 순위 조정에 살펴보면 다음과 같다.In the present invention, entropy-based document ranking adjustment is as follows.

도 8은 본 발명에서 사용하는 엔트로피 기반의 순위 조정 시스템이다. 엔트로피 기반의 순위 조정시스템은 크게, 색인어를 추출하기 위한 형태소 분석 모듈과 색인어 구축모듈, 그리고 각 단어들간의 상호 정보량을 구축하는 공기 정보 구축모듈, 마지막으로 엔트로피 계산에 의한 순위 조정모듈로 분류할 수 있다.8 is an entropy based ranking system used in the present invention. An entropy-based ranking system can be largely classified into a morphological analysis module for extracting index words, an index word building module, an air information building module for constructing mutual information amount between each word, and finally a ranking adjustment module by entropy calculation. have.

본 발명 엔트로피 기반의 문서 순위 조정에 있어서, 색인어 추출을 위한 형태소 분석에 대해 살펴보면 다음과 같다.In the entropy-based document ranking adjustment of the present invention, morphological analysis for index word extraction is as follows.

웹 정보검색 시스템의 검색 효율을 높이기 위해서는 각각의 웹 문서에서 적절한 색인어를 추출해야 하며, 이를 위해 불필요한 단어는 제거되어야 한다. 영어권의 검색 시스템에서는 불필요한 기능어들을 제거하기 위한 방법으로 스테밍(stemming) 기법을 사용하지만, 한글에서는 불필요한 단어를 추출하기 위해서 형태소 분석이 필요하다. 형태소 분석기를 설계함에 있어서 당면하게 되는 첫 단계는 어절을 구성하고 있는 단위를 규명하는 것이다. 언어학에서는 "의미를 지닌 최소 단위"를 형태소라고 정의하고 형태소를 규명하는 6가지 원칙을 제시하였다. 그러나 기계처리의 입장에서 이들을 그대로 적용하기는 곤란하다. 따라서, 본 발명에서는 기계처리의 측면에서 "형태소 분석기가 파악하고 있는 한 어절의 구성 성분"을 형태소라고 정의하며, 이러한 구성 성분은 그대로 사전의 표제어가 된다.In order to improve the search efficiency of the web information retrieval system, an appropriate index word must be extracted from each web document, and unnecessary words must be removed for this purpose. In the English-speaking search system, the stemming technique is used as a method for removing unnecessary functional words, but in Korean, morphological analysis is required to extract unnecessary words. The first step in designing a morpheme analyzer is to identify the units that make up the word. Linguistics defined the "minimal unit of meaning" as a morpheme and proposed six principles for identifying morphemes. However, from the standpoint of mechanical processing, it is difficult to apply them as they are. Therefore, in the present invention, in terms of machine processing, "a component of a word understood by a morphological analyzer" is defined as a morpheme, and such a component becomes a headword of a dictionary as it is.

한국어와 같이 실질 형태소(lexical morpheme)의 문법적 기능을 형식 형태소(grammatical morpheme)의 결합으로 나타내는 교착어에서는 각 형태소간의 결합 조건을 정확하게 정의하기가 매우 어렵다. 따라서, 성질이 같은 형태소들의 조합형은 하나의 형태소로 취급한다. 즉, 접미사와 접미사의 결합형, 조사와 조사의 결합형, 선어말 어미들의 결합형을 각각 하나의 접미사, 조사, 선어말 어미로 분류한다. 또한 선어말 어미 중 어미와 결합 분포가 적은 비분리적인 선어말어미와 어미의 결합형을 어미로 취급하고, 조사가 붙는 어말 어미도 어미로 취급한다. 이들은 그 결합 양상을 규칙으로 표현하기가 매우 어렵고, 접속 양상을 규칙으로 나타낸다 하더라도 잘못된 결합을 허용할 수 있다. 따라서 이들을 통합형으로 처리하는 것이 잘못된 분석을 방지하고 결합 순서와 결합 조건을 검사하는 부담을 줄일 수 있어 보다 효율적이다. 이들은 그 결합 양상을 규칙으로 표현하기가 매우 어렵고, 접속 양상을 규칙으로 나타낸다 하더라도 잘못된 결합을 허용할 수 있다. 따라서, 이들을 통합형으로 처리하는 것이 잘못된 분석을 방지하고 결합 순서와 결합 조건을 검사하는 부담을 줄일 수 있어 보다 효율적이다.As in Korean, it is very difficult to accurately define the binding conditions between morphemes in a deadlock that expresses the grammatical functions of lexical morphemes as a combination of grammatical morphemes. Thus, combinations of morphemes of the same nature are treated as one morpheme. That is, the combination of the suffix and the suffix, the combination of the search and the investigation, and the combination of the ending endings are classified into one suffix, the investigation and the ending ending. In addition, the non-separated pre-fish mothers and the combined types of their mothers are treated as mothers, and those with irradiated horses are also treated as mothers. They are very difficult to express the coupling aspect as a rule, and even if the connection aspect is represented as a rule, it may allow a wrong coupling. Therefore, integrating them is more efficient because it avoids false analysis and reduces the burden of checking join order and join conditions. They are very difficult to express the coupling aspect as a rule, and even if the connection aspect is represented as a rule, it may allow a wrong coupling. Therefore, processing them integrally is more efficient because it prevents false analysis and reduces the burden of checking the join order and join conditions.

도 9는 색인어 추출을 위해 본 발명에서 설계한 형태소 분석기로, 화면 상단에 텍스트 문장을 입력받아 형태소 분석 알고리즘에 의해 분석한 결과를 화면 하단에 보여주게 된다.9 is a morpheme analyzer designed by the present invention for extracting an index word, and shows the result analyzed by the morpheme analysis algorithm at the bottom of the screen by receiving a text sentence at the top of the screen.

일반적으로, 형태소 분석기는 응용 분야에 따라 그 기능이 다를 수 있으므로 형태소 분석기가 어떠한 기능들을 가져야 하는가 하는 문제는 형태소 분석기가 어떤 목적으로 사용되는가에 따라 달라진다. 본 발명에서 사용하는 형태소 분석기는 웹 정보검색 시스템의 구현을 위해 설계되었기 때문에 철자검색 등에 사용되는 형태소 분석기와는 달리 다른 어떤 부분보다도 더욱 명사구에 대한 정확한 분석을 시도한다.In general, the functionality of a morpheme analyzer may vary depending on the application, so the question of which functions the morpheme analyzer should have depends on what purpose the morpheme analyzer is used for. Since the morpheme analyzer used in the present invention is designed for the implementation of the web information retrieval system, unlike the morpheme analyzer used for the spell search, it tries to analyze the noun phrase more accurately than any other part.

한국어에 대한 형태소 분석 방법에는 크게 최장일치법, Head-Tail 구분법,좌우 접속정보를 이용한 Tabular Parsing 방법 등이 있다. 최장일치법과 Head-Tail 구분법은 한 어절이 여러 가지로 분석될 수 있는 경우에 가능한 모든 결합 형태를 분석해 내지 못하므로 상위 단계에서 backtracking을 유발한다. 반면에, Tabular Parsing 방법은 가능한 모든 결합을 얻지만 잘못된 결합을 배제하기 위해 사전의 표제어에 다른 형태소와의 결합 가능 여부를 판단하는 접속 정보(connectivity information)를 부여하도록 되어 있어 사전의 구성과 확장 및 유지에 어려움이 따른다.The morphological analysis methods for Korean include the longest matching method, head-tail classification method, and tabular parsing method using left and right access information. The longest matching method and the head-tail classification method do not analyze all possible combinations when a word can be analyzed in various ways, causing backtracking at a higher level. The Tabular Parsing method, on the other hand, obtains all possible combinations, but in order to exclude erroneous associations, the dictionary's headings are given connectivity information that determines whether or not they can be combined with other morphemes. Difficult to maintain

본 발명에서 사용하는 형태소 분석기는 일반적인 품사(part of speech) 정보만을 이용하여 가능한 모든 결합을 얻는 형태소 분석기로, 한 어절에서 추출된 형태소들로 원래의 어절을 구성할 수 있으면 가능한 결합으로 추정하고, 잘못된 결합은 형태소 분석기의 필터링 능력을 이용하여 삭제하는 다단계 필터링 기법을 이용한 형태소 분석기이다.The morpheme analyzer used in the present invention is a morpheme analyzer that obtains all possible combinations using only general part of speech information, and estimates the possible combinations if the original phrases can be composed of morphemes extracted from one word, Mismatching is a morphological analyzer using a multistage filtering technique that eliminates using the morphological analyzer's filtering capabilities.

본 발명 엔트로피 기반의 문서 순위 조정에 있어서, 색인어 선정 및 분류에 대해 살펴보면 다음과 같다.In the present invention entropy-based document ranking adjustment, look at the index selection and classification as follows.

색인이란 입력된 문서의 텍스트를 분석한 후에 해당 문헌의 주제 내용을 대표할 수 있는 단어나 단어구를 추출해 내는 것이다. 여기서, 분석 대상이 되는 텍스트는 문서의 전문이나 초록의 표제가 된다. 일반적으로 색인어란 용어의 사용법과 용어 상호간의 관계를 규정하는 규칙에 따라서 특정 문헌을 해석하고, 문헌 파일을 탐색하기 위한 질문을 작성하는데 허용되는 모든 단어라고 정의할 수 있다.An index is an analysis of the text of an input document and extracts a word or word phrase that can represent the subject content of the document. Here, the text to be analyzed becomes the title of the full text or abstract of the document. In general, an index word can be defined as any word that is allowed to interpret a particular document and write a question to search a document file according to rules that define the usage of the term and the relationship between terms.

색인어의 대상이 되는 단어 중 가장 빈번히 나타나는 고빈도의 단어는 일반적인 단어로서 주제어로 사용하기에는 가치가 없으며, 반면에 저빈도 단어 또한 주제어로서 의미가 없다. 따라서, 이러한 무의미 단어를 제외한 나머지 단어가 색인어의 대상이 되는 주제어가 된다.The most frequent words among the words that are the subjects of the index word are general words and are not worth using as the main words, while the low frequency words are also meaningless as the main words. Therefore, the remaining words except these meaningless words become the subject words of the index word.

본 발명에서는 도 10과 같이 단어의 빈도분포에서 최고 한계빈도와 최저 한계빈도를 정한 뒤, 두 한계빈도 내에 속하는 중간빈도의 단어들이 문서 내용에 대한 식별력이 크다고 보고 중간 빈도어를 색언어로 선정한다.In the present invention, as shown in FIG. 10, after determining the highest limit frequency and the lowest limit frequency in the frequency distribution of the word, the middle frequency words belonging to the two limit frequencies are considered to have a high discrimination ability in the document content, and the intermediate frequency is selected as the color language. .

색인어 결정 기법에서 중요한 것은 결정된 색인어에 대한 가중치(weight)를 부여하는 것으로 이는 효율적인 검색을 지원하는 기능이다. 결정된 색인어는 사용자가 원하는 정보를 검색할 때 이용되기 때문에 색인어 결정기법이 정보검색 시스템의 성능을 좌우한다고 할 수 있다.What is important in the index determination technique is to assign a weight to the determined index, which is a function that supports efficient search. Since the determined index word is used when searching for the information desired by the user, it can be said that the index word determining technique determines the performance of the information retrieval system.

일반적으로 색인어를 선택하는 기법은 통계적 기법과 언어학적 기법, 문헌 구조적 기법 그리고 통제어휘집 기반 기법 등이 있으며, 자동 색인에 적용되는 언어학적 기법은 다시 어휘적 단계, 구문적 단계, 어의적 단계로 나눌 수 있다.In general, index word selection techniques include statistical techniques, linguistic techniques, document structure techniques, and controlled vocabulary-based techniques. Linguistic techniques applied to automatic indexing can be divided into lexical, syntactic, and lexical stages. Can be.

최근 자연언어 처리 기술의 발전으로 통제어휘집을 사용하여 색인하는 방법 보다는 자연언어를 이용하여 색인하는 방법이 효과적일 수 있는데, 자연언어 색인은 최신성을 유지하기 용이하고 입력 비용이 저렴하며, 데이터 베이스간 교환이 용이하다는 장점을 가지고 있다.With the recent development of natural language processing technology, indexing using natural language can be more effective than indexing using controlled vocabulary. Natural language indexing is easy to maintain the latest, low input cost, and database. It has the advantage of easy exchange.

본 발명에서는 한국어의 특징을 반영하기 위해 형태소 분석을 통해 추출한 품사를 기준으로 형태소 구조 규칙에 의한 후처리를 함으로써 통계적인 방법뿐 아니라 한국어의 특징을 반영하는 구조적 방법을 병행하여 사용하고 있다. 또한, 본발명에서의 색인어 선정을 위한 통계적 기준은 주로 단어의 출현 빈도에 근거하고 있다.In the present invention, in order to reflect the characteristics of the Korean language by post-processing by the morpheme structural rules based on the parts of speech extracted through morphological analysis, the structural method reflecting the characteristics of Korean as well as the statistical method is used in parallel. In addition, the statistical criteria for selecting index words in the present invention are mainly based on the frequency of occurrence of words.

출현 빈도를 직접적으로 이용하는 기준은 단어의 빈도 산출방식에 따라 단순 빈도와 상대빈도로 구분한다. 여기서, 단순빈도는 단어가 어디에 출현했는가에 따라 단어빈도(term frequency), 문헌빈도(document frequency)로 구분된다.The criteria that directly use the frequency of appearance are divided into simple frequency and relative frequency according to the frequency calculation method of words. Here, the simple frequency is divided into a word frequency and a document frequency according to where the word appears.

단어빈도(T_i)는 색인대상이 되는 각 문헌 i에 특정 단어 k가 출현한 문헌의 수로,이며 f_ik≥ 1일 때 b_ik= 1, 그리고 f_ik= 0일 때 b_ik= 0이다.Word frequency (T _i) is the number of a particular word k appears in each document i is the index target document, And b _ik = 1 when f _ik ≥ 1 and b _ik = 0 when f _ik = 0.

단순빈도는 문헌 집단의 크기나 분석대상 텍스트의 길이, 단어의 사용빈도를 전혀 고려하지 않은 것으로, 실제로 이것만으로 색인어의 선정 기준을 정하여 사용하기는 어렵다. 따라서 이러한 요인을 고려한 상대빈도가 보다 더 적합한 기준으로 평가되고 있다.The simple frequency does not consider the size of the document group, the length of the text to be analyzed, and the frequency of use of words. It is difficult to use the criteria for selecting index words based on this. Therefore, the relative frequency considering these factors is evaluated as a more suitable criterion.

도 11은 "엔진"이라는 키워드에 대해 통계적 기준에 의하여 단어빈도와 문서빈도를 구하는 모듈이다. 화면의 상단에는 문서번호와 문서위치, 문서의 내용을 나타내며, 화면 하단에는 문서에 대한 형태소 분석 결과를 보여준다.11 is a module for calculating word frequency and document frequency based on statistical criteria for the keyword "engine". The upper part of the screen shows the document number, the document position, and the contents of the document. The lower part of the screen shows the stemming results of the document.

색인어 파일은 색인어 결정기법에 의해 추출된 색인어에 대한 정보를 저장하는 파일을 말한다. 이 파일은 색인과 검색시에 공유되며 빠른 속도와 효율적인 저장공간을 이용해야 한다.An index word file refers to a file that stores information on index words extracted by an index word determination technique. This file is shared during indexing and retrieval, and should use fast speed and efficient storage.

정보 파일의 구성 기법으로 주로 이용되는 방법은 각 색인어에 대한 참조를 저장한 역파일(inverted file)을 이용하는 방법과 각 문서에 해당하는 색인어 정보를 이진 열(bit stream)로 구성한 비트맵 색인, 그리고 요약(signature)파일을 이용하는 방법 등으로 분류할 수 있다. 여기서, 역파일을 이용하는 방법에는 동적인 구성, 검색속도, 저장공간의 크기 등이 중요한 요소이고 비트맵 및 요약파일을 이용하는 방법에서는 검색기법, 저장 공간의 크기 등이 중요한 요소이다.Commonly used techniques for organizing information files include using an inverted file that stores a reference to each index word, a bitmap index consisting of a bit stream of index word information corresponding to each document, and It can be classified by using a signature file. In this case, the dynamic configuration, the search speed, the size of the storage space, and the like are important factors in the reverse file method, and the retrieval technique, the size of the storage space, and the like are important factors in the bitmap and summary file method.

도 12는 본 발명에서 사용하는 역색인파일 생성모듈이다. 여기서, 화면 상단에는 홈페이지의 URL들이 나타나고, 화면 하단에는 홈페이지에 대한 내용을 보여주고 있다.12 is an inverted index file generation module used in the present invention. Here, URLs of the homepage appear at the top of the screen, and contents of the homepage are shown at the bottom of the screen.

본 발명에서 사용하는 색인 기법은 색인어의 위치나 빈도뿐 아니라 색인어의 의미까지 고려해야 하고, 단순히 부가적인 자료구조의 추가만으로는 시스템의 구성이 오히려 복잡해지기 때문에 비트맵 색인을 자료구조로 사용하지 않고 확장된 역파일 구조를 사용한다.The indexing technique used in the present invention should consider not only the position and frequency of the index word but also the meaning of the index word, and since the configuration of the system is complicated by simply adding additional data structures, the index scheme is extended without using the bitmap index as the data structure. Use an inverted file structure.

본 발명 엔트로피 기반의 문서 순위 조정에 있어서, 역색인 파일(Inverted index file) 생성에 대해 살펴보면 다음과 같다.In the entropy-based document ranking adjustment of the present invention, generation of an inverted index file is described as follows.

역색인 파일 기법은 저장된 모든 문서에서 추출된 주요어들 각각에 대해서 그것이 포함된 문서들의 리스트 즉, 역리스트(inverted list)를 구성하고, 구성된 역리스트를 주요어의 알파벳 순서에 따라 정렬시킨 파일을 이용하여 검색한다.The inverted index file technique consists of a list of the documents it contains, or an inverted list, for each of the key words extracted from all stored documents, and using a file sorted according to the alphabetical order of the key words. Search.

도 13은 본 발명에서 사용하는 역파일 자료구조의 예이다.13 is an example of an inverted file structure used in the present invention.

도 13과 같이 본 발명에서 설계한 역파일은 비트맵이나 요약파일 방식과는 달리 색인어에 대한 별다른 가공을 하지 않고 정보를 저장하게 된다. 먼저, 문자별 참조파일은 색인어의 첫 음절의 순서에 의해 정렬된 색인 용어별 참조 파일의 첫음절별 위치 정보를 기록한다. 색인 용어별 참조파일은 색인어의 출현빈도, 가중치 및 문헌 식별자, 참조파일의 관련 위치정보를 기록한다. 최종적으로 문헌 식별자 참조파일로부터 얻어진 문서정보에 의하여 해당 색인어를 가지는 문서의 목록을 참조문서 배열을 통하여 얻을 수 있다.As shown in FIG. 13, the reverse file designed in the present invention stores information without any processing for index words, unlike a bitmap or a summary file method. First, the character reference file records the location information of the first syllable of the index term-by-syllable reference file sorted by the order of the first syllable of the index word. The reference file for each index term records the frequency of occurrence of index words, weights and document identifiers, and the relevant location information of the reference file. Finally, a list of documents having the corresponding index word can be obtained through the reference document arrangement based on the document information obtained from the document identifier reference file.

도 13은 "정보"라는 색인어를 가지는 문서를 찾는 과정을 나타낸다. 먼저 "정보"를 찾기 위하여 문자별 참조파일을 통하여 '정'자로 시작하는 색인 용어별 참조파일의 시작 위치를 일차적으로 구한 후, 색인 용어별 참조파일에서 "정보"란 용어에 대한 문서의 수, 전체 문서에서의 색인 용어 발생빈도, 문헌 식별자 참조파일의 시작 위치값 등을 구하고 문헌 식별자 참조파일에서 참조문서, 참조문서에서 용어 발생빈도 등을 구하여 가중치를 계산한 후 해당 문서의 내용을 사용자에게 제시한다.13 shows a process of finding a document having an index word of "information". First, the starting position of the index term reference file starting with 'jeong' through the character reference file to find "information" is obtained first. Then, the number of documents for the term "information" in the index term reference file, Obtain the index term frequency in the entire document, the starting position value of the document identifier reference file, etc., calculate the weight by calculating the term occurrence frequency in the reference document and the reference document in the document identifier reference file, and present the contents of the document to the user. do.

이 방식은 저장공간 갱신이나 재구성 비율 및 리스트가 많거나 길 때 그들을 병합하는 데 비용이 많이 든다는 단점이 있지만, 별도의 탐색없이 바로 검색이 가능한 장점 이외에도 주요어들에 대한 가중치나 문서에 대한 정보를 표현할 수 있고, 검색시 질의에 대한 유사도 계산이나 동음이의어 검색이 용이하다. 이러한 점들을 고려할 때 일반적으로 역파일 기법은 정적인 환경, 즉 질의가 빈번하고 갱신작업이 거의없는 환경에 적합하다.This method has the disadvantage of expensive storage update or reorganization rate and merging them when the list is large or long.However, in addition to the advantage of being able to search immediately without searching, it is possible to express the weight of key words or information about documents. It is easy to calculate similarity or search for homonyms for a query. Considering these points, the reverse file technique is generally suitable for static environments, that is, environments with frequent queries and few updates.

도 14의 표에서는 비트맵 색인, 요약파일 색인 및 역파일 색인의 특징을 비교하였다. 도 14 표를 통해 알 수 있는 것처럼 시스템 구현의 용이성 측면을 제외하고 지능적인 색인어 추출 및 부가적인 정보를 저장하기 위한 자료구조는 비트맵색인과 역파일 색인이 적합하다고 볼 수 있으며, 순위부여의 용이성 및 사전을 사용한 자연언어 처리에는 역파일 색인이 적합하다고 할 수 있다.In the table of FIG. 14, the characteristics of the bitmap index, summary file index, and reverse file index are compared. As can be seen from the table of FIG. 14, except for the ease of system implementation, a bitmap index and an inverted file index are suitable for a data index for intelligent index extraction and additional information storage. The reverse file index is suitable for natural language processing using a dictionary and a dictionary.

도 14 표에서 나타난 장점에도 불구하고 역파일은 저장공간 갱신이나 재구성 비율 및 리스트가 길어질 때 시스템의 성능을 저하시키는 단점을 가지고 있다. 따라서, 색인어에 대한 부가적인 정보를 추가적으로 구성하기 위해서는 역파일 구조의 확장이 필수적이며, 확장시에는 시스템의 성능에 영향을 미치지 않도록 주의해서 설계하여야 한다.In spite of the advantages shown in the table of FIG. 14, the inverse file has a disadvantage of degrading the performance of the system when the storage space update or reconfiguration rate and the list become long. Therefore, in order to additionally configure additional information on the index word, the expansion of the reverse file structure is essential, and the expansion should be carefully designed so as not to affect the performance of the system.

일반적으로 웹 정보검색 시스템은 CGI(Common Gateway Interface)를 통해 서비스를 제공하는데, 이때 대두되는 문제는 서버의 성능을 어떻게 높이는가 하는 것이다. 서버는 클라이언트의 브라우저에서 들어오는 요청을 처리하기 위해 각각의 요청마다 CGI를 실행시키게 되고 CGI가 실행될 때마다 새로운 프로세스를 생성하게 된다. 그런 후에 웹 브라우저에서 전달받은 환경변수(environment variable)와 표준입력(stdin)을 통해 CGI에 전달한다. 그리고 CGI를 통해 수집된 정보들은 표준출력(stdout)을 통하여 결국은 클라이언트로 전달된다. 이때, 문제가 되는 것이 클라이언트의 요청에 의해 매번 생성되는 프로세스들이다. 서버가 CGI를 한 번 실행시킬 때마다 매번 새로운 프로세스가 생성되고, 이것은 클라이언트의 요청이 많을수록 이에 대한 프로세스 숫자가 증가하여 결과적으로 서버의 성능을 감소시키는 문제점이 있다. 특히, 역파일의 특성상 방대한 양의 문서를 색인해야 하는데, 그 크기는 색인되는 문서의 양과 비례하며 클라이언트로부터의 요청마다 각 CGI 프로세스가 같은 역파일 자료구조를 각자 메모리 영역에 로드하게 되는 문제가 발생하게된다.In general, a web information retrieval system provides a service through a common gateway interface (CGI). The problem that arises is how to increase the performance of the server. The server executes a CGI for each request to handle requests coming from the client's browser and creates a new process each time the CGI is executed. Then it passes it to CGI through environment variables and stdin. Information collected through CGI is eventually passed to the client via stdout. At this point, the problem is the processes that are created each time at the request of the client. Each time the server executes CGI, a new process is created each time, which increases the number of processes for a client's request and consequently decreases the server's performance. In particular, due to the nature of the reverse file, a large amount of documents must be indexed, and the size is proportional to the amount of documents to be indexed, and each CGI process loads the same reverse file data structure into a memory area for each request from a client. Will be done.

이러한 문제점을 해결하기 위한 방안으로 HTTP 서버와 연동되는 인터넷 서버 응용프로그램(ISA: Internet Server Application)이 제안되었다. 인터넷 서버 응용프로그램은 DLL(Dynamic Linking Library) 형태의 실행파일로 구현되기 때문에 실행될 때마다 별도의 프로세스가 생성되지 않는다. DLL은 기존 독립적인 실행파일과 달리 DLL을 호출한 프로세스에 속해서 동작하고, 한 번 메모리에 로드되면 더 이상 사용되지 않을 때까지 계속 메모리에 남아 있어서 계속해서 클라이언트의 요청을 더 빨리 처리할 수 있다.In order to solve this problem, an Internet server application (ISA) that interoperates with an HTTP server has been proposed. Since the Internet server application is implemented as an executable file in the form of a DLL (Dynamic Linking Library), a separate process is not created each time it is executed. Unlike traditional stand-alone executables, DLLs belong to the process that called the DLL, and once loaded into memory, they remain in memory until they are no longer used so they can continue to process requests from clients faster.

인터넷 서버 응용프로그램을 이용한 시스템에서도 역파일 자료구조를 한 번은 메모리로 로드해야 하기 때문에 역시 불필요한 메모리 낭비를 초래하게 된다. 또한, HTTP 서버가 동작하는 동안에는 시스템이 파일을 오픈하고 있기 때문에 역파일 자료구조의 추가 및 변경이 불가능하다는 문제가 발생한다.Even on systems using Internet server applications, the reverse file data structure must be loaded into memory once, causing unnecessary memory waste. In addition, since the system is opening the file while the HTTP server is running, the problem arises that adding and changing the reverse file data structure is impossible.

본 발명에서는 이러한 문제를 해결하기 위하여 역파일 자료구조를 색인어의 첫 음절에 의한 동적인 트라이(dynamic trie)로 구성하였다. 뿐만 아니라 색인어는 문서내의 위치와 빈도 정보외에도 추가적인 정보를 지속적으로 포함할 수 있도록 설계되었다.In the present invention, in order to solve this problem, the reverse file data structure is composed of a dynamic trie by the first syllable of the index word. In addition, index terms are designed to continue to contain additional information in addition to location and frequency information in the document.

도 15는 본 발명에 의해 설계된 의미정보를 포함하고 있는 확장된 역파일 자료구조이다.15 is an extended inverted file structure containing semantic information designed by the present invention.

도 15에서의 동적인 트라이 구조의 역파일은 역파일의 첫 음절과 차일드 노드에 관한 정보만을 저장하고 있기 때문에 초기에 로드해야 하는 파일의 크기가,동적으로 구성되지 않은 역파일 크기에 비해 0.5%에 지나지 않는다. 또한, 4만여 단어사전을 기준으로 추출된 2천여개의 색인어를 644개의 파일로 분리함으로써 잘 사용되지 않는 불필요한 노드를 메모리에 적재하지 않게 되어 시스템의 자원을 효율적으로 사용할 수 있게 되었다. 이러한 동적인 트라이로 구성된 역파일은 시스템이 필요로 하는 노드만을 로드하게 되고, 한 번 메모리에 로드된 노드는 같은 프로세스에서 동작되는 모든 인터넷 서버 응용프로그램이 공유할 수 있으므로 기존 CGI와 비교하였을 때 시스템 자원의 효율적 이용 측면과 성능향상 측면에서 유리하다.Since the inverted file of the dynamic tri-structure of FIG. 15 stores only information about the first syllable and child nodes of the inverted file, the size of the file to be initially loaded is 0.5% compared to the inverse file size which is not dynamically constructed. It is only In addition, by dividing about 2,000 index words extracted from 40,000 word dictionaries into 644 files, unnecessary nodes that are not used well are not loaded into memory, thus enabling efficient use of system resources. This dynamic tri inverse file loads only the nodes needed by the system, and once loaded into memory, all the Internet server applications running in the same process can be shared by the system. It is advantageous in terms of efficient use of resources and improvement of performance.

이러한 장점으로 인하여 확장된 역파일 구조는 웹 정보검색 시스템의 성능을 극대화시킬 수 있을 뿐 아니라, 서버 자원공유라는 측면에서 정적인 기존 CGI 방법과 비교하여 시스템의 성능향상이 용이하고 새로운 정보의 추가나 유지 보수시에 시간과 노력이 현저히 감소하여 경제적인 측면 또한 뛰어나다.Due to these advantages, the expanded reverse file structure not only maximizes the performance of the web information retrieval system, but also improves the system performance compared to the existing static CGI method in terms of server resource sharing, and adds or maintains new information. The time and effort required for renovation are significantly reduced, which is also an excellent economic aspect.

본 발명 엔트로피 기반의 문서 순위 조정에 있어서, 엔트로피와 베이지안 추정치를 이용한 문서순위 조정에 대해 살펴보면 다음과 같다.In the entropy-based document ranking adjustment of the present invention, the document ranking adjustment using entropy and Bayesian estimation is as follows.

사용자 질의어에 포함되는 단어들에 대한 문맥정보를 반영하기 위하여 본 발명에서는 사용자 질의어와 사용자 프로파일 벡터로 구성되는 입력벡터와 동적 검색된 문서들의 색인어들 사이에 엔트로피를 계산한다.In order to reflect the contextual information on the words included in the user query, the present invention calculates entropy between the input vector consisting of the user query and the user profile vector and the index words of the dynamically searched documents.

도 16은 입력벡터와 동적 검색된 문서들의 색인어들 사이의 엔트로피를 계산하기 위한 예를 나타낸다.16 shows an example for calculating entropy between an input vector and index words of dynamically retrieved documents.

도 16에서 사용자 질의어가 "질병"이라 하고, [문서 #1]의 주제어들이 각각 ["자궁암", "임상", "연구", "환자"]라고 한다면, 이때의 엔트로피 계산은 공식 9에 의해 계산할 수 있다.In FIG. 16, if the user query word is "disease", and the main words of [Document # 1] are ["uterine cancer", "clinical", "research", and "patient", the entropy calculation at this time is performed by Equation 9. Can be calculated

(공식 9) (Formula 9)

공식 9에서 n의 값 4는 [문서 #1]에 색인어들이 4개 존재함을 의미하며, 각각의 p_i는 공식 6에 의해 질병과 ["자궁암", "임상", "연구", "환자"]의 상호 정보량에 의해 구할 수 있다. 이와 같은 동일한 작업 과정은 H₂(바이러스), H₃(병), H₄(환자)에 대해서도 수행된다. 사용자 질의어 및 사용자 프로파일 벡터와 [문서 #1]의 최종 엔트로피 계산은 공식 10에 의해 계산된다.In Equation 9, the value 4 of n means that there are four index words in [Document # 1], and each p _i is defined by Equation 6 for the disease and ["uterine cancer", "clinical", "study", "patient" Can be obtained from the mutual information amount This same working procedure is also carried out for H ₂ (virus), H ₃ (disease), H ₄ (patient). The final entropy calculation of the user query and user profile vectors and [Document # 1] is calculated by Equation 10.

(공식 10) (Formula 10)

본 발명에서는 사용자 질의어-사용자 프로파일 벡터와 동적 검색된 문서들의 색인어들 사이의 엔트로피를 계산함에 있어 역문헌 빈도에 의한 가중치와 베이지안 학습에 의한 추정치를 이용하였다. 이때, 사용자 질의어에는 1*역문헌 빈도값을, 사용자 프로파일의 최상위 빈도 세 단어에는 단어빈도를 프로파일 단어 전체 빈도의 합으로 나누어 역문헌 빈도값을 곱한값을 가중치로 이용하였다.In calculating the entropy between the user query-user profile vector and the index words of the dynamically retrieved documents, the weighting by the inverse literature frequency and the estimation by Bayesian learning are used. In this case, 1 * inverse literature frequency value was used as the user query word, and the word frequency was divided by the sum of the frequency of all the profile words in the three most frequent frequencies of the user profile.

여기서, 사용자 질의어와 사용자 프로파일에 대한 가중치 결정은 베이지안의 사후 확률(posterior probability)을 이용한다. 이때 사용하는 베이지안의 사후 확률은 공식 11과 같이 정의한다.Here, the weight of the user query and the user profile is determined using Bayesian posterior probability. The Bayesian posterior probability used here is defined as in Equation 11.

(공식 11) (Formula 11)

공식 11에서 t는 텍스트를 d₁,...,d_n은 문서를 나타낸다. 즉, 베이지안의 사후확률은 문서들에서의 텍스트의 우도함수(likelyhood function)와 사전확률(prior probability)의 곱으로 나타낼 수 있다.In Equation 11, t represents text and d ₁ , ..., d _n represents a document. In other words, Bayesian posterior probability can be expressed as the product of the likelihood function and prior probability of text in documents.

사전확률 p(t)에 대한 정보가 없을 때는 일반적으로 1의 값을 사용하지만 본 발명에서는 텍스트의 역문헌 빈도를 사용하여 사전(prior)에 대한 정보를 높히고, 우도함수값은 말뭉치에 나타난 각 단어의 출현빈도를 이용한다.When there is no information about the dictionary probability p (t), a value of 1 is generally used. However, in the present invention, the information on the dictionary is increased by using the inverse text frequency of the text, and the likelihood function is used for each word in the corpus. Use the frequency of occurrence of

이렇게 사전(prior)에 대한 구체적 정보인 역문헌 빈도와 말뭉치에서의 각 단어의 빈도를 이용하여 구한 사후정보는 사용자 질의어와 사용자 프로파일간의 각 단어들에 대한 가중치로 이용한다.The post information obtained by using the bibliographic frequency and the frequency of each word in the corpus, which are specific information about the dictionary, is used as a weight for each word between the user query and the user profile.

예를들어, 공식 10에서의 베이지안 추정치는 사용자 프로파일에 의해서 구할 수 있다. 예를 들어, 사용자 프로파일로부터 각각 "병"의 질의어가 959회, "환자"의 질의어가 9회, 그리고 "바이러스"의 질의어가 17회 사용되었던 빈도 정보를 얻을 수 있다면, 베이지안 추정치의 초기값은 각각 959/1085, 9/1085, 17/1085이 되며, 이 값을 다시 역문헌 빈도의 가중치값으로 곱하여 최종 가중치를 구할 수 있다. 따라서 공식 10은 공식 12로 재계산된다.For example, the Bayesian estimate in Equation 10 can be obtained from the user profile. For example, if the frequency information of 959 queries for "bottle", 9 queries for "patient" and 17 queries for "virus" were obtained from the user profile, the initial value of the Bayesian estimate is 959/1085, 9/1085, and 17/1085, respectively, and the final weight can be obtained by multiplying this value by the weight of the inverse literature frequency. Therefore, Equation 10 is recalculated to Equation 12.

(공식 12) (Formula 12)

본 발명에서 베이지안 가중치를 사용하는 이유는 사용자 프로파일내에 있는 단어들의 빈도수를 가중치에 반영함에 있어 모집단이 변하는 특성을 반영하기 위한것으로 실제로 이는 사용자의 검색 행위를 유용하게 모델링하는데 이용될 수 있음을 실험을 통해 확인할 수 있음은 물론이다.The reason for using Bayesian weight in the present invention is to reflect the changing characteristics of the population in reflecting the frequency of the words in the user profile in the weight, and in practice it can be used to model the user's search behavior usefully. Of course you can check through.

도 17에 베이지안 추정치와 역문헌 빈도를 이용한 사용자 질의어와 사용자 프로파일 사이의 엔트로피값을 계산하기 위한 알고리즘의 일례를 참고삼아 나타내 보였다.17 shows an example of an algorithm for calculating an entropy value between a user query using a Bayesian estimate and an inverse bibliographic frequency and a user profile.

이상에서 살펴본 바와 같이 본 발명 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법은, 사용자 질의어와 각 문서의 주제어들간의 확률정보에 따른 엔트로피값과 사용자 프로파일을 이용함으로써 정보검색 시스템에서 검색의 정확도를 향상시킬 수 있도록 하는 이점을 제공한다.As described above, the document ranking adjustment method using the present invention entropy and the user profile may improve the accuracy of the search in the information retrieval system by using the entropy value and the user profile according to the probability information between the user query word and the subject words of each document. It provides the benefits of doing so.

또한, 본 발명은 엔트로피값의 계산을 단어간의 공기 정보를 이용한 상호 정보량을 이용하고, 엔트로피값의 계산시 역문헌 빈도에 의한 가중치 계산방법에 베이지안 학습을 이용한 가중치 부여방법을 결합하여 적용함으로써 검색 문서에 대한 정확도 향상을 달성토록 하는 이점을 제공한다.In addition, the present invention uses a mutual information amount using the air information between words to calculate the entropy value, combined with the weighting method using Bayesian learning to the weight calculation method by the inverse literature frequency when calculating the entropy value search document It offers the advantage of achieving an improvement in accuracy for the system.

그리고, 본 발명은 정보검색 시스템을 이용하는 사용자들의 최근 관심도와 검색행위 및 사용자 질의어의 의미정보를 반영하기 위하여 종래 검색에 사용되었던 키워드들과 이들에 대한 사용빈도로 구성된 사용자 프로파일을 설계하여 엔트로피 계산에 반영하고, 실제 엔트로피 계산시에는 시간의 복잡도에 따른 검색시간의 지연을 억제하기 위하여 사용자 프로파일내에 있는 이전의 질의어 중 최상위 빈도를갖는 설정된 수의 질의어들만이 계산에 참여하게 되고, 최종적인 검색문서에 대한 순위는 엔트로피의 총합이 증가하는 순으로 재조정되게 한다.In addition, the present invention design a user profile consisting of the keywords used in the conventional search and the frequency of use thereof in order to reflect the recent interest and search behavior of users using the information retrieval system and the semantic information of the user query words to calculate the entropy In the actual entropy calculation, only the set number of queries having the highest frequency among the previous queries in the user profile participates in the calculation to suppress the delay of the search time due to the complexity of time. The ranking of the reassurance causes the total entropy to be readjusted in increasing order.

이상 본 발명의 바람직한 실시예에 대해 상세히 기술되었지만, 본 발명이 속하는 기술분야에 있어서 통상의 지식을 가진 사람이라면, 첨부된 청구 범위에 정의된 본 발명의 정신 및 범위를 벗어나지 않으면서 본 발명을 여러 가지로 변형 또는 변경하여 실시할 수 있음을 알 수 있을 것이다. 따라서 본 발명의 앞으로의 실시예들의 변경은 본 발명의 기술을 벗어날 수 없을 것이다.Although the preferred embodiments of the present invention have been described in detail above, those skilled in the art will appreciate that the present invention may be modified without departing from the spirit and scope of the invention as defined in the appended claims. It will be appreciated that modifications or variations may be made. Therefore, changes in the future embodiments of the present invention will not be able to escape the technology of the present invention.

Claims

상기 엔트로피 계산은 정확도를 높이기 위하여 상기 사용자의 각 주제어들의 역문헌 빈도와 베이지안 학습(Bayesian learning)에 의한 가중치값을 산출 적용하고 이를 기반으로 해서 상기 계산을 수행하도록 된 것을 특징으로 하는 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법.The entropy calculation calculates and applies the inverse literature frequency and Bayesian learning weight value of each subject of the user to increase accuracy, and performs the calculation based on the entropy and the user profile. How to adjust document ranking using.

제 1 항에 있어서, 상기 소정의 정보이론에 있어서의 정보량은 전달될 메시지에 대한 불확실성의 정도를 나타내며, 상기 불확실성의 정도가 클수록 정보량이 많은 것이며, 상기 불확실성은 메시지의 발생 확률에 역수를 취하고 이 값에 로그를 취한 값에 각 메시지의 발생 확률을 가중치로 취한 평균 정보량으로 표현되도록 된 것을 특징으로 하는 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법.The method of claim 1, wherein the amount of information in the predetermined information theory indicates the degree of uncertainty of the message to be delivered, and the greater the degree of uncertainty, the greater the amount of information, and the uncertainty is inversely proportional to the probability of occurrence of the message. A method of adjusting document rank by applying entropy and user profile, characterized in that it is expressed as an average amount of information obtained by weighting a probability of occurrence of each message to a value obtained by logging a value.

제 1 항에 있어서,The method of claim 1,

상기 엔트로피 계산은 평균 정보량의 값에 로그를 취한 형태를 이용하며,The entropy calculation uses a form in which the logarithm of the average information amount is taken,

상기 정보량은 전달될 메시지에 대한 불확실성의 정도를 나타내며,The amount of information represents the degree of uncertainty about the message to be conveyed,

상기 불확실성에 대한 양을 나타내는 엔트로피 H는 공식,The entropy H, which represents the amount for the uncertainty, is a formula,

,로 정의되고, Defined as,

상기 공식에서, 메시지 i가 갖는 정보량 logp_i는 메시지가 선택될 확률 p_i에 의해 결정되되,In the above formula, the amount of information logp _i of message _i is determined by the probability p _i of the message being selected,

상기 엔트로피 H는 n개의 메시지가 갖는 평균 정보량이며,The entropy H is the average amount of information of n messages,

p_i는 i번째 메시지의 발생 확률을 나타내고,p _i represents the probability of occurrence of the i th message,

상기 엔트로피 H는 수식적으로,The entropy H is formulaally,

,와 같은 특성을 갖는 것을 특징으로 하는 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법. Document ranking adjustment method using the entropy and user profile, characterized in that having the characteristics, such as.

제 1 항에 있어서, 상기 사용자 프로파일의 설계에는 베이지안 확률이 이용되는 것을 특징으로 하는 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법.2. The method of claim 1, wherein Bayesian probabilities are used for the design of the user profile.

제 1 항에 있어서, 상기 문서순위의 재조정은,The method of claim 1, wherein the readjustment of the document rank,

색인어를 추출하기 위한 형태소 분석 모듈과 색인어 구축모듈, 각 단어들간의 상호 정보량을 구축하는 공기 정보 구축모듈, 및 엔트로피 계산에 의한 순위 조정모듈에 의해 달성되도록 된 것을 특징으로 하는 엔트로피와 사용자 프로파일을 적용한 문서순위 조정방법.Applying entropy and user profile, which is achieved by a stemming module for extracting index words, an index word building module, an air information building module for building mutual information amount between each word, and a ranking adjustment module by entropy calculation. How to adjust document ranking.