KR100659370B1

KR100659370B1 - Method for constructing a document database and method for searching information by matching thesaurus

Info

Publication number: KR100659370B1
Application number: KR1020060014749A
Authority: KR
Inventors: 정한민; 성원경; 김평; 이미경; 구희관; 구남앙; 박동인
Original assignee: 한국과학기술정보연구원
Priority date: 2006-02-15
Filing date: 2006-02-15
Publication date: 2006-12-19

Abstract

A document database forming method and an information retrieval method using thesaurus matching are provided to easily form a document database by automatically forming the document database with a program, and quickly update the document database in case that a new field and term is generated, as formation of the document database is automatically changed by update of a field classification system and a thesaurus. A document is read and contents of the document are extracted(S120). An index word of the document is extracted by performing morpheme analysis for the extracted document(S140). The index word is analyzed through the thesaurus by matching with the thesaurus(S150). An analysis result is stored to the document database(S160). Thesaurus matching is analyzed by comparing more than one condition selected from a term frequency, a document frequency, a field classification frequency, and a concept word depth of the index word.

Description

시소러스 매칭에 의한 문서 ＤＢ 형성 방법 및 정보검색 방법{Method for constructing a document database and method for searching information by matching thesaurus}Method for constructing a document database and method for searching information by matching thesaurus

도1은 본 발명의 문서 DB 형성 방법에 사용되는 분야분류체계 및 시소러스를 연동하는 정보검색 시스템의 구성도.1 is a block diagram of an information retrieval system interworking a field classification system and a thesaurus used in the document DB forming method of the present invention.

도2는 분야분류코드가 기록된 분야분류체계의 일예를 보여주는 도면.2 is a view showing an example of a field classification system in which a field classification code is recorded.

도3은 도2의 분야분류체계에 기반을 둔 시소러스의 일예를 보여주는 도면.FIG. 3 is a diagram showing an example of a thesaurus based on the field classification system of FIG. 2. FIG.

도4는 본 발명에 따라 시소러스 매칭에 의한 문서 DB 형성 방법을 보여주는 순서도.4 is a flowchart showing a method for forming a document DB by thesaurus matching according to the present invention.

도5는 입력된 문서에 대하여 색인어와 문서번호로 구성되는 색인어 DB의 구조를 보여주는 도면.Fig. 5 is a diagram showing the structure of an index word DB composed of an index word and a document number for an input document.

도6은 본 발명에 따른 문서 DB 형성 방법에 의해서, 입력된 문서에 대하여 색인어와 시소러스가 매칭된 결과가 문서 DB에 저장된 구조를 보여주는 도면.6 is a view showing a structure in which an index word and a thesaurus match with respect to an input document are stored in the document DB according to the method for forming a document DB according to the present invention.

도7은 본 발명의 따른 문서 DB 형성 방법에 의해서, 주제를 추천하여 형성된 문서 DB의 구조를 보여주는 도면.7 is a view showing a structure of a document DB formed by recommending a subject by the method of forming a document DB according to the present invention.

도8은 본 발명의 일 실시예에 의한 문서 DB 관리 방법을 보여주는 순서도.8 is a flow chart showing a document DB management method according to an embodiment of the present invention.

도9는 본 발명의 일 실시예에 따른 정보검색 방법을 보여주는 순서도.9 is a flowchart showing an information retrieval method according to an embodiment of the present invention.

*도면의 주요부분에 대한 부호의 설명** Explanation of symbols for main parts of drawings *

10 : 사용자 컴퓨터 50 : 네트워크 통신망10: user computer 50: network communication network

100 : 웹 서버 110 : 사용자 인터페이스100: web server 110: user interface

200 : 정보검색 서버 300 : 데이터베이스200: information search server 300: database

본 발명은 시소러스 매칭에 의한 문서 DB 형성 방법 및 정보검색 방법에 관한 것으로, 보다 상세하게는 분야분류체계 기반의 시소러스(thesaurus)를 이용하여 효과적으로 문서를 검색할 수 있는 시소러스 매칭에 의한 문서 DB 형성 방법 및 정보검색 방법에 관한 것이다.The present invention relates to a document DB formation method and information retrieval method by thesaurus matching, and more particularly, to a document DB formation method by thesaurus matching that can effectively retrieve a document using a thesaurus based on the field classification system. And an information retrieval method.

최근 정보검색 분야에서 한글 자연어처리 기술은 형태소 분석 및 구문 분석을 통하여 품사의 식별과 원형 그리고 문장 내에서의 역할 등을 상당 수준 검출할 수 있게 되었다. Recently, in the field of information retrieval, Hangul natural language processing technology has been able to detect the parts of speech, its original form, and its role in sentences through morphological analysis and syntax analysis.

그러나 정보통신 및 기술의 발달로 새로운 분야 및 용어가 급격히 증가하는 추세에서 종래의 정보검색 방법을 적용하는 경우, 새로운 분야나 용어가 증가함에 따라 관련된 정보검색 시스템을 수시로 경신하는 작업이 필요하다.However, when the conventional information retrieval method is applied in a new field and terminology rapidly increasing due to the development of information and communication technology, it is necessary to update the related information retrieval system from time to time as the new field or term is increased.

종래의 정보검색 방법에서 정보의 급격한 증가에 따라 시스템 관리자 또는 정보 관리자가 수시로 색인어 DB 및 문서 DB 등을 변경하여야 하는데, 종래의 경우에는 이러한 DB를 수작업으로 변경하여야 하기 때문에 유지보수에 많은 비용과 인력이 필요하다. In the conventional information retrieval method, in accordance with the rapid increase of information, the system administrator or the information manager must change the index word database and the document DB from time to time. This is necessary.

특히 방대한 문서에 대한 정보가 저장된 문서 DB를 수정하는 일은 많은 시간과 비용이 소용되기 때문에 수시로 실시하기 어려운 문제가 있다.In particular, modifying the document DB, which stores information on a large amount of documents, is difficult to carry out from time to time because it takes a lot of time and money.

따라서 본 발명은 상기와 같은 종래기술의 문제점을 해결하기 위하여 이루어진 것으로, 본 발명의 목적은 문서 DB를 형성하는 경우 프로그램의 실행에 의해서 자동으로 이루어지기 때문에 용이하게 실시할 수 있는 시소러스 매칭에 의한 문서DB 형성 방법을 제공하는데 있다.Therefore, the present invention is made to solve the problems of the prior art as described above, the object of the present invention is formed by the execution of the program automatically when the document DB is formed, the document by thesaurus matching that can be easily performed It is to provide a DB formation method.

또한 본 발명은 문서 DB의 생성을 분야분류체계와 시소러스의 경신에 의해서 자동으로 변경할 수 있기 때문에 급격하게 증가하는 정보에 의해서 새로운 분야 및 용어가 발생하는 경우에도 신속하게 문서 DB를 경신할 수 있는 문서 DB 관리 방법을 제공한다.In addition, since the present invention can automatically change the creation of the document DB according to the field classification system and the renewal of the thesaurus, the document DB can be quickly updated even when new fields and terms are generated due to the rapidly increasing information. Provide DB management method.

또한 본 발명은 질의어에 대한 정보검색에서 주제어 또는 문서를 입력받아서 관련 문서뿐만 아니라 관련 분야 및 주제에 대한 정보를 함께 제공하여 정보검색의 결과를 정확하게 파악하는 효과가 있는 정보검색 방법을 제공한다.The present invention also provides an information retrieval method having an effect of accurately grasping the results of the information retrieval by receiving information on a subject or document in an information retrieval for a query and providing information on related fields and topics as well as related documents.

상기의 목적을 달성하기 위한 본 발명의 실시예에 따른 시소러스 매칭에 의한 문서 DB 형성 방법은 분야분류체계 및 시소러스를 연동하는 정보검색 시스템에 서, 문서를 읽어 들이는 단계; 상기 문서의 내용이 추출되는 단계; 추출된 상기 문서의 형태소 분석이 실시되어, 상기 문서의 색인어가 추출되는 단계; 상기 시소러스를 통하여 상기 색인어가 시소러스 매칭 되어 분석되는 단계; 및 상기 분석 결과를 문서 DB에 저장하는 단계를 포함한다.Method for forming a document DB by thesaurus matching according to an embodiment of the present invention for achieving the above object, in the information retrieval system interworking the field classification system and the thesaurus, reading the document; Extracting contents of the document; Morphological analysis of the extracted document is performed to extract an index word of the document; Analyzing the index word by thesaurus through the thesaurus; And storing the analysis result in a document DB.

바람직한 실시예에 있어서, 상기 시소러스 매칭은 상기 색인어의 용어 빈도, 문서 빈도, 분야분류코드 빈도 및 개념어 깊이로 이루어진 군에서 선택된 하나 이상의 조건을 비교하여 분석되는 것을 특징으로 한다.In a preferred embodiment, the thesaurus matching is analyzed by comparing one or more conditions selected from the group consisting of term frequency, document frequency, field classification code frequency, and conceptual word depth of the index word.

상기 분석 결과는 상기 문서에 대하여 상기 조건에 따라 항목에 대한 확률 값을 순서대로 할당한 정보이고, 상기 항목은 분야분류코드 또는 주제인 것을 특징으로 한다.The analysis result is information in which probability values for items are sequentially assigned to the document according to the condition, and the items are sector classification codes or topics.

상기 항목에 대하여 상기 확률 값이 큰 순서에 따라 선택된 분야분류코드 또는 주제의 상기 확률 값이 저장되는 것을 특징으로 한다.The probability value of the field classification code or the subject selected according to the order in which the probability values are larger for the item is stored.

또한 본 발명의 문서 DB 관리 방법은 분야분류체계 및 시소러스를 연동하는 정보검색 시스템에서, 문서에서 새로운 분야 또는 새로운 용어가 추출되는 단계; 상기 새로운 분야 또는 상기 새로운 용어에 대하여 분야분류체계를 경신하는 단계; 경신된 상기 분야분류체계 또는 상기 새로운 용어에 의해서 상기 시소러스를 경신하는 단계; 경신된 상기 시소러스를 적용하여 저장된 문서의 색인어가 시소러스 매칭 되어 분석되는 단계; 및 상기 분석 결과를 문서 DB에 저장하는 단계를 포함한다.In addition, the document DB management method of the present invention comprises the steps of extracting a new field or a new term from the document in the information retrieval system interworking the field classification system and the thesaurus; Updating a field classification system for the new field or the new term; Updating the thesaurus by the updated discipline classification system or the new terminology; Applying the renewed thesaurus and indexing and analyzing the index words of the stored documents; And storing the analysis result in a document DB.

또한 본 발명의 실시예에 따른 정보검색 방법은 분야분류체계 및 시소러스를 연동하는 정보검색 시스템에서, 질의어를 읽어 들이는 단계; 상기 질의어의 형태소 분석이 실시되어 상기 질의어의 색인어가 추출되는 단계; 상기 색인어를 시소러스 매칭에 의해서 형성된 문서 DB와 비교하여 검색하는 단계; 및 상기 검색 결과를 제공하는 단계를 포함한다.In addition, the information retrieval method according to an embodiment of the present invention comprises the steps of: reading a query in the information retrieval system interworking the field classification system and the thesaurus; Extracting an index of the query by performing a morpheme analysis of the query; Searching the index word by comparing it with a document DB formed by thesaurus matching; And providing the search results.

바람직한 실시예에 있어서, 상기 검색 결과는 분야분류코드 또는 주제와 함께 제공되는 것을 특징으로 한다.In a preferred embodiment, the search results are characterized in that it is provided with a field classification code or subject.

또한 본 발명의 다른 실시예에 따른 정보검색 방법은 분야분류체계 및 시소러스를 연동하는 정보검색 시스템에서, 문서를 읽어 들이는 단계; 상기 문서의 내용이 추출되는 단계; 추출된 상기 문서의 형태소 분석이 실시되어, 상기 문서의 색인어가 추출되는 단계; 상기 시소러스를 통하여 상기 색인어가 시소러스 매칭 되는 분야분류코드 또는 주제를 상기 문서에 부여하는 단계; 상기 분야분류코드 또는 상기 주제를 문서 DB와 비교하여 검색하는 단계; 및 상기 검색 결과를 제공하는 단계를 포함한다.In addition, an information retrieval method according to another embodiment of the present invention comprises the steps of: reading a document in an information retrieval system interworking a field classification system and a thesaurus; Extracting contents of the document; Morphological analysis of the extracted document is performed to extract an index word of the document; Assigning to the document a field classification code or a subject whose thesaurus is thesaurus matched through the thesaurus; Searching for comparing the field classification code or the subject with a document DB; And providing the search results.

바람직한 실시예에 있어서, 상기 문서와 상기 문서 DB를 분야분류코드, 확률 값, 색인어 분포, 색인어 빈도 및 저자의 이름으로 이루이진 군에서 선택된 하나 이상의 조건에 대하여 유사성을 비교하거나 중복검사하고, 상기 검색 결과는 상기 유사성을 군집화(clustering)하여 문서로 제공하는 것을 특징으로 한다.In a preferred embodiment, the document and the document DB are compared or duplicated for similarity against one or more conditions selected from a group consisting of a field classification code, a probability value, an index word distribution, an index word frequency, and an author's name, and the search. The result is characterized in that the similarities are clustered and provided as a document.

이하 첨부한 도면을 참조하여 본 발명의 실시예를 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도1은 본 발명의 문서 DB 형성 방법에 사용되는 분야분류체계 및 시소러스를 연동하는 정보검색 시스템의 구성도이다.1 is a block diagram of an information retrieval system interworking a field classification system and a thesaurus used in the document DB forming method of the present invention.

도1에 도시된 바와 같이, 본 발명의 문서 DB 형성 방법이 사용되는 정보검색 시스템은 사용자 컴퓨터(10), 네트워크 통신망(50), 웹 서버(100), 정보검색 서버(200) 및 데이터베이스(300)를 포함한다.As shown in Figure 1, the information retrieval system using the document DB forming method of the present invention is a user computer 10, network communication network 50, Web server 100, information retrieval server 200 and database 300 ).

네트워크 통신망(50)에 연결된 사용자 컴퓨터(10)는 사용자가 네트워크 통신망(50)을 통하여 웹 서버(100)에 접속하고, 정보검색을 실시할 수 있는 환경을 제공한다. 사용자 컴퓨터(10) 대신에 예를 들어 사용자용 터미널 및 무선 휴대 통신기 등이 네트워크 통신망(50)에 연결되어, 사용자가 웹 서버(100)에 접속하는 것도 가능하다.The user computer 10 connected to the network communication network 50 provides an environment in which a user can access the web server 100 through the network communication network 50 and perform information search. Instead of the user computer 10, for example, a user terminal and a wireless portable communication device may be connected to the network communication network 50 so that the user may access the web server 100.

웹 서버(100)는 사용자 인터페이스(110) 및 사용자 DB(120)를 포함하며, 사용자 컴퓨터(10)로부터 입력된 질의어, 검색 조건 및 문서를 네트워크 통신망(50)을 통하여 수신한다. 웹 서버(100)는 네트워크 통신망(50)을 통하여 정보검색 서버(200)에 검색을 요청하고, 그 결과를 정보검색 서버(200)로부터 전달받아 사용자 컴퓨터(10)에 제공하는 역할을 한다. 웹 서버(100)는 네트워크 통신망(50)을 통하여 다른 웹 사이트(20)의 웹 문서나 정보를 검색하기 위한 기능도 가진다.The web server 100 includes a user interface 110 and a user DB 120, and receives a query word, a search condition, and a document input from the user computer 10 through the network communication network 50. The web server 100 requests a search to the information retrieval server 200 through the network communication network 50, and receives the result from the information retrieval server 200 and provides the result to the user computer 10. The web server 100 also has a function for retrieving web documents or information of other web sites 20 through the network communication network 50.

사용자 인터페이스(110)는 사용자가 사용자 컴퓨터(10)를 통해서 검색을 원하는 질의어 및 문서 등에 대하여 하나 이상의 검색 조건을 입력할 수 있도록 한다. 또한 사용자 인터페이스(110)는 정보검색 서버(200)와 연동되어 질의어, 검색 조건 및 문서 등에 부합하는 문서 또는 웹 문서 등을 검색할 수 있다. 사용자 DB(120)는 다수의 사용자에 대한 상세 정보를 저장하고, 사용자가 입력한 정보에 따라 사용자의 검색 성향 등의 정보를 저장한다.The user interface 110 allows a user to input one or more search conditions for a query or document that the user wishes to search through the user computer 10. In addition, the user interface 110 may be linked with the information search server 200 to search for a document or a web document that matches a query, a search condition, a document, and the like. The user DB 120 stores detailed information about a plurality of users, and stores information such as a search tendency of the user according to the information input by the user.

정보검색 서버(200)는 문서 추출 모듈(module, 210), 형태소 분석 모듈(220), 역파일(inverted file) 생성 모듈(230), 시소러스(thesaurus) 매칭(matching) 모듈(240) 및 검색 엔진(250) 등을 포함하며, 사용자 컴퓨터(10)로부터 입력된 데이터에 부합하는 정보를 검색한다. 이때 정보검색 서버(200)의 각 모듈은 각각의 기능을 독립적으로 수행하는 프로그램이 저장된 노드 또는 부분적으로 통합된 프로그램이 저장된 복수의 노드일 수도 있다.The information retrieval server 200 includes a document extraction module 210, a morphological analysis module 220, an inverted file generation module 230, a thesaurus matching module 240, and a search engine. 250 and the like, and retrieves information corresponding to the data input from the user computer 10. In this case, each module of the information retrieval server 200 may be a node in which a program for independently performing each function is stored, or a plurality of nodes in which a partially integrated program is stored.

상세하게, 문서 추출 모듈(210)은 사용자 컴퓨터(10) 또는 네트워크 통신망(50) 등을 통하여 입력된 문서를 수신하여 내용을 판독한 후, 문서 내용의 추출 결과를 저장한다. 형태소 분석 모듈(220)은 입력된 질의어 또는 문서에서 형태소 분석을 실시하고, 질의어 또는 문서에 대응하는 색인어를 추출한다. In detail, the document extraction module 210 receives a document input through the user computer 10 or the network communication network 50, reads the content, and stores the extraction result of the document content. The morphological analysis module 220 performs morphological analysis on the input query or document, and extracts an index word corresponding to the query or document.

역파일 생성 모듈(230)은 형태소 분석 모듈(220)에 의해서 생성된 색인어에 문서를 대응시켜서 색인어 DB(310)를 생성하거나, 색인어 DB(310)에 입력된 문서의 색인어 정보를 저장한다. 시소러스 매칭 모듈(240)은 시소러스(330)와 문서에 대응하는 색인어에 대하여 시소러스 매칭을 실시하여 문서를 분석하고, 매칭 결과를 문서 DB(320)에 저장한다.The reverse file generation module 230 generates an index word DB 310 by mapping a document to an index word generated by the morpheme analysis module 220, or stores index word information of a document input to the index word DB 310. The thesaurus matching module 240 analyzes a document by performing a thesaurus matching on the thesaurus 330 and an index word corresponding to the document, and stores the matching result in the document DB 320.

검색 엔진(250)은 질의어 또는 문서에 대응하는 색인어 DB(310) 및 문서 DB(320)를 기반으로 질의어 또는 문서에 대하여 정보검색을 실시한다.The search engine 250 performs an information search on the query word or the document based on the index word DB 310 and the document DB 320 corresponding to the query word or the document.

데이터베이스(300)는 정보검색 서버(200)에 연결되어 정보검색 서버(200)의 각 모듈에 관련되어 색인어 DB(310), 문서 DB(320), 시소러스(330) 및 분야분류체계(340) 등을 포함하고, 시소러스(330)는 분야분류체계(340)를 기반으로 개념어를 체계화한 계층적 구조를 가지고 있어 시소러스 매칭에 사용된다. 이때 데이터베이스(300)의 각 DB는 각각의 정보를 독립적으로 저장하는 정보저장장치 또는 정보저장매체일 수 있고, 정보검색 서버(200)에 포함되어 각 모듈과 연동된 정보저장장치 또는 정보저장매체일 수 있으며, 각 정보가 통합된 단일의 정보저장장치 또는 정보저장 매체일 수도 있다.The database 300 is connected to the information retrieval server 200 and is associated with each module of the information retrieval server 200, the index word DB 310, the document DB 320, the thesaurus 330 and the field classification system 340, etc. The thesaurus 330 has a hierarchical structure in which conceptual words are organized based on the field classification system 340 and used for thesaurus matching. In this case, each DB of the database 300 may be an information storage device or an information storage medium for storing each information independently, and included in the information retrieval server 200 to be an information storage device or an information storage medium interworking with each module. It may be a single information storage device or information storage medium in which each information is integrated.

상기와 같이 구성된 정보검색 시스템에 의해서 분야분류체계 및 시소러스를 연동하여 본 발명의 정보검색 방법을 실시하기 위해 필요한 본 발명의 문서 DB 형성 방법에 대하여 설명한다.The document DB formation method of the present invention necessary for implementing the information retrieval method of the present invention by interworking the field classification system and the thesaurus by the information retrieval system configured as described above will be described.

시소러스는 개념어를 체계화한 트리(tree) 또는 격자형(lattice) 구조를 가지는 어휘표로 정의할 수 있다. 종래의 정보검색 방법에서는 시소러스를 질의어 자동 확장이나 개념 기반 검색의 용도로 사용되었다. 본 발명에서는 분야분류체계로부터 시소러스의 개념어에 분야분류코드를 부여한다면, 개념어가 분야를 가지게 되는 확장된 개념의 시소러스가 된다. A thesaurus can be defined as a lexical table with a tree or lattice structure that organizes conceptual words. In the conventional information retrieval method, thesaurus is used for query automatic expansion or concept based retrieval. In the present invention, if a field classification code is given to a conceptual word of a thesaurus from the field classification system, the conceptual word becomes an extended concept thesaurus having the field.

본 발명에서는 이러한 시소러스가 분야 정보를 필요로 하는 응용분야에 적용되어 분야 정보를 제공하는 도구로서 사용된다.In the present invention, such a thesaurus is applied to an application requiring the field information and is used as a tool for providing the field information.

도2는 분야분류코드가 기록된 분야분류체계의 일예를 보여주는 도면이고, 도3은 도2의 분야분류체계에 기반을 둔 시소러스의 일예를 보여주는 도면이며, 도4는 본 발명에 따라 시소러스 매칭에 의한 문서 DB 형성 방법을 보여주는 순서도이다.2 is a view showing an example of a field classification system in which a field classification code is recorded, and FIG. 3 is a view showing an example of a thesaurus based on the field classification system of FIG. 2, and FIG. 4 is a diagram illustrating a thesaurus matching according to the present invention. Is a flowchart showing how to create a document DB.

도2 내지 도4를 참조하여 설명하면, 분야분류체계 및 시소러스를 연동하여 적용하는 정보검색 시스템에서 시소러스 매칭에 의한 문서 DB를 형성하기 위해서 사용자 컴퓨터, 웹 서버 또는 정보검색 서버를 통하여 문서 DB에 저장할 문서를 읽어 들인다(S110).Referring to Figures 2 to 4, in the information retrieval system applying the field classification system and thesaurus to be stored in the document DB through the user computer, web server or information retrieval server to form a document DB by thesaurus matching Read the document (S110).

문서가 입력되면, 문서의 내용을 판독한 후에 판독된 내용을 저장하여, 문서의 내용이 일정한 형식으로 추출되게 한다(S120). 추출된 문서의 내용으로부터 문서의 텍스트를 형태소 분석하고(S130), 그 결과에서 명사 위주로 문서의 색인어를 추출한다(S140).When the document is input, the content of the document is stored after reading the content of the document, so that the content of the document is extracted in a predetermined format (S120). The text of the document is morphologically analyzed from the contents of the extracted document (S130), and the index word of the document is extracted based on the nouns from the result (S140).

추출된 색인어들은 분야분류체계에 기반을 둔 시소러스의 우선어 또는 비우선어와 매칭을 시도하고(S150), 필요에 따라 하위 개념어에서 상위 개념어까지 사용할 수 있다. 이때 매칭이 성공하는 경우 해당 개념어가 가지는 분야분류코드를 반환한다. The extracted index words attempt to match the thesaurus or non-priority word of the thesaurus based on the field classification system (S150), and can use from the lower conceptual words to the higher conceptual words as necessary. If the match is successful, the field classification code of the corresponding conceptual word is returned.

예를 들어, 입력된 문서 내에서 "다국어번역시스템"이라는 색인어가 추출된 경우에 시소러스의 우선어 또는 비우선어에서 매칭을 시도하는데, 매칭이 성공하는 경우에는 해당 개념어가 가지는 분야분류코드 "5140(언어처리및번역기술)"을 반환한다. For example, when an index word "multilingual translation system" is extracted from an input document, an attempt is made to match a priority or non-priority word of a thesaurus. If the matching is successful, the field classification code "5140 ( Language processing and translation technology) ".

문서 내의 모든 색인어들에 대해 이러한 매칭을 수행하게 되면, 분야분류코드들이 해당 문서에 부여된다. 이 결과가 문서 DB에 저장된다(S160).When this matching is performed for all index words in a document, field classification codes are assigned to the document. This result is stored in the document DB (S160).

이때 문서의 색인어와 시소러스가 매칭된 색인어의 용어 빈도(term frequency), 문서 빈도(inverse document frequency), 분야분류코드 빈도 및 개념어 깊이(depth) 등에서 선택된 조건을 비교하여 문서가 분석되고, 분석결과를 이용하여 선택된 조건에 따라 분야분류코드에 대한 확률 값이 계산된다. 상위 n개의 분야분류코드에 대한 확률 값을 크기 순서대로 할당한 정보를 문서 DB에 저장한다(S160).In this case, the document is analyzed by comparing conditions selected from term frequency, inverse document frequency, field classification code frequency, and conceptual word depth of the index word with which the index word and thesaurus match. According to the selected condition, the probability value for the field classification code is calculated. Information allocated in the order of magnitude of probability values for the top n sector classification codes is stored in the document DB (S160).

도5는 입력된 문서에 대하여 색인어와 문서번호로 구성되는 구조를 저장하는 색인어 DB이고, 도6은 본 발명에 따른 문서 DB 형성 방법에 의해서, 입력된 문서에 대하여 색인어와 시소러스가 매칭된 결과가 문서 DB에 저장된 구조를 보여주는 도면이다.FIG. 5 is an index word DB for storing a structure consisting of an index word and a document number for an input document. FIG. 6 shows a result of matching an index word and a thesaurus with respect to an input document by a method of forming a document DB according to the present invention. This figure shows the structure stored in the document DB.

또한 시소러스 매칭에 의해서 분야분류코드 대신에 주제를 추천하는 것도 가능하다. 도7은 주제를 추천하여 형성된 문서 DB의 구조를 보여주는 도면이다.It is also possible to recommend topics instead of discipline codes by thesaurus matching. 7 is a view showing the structure of a document DB formed by recommending a topic.

도5 내지 도7에 도시된 바와 같이, 문서 DB에 저장되는 정보는 분야분류코드 또는 주제에 대하여 확률 값이 큰 순서에 따라 선택된 정보와 상기 확률 값이 결합되어 저장되는 것이 바람직하다.5 to 7, the information stored in the document DB is preferably stored in combination with the information selected according to the order of the largest probability value for the field classification code or the subject.

최근에 정보통신 및 기술의 발달로 새로운 분야 및 용어가 급격히 증가하는 추세에서 종래의 정보검색 방법을 적용하는 경우, 시스템 관리자 또는 정보 관리자가 문서 DB를 수작업으로 변경하여야 하기 때문에 유지보수에 많은 비용과 인력이 필요하다. 그러나 본 발명의 문서 DB 형성 방법을 적용하는 경우는 훨씬 용이하게 문서 DB를 관리할 수 있다.In case of applying the conventional information retrieval method with the recent increase of new field and terminology due to the development of information communication and technology, the system manager or the information manager has to change the document DB manually. Manpower is required. However, when the document DB forming method of the present invention is applied, the document DB can be managed more easily.

도8은 본 발명의 일 실시예에 의한 문서 DB 관리 방법을 보여주는 순서도이 다.8 is a flowchart showing a document DB management method according to an embodiment of the present invention.

도8을 참조하여, 문서 DB를 경신하려는 시점의 문서에서 형태소 분석 등을 통하여 새로운 분야 또는 새로운 용어가 추출되면(S210), 새로운 분야 또는 새로운 용어에 의해서 분야분류체계를 경신한다(S220).Referring to FIG. 8, when a new field or a new term is extracted from the document at the time of updating the document DB through morphological analysis (S210), the field classification system is updated by the new field or the new term (S220).

이어서 경신된 분야분류체계에 의해서 시소러스를 경신한다(S230). 분야분류체계와 시소러스를 경신하는 것은 수작업에 의해서 이루어지지만, 새로운 분야 또는 관련된 용어 부분만 추가하거나 변경하면 되기 때문에 용이하게 이루어질 수 있다.Subsequently, the thesaurus is renewed by the updated field classification system (S230). Updating disciplines and thesauruses is done manually, but can be done easily by adding or changing only new disciplines or related terminology.

경신된 시소러스를 적용하여 저장된 문서의 색인어를 시소러스 매칭하고, 그 분석 결과를 문서 DB에 저장하면(S240) 새로운 문서 DB가 형성된다. 이때 경신된 시소러스를 통하여 시소러스 매칭을 실시하는 작업은 상기에서 설명한 문서 DB 형성 방법과 동일하게 실시하면 된다.By applying the updated thesaurus, thesaurus matching the index word of the stored document, and storing the analysis result in the document DB (S240) a new document DB is formed. At this time, the thesaurus matching through the updated thesaurus may be performed in the same manner as the document DB forming method described above.

이렇게 문서 DB를 경신된 시소러스를 통하여 수정하는 작업은 프로그램에 의해서 자동으로 실시되기 때문에, 종래와 달리 많은 비용과 인력을 사용하지 않고 자동으로 용이하게 실시됨으로써, 정보검색 시스템의 유지보수가 획기적으로 개선되는 장점이 있다.Since the work of modifying the document DB through the updated thesaurus is automatically performed by the program, unlike the conventional method, the maintenance of the information retrieval system is drastically improved by easily and automatically performed without using a lot of cost and manpower. It has the advantage of being.

상기에서 설명한 문서 DB를 이용하여 분야분류체계 및 시소러스를 연동하는 정보검색 시스템에서 사용되는 본 발명의 정보검색 방법에 대하여 설명한다.The information retrieval method of the present invention used in the information retrieval system linking the field classification system and the thesaurus using the document DB described above will be described.

도9는 본 발명의 일 실시예에 따른 정보검색 방법을 보여주는 순서도이다. 이때 실질적으로 동일한 구성과 기능을 가진 구성요소들에 대하여는 동일한 참조부 호를 사용한다.9 is a flowchart illustrating an information retrieval method according to an embodiment of the present invention. At this time, the same reference numerals are used for components having substantially the same configuration and function.

우선 질의어를 사용한 정보검색 방법에 대하여 설명한다. 사용자가 입력한 질의어를 읽어 들인다(S310). 이때 질의어는 키워드나 자연어 상태로 입력될 수 있고, 사용자의 요청에 따라 검색 조건이 같이 입력되어 정보검색 서버에 전달된다.First, an information retrieval method using a query will be described. The query input by the user is read (S310). In this case, the query word may be input in a keyword or natural language state, and a search condition is input together with the user's request and transmitted to the information search server.

질의어에 대한 형태소 분석이 실시되어(S330) 질의어의 색인어를 추출하고(S340), 추출된 색인어에 대하여 색인어 DB(310) 및 문서 DB(320)와 비교하여 검색을 실시한다. 이때 종래와 다르게 시소러스 매칭에 의해서 형성된 문서 DB(320)가 사용되기 때문에, 색인어에 대하여 확률을 가지는 분야분류코드 또는 주제가 검색된다(S360).A morphological analysis of the query is performed (S330), an index word of the query word is extracted (S340), and the search is performed by comparing the extracted index word with the index word DB 310 and the document DB 320. In this case, since the document DB 320 formed by thesaurus matching is used differently from the related art, a field classification code or a subject having a probability with respect to the index word is searched (S360).

따라서 본 발명의 정보검색 방법에 의해서 질의어에 대한 검색 결과가 문서뿐만 아니라 분야분류코드 또는 주제로서 제공된다. 도5 내지 도6을 참조하여 설명하면, 예를 들어, 질의어가 "다국어번역시스템"일 경우 검색결과로서 1, 5 및 7번 문서가 제시되고, 문서에 대한 적합 분야로서 5번 문서의 경우 "5140(언어처리및번역기술)"과 "4220(미들웨어)"가 추천되어 사용자가 문서의 분야를 파악하여 원하는 결과인지를 정확하게 판단할 수 있는 장점이 있다.Therefore, according to the information retrieval method of the present invention, the search result for the query word is provided not only as a document but also as a field classification code or a subject. Referring to FIGS. 5 to 6, for example, documents 1, 5, and 7 are presented as search results when the query language is a "multilingual translation system", and in case of document 5 as a suitable field for the document " 5140 (language processing and translation technology) "and" 4220 (middleware) "are recommended, which has the advantage that the user can accurately determine the desired result by understanding the field of the document.

질의어 대신에 문서가 입력되는 경우에 본 발명의 정보검색 방법에 대하여 설명한다.The following describes the information retrieval method of the present invention when a document is input instead of a query word.

사용자 컴퓨터를 통하여 문서가 입력되는 경우에는 입력된 문서와 유사한 문서 또는 관련된 문서를 발견하는 것이 정보검색의 목적이다. 이때 문서에 대하여 정보검색을 실시하는 방법은 입력된 문서에 대한 색인어가 시소러스를 통하여 시소 러스 매칭 되는 과정까지 도9에 도시된 바와 같이, 상기에 설명한 본 발명의 문서 DB 형성 방법과 동일하다. When a document is input through a user's computer, the purpose of information retrieval is to find a document similar to the input document or a related document. At this time, the method of performing information retrieval on the document is the same as the document DB forming method of the present invention described above, as shown in FIG. 9 until the process of matching the index word for the input document through the thesaurus.

정보검색 서버로 문서가 입력되고(S320), 입력된 문서의 문서 내용이 추출된다. 정보검색 서버에서 입력된 문서의 형태소 분석이 완료되면(S330), 색인어가 추출된다(S340). The document is input to the information retrieval server (S320), and the document content of the input document is extracted. When the stemming of the document input from the information retrieval server is completed (S330), the index word is extracted (S340).

입력된 문서의 색인어가 시소러스를 통하여 시소러스 매칭 되고, 다음에 시소러스 매칭 되는 분야분류코드 또는 주제를 입력된 문서에 부여한다(S350). 이때 상기에서 설명한 것과 동일한 방법으로 분야분류코드 또는 주제에 대한 확률 값을 계산하는 것이 바람직하다.The index words of the input document are thesaurus matched through the thesaurus, and then a field classification code or a subject that is thesaurus matched is assigned to the input document (S350). In this case, it is preferable to calculate the probability value for the field classification code or the subject in the same manner as described above.

상기 분야분류코드 또는 주제를 적용하여 문서 DB(320)와 유사성을 비교하여 검색하여 문서 매칭을 실시한다(S370). 이때 상위 n개의 분야분류코드 또는 주제에 대한 확률 값을 문서 DB 내의 분야분류코드 또는 주제에 할당된 확률 값과 비교하면 유사성이 높은 문서를 좀더 정확하게 발견할 수 있다.By applying the field classification code or the subject, the document DB 320 is compared with the similarity to search for document matching (S370). In this case, comparing the probability values of the top n sector classification codes or subjects with the probability values assigned to the subject classification codes or subjects in the document DB can more accurately find a document having high similarity.

또한 입력된 문서와 문서 DB를 분야분류코드, 확률 값, 색인어 분포, 색인어 빈도 및 저자의 이름으로 이루이진 군에서 선택된 하나 이상의 조건에 대하여 유사성을 비교하거나 중복검사를 하는 것이 정보검색의 정확도를 높이는 바람직한 검색 방법이다.In addition, comparing the similarity or duplication of the input documents and document DB with one or more conditions selected from the group consisting of field classification code, probability value, index word distribution, index word frequency, and author's name improves the accuracy of IR. It is a preferred search method.

이렇게 본 발명의 정보검색 방법에 의해서 얻어진 검색 결과는 유사성을 군집화(clustering)하여 문서로 제공하는 것이 바람직하다(S380).In this way, the search results obtained by the information retrieval method of the present invention are preferably clustered to provide a document (S380).

이와 같이 분야분류체계(340) 및 시소러스(330)를 연동하여 정보검색을 실시 하는 경우 분야 및 주제의 추천이 가능한 장점이 있고, 문서를 입력받아서 사용자에게 유사하거나 관련 문서를 정확하게 발견하여 리포트 중복 검사, 과제 유사성 검사 등 종래의 정보검색 방법으로 할 수 없는 다양한 영역으로 활용도를 넓혀나가는 장점이 있다.As such, when information search is performed in conjunction with the field classification system 340 and the thesaurus 330, there is an advantage that recommendation of the field and the subject is possible, and the report is repeatedly detected by finding similar or related documents to the user by receiving the document. In addition, the present invention has the advantage of extending the utilization to various areas that cannot be done by conventional information retrieval methods such as task similarity checking.

이상에서, 본 발명의 구성 및 동작을 상기한 설명 및 도면에 따라 도시하였지만, 이는 예를 들어 설명한 것에 불과하며 본 발명의 기술적 사상 및 특허청구 범위를 벗어나지 않는 범위 내에서 다양한 변화 및 변경이 가능함은 물론이다.In the above, the configuration and operation of the present invention have been shown in accordance with the above description and drawings, but this is merely described, for example, and various changes and modifications are possible without departing from the spirit and scope of the present invention. Of course.

상술한 바와 같이, 본 발명에 의하면 다음과 같은 효과를 달성한다.As described above, according to the present invention, the following effects are achieved.

본 발명의 시소러스 매칭에 의한 문서DB 형성 방법에 의해서 문서 DB를 형성하는 경우 종래와 달리 시소러스 매칭 방법에 의해서 자동으로 실시되기 때문에 용이하게 문서 DB를 형성하는 효과가 있다.When the document DB is formed by the method DB forming method by the thesaurus matching of the present invention, since the document DB is automatically performed by the thesaurus matching method unlike the conventional art, the document DB is easily formed.

또한 본 발명의 문서 DB 관리 방법에 의해서, 문서 DB의 생성이 분야분류체계와 시소러스의 경신에 의해서 자동으로 변경할 수 있기 때문에 급격하게 증가하는 정보에 의해서 새로운 분야 및 용어가 발생하는 경우에도 신속하게 문서 DB를 경신할 수 있는 장점이 있다.In addition, according to the document DB management method of the present invention, since the creation of the document DB can be automatically changed by the field classification system and the renewal of the thesaurus, even if a new field or term is generated due to the rapidly increasing information, There is an advantage to update the DB.

또한 본 발명의 정보검색 방법에 의해서 질의어에 대한 정보검색에서 관련 문서뿐만 아니라 관련 분야 및 주제에 대한 정보를 함께 제공하기 때문에 정보검색의 결과를 정확하게 파악하는 효과가 있다.In addition, according to the information retrieval method of the present invention, the information retrieval of the query word provides not only related documents but also information on related fields and subjects.

또한 본 발명의 정보검색 방법에 의해서 문서를 입력하여 검색하는 것이 용 이하고, 문서 사이의 유사성을 파악하기 용이한 장점이 있다.In addition, it is easy to input and search documents by the information retrieval method of the present invention, and it is easy to grasp the similarity between documents.

또한 문서를 입력하여 검색하는 방식에 의해서 리포트 중복 검사, 과제 유사성 검사 등 종래의 정보검색 방법으로 할 수 없는 다양한 영역으로 활용할 수 있는 효과가 있다.In addition, by inputting a document and searching, there is an effect that it can be utilized in various areas that cannot be performed by a conventional information retrieval method such as a duplicate report inspection and a problem similarity inspection.

Claims

분야분류체계 및 시소러스를 연동하는 정보검색 시스템에서,In the information retrieval system linking the field classification system and the thesaurus,

문서를 읽어 들이는 단계;Reading a document;

상기 문서의 내용이 추출되는 단계;Extracting contents of the document;

추출된 상기 문서의 형태소 분석이 실시되어, 상기 문서의 색인어가 추출되는 단계;Morphological analysis of the extracted document is performed to extract an index word of the document;

상기 시소러스를 통하여 상기 색인어가 시소러스 매칭 되어 분석되는 단계; 및Analyzing the index word by thesaurus through the thesaurus; And

상기 분석 결과를 문서 DB에 저장하는 단계를 포함하는 것을 특징으로 하는 시소러스 매칭에 의한 문서 DB 형성 방법. Method for forming a document DB by thesaurus matching, characterized in that it comprises the step of storing the analysis result in the document DB.

제1항에 있어서,The method of claim 1,

상기 시소러스 매칭은 상기 색인어의 용어 빈도, 문서 빈도, 분야분류코드 빈도 및 개념어 깊이로 이루어진 군에서 선택된 하나 이상의 조건을 비교하여 분석되는 것을 특징으로 하는 시소러스 매칭에 의한 문서 DB 형성 방법.And the thesaurus matching is analyzed by comparing one or more conditions selected from the group consisting of term frequency, document frequency, field classification code frequency, and conceptual word depth of the index word.

제1항 또는 제2항에 있어서,The method according to claim 1 or 2,

상기 분석 결과는 상기 문서에 대하여 상기 조건에 따라 항목에 대한 확률 값을 순서대로 할당한 정보인 것을 특징으로 하는 시소러스 매칭에 의한 문서 DB 형성 방법.The analysis result is a document DB formation method by thesaurus matching, characterized in that the information assigned in order to the probability value for the item in accordance with the condition.

제3항에 있어서,The method of claim 3,

상기 항목은 분야분류코드 또는 주제인 것을 특징으로 하는 시소러스 매칭에 의한 문서 DB 형성 방법.And the item is a field classification code or a subject.

제3항에 있어서,The method of claim 3,

상기 항목에 대하여 상기 확률 값이 큰 순서에 따라 선택된 분야분류코드 또는 주제의 상기 확률 값이 저장되는 것을 특징으로 하는 시소러스 매칭에 의한 문서 DB 형성 방법.And the probability classification of the subject classification code or the subject selected according to the order in which the probability values are larger for the item.

문서에서 새로운 분야 또는 새로운 용어가 추출되는 단계;Extracting a new field or new term from the document;

상기 새로운 분야 또는 상기 새로운 용어에 대하여 분야분류체계를 경신하는 단계;Updating a field classification system for the new field or the new term;

경신된 상기 분야분류체계 또는 상기 새로운 용어에 의해서 상기 시소러스를 경신하는 단계; Updating the thesaurus by the updated discipline classification system or the new terminology;

경신된 상기 시소러스를 적용하여 저장된 문서의 색인어가 시소러스 매칭 되어 분석되는 단계; 및Applying the renewed thesaurus and indexing and analyzing the index words of the stored documents; And

상기 분석 결과를 문서 DB에 저장하는 단계를 포함하는 것을 특징으로 하는 시소러스 매칭에 의한 문서 DB 관리 방법. Document DB management method by thesaurus matching, characterized in that it comprises the step of storing the analysis result in the document DB.

제6항에 있어서,The method of claim 6,

상기 시소러스 매칭은 상기 색인어의 용어 빈도, 문서 빈도, 분야분류코드 빈도 및 개념어 깊이로 이루어진 군에서 선택된 하나 이상의 조건을 비교하여 분석되는 것을 특징으로 하는 시소러스 매칭에 의한 문서 DB 관리 방법.And the thesaurus matching is analyzed by comparing one or more conditions selected from the group consisting of term frequency, document frequency, field classification code frequency, and conceptual word depth of the index word.

질의어를 읽어 들이는 단계;Reading a query word;

상기 질의어의 형태소 분석이 실시되어 상기 질의어의 색인어가 추출되는 단계;Extracting an index of the query by performing a morpheme analysis of the query;

상기 색인어를 시소러스 매칭에 의해서 형성된 문서 DB와 비교하여 검색하는 단계; 및Searching the index word by comparing it with a document DB formed by thesaurus matching; And

상기 검색 결과를 제공하는 단계를 포함하는 것을 특징으로 하는 정보검색 방법.And providing the search results.

제8항에 있어서,The method of claim 8,

상기 검색 결과는 분야분류코드 또는 주제와 함께 제공되는 것을 특징으로 하는 정보검색 방법.The search result is an information retrieval method characterized in that it is provided with a field classification code or a topic.

문서를 읽어 들이는 단계;Reading a document;

상기 시소러스를 통하여 상기 색인어가 시소러스 매칭 되는 분야분류코드 또는 주제를 상기 문서에 부여하는 단계; Assigning to the document a field classification code or a subject whose thesaurus is thesaurus matched through the thesaurus;

상기 분야분류코드 또는 상기 주제를 문서 DB와 비교하여 검색하는 단계; 및Searching for comparing the field classification code or the subject with a document DB; And

제10항에 있어서,The method of claim 10,

상기 문서와 상기 문서 DB를 분야분류코드, 확률 값, 색인어 분포, 색인어 빈도 및 저자의 이름으로 이루이진 군에서 선택된 하나 이상의 조건에 대하여 유사성을 비교하거나 중복검사를 하는 것을 특징으로 하는 정보검색 방법.And comparing similarity between the document and the document DB with respect to one or more conditions selected from a group consisting of a field classification code, probability value, index word distribution, index word frequency, and author's name.

제11항에 있어서,The method of claim 11,

상기 검색 결과는 상기 유사성을 군집화(clustering)하여 문서로 제공하는 것을 특징으로 하는 정보검색 방법.The search result is an information retrieval method characterized in that the clustering of the similarity (clustering) to provide a document.