JP2008123095A

JP2008123095A - Retrieval terminal device, retrieval system, and program

Info

Publication number: JP2008123095A
Application number: JP2006304055A
Authority: JP
Inventors: Tetsuaki Daihi; 哲明大朏
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2006-11-09
Filing date: 2006-11-09
Publication date: 2008-05-29
Anticipated expiration: 2026-11-09
Also published as: JP5135766B2

Abstract

PROBLEM TO BE SOLVED: To generate a query not from a keyword but from a file. SOLUTION: A retrieval terminal device includes: a reference data inputting means for receiving the inputting of one or more reference data, each of which includes one or more words; a matrix generating means for generating a matrix, which comprises a plurality of reference vectors, based on one or more reference data which are input by the reference data inputting means, wherein the reference vectors include attribute vectors as the components and the attribute vectors include, as the components, the attributes of words included in the reference data; a statistic vector generating means for generating a plurality of statistic vectors based on the matrix generated by the matrix generating means, wherein each statistic vector has, as a component, a scalar obtained by performing statistic processing to the components of the attribute vectors; a query-generating means for generating a query, which includes one or more words, based on the plurality of statistic vectors generated by the statistic vector generating means; and a query-outputting means for outputting the query generated by the query-generating means. COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、検索技術に関する。 The present invention relates to a search technique.

現在用いられている検索システムにおいて、キーワードが直接クエリとして入力される。そのキーワードを含むドキュメントがランキング表示されたものが検索結果として得られる。例えば、ウェブ検索を行う際、検索エンジンは、キーワードの頻度等から定められる人気度に基づいて、関連がありそうなウェブページをインデックスのテキスト検索により抽出する。検索エンジンは、抽出されたウェブページのリストを表示する。検索のアルゴリズムとしては、ｇｏｏｇｌｅ（登録商標）が使用しているＰａｇｅＲａｎｋというアルゴリズムが知られている。この他にも、特許文献１に示される結果クラスタリング方法というアルゴリズムも知られている。なお、テキストやメタデータを利用したインデックス作成および検索の方法としては、例えば特許文献２〜４に記載された技術が知られている。また、検索エンジンの概要については、例えば非特許文献１に開示されている。 In currently used search systems, keywords are directly entered as queries. A search result obtained by ranking the documents including the keyword is obtained. For example, when performing a web search, the search engine extracts web pages that are likely to be related by a text search of the index based on the popularity determined by the frequency of keywords and the like. The search engine displays a list of extracted web pages. As a search algorithm, an algorithm called PageRank used by Google (registered trademark) is known. In addition to this, an algorithm called a result clustering method disclosed in Patent Document 1 is also known. For example, techniques described in Patent Documents 2 to 4 are known as index creation and search methods using text and metadata. An outline of a search engine is disclosed in Non-Patent Document 1, for example.

特開２００６−０１８８４３号公報JP 2006-018843 A 特開２００４−１５７８２６号公報JP 2004-157826 A 特開２００５−２０９２１４号公報JP 2005-209214 A 特開２００５−２４３０１２号公報JP-A-2005-243012 「検索エンジン２００５−ウェブの道しるべ−」、日本情報処理学会、ＩＰＳＪＭａｇａｚｉｎｅ、第４６巻、第９号、２００５年９月"Search Engine 2005-The Way to the Web", Information Processing Society of Japan, IPSJ Magazine, Vol. 46, No. 9, September 2005

上記の技術には、適切なキーワードやフレーズ（複数のキーワード）を用いなければ、所望の結果が得られないという問題がある。例えば、数多くの単語を含む文章をキーワードとして用いると、検索で得られるドキュメントの範囲が広くなりすぎるか、あるいは、全く結果が得られない。
これに対し本発明は、キーワードではなくファイル（ドキュメントまたはコンテンツ）からクエリを生成する技術、また、これを用いて検索を行う技術を提供するものである。 The above-described technique has a problem that a desired result cannot be obtained unless an appropriate keyword or phrase (a plurality of keywords) is used. For example, if a sentence including a large number of words is used as a keyword, the range of documents obtained by search becomes too wide, or no result is obtained.
In contrast, the present invention provides a technique for generating a query from a file (document or content) instead of a keyword, and a technique for performing a search using this.

上述の課題を解決するため、本発明は、各々が１以上の単語を含む１以上の参照データの入力を受け付ける参照データ入力手段と、前記参照データ入力手段により入力された１以上の参照データに基づいて複数の参照ベクトルからなるマトリクスを生成し、前記参照ベクトルが属性ベクトルを成分として含み、前記属性ベクトルが前記参照データに含まれる単語の属性を成分として含むマトリクス生成手段と、前記マトリクス生成手段により生成されたマトリクスに基づいて複数の統計ベクトルを生成し、各統計ベクトルが前記属性ベクトルの成分に対して統計処理を施すことにより得られたスカラーを成分として有する統計ベクトル生成手段と、前記統計ベクトル生成手段により生成された複数の統計ベクトルに基づいて、１以上の単語を含むクエリを生成するクエリ生成手段と、前記クエリ生成手段により生成されたクエリを出力するクエリ出力手段とを有する検索端末装置を提供する。
この検索端末装置によれば、キーワードではなくファイル（参照データ）からクエリを生成することができる。 In order to solve the above-described problems, the present invention provides a reference data input unit that receives input of one or more reference data each including one or more words, and one or more reference data input by the reference data input unit. Generating a matrix composed of a plurality of reference vectors, wherein the reference vector includes an attribute vector as a component, and the attribute vector includes a word attribute included in the reference data as a component, and the matrix generation unit A plurality of statistical vectors based on the matrix generated by the above, and a statistical vector generation means each having a scalar obtained by performing statistical processing on each attribute vector component as a component, and the statistics Based on a plurality of statistical vectors generated by the vector generating means, Providing a query generation means for generating a non-query, the search terminal device and a query output means for outputting a query generated by the query generation unit.
According to this search terminal device, a query can be generated from a file (reference data) instead of a keyword.

好ましい態様において、この検索端末装置は、前記統計ベクトル生成手段により生成された複数の統計ベクトルを複数のクラスタに分類するクラスタ化手段を有し、前記クエリ生成手段が、前記複数のクラスタの中から、ある条件を満たすクラスタに含まれる１以上の統計ベクトルに対応する単語を抽出してクエリを生成してもよい。
この検索端末装置によれば、クラスタ化を用いて抽出された単語を用いてクエリを生成することができる。 In a preferred aspect, the search terminal device has clustering means for classifying the plurality of statistical vectors generated by the statistical vector generation means into a plurality of clusters, and the query generation means is selected from the plurality of clusters. A query may be generated by extracting words corresponding to one or more statistical vectors included in a cluster that satisfies a certain condition.
According to this search terminal device, a query can be generated using words extracted using clustering.

別の好ましい態様において、この検索端末装置は、前記クエリ生成手段が、しきい値関数および各クラスタに属する統計ベクトルを用いて各クラスタの重要度を算出し、少なくとも重要度が最大のクラスタから抽出された単語を用いてクエリを生成してもよい。
この検索端末装置によれば、重要度が高いクラスタから抽出された単語を用いてクエリを生成することができる。 In another preferred aspect, in this search terminal device, the query generation means calculates the importance of each cluster using a threshold function and a statistical vector belonging to each cluster, and extracts at least from the cluster having the maximum importance. The generated word may be used to generate a query.
According to this search terminal device, a query can be generated using a word extracted from a cluster having high importance.

さらに別の好ましい態様において、この検索端末装置は、前記統計ベクトル生成手段により生成された複数の統計ベクトルから、各統計ベクトルに対応する単語の重要度を算出する重要度算出手段を有し、前記複数の統計ベクトルのうち、前記重要度算出手段により算出された重要度がしきい値以上である統計ベクトルに対応する１以上の単語を含むクエリを生成してもよい。
この検索端末装置によれば、重要度が高い単語を含むクエリを生成することができる。 In still another preferred aspect, the search terminal device includes importance calculating means for calculating importance of a word corresponding to each statistical vector from a plurality of statistical vectors generated by the statistical vector generating means, A query including one or more words corresponding to a statistical vector whose importance calculated by the importance calculation means is not less than a threshold value among a plurality of statistical vectors may be generated.
According to this search terminal device, it is possible to generate a query including a word having high importance.

さらに別の好ましい態様において、この検索端末装置は、前記クエリに応じた検索結果を取得する検索結果取得手段と、前記検索結果取得手段により取得された検索結果のカテゴリを指定するカテゴリ指定手段と、前記カテゴリ指定手段により指定されたカテゴリ、前記検索結果、および前記クエリを記録するデータベースと、前記クエリ生成手段により新たなクエリが生成された場合、前記データベースの記録内容に基づいて検索結果の絞込みを行う絞込み手段とを有してもよい。
この検索端末装置によれば、ユーザのフィードバックに応じて検索結果の絞込みを行うことができる。 In still another preferred aspect, the search terminal device includes a search result acquisition unit that acquires a search result according to the query, a category specification unit that specifies a category of the search result acquired by the search result acquisition unit, When a new query is generated by the category specified by the category specifying unit, the search result, and the database that records the query, and the query generating unit, the search result is narrowed down based on the recorded contents of the database. You may have the narrowing-down means to perform.
According to this search terminal device, search results can be narrowed down according to user feedback.

さらに別の好ましい態様において、この検索端末装置は、前記単語の属性が、前記参照データのファイルタイプ、前記参照データにおける前記単語の頻度、前記参照データにおける前記単語の相対位置、前記単語の品詞、前記単語が表示されるフォントの種類、前記参照データのページ数、前記単語が表示される色、前記単語が表示されるフォントの大きさ、前記参照データにおけるリンク情報、前記参照データに付加されたメタ情報のうち少なくとも１つを含んでもよい。 In still another preferred aspect, in this search terminal device, the attribute of the word includes a file type of the reference data, a frequency of the word in the reference data, a relative position of the word in the reference data, a part of speech of the word, The type of font in which the word is displayed, the number of pages of the reference data, the color in which the word is displayed, the size of the font in which the word is displayed, the link information in the reference data, and added to the reference data At least one of the meta information may be included.

さらに別の好ましい態様において、この検索端末装置は、前記参照データから単語を抽出する単語抽出手段を有し、前記マトリクス生成手段が、前記単語抽出手段により抽出された単語および前記参照データに基づいて前記マトリクスを生成し、前記単語抽出手段が、複数の言語に対応していてもよい。
この検索端末装置によれば、複数の言語で記述されたファイルからクエリを生成することができる。 In still another preferred aspect, the search terminal device includes word extraction means for extracting a word from the reference data, and the matrix generation means is based on the word extracted by the word extraction means and the reference data. The matrix may be generated, and the word extraction unit may support a plurality of languages.
According to this search terminal device, a query can be generated from files described in a plurality of languages.

また、本発明は、各々が１以上の単語を含む１以上の参照データの入力を受け付ける参照データ入力手段と、前記参照データ入力手段により入力された１以上の参照データに基づいて複数の参照ベクトルからなるマトリクスを生成し、前記参照ベクトルが属性ベクトルを成分として含み、前記属性ベクトルが前記参照データに含まれる単語の属性を成分として含むマトリクス生成手段と、前記マトリクス生成手段により生成されたマトリクスに基づいて複数の統計ベクトルを生成し、各統計ベクトルが前記属性ベクトルの成分に対して統計処理を施すことにより得られたスカラーを成分として有する統計ベクトル生成手段と、前記統計ベクトル生成手段により生成された複数の統計ベクトルに基づいて、１以上の単語を含むクエリを生成するクエリ生成手段と、前記クエリ生成手段により生成されたクエリを出力するクエリ出力手段とを有する検索システムを提供する。
この検索システムによれば、キーワードではなくファイル（参照データ）からクエリを生成することができる。 The present invention also provides a reference data input means for receiving input of one or more reference data each including one or more words, and a plurality of reference vectors based on the one or more reference data input by the reference data input means. A matrix generation unit including an attribute vector as a component, the attribute vector including an attribute of a word included in the reference data as a component, and a matrix generated by the matrix generation unit. A plurality of statistic vectors based on the statistic vector generating means, each statistic vector generating means having a scalar obtained by applying statistical processing to the attribute vector component, and the statistic vector generating means Generate a query containing one or more words based on a plurality of statistical vectors Providing a collar generating means, a search system and a query output means for outputting a query generated by the query generation unit.
According to this search system, a query can be generated from a file (reference data) instead of a keyword.

さらに、本発明は、コンピュータ装置に、各々が１以上の単語を含む１以上の参照データの入力を受け付ける参照データ入力ステップと、前記１以上の参照データに基づいて複数の参照ベクトルからなるマトリクスを生成し、前記参照ベクトルが属性ベクトルを成分として含み、前記属性ベクトルが前記参照データに含まれる単語の属性を成分として含むマトリクス生成ステップと、前記マトリクスに基づいて複数の統計ベクトルを生成し、各統計ベクトルが前記属性ベクトルの成分に対して統計処理を施すことにより得られたスカラーを成分として有する統計ベクトル生成ステップと、前記複数の統計ベクトルに基づいて、１以上の単語を含むクエリを生成するクエリ生成ステップと、前記クエリを出力するクエリ出力ステップとを実行させるプログラムを提供する。
このプログラムによれば、キーワードではなくファイル（参照データ）からクエリを生成することができる。 Furthermore, the present invention provides a computer device with a reference data input step for receiving input of one or more reference data each including one or more words, and a matrix composed of a plurality of reference vectors based on the one or more reference data. Generating a matrix generation step in which the reference vector includes an attribute vector as a component, and the attribute vector includes a word attribute included in the reference data as a component, and generates a plurality of statistical vectors based on the matrix, A statistical vector generation step in which a statistical vector has a scalar obtained by performing statistical processing on the component of the attribute vector, and a query including one or more words based on the plurality of statistical vectors A query generation step and a query output step for outputting the query are executed. To provide that program.
According to this program, a query can be generated from a file (reference data) instead of a keyword.

１．第１実施形態
１−１．構成
図１は、本発明の第１実施形態に係る検索システム１の機能構成を示すブロック図である。参照データ入力部１０は、ユーザからの参照データの入力を受け付ける。ここで、「参照データ」とは、検索を行う際に用いられるクエリ（１以上のキーワードを含む検索要求をいう）を生成する基となる情報をいう。参照データは、例えば、ファイル名、ドラッグ＆ドロップ、ＵＲＬ（Uniform Resource Locator）、リンクによって指定される。参照データは、１または複数の単語（キーワード）そのものではなく、文書ファイル、テキストファイル、ＨＴＭＬ（HyperText Markup Language）ファイル、またはこれらのファイルへのリンクである。なお、「リンク」とは、ファイルの所在を示す情報をいう。参照データ入力部１０は、指定されたファイルをアナライザ部１１に出力する。また、参照データ入力部１０は、参照データからメタデータを生成する。「メタデータ」とは、文書またはその文書に含まれる文字もしくは文字列の属性を示す情報である。メタデータは、あらかじめ文書に付加されていてもよい。あるいは、参照データ入力部１０は、あらかじめ決められたアルゴリズムに従って文書（参照データ）からメタデータを生成してもよい。参照データ入力部１０は、生成したメタデータをクエリ生成部１２に出力する。 1. First embodiment 1-1. Configuration FIG. 1 is a block diagram showing a functional configuration of a search system 1 according to the first embodiment of the present invention. The reference data input unit 10 receives input of reference data from the user. Here, “reference data” refers to information serving as a basis for generating a query (referred to as a search request including one or more keywords) used when performing a search. The reference data is specified by, for example, a file name, drag and drop, a URL (Uniform Resource Locator), and a link. The reference data is not one or a plurality of words (keywords) itself but a document file, a text file, an HTML (HyperText Markup Language) file, or a link to these files. “Link” refers to information indicating the location of a file. The reference data input unit 10 outputs the designated file to the analyzer unit 11. The reference data input unit 10 generates metadata from the reference data. “Metadata” is information indicating attributes of a document or characters or character strings included in the document. The metadata may be added to the document in advance. Alternatively, the reference data input unit 10 may generate metadata from a document (reference data) according to a predetermined algorithm. The reference data input unit 10 outputs the generated metadata to the query generation unit 12.

アナライザ部１１は、参照データとして入力されたファイルを解析し、単語を抽出する。アナライザ部１１は、必要に応じて類似辞書１３（「lexicon」ということもある）を用いる。クエリ生成部１２は、アナライザ部１１から出力された単語と、参照データ入力部１０から出力されたメタデータとに基づいて、多次元のマトリクスＭを生成する。マトリクスＭは、複数の参照ベクトルから構成される。各参照ベクトルは、マトリクスＭのうち一の行ベクトルである。各参照ベクトルは、参照データに含まれる単語に対応する。「参照ベクトル」とは、その成分として属性ベクトルを含むベクトルである。「属性ベクトル」とは、その成分が単語の属性であるベクトルである。クエリ生成部１２は、マトリクスＭに対し、参照ベクトルごとにその要素（属性ベクトル）の少なくとも一部に対して統計処理を行う。こうして、クエリ生成部１２は、属性ベクトルをスカラーに変換する。このスカラーを要素とするベクトルを、「統計ベクトル」という。各統計ベクトルは、一の単語に対応している。さらに、クエリ生成部１２は、統計ベクトル（すなわち単語）を複数のクラスタに分類すなわちグループ分けする。クエリ生成部１２は、複数のクラスタの中から少なくとも１のクラスタを選択する。クエリ生成部１２は、選択されたクラスタに含まれる単語に基づいてクエリを生成する。検索部１４は、生成されたクエリを、検索サーバ２００に送信する。また、検索部１４は、クエリに応じた検索結果を、検索サーバ２００から取得する。なお検索サーバ２００としては、公知の検索装置が用いられる。 The analyzer unit 11 analyzes a file input as reference data and extracts words. The analyzer unit 11 uses a similar dictionary 13 (sometimes referred to as “lexicon”) as necessary. The query generation unit 12 generates a multidimensional matrix M based on the words output from the analyzer unit 11 and the metadata output from the reference data input unit 10. The matrix M is composed of a plurality of reference vectors. Each reference vector is one row vector in the matrix M. Each reference vector corresponds to a word included in the reference data. A “reference vector” is a vector that includes an attribute vector as its component. An “attribute vector” is a vector whose components are word attributes. The query generation unit 12 performs statistical processing on the matrix M for at least part of its elements (attribute vectors) for each reference vector. Thus, the query generation unit 12 converts the attribute vector into a scalar. A vector having this scalar as an element is called a “statistic vector”. Each statistical vector corresponds to one word. Further, the query generation unit 12 classifies or groups the statistical vector (that is, words) into a plurality of clusters. The query generation unit 12 selects at least one cluster from the plurality of clusters. The query generation unit 12 generates a query based on the words included in the selected cluster. The search unit 14 transmits the generated query to the search server 200. In addition, the search unit 14 acquires a search result corresponding to the query from the search server 200. As the search server 200, a known search device is used.

検索結果整理部１６は、検索部１４が取得した検索結果を、あらかじめ決められたアルゴリズムにより限定する。検索結果整理部１６は、その結果をユーザインターフェース１９に出力する。ユーザインターフェース１９は、ユーザに検索結果を提供する。フィードバック部１７は、ユーザインターフェース１９を解して入力された、検索結果に対するユーザのフィードバックを受け付ける。抽出・分類部１５は、フィードバックに基づいてカテゴリデータベース１８を更新する。カテゴリデータベース１８は、検索結果（リンク情報）、その結果を得るのに用いられたクエリおよび対応するカテゴリを含む。抽出・分類部１５は、カテゴリデータベース１８の更新後に検索を行う際には、ユーザにより指定されたカテゴリに対応する検索結果およびクエリを検索条件として用いることができる。なお、カテゴリ、クエリ、リンク先のドキュメントの言語は１つに限定されない。また、「カテゴリ」とは、１または複数の単語の種類、概念または区分をいう。例えば、名詞の組み合わせ、または形容詞と名詞の組み合わせに対してカテゴリが与えられる。具体的には、例えばカテゴリ「車」には、「sports car」および「racing car」などの単語（群）が含まれる。 The search result organizing unit 16 limits the search results acquired by the search unit 14 using a predetermined algorithm. The search result organizing unit 16 outputs the result to the user interface 19. The user interface 19 provides search results to the user. The feedback unit 17 receives user feedback on the search result input through the user interface 19. The extraction / classification unit 15 updates the category database 18 based on the feedback. The category database 18 includes search results (link information), queries used to obtain the results, and corresponding categories. When performing a search after updating the category database 18, the extraction / classification unit 15 can use a search result and a query corresponding to the category specified by the user as a search condition. Note that the language of the category, query, and linked document is not limited to one. “Category” refers to the type, concept, or category of one or more words. For example, categories are given for combinations of nouns or adjectives and nouns. Specifically, for example, the category “car” includes words (groups) such as “sports car” and “racing car”.

図２は、本実施形態に係る検索端末１００のハードウェア構成を示す図である。ＣＰＵ（Central Processing Unit）１１０は、検索端末１００の各構成要素を制御する制御装置である。ＲＯＭ（Read Only Memory）１２０は、検索端末１００の起動に必要なデータおよびプログラムを記憶する記憶装置である。ＲＡＭ（Random Access Memory）１３０は、ＣＰＵ１１０がプログラムを実行する際の作業領域として機能する記憶装置である。Ｉ／Ｆ（Interface）１４０は、種々の入出力装置や記憶装置との間でデータおよび制御信号の入出力をするインターフェースである。ＨＤＤ（Hard Disk Drive）１５０は、各種プログラムおよびデータを記憶する記憶装置である。本実施形態に関して、ＨＤＤ１５０は、クエリ情報を生成するプログラムを記憶している。キーボード・マウス１６０は、ユーザが検索端末１００に対して指示入力を行うための入力装置である。ディスプレイ１７０は、データの内容あるいは処理の状況などを表示する出力装置である。ネットワークＩＦ１８０は、ネットワーク（図示略）を介して接続された他の装置との間でデータの送受信を行うためのインターフェースである。検索端末１００は、例えば、ネットワークおよびネットワークＩＦ１８０を介して文書（正確には、文書を示す電子データ）を受信することができる。また、検索端末１００は、ネットワークおよびネットワークＩＦ１８０を介して検索サーバ２００と通信することができる。ＣＰＵ１１０、ＲＯＭ１２０、ＲＡＭ１３０、およびＩ／Ｆ１４０は、バス１９０を介して接続されている。ＣＰＵ１１０がプログラムを実行することにより、検索端末１００は、図１に示される参照データ入力部１０、アナライザ部１１、クエリ生成部１２、類似辞書１３および検索部１４の機能を備える。以下の説明において、「参照データ入力部１０が〜を行う」というように、機能要素を主体として記述するときは、プログラムを実行したＣＰＵ１１０が処理を行うことを意味する。他の要素についても同様である。なお、検索端末１００は、図１に示される機能構成および図２に示されるハードウェア構成を含むものであれば、どのような装置であってもよい。例えば、検索端末１００は、いわゆるパーソナルコンピュータであってもよい。あるいは、検索端末１００は、記憶性液晶層を含む表示装置を有する情報表示装置であってもよい。 FIG. 2 is a diagram illustrating a hardware configuration of the search terminal 100 according to the present embodiment. A CPU (Central Processing Unit) 110 is a control device that controls each component of the search terminal 100. A ROM (Read Only Memory) 120 is a storage device that stores data and programs necessary for starting the search terminal 100. A RAM (Random Access Memory) 130 is a storage device that functions as a work area when the CPU 110 executes a program. An I / F (Interface) 140 is an interface for inputting / outputting data and control signals to / from various input / output devices and storage devices. An HDD (Hard Disk Drive) 150 is a storage device that stores various programs and data. Regarding the present embodiment, the HDD 150 stores a program for generating query information. The keyboard / mouse 160 is an input device for a user to input an instruction to the search terminal 100. The display 170 is an output device that displays data contents or processing status. The network IF 180 is an interface for transmitting and receiving data to and from other devices connected via a network (not shown). The search terminal 100 can receive a document (more precisely, electronic data indicating the document) via the network and the network IF 180, for example. In addition, the search terminal 100 can communicate with the search server 200 via the network and the network IF 180. CPU 110, ROM 120, RAM 130, and I / F 140 are connected via a bus 190. When the CPU 110 executes the program, the search terminal 100 includes the functions of the reference data input unit 10, the analyzer unit 11, the query generation unit 12, the similar dictionary 13, and the search unit 14 shown in FIG. In the following description, when a function element is mainly described as “the reference data input unit 10 performs”, this means that the CPU 110 that has executed the program performs processing. The same applies to other elements. The search terminal 100 may be any device as long as it includes the functional configuration shown in FIG. 1 and the hardware configuration shown in FIG. For example, the search terminal 100 may be a so-called personal computer. Alternatively, the search terminal 100 may be an information display device having a display device including a memory liquid crystal layer.

１−２．動作
図３は、検索システム１の動作を示すフローチャートである。ステップＳ１００において、参照データ入力部１０は、ユーザの操作に応じて、参照データすなわちファイル（コンテンツ）を特定する情報の入力を受け付ける。 1-2. Operation FIG. 3 is a flowchart showing the operation of the search system 1. In step S 100, the reference data input unit 10 receives input of information for specifying reference data, that is, a file (content), according to a user operation.

図４は、参照データ入力部１０によるユーザインターフェースを例示する図である。ウインドウ４０３内に表示されているのが参照データ（詳細には、参照データの所在を示す情報）である。ウインドウ４０１内に表示されているのが参照データの候補である。ユーザは、ウインドウ４０２にＵＲＬやファイルのパスを入力することにより、参照データを指定することができる。あるいは、ユーザは、ウインドウ４０１からウインドウ４０３に、いわゆるドラッグ＆ドロップすることにより参照データを指定することもできる。また、ユーザは、検索オプションを指定してもよい。例えばボタン４０４をクリックすることにより、検索オプションを入力するユーザインターフェースが表示される。検索オプションとは、作成者、作成日時、ファイルサイズなど、検索条件を限定する情報をいう。検索オプションはファイルの検索条件のメタデータとして取得される。参照データ入力部１０は、入力された参照データにより特定されるファイルをアナライザ部１１に出力する。また、参照データ入力部１０は、メタデータをクエリ生成部１２に出力する。 FIG. 4 is a diagram illustrating a user interface by the reference data input unit 10. Reference data (specifically, information indicating the location of the reference data) is displayed in the window 403. Reference data candidates are displayed in the window 401. The user can designate reference data by inputting a URL or a file path in the window 402. Alternatively, the user can designate reference data by dragging and dropping from the window 401 to the window 403. The user may also specify search options. For example, by clicking the button 404, a user interface for inputting a search option is displayed. The search option is information that limits search conditions such as creator, creation date and time, and file size. The search option is acquired as metadata of a file search condition. The reference data input unit 10 outputs a file specified by the input reference data to the analyzer unit 11. Further, the reference data input unit 10 outputs the metadata to the query generation unit 12.

再び図３を参照して説明する。ステップＳ１１０において、アナライザ部１１は、ファイルを解析し、単語を抽出する。アナライザ部１１の機能は大きく２つに分けられる。一つは、いわゆる「分かち書き」される言語（英語、ドイツ語、フランス語など）のためのアナライザであり、もう一つは、「分かち書き」されない言語（日本語、中国語、韓国語など）のためのアナライザである。以下、前者を標準アナライザ（Standard Analyzer）といい、後者をＣＪＫアナライザ（CJK Analyzer）という。標準アナライザは通常のパーサを用いる。パーサとは、単語を一つずつ切り離す処理をいう。一つずつ切り離された部分をトークン（token）という。標準アナライザは、単語と単語の間のスペース（空白）によりトークンを取得することができる。句読点はフィルタにより除去される。ＣＪＫアナライザは類似辞書１３を用いてトークンを取得する。 A description will be given with reference to FIG. 3 again. In step S110, the analyzer unit 11 analyzes the file and extracts words. The function of the analyzer unit 11 is roughly divided into two. One is for so-called "separated" languages (English, German, French, etc.) and the other is for non-separated languages (Japanese, Chinese, Korean, etc.) It is an analyzer. Hereinafter, the former is referred to as a standard analyzer and the latter is referred to as a CJK analyzer. The standard analyzer uses a normal parser. Parser is the process of separating words one by one. The parts separated one by one are called tokens. Standard analyzers can obtain tokens by a space between words. Punctuation marks are removed by a filter. The CJK analyzer obtains a token using the similar dictionary 13.

アナライザ部１１は、抽出されたトークンを、ドキュメントごとにリストとしてクエリ生成部１２に出力する。なお、アナライザ部１１は、トークンの抽出に加え、単語情報（トークン情報）を抽出してもよい。単語情報は、例えば、ファイルごとの単語のフォントタイプ、フォントサイズ、発生頻度、位置情報、色のうち１つ以上の情報を含む。 The analyzer unit 11 outputs the extracted tokens to the query generation unit 12 as a list for each document. The analyzer unit 11 may extract word information (token information) in addition to token extraction. The word information includes, for example, one or more pieces of information among a font type, font size, occurrence frequency, position information, and color of each word.

ステップＳ１２０において、クエリ生成部１２は、アナライザ部１１から出力されたトークン、単語情報および参照データ入力部１０から出力されたメタデータに基づいて、マトリクスＭを生成する。マトリクスＭは、例えば、各単語（トークン）がどういうフォントで形成されているか、フォントサイズはいくつか、その頻度はどれくらいか、その相対的位置はどこかといった情報を含んでいる。より詳細には、次式で示されるように、マトリクスＭの要素Ｍ_ｎはベクトルとして表すことができる。ベクトルＭ_ｎはマトリクスＭを構成する参照ベクトル（行ベクトル）である。

In step S 120, the query generation unit 12 generates a matrix M based on the token output from the analyzer unit 11, word information, and metadata output from the reference data input unit 10. The matrix M includes information such as what font each word (token) is formed in, the font size is some, how often is the frequency, and where the relative position is. More specifically, the element M _{n of the} matrix M can be represented as a vector, as shown in the following equation. The vector M _n is a reference vector (row vector) constituting the matrix M.

Ｗは、各ファイルで発生する単語（トークン）と対応する整数、すなわち各単語にマッピングされた整数である。ベクトルＷの成分は、単語の識別子として機能する。ベクトルＦ_ｉは、各ファイルと対応する整数、すなわち各ファイルとマッピングされた整数をその成分として含む。ベクトルＦ_ｉの成分は、ファイルの識別子として機能する。整数とファイルの所在、詳細には、ファイルパスやＵＲＬは、マトリクスＭとは異なる管理テーブルによりマッピングされている。 W is an integer corresponding to a word (token) generated in each file, that is, an integer mapped to each word. The component of the vector W functions as a word identifier. The vector F _i includes an integer corresponding to each file, that is, an integer mapped to each file as its component. The component of the vector F _i functions as a file identifier. The integer and the location of the file, specifically, the file path and URL are mapped by a management table different from the matrix M.

ベクトルｆ_ｉは、各ファイルにおける単語の頻度を示す整数をその成分として含む。単語の頻度が高いほど、そのファイルとその単語の関連性が高いと考えられる。なお、ある種の特定の単語（「ストップワード（stopword）」ということもある。英語の例では、「the」、「of」、「and」、「for」など）は頻度が高くても意味が無いので、これらの語は対象から除外されてもよい。 Vector f _i includes an integer that indicates the frequency of words in each file as a component. The higher the frequency of the word, the higher the relevance between the file and the word. Note that some specific words (also called “stopwords”, in the English example “the”, “of”, “and”, “for”, etc.) may mean more frequently. These words may be excluded from the subject.

ベクトルＰ_ｉは、各ファイルで発生する単語の絶対的位置を示す整数をその成分として含む。この情報により、複数の単語からなるフレーズ（連語、熟語）の情報（以下、「フレーズ情報」という）を取得することができる。フレーズ情報は、クエリ情報を生成する段階で用いられる。 The vector P _i includes an integer indicating the absolute position of a word occurring in each file as its component. With this information, it is possible to acquire information (hereinafter referred to as “phrase information”) of phrases (combined words, idioms) composed of a plurality of words. The phrase information is used at the stage of generating query information.

ベクトルＦｎｔ_ｉは、各ファイルにおける単語のフォントの種類（フォントタイプ）を示す整数をその成分として含む。ベクトルＦｎｔ_ｉの成分は、フォントの識別子として機能する。整数とフォントの種類は、管理テーブルによりマッピングされている。フォントの種類の情報は、次のように用いられる。例えば、あるファイルにおいてある種類のフォントの頻度が高い場合、ある単語にそれとは異なる種類のフォントが用いられていると、その単語は強調されていると考えられる。 The vector Fnt _i includes an integer indicating the font type (font type) of the word in each file as its component. The component of the vector Fnt _i functions as a font identifier. The integer and font type are mapped by the management table. The font type information is used as follows. For example, when the frequency of a certain type of font is high in a certain file, if a different type of font is used for a certain word, the word is considered to be emphasized.

ベクトルＦｎｔｓ_ｉは、各ファイルにおける単語のフォントサイズを示す整数をその成分として含む。整数とフォントサイズは、管理テーブルによりマッピングされている。フォントの種類と同様に、フォントサイズもその単語が強調されていることを示す情報として用いられる。 The vector Fnts _i includes an integer indicating the font size of the word in each file as its component. The integer and font size are mapped by the management table. Similar to the type of font, the font size is also used as information indicating that the word is emphasized.

ベクトルＣ_ｉは、各ファイルにおける単語の色を示す整数をその成分として含む。整数とフォントの色は、管理テーブルによりマッピングされている。単一の単語が複数の色を含む場合には、各色の頻度に応じてその単語の色として決められてもよい。あるいは、各色の中間色など別の色がその単語の色として決められてもよい。 The vector C _i includes an integer indicating the color of the word in each file as its component. The integer and font color are mapped by the management table. When a single word includes a plurality of colors, the color of the word may be determined according to the frequency of each color. Alternatively, another color such as an intermediate color between the colors may be determined as the color of the word.

ベクトルＵ_ｉは、各ファイルにおける単語が、その言語においてどういうタイプまたは種類の単語であるか、すなわちその単語の品詞を示す整数をその成分として含む。整数と品詞は、管理テーブルによりマッピングされている。例えば、名詞や形容詞はそのファイルのコンセプトを示す品詞であることが多いので、その重要度は高い。また、固有名詞は一般名詞よりも重要度が高い。 The vector U _i includes, as its component, an integer indicating what type or kind of word the word in each file is, that is, the part of speech of the word. Integers and parts of speech are mapped by the management table. For example, nouns and adjectives are often part-of-speech indicating the concept of the file, so their importance is high. Proper nouns are more important than general nouns.

ベクトルＴ_ｉは、各ファイルにおける単語が、タイトル、リンク、またはフットノートなど特定の用途に用いられているかを示すフラグをその成分として含む。フラグの値は整数であり、管理テーブルによりマッピングされている。 The vector T _i includes, as its component, a flag indicating whether a word in each file is used for a specific application such as a title, a link, or a footnote. The value of the flag is an integer and is mapped by the management table.

ベクトルＱ_ｉは、各ファイルにおける単語のカテゴリまたはサブジェクトを示すパラメータをその成分として含む。単語のサブジェクトは類似辞書１３により整数にマッピングされている。各カテゴリは整数にマッピングされている。 The vector Q _i includes a parameter indicating the category or subject of the word in each file as its component. The word subject is mapped to an integer by the similar dictionary 13. Each category is mapped to an integer.

なお、マトリクスＭの成分は、必ずしも上記のベクトルに限定されるものではない。マトリクスＭの成分は、参照データのファイルタイプ（ＨＴＭＬ、テキスト、文書など）、参照データにおける単語の頻度、参照データにおける単語の相対位置、単語の品詞、単語が表示されるフォントの種類、参照データのページ数、単語が表示される色、単語が表示されるフォントの大きさ、参照データにおけるリンク情報、参照データに付加されたメタデータ（メタ情報）のうち少なくとも１つを含んでいてもよい。また、上記のベクトルにおいて、添え字ｉは、ファイルを示すパラメータである（すなわち、添え字ｉが付された記号がベクトルを表していることを意味する）。 Note that the components of the matrix M are not necessarily limited to the vectors described above. The components of the matrix M include the file type of reference data (HTML, text, document, etc.), the frequency of words in the reference data, the relative position of the words in the reference data, the part of speech of the words, the type of font in which the words are displayed, and reference data May include at least one of the number of pages, the color in which the word is displayed, the font size in which the word is displayed, link information in the reference data, and metadata (meta information) added to the reference data. . In the above vector, the subscript i is a parameter indicating a file (that is, the symbol with the subscript i represents a vector).

ステップＳ１３０において、クエリ生成部１２は、マトリクスＭを分析する。より詳細には、クエリ生成部１２は、属性ベクトルの少なくとも一部に対し統計処理を行い、属性ベクトルをスカラー化する。こうして算出されたスカラーが、統計ベクトルの要素となる。すなわち、参照ベクトル（その成分はベクトルを含む）が統計ベクトル（その成分はスカラーである）に変換される。クエリ生成部１２は、統計ベクトルを複数のクラスタに分類する。この処理を、「クラスタリング処理」という。 In step S130, the query generation unit 12 analyzes the matrix M. More specifically, the query generation unit 12 performs statistical processing on at least a part of the attribute vector, and scalarizes the attribute vector. The scalar thus calculated becomes an element of the statistical vector. That is, a reference vector (its component includes a vector) is converted into a statistical vector (its component is a scalar). The query generation unit 12 classifies the statistical vector into a plurality of clusters. This process is called “clustering process”.

ステップＳ１４０において、クエリ生成部１２は、複数のクラスタの中から、あらかじめ決められたアルゴリズムにより少なくとも１のクラスタを選択する。クエリ生成部１２は、選択されたクラスタに属する統計ベクトル群（ベクトル空間すなわち単語集）に基づいてクエリを生成する。 In step S140, the query generation unit 12 selects at least one cluster from a plurality of clusters by a predetermined algorithm. The query generation unit 12 generates a query based on a statistical vector group (vector space, that is, a word collection) belonging to the selected cluster.

ステップＳ１５０において、検索結果整理部１６は、クエリにより得られた検索結果を限定する。さらに、ユーザインターフェース１９は、その検索結果をユーザに提供する。ステップＳ１６０において、抽出・分類部１５は、フィードバック結果に応じてカテゴリデータベース１８を更新する。 In step S150, the search result organizing unit 16 limits the search results obtained by the query. Further, the user interface 19 provides the search result to the user. In step S160, the extraction / classification unit 15 updates the category database 18 according to the feedback result.

１−２−１．クラスタリング処理
図５は、クラスタリング処理を説明する図である。上述のようにクエリ生成部１２は、マトリクスＭを分析し、統計処理により複数のパラメータを算出する。統計ベクトルは、これらのパラメータを成分として含む。図５には、算出されたパラメータが示されている。発生共通度Ｔ_θは、単語がすべてのファイルにおいてどのくらい共通して存在しているかを示す確率を表している。Ｔ_θは、０≦Ｔ_θ≦１を満たす実数である。発生頻度共通度Ｔ_ｆは、単語がすべてのファイルに共通している場合、頻度がどのくらい共通しているかを示す。Ｔ_ｆは、０≦Ｔ_ｆ≦１を満たす実数である。単語タイプＴ_ｔは、単語の種類がマッピングされた実数であり、０≦Ｔ_ｔ≦１を満たす。単語タイプＴ_ｔは、Ｍ_βのＵ_ｉに基づいて得られる。すなわち、単語タイプＴ_ｔは、ある単語に関するベクトルＭ_βの成分であるＵ_ｉに基づいて得られる。具体的には、単語タイプＴ_ｔは、ベクトルＵ_ｉをスカラー（例えば、０から１の間の実数）に変換することにより得られる。リンク共通度Ｌは、その単語がリンクに含まれているかを示す。Ｌは整数、例えば０または１である。タイトル共通度Ｓは、その単語がタイトルに含まれているかを示す。Ｓは整数、例えば０または１である。カテゴリ共通度Ｋは、その単語が既存のカテゴリデータベースのクエリに含まれているかを示す。Ｋは整数。例えば０または１である。サブジェクトＱは、その単語のコンセプトを示す。単語のコンセプトは、０から１の間の実数にマッピングされている。マッピングは、類似辞書１３に基づいて行われてもよい。なお、類似辞書１３は、ユーザにより編集可能であることが望ましい。なお、統計ベクトルは必ずしも上記すべてのパラメータをその成分として含んでいる必要はなく、少なくとも１つのパラメータを含んでいればよい。 1-2-1. Clustering Process FIG. 5 is a diagram for explaining the clustering process. As described above, the query generation unit 12 analyzes the matrix M and calculates a plurality of parameters by statistical processing. The statistical vector includes these parameters as components. FIG. 5 shows the calculated parameters. The occurrence commonness T _θ represents a probability indicating how common a word exists in all files. T _θ is a real number that satisfies 0 ≦ T _θ ≦ 1. The occurrence frequency commonness _Tf indicates how common the frequency is when a word is common to all files. T _f is a real number that satisfies 0 ≦ T _f ≦ 1. The word type T _t is a real number to which the word type is mapped, and satisfies 0 ≦ T _t ≦ 1. The word type T _t is obtained based on U _i of M _β . That is, the word type T _t is obtained based on U _i which is a component of the vector M _β related to a certain word. Specifically, the word type T _t is obtained by converting the vector U _i into a scalar (for example, a real number between 0 and 1). The link commonality L indicates whether the word is included in the link. L is an integer, for example, 0 or 1. The title commonality S indicates whether the word is included in the title. S is an integer, for example 0 or 1. The category commonality K indicates whether the word is included in an existing category database query. K is an integer. For example, 0 or 1. Subject Q indicates the concept of the word. The word concept is mapped to a real number between 0 and 1. The mapping may be performed based on the similar dictionary 13. The similar dictionary 13 is preferably editable by the user. Note that the statistical vector does not necessarily include all the above parameters as its components, and it is sufficient that the statistical vector includes at least one parameter.

クエリ生成部１２は、上記のパラメータを用い、クラスタリングアルゴリズムに従って、クラスタリング処理を行う。クラスタリングアルゴリズムとしては、ｋ−平均法（K-means Clustering）、階層的クラスタリング（Hierarchical Clustering）、固有値クラスタリング（Eigenvalue Clustering）、ＰＣＡクラスタリング（PCA Clustering）など公知のアルゴリズムを用いることができる。クラスタリングアルゴリズムにおいては、距離計算（Distance Measure）という演算をする必要がある。距離計算のアルゴリズムとしては、ユークリッド距離（Euclidean）、ユークリッド平方距離（Squared Euclidean）、マンハッタン距離（Manhattan）、チェビシェフ距離（Chebyshev）、ピヤソン距離（Pearson）など公知のアルゴリズムを用いることができる。 The query generation unit 12 performs clustering processing according to the clustering algorithm using the above parameters. As the clustering algorithm, known algorithms such as k-means clustering, hierarchical clustering, eigenvalue clustering, and PCA clustering can be used. In the clustering algorithm, it is necessary to perform an operation called distance calculation. As the distance calculation algorithm, known algorithms such as Euclidean distance, Euclidean square distance (Squared Euclidean), Manhattan distance (Manhattan), Chebyshev distance (Chebyshev), and Pearson distance (Pearson) can be used.

具体的には、クエリ生成部１２は、上記のパラメータを用い、次式のように各単語にベクトルｘを関連付ける。

次に、クエリ生成部１２は、ベクトルｘ_１〜ｘ_ｎを、それぞれ、あらかじめ決められた数のクラスタＣ_１〜Ｃ_ｍのいずれかに分類する。 Specifically, the query generation unit 12 associates the vector x with each word using the above parameters as in the following equation.

Next, the query generation unit 12 classifies the vectors x _{1 to} x _n into any of a predetermined number of clusters C _{1 to} C _m , respectively.

図６は、クラスタに分類されたベクトルを例示する図である。次に、クエリ生成部１２は、各クラスタに含まれるベクトルの平均ベクトルＶ_１〜Ｖ_ｍを算出する。次に、クエリ生成部１２は、ベクトルｘ_１〜ｘ_ｎの中から処理対象のベクトルを特定する。クエリ生成部１２は、処理対象のベクトルと、平均ベクトルＶ_１〜Ｖ_ｍの各々との距離を算出する。クエリ生成部１２は、処理対象のベクトルを、算出された距離が最も短い平均ベクトルに対応するクラスタに分類しなおす。すべてのベクトルについて再分類が完了したら、クエリ生成部１２は、平均ベクトルを再計算し、再分類処理を反復的（Iteration）に実行する。反復の回数はユーザにより設定可能であってもよい。ベクトル間の距離Ｄは、例えばユークリッド距離を用いると、次式のように算出される。

FIG. 6 is a diagram illustrating vectors classified into clusters. Next, the query generation unit 12 calculates average vectors V _{1 to} V _m of vectors included in each cluster. Next, the query generation unit 12 specifies a vector to be processed from the vectors x _{1 to} x _n . The query generation unit 12 calculates the distance between the processing target vector and each of the average vectors V _{1 to} V _m . The query generation unit 12 reclassifies the processing target vector into clusters corresponding to the average vector having the shortest calculated distance. When the reclassification is completed for all the vectors, the query generation unit 12 recalculates the average vector and executes the reclassification process iteratively. The number of iterations may be settable by the user. If the Euclidean distance is used, for example, the distance D between the vectors is calculated as follows.

クエリ生成部１２は、以上の方法により得られた単語の集合に基づいて、クエリを生成する。クエリ生成部１２は、例えば、あらかじめ決められたアルゴリズムに従って、１または複数のクラスタを選択する。クエリ生成部１２は、選択されたクラスタに含まれる単語、またはその単語を含むフレーズもしくは句をクエリとして決定する。クエリ生成部１２は、フィルタ処理によりクラスタの数を制限してもよい。 The query generation unit 12 generates a query based on the set of words obtained by the above method. For example, the query generation unit 12 selects one or a plurality of clusters according to a predetermined algorithm. The query generation unit 12 determines a word included in the selected cluster or a phrase or phrase including the word as a query. The query generation unit 12 may limit the number of clusters by filtering.

１−２−２．処理の具体例
図７は、参照データとして用いられるファイルを例示する図である。以下、これらのファイルを用いてクエリを生成する処理を具体的に説明する。ファイル１は、「Sports car」というタイトルと、「A sports car is a type of automobile designed primarily for performance driving.」という本文を有するＨＴＭＬファイルである。ファイル２は、「Super fast sports car」というタイトルと、「From the royally elegant Bentley Continental GT to the legendary but Spartan Enzo Ferrari, the content has been chosen carefully.」という本文を有する文書ファイルである。ＨＴＭＬファイルおよび文書ファイルは、これらの文字列を示すデータに加え、フォントの種類、フォントサイズ、文字の色などを示す情報を含んでいる。 1-2-2. Specific Example of Processing FIG. 7 is a diagram illustrating a file used as reference data. Hereinafter, a process for generating a query using these files will be described in detail. The file 1 is an HTML file having a title “Sports car” and a text “A sports car is a type of automobile designed primarily for performance driving”. File 2 is a document file having the title “Super fast sports car” and the text “From the royally elegant Bentley Continental GT to the legendary but Spartan Enzo Ferrari, the content has been chosen carefully.”. The HTML file and the document file include information indicating the font type, font size, character color, and the like in addition to data indicating these character strings.

クエリ生成部１２は、ファイル１およびファイル２から、マトリクスＭを生成する。ここでは、説明が煩雑になるのを避けるため、３つのパラメータ（単語頻度ｆ、フォントタイプＦｎｔ、フォントサイズＦｎｔｓ）のみを用いる例について説明する。また、ファイル１およびファイル２には合わせて３０以上の単語が登場するが、このうち６つのみについて説明する。 The query generation unit 12 generates a matrix M from the file 1 and the file 2. Here, an example using only three parameters (word frequency f, font type Fnt, font size Fnts) will be described in order to avoid complicated explanation. In addition, more than 30 words appear in files 1 and 2, but only 6 of them will be described.

図８は、マトリクスＭを例示する図である。ここでは、理解を助けるため、マトリクスＭの成分は数字ではなく対応する意味内容で表されているが、実際にはこれらの内容を示す数字が記憶される。ファイルインデックスは、単語が含まれているファイルを示す。単語頻度は、その単語がどのファイルで何回出現するか（何個含まれているか）を示す。フォントタイプは、その単語が、どのフォントタイプにより表されているかを示す。フォントサイズは、その単語が、どの大きさのフォントにより表されているかを示す。 FIG. 8 is a diagram illustrating a matrix M. Here, in order to help understanding, the components of the matrix M are represented not by numbers but by the corresponding semantic contents, but in reality, numerals indicating these contents are stored. The file index indicates a file containing words. The word frequency indicates how many times the word appears in which file (how many words are included). The font type indicates which font type represents the word. The font size indicates what size font the word is represented in.

例えば、単語「sports」について説明すると、ファイル１で２回、ファイル２で２回出現している。したがって、単語頻度ｆは、「２」および「１」を成分とするベクトルである。また、「sports」のフォントは、ファイル１（２回出現している）では「タイプ１」および「タイプ２」、ファイル２では「タイプ２」である。したがって、フォントタイプＦｎｔは、ファイル１について「タイプ１」および「タイプ２」を、ファイル２について「タイプ２」を成分とするベクトルである。さらに、「sports」のフォントは、ファイル１では「１２」および「１０」、ファイル２では「１２」である。したがって、フォントサイズＦｎｔｓは、ファイル１について「１２」および「１０」を、ファイル２について「１２」を成分とするベクトルである。 For example, the word “sports” appears twice in file 1 and twice in file 2. Therefore, the word frequency f is a vector having “2” and “1” as components. The font of “sports” is “type 1” and “type 2” in file 1 (appears twice), and “type 2” in file 2. Therefore, the font type Fnt is a vector having “type 1” and “type 2” for file 1 and “type 2” for file 2 as components. Furthermore, the font of “sports” is “12” and “10” in the file 1 and “12” in the file 2. Therefore, the font size Fnts is a vector having “12” and “10” for file 1 and “12” for file 2 as components.

次に、クエリ生成部１２は、マトリクスＭの（少なくとも一部の）成分である属性ベクトルに対して統計的処理を行い、属性ベクトルをスカラーに変換する。すなわち、統計ベクトルが算出される。統計ベクトルを、ベクトルｘ_１〜ｘ_ｎと表す。いまｎ＝６であるので、図８の各単語に対応する統計ベクトルｘ_１〜ｘ_６が算出される。 Next, the query generation unit 12 performs statistical processing on the attribute vector which is (at least a part of) the component of the matrix M, and converts the attribute vector into a scalar. That is, a statistical vector is calculated. Statistical vector represents a vector _x 1 ~x _n. Since n = 6 at present, statistical vectors x _{1 to} x ₆ corresponding to the respective words in FIG. 8 are calculated.

図９は、各単語に対する統計処理パラメータを例示する図である。発生共通度Ｔ_θは、すべてのファイルのうち、その単語を含むファイルの割合を示す。例えば、ファイルの数をｍ、その単語を含むファイルの数をｔとすると、その単語の発生共通度Ｔ_θは、次式により算出される。

FIG. 9 is a diagram illustrating statistical processing parameters for each word. The occurrence commonness T _θ indicates a ratio of files including the word among all files. For example, if the number of files is m and the number of files including the word is t, the occurrence commonness T _θ of the word is calculated by the following equation.

発生頻度共通度Ｔ_ｆは、変動係数を用いて算出される。まず、クエリ生成部１２は、すべてのファイル間での単語の平均頻度ｘ＾（「ｘ＾」は本来ｘの上にバーを付けた記号を示すが、本明細書の文中ではそのような表記を用いることができないためこのように表す）を次式に従って算出する。なおｍはファイルの数を示す。

The occurrence frequency commonness _Tf is calculated using a variation coefficient. First, the query generation unit 12 indicates an average word frequency x ^ ("x ^" is a symbol with a bar on x, which is notated in the text of this specification. Is expressed according to the following equation. Note that m indicates the number of files.

例えば、単語「sports」に対しては、次式で示すようにｘ＾＝１．５である。

For example, for the word “sports”, x ^ = 1.5 as shown in the following equation.

次に、クエリ生成部１２は、標準偏差ｓを次式に従って算出する。

Next, the query generation unit 12 calculates the standard deviation s according to the following equation.

例えば、単語「sports」に対しては、次式で示すようにｓ＝０．５である。

For example, for the word “sports”, s = 0.5 as shown in the following equation.

次に、クエリ生成部１２は、変動係数（Relative Standard Deviation）ＲＳＤを次式に従って算出する。例えば、単語「sports」に対しては、ＲＳＤ＝０．３３である。変動係数ＲＳＤは０に近いほどファイル間で共通して用いられている頻度が高いことを、１に近いほど頻度が低いことを示す。なおここでは変動係数ＲＳＤが発生共通度Ｔ_θとして用いられる。

Next, the query generation unit 12 calculates a variation coefficient (Relative Standard Deviation) RSD according to the following equation. For example, for the word “sports”, RSD = 0.33. The closer the coefficient of variation RSD is to 0, the higher the frequency that is commonly used among files, and the closer to 1, the lower the frequency. Note variation coefficient RSD is used as generator commonalities T _theta here.

次に、フォント珍しさＴ_ｆｎｔについて説明する。クエリ生成部１２は、次式に従ってフォント珍しさＴ_ｆｎｔを算出する。なお「フォント頻度」とは、ある単語の総出現回数に対する、そのフォントを使用した回数の割合をいう。

Next, the font rarity T _fnt will be described. The query generator 12 calculates the font rarity T _fnt according to the following equation. “Font frequency” refers to the ratio of the number of times that a font is used to the total number of occurrences of a word.

例えば、単語「a」および単語「sports」に対して、フォント珍しさＴ_ｆｎｔは次式のように算出される。Ｔ_ｆｎｔが０に近いほど、珍しい（頻度の低い、すなわち強調された）フォントタイプで表示された単語を示す。なお、フォントサイズ珍しさＴ_ｆｎｔｓについても、フォント珍しさと同様に計算することができる。

For example, for the word “a” and the word “sports”, the font rarity T _fnt is calculated as follows: The closer T _fnt is to 0, the more the word is displayed with a rare (infrequent or emphasized) font type. Note that the font size rarity T _fnts can be calculated in the same manner as the font rarity.

なお、上記の例では、単語の数を６つ、パラメータの数を３つに限定して説明したが、単語およびパラメータの数はこれに限定されるものではない。クエリ生成部１２は、全部または一部の単語について、必要とされるパラメータをすべて算出してもよい。なお、属性ベクトルをスカラー化する際に用いられる統計処理は、上述のものに限定されない。平均値、最頻値、中央値、分散、標準偏差、ある条件を満たす要素の数、またはこれらの組み合せなど、どのような統計量が用いられてもよい。 In the above example, the number of words is limited to six and the number of parameters is limited to three. However, the number of words and parameters is not limited to this. The query generation unit 12 may calculate all necessary parameters for all or some of the words. Note that the statistical processing used when the attribute vector is scalarized is not limited to the above. Any statistic such as an average value, mode value, median value, variance, standard deviation, number of elements satisfying a certain condition, or a combination thereof may be used.

図１０は、図９に例示される単語に対してクラスタリング処理を行った例を示す。ここでは、例えばＫ平均法が用いられる。６つの単語は、クラスタＣ_１およびＣ_２の２つのクラスタに分類される。この例では、クラスタＣ_２は、クラスタＣ_１よりも大きい。なお、クラスタの大きさとは、用いられたパラメータ空間に単語をプロットした場合の空間的広がりをいう。例えば２つのパラメータが用いられた場合、２次元座標空間に単語をプロットしたときのクラスタ（単語の集合）の面積をいう。 FIG. 10 shows an example in which clustering processing is performed on the words exemplified in FIG. Here, for example, a K-average method is used. Six words are classified into two clusters of cluster C ₁ and C _2. In this example, the cluster _{C 2} is greater than the cluster _{C 1.} Note that the size of the cluster means a spatial spread when words are plotted in the used parameter space. For example, when two parameters are used, it means the area of a cluster (a set of words) when words are plotted in a two-dimensional coordinate space.

クエリ生成部１２は、以上のようにして生成されたクラスタに基づいて、クエリを生成する。例えば、クエリ生成部１２は、あらかじめ決められた条件を満たすクラスタを用いる。条件としては、上述の３つのパラメータを用いた場合、例えば次のようなものが用いられる。
（１）クラスタの発生共通度が高い。
（２）クラスタの発生頻度が高い。
（３）クラスタに属する単語のフォントタイプが珍しい。
（４）クラスタに属する単語のフォントサイズが珍しい。
この条件に従って、クエリ生成部１２はクラスタＣ_１を選択する。クエリ生成部１２は、クラスタＣ_１に属する単語「sports」および「car」をクエリとして決定する。 The query generation unit 12 generates a query based on the cluster generated as described above. For example, the query generation unit 12 uses a cluster that satisfies a predetermined condition. As the conditions, when the above three parameters are used, for example, the following are used.
(1) The generation commonality of clusters is high.
(2) The occurrence frequency of clusters is high.
(3) The font type of words belonging to the cluster is unusual.
(4) The font size of words belonging to the cluster is unusual.
In accordance with this condition, the query generation unit 12 selects a cluster C _1. The query generation unit 12 determines the words “sports” and “car” belonging to the cluster C ₁ as queries.

なお、他のパラメータを用いた場合には、そのパラメータに関する条件が用いられてもよい。例えば、パラメータとして品詞の種類が用いられた場合には、名詞と固有名詞の割合が高いという条件が用いられてもよい。また、クエリ生成部１２は、クラスタ間でパラメータの高低を比較するだけでなく、しきい値との比較によりクラスタを選択してもよい。あるいは、クエリ生成部１２は、パラメータによって異なる重み係数を用いたしきい値関数を用いて、複数のパラメータの寄与を総合的に判断してクラスタを選択してもよい。さらに、クエリ生成部１２は、クラスタの大きさがあらかじめ決められた条件を満たすように、クラスタの大きさを縮めたり広げたりする処理を行ってもよい。 Note that when other parameters are used, conditions relating to the parameters may be used. For example, when the type of part of speech is used as a parameter, a condition that the ratio of nouns and proper nouns is high may be used. Further, the query generation unit 12 may select a cluster not only by comparing the level of parameters between clusters but also by comparison with a threshold value. Alternatively, the query generation unit 12 may select a cluster by comprehensively determining the contribution of a plurality of parameters using a threshold function using a weighting factor that varies depending on the parameters. Further, the query generation unit 12 may perform a process of reducing or expanding the size of the cluster so that the size of the cluster satisfies a predetermined condition.

図１１は、しきい値関数を例示する図である。しきい値関数Ｓｕｍ_ｉは、例えば次式（１３）で表される。図１１において、パラメータｗは重み係数を示す。Ｓ_Ｆｉについて、共有ファイル数がしきい値Ｊを超えていた場合、重み係数はｗ_０に決定される。それ以外の場合、重み係数はゼロに決定される。Ｓ_ｆｉについて、その単語がストップワードである場合、単語頻度ｆ_ｉがしきい値αを超えていた場合、重み係数はｗ_１に決定される。それ以外の場合、重み係数はゼロに決定される。Ｓ_Ｆｎｔｉについて、Ｆｎｔ_ｉがしきい値μより小さい場合、すなわち、そのフォントタイプの頻度が低い希少なものである場合、重み係数はｗ_２に決定される。それ以外の場合、重み係数はゼロに決定される。Ｓ_{Ｆｎｔｓｉ}について、Ｆｎｓｔｓ_ｉがしきい値βより大きい場合、すなわちフォントサイズが大きい場合、重み係数はｗ_３に決定される。それ以外の場合、重み係数はゼロに決定される。Ｓ_Ｃｉについて、Ｃ_ｉがしきい値ωよりも小さい場合、すなわち、その色の頻度が少ない希少なものである場合、重み係数はｗ_４に決定される。それ以外の場合、重み係数はゼロに決定される。Ｓ_Ｕｉについて、Ｕ_ｉが固有名詞（propernoun）を示すものである場合、重み係数はｗ_５に決定される。Ｕ_ｉが普通名詞（commonnoun）を示すものである場合、重み係数はｗ_６に決定される。Ｕ_ｉが形容詞（adjective）を示すものである場合、重み係数はｗ_７に決定される。それ以外の場合、重み係数はゼロに決定される。Ｓ_Ｔｉについて、その単語が、タイトル、章の見出しなどに用いられることを示すものである場合、重み係数はｗ_８に決定される。それ以外の場合、重み係数はゼロに決定される。しきい値関数Ｓｕｍ_ｉが、Ｓｕｍ_ｉ≧Ｓｕｍ_ｔを満たす場合、その統計ベクトルは選択される。Ｓｕｍ_ｉがこの条件を満たさない場合、その統計ベクトルは選択されない。なお重み係数の具体的な値は、試行錯誤的に求められてもよい。上記の説明で重み係数がゼロになると記載した場合についても同様である。また、各項にはそれぞれ優先度が割り当てられていてもよい。この優先度に基づいて、一部のパラメータのみを考慮してしきい値関数Ｓｕｍ_ｉの値が算出されてもよい。

FIG. 11 is a diagram illustrating a threshold function. The threshold function Sum _i is expressed by the following equation (13), for example. In FIG. 11, the parameter w indicates a weighting factor. For S _Fi, if the share number of files exceeds the threshold J, the weighting factor is determined to be w _0. Otherwise, the weighting factor is determined to be zero. For S _fi, if the word is a stop word, if the word frequency f _i exceeds the threshold value alpha, the weighting factor is determined to be w _1. Otherwise, the weighting factor is determined to be zero. For S _Fnti, if FNT _i is smaller than the threshold value mu, i.e., if the frequency of the font type is of low rare, the weighting factor is determined to be _{w 2.} Otherwise, the weighting factor is determined to be zero. For S _Fntsi , if Fnsts _i is greater than threshold β, ie, the font size is large, the weighting factor is determined to be w ₃ . Otherwise, the weighting factor is determined to be zero. For S _Ci , if C _i is smaller than the threshold ω, that is, if the color frequency is rare, the weighting factor is determined to be w ₄ . Otherwise, the weighting factor is determined to be zero. For S _Ui , if U _i indicates a proper noun, the weighting factor is determined to be w ₅ . If U _i indicates a common noun, the weighting factor is determined as w ₆ . If U _i is indicative of the adjectives (adjective), the weighting factor is determined to be w _7. Otherwise, the weighting factor is determined to be zero. For S _Ti, that word, title case, is an indication that used as headings chapter, the weighting factor is determined to be w _8. Otherwise, the weighting factor is determined to be zero. If the threshold function Sum _i satisfies Sum _i ≧ Sum _t , its statistical vector is selected. If Sum _i does not satisfy this condition, its statistical vector is not selected. The specific value of the weighting factor may be obtained by trial and error. The same applies to the case where the weight coefficient is described as zero in the above description. Moreover, each item may be assigned a priority. Based on this priority, the value of the threshold function Sum _i may be calculated in consideration of only some parameters.

１−２−３．検索処理
検索サーバ２００は、検索端末１００から送信されたクエリに従って検索を行う。既に述べたように、検索サーバ２００は周知の検索エンジンである。検索端末１００は、周知の検索エンジンに適合するクエリを生成する。検索サーバ２００は、クエリに応じた検索結果を検索端末１００に送信する。 1-2-3. Search Process The search server 200 performs a search according to the query transmitted from the search terminal 100. As already described, the search server 200 is a well-known search engine. The search terminal 100 generates a query that matches a known search engine. The search server 200 transmits a search result corresponding to the query to the search terminal 100.

１−２−４．検索結果の整理
検索結果整理部１６は、クエリに応じて得られた検索結果と、検索オプションとを比較する。検索結果整理部１６は、この比較結果に応じて、検索結果の絞込みを行う。すなわち、検索結果整理部１６は、検索オプションで示される条件に適合する検索結果のみを採用し、条件に適合しない検索結果は採用しない。あるいは、検索結果整理部１６は、検索結果の数があらかじめ決められたしきい値以下になるように検索結果を絞り込んでもよい。 1-2-4. Search Result Organizing The search result organizing unit 16 compares a search result obtained according to a query with a search option. The search result organizing unit 16 narrows down the search results according to the comparison result. That is, the search result organizing unit 16 adopts only search results that match the conditions indicated by the search options, and does not adopt search results that do not match the conditions. Alternatively, the search result organizing unit 16 may narrow down the search results so that the number of search results is equal to or less than a predetermined threshold value.

１−２−５．検索結果の提供とフィードバック
ユーザインターフェース１９は、絞込みされた検索結果をユーザに提供する。検索結果は、検索サーバ２００によりランキング（順位付け）されてもよい。ユーザは検索結果を確認する。ユーザは、検索結果に対してフィードバックを入力する。すなわち、ユーザは、ユーザインターフェース１９を介して、検索結果のカテゴリを入力する。このとき、提供された検索結果のうち特定のもののみをカテゴリデータベース１８に記録する旨をユーザが指定してもよい。抽出・分類部１５は、入力されたカテゴリと、検索結果（リンク情報）と、その検索結果を得るのに用いられたクエリとをカテゴリデータベース１８に追加する。抽出・分類部１５は、この次に検索を行う際に、カテゴリに基づいて検索結果の絞込みを行うことができる。なお、カテゴリは検索オプションとして指定されてもよい。 1-2-5. Providing search results and feedback The user interface 19 provides the user with the narrowed search results. The search results may be ranked (ranked) by the search server 200. The user confirms the search result. The user inputs feedback for the search result. That is, the user inputs the category of the search result via the user interface 19. At this time, the user may specify that only a specific search result is recorded in the category database 18. The extraction / classification unit 15 adds the input category, the search result (link information), and the query used to obtain the search result to the category database 18. When the next search is performed, the extraction / classification unit 15 can narrow down the search results based on the category. The category may be specified as a search option.

１−２−６．データベース更新後の検索
カテゴリデータベース１８の更新後に新たな検索を行う場合、抽出・分類部１５は、カテゴリデータベース１８に記憶されている内容に基づいて検索結果の絞込みをすることができる。例えば、抽出・分類部１５は、カテゴリデータベース１８に記録された検索結果のうち、対応するクエリおよびカテゴリ（の少なくとも一部）が、新たな検索に用いられるクエリおよびカテゴリ（検索オプション）と共通する検索結果のみを、ユーザに提供してもよい。あるいは、抽出・分類部１５は、カテゴリデータベース１８から、ユーザにより指定された複数のカテゴリのいずれか一のカテゴリと共通するカテゴリに対応する検索結果を抽出し、ユーザに提供してもよい。 1-2-6. Search after Updating Database When performing a new search after updating the category database 18, the extraction / classification unit 15 can narrow down search results based on the contents stored in the category database 18. For example, in the search result recorded in the category database 18, the extraction / classification unit 15 has a corresponding query and category (at least part of) corresponding to a query and category (search option) used for a new search. Only the search result may be provided to the user. Alternatively, the extraction / classification unit 15 may extract a search result corresponding to a category in common with any one of a plurality of categories designated by the user from the category database 18 and provide it to the user.

２．第２実施形態
続いて、本発明の第２実施形態について説明する。以下、第１実施形態と共通する要素については、共通の参照符号を用いて説明をする。また、第１実施形態と共通する事項についてはその説明を省略する。第２実施形態においては、クラスタリング処理に代わり、発見的（Heuristic）方法が用いられる。「発見的方法」とは、数式により算出された重要度に基づいて単語を選択する方法である。具体的には、重要度があらかじめ決められたしきい値以下である単語はフィルタにより除去され、重要度がしきい値より高い単語のみが採用される。 2. Second Embodiment Subsequently, a second embodiment of the present invention will be described. Hereinafter, elements common to the first embodiment will be described using common reference numerals. Further, the description of matters common to the first embodiment is omitted. In the second embodiment, a heuristic method is used instead of the clustering process. The “heuristic method” is a method of selecting a word based on the importance calculated by a mathematical formula. Specifically, words whose importance is below a predetermined threshold are removed by the filter, and only words whose importance is higher than the threshold are adopted.

クエリ生成部１２は、統計ベクトルの成分（すなわち、統計処理により得られたパラメータ）と、あらかじめ決められた関数を用いて、各単語の重要度を算出する。関数は、例えば、パラメータに重み係数をかけて足し合わせるものである。また、これに限られずどのような関数が用いられてもよい。クエリ生成部１２は、重要度に基づいて単語の絞込みを行う。すなわち、クエリ生成部１２は、重要度がしきい値よりも高い単語のみを抽出する。クエリ生成部１２は、抽出された単語のうち、２以上の単語の組み合わせ（またはこれに類似する組み合わせ）が、各参照データにおいて存在するか検索する。これは、公知の近似検索技術を用いて行われる。各参照データにおいて単語の組み合わせが検出された場合、クエリ生成部１２は、その単語の組み合わせをクエリとして採用する。なお、クエリ生成部１２は、各参照データにおいてタイトルとして用いられている単語については、上記の処理によらずにクエリとして採用してもよい。 The query generation unit 12 calculates the importance of each word using a statistical vector component (that is, a parameter obtained by statistical processing) and a predetermined function. For example, the function is obtained by adding a weighting factor to a parameter and adding the parameters. Further, the present invention is not limited to this, and any function may be used. The query generation unit 12 narrows down words based on importance. That is, the query generation unit 12 extracts only words whose importance is higher than a threshold value. The query generation unit 12 searches whether or not a combination of two or more words (or a combination similar thereto) among the extracted words exists in each reference data. This is performed using a known approximate search technique. When a combination of words is detected in each reference data, the query generation unit 12 employs the combination of words as a query. Note that the query generation unit 12 may adopt a word used as a title in each reference data as a query regardless of the above processing.

なお重要度がしきい値を超えた単語の数があらかじめ決められたしきい値以下であった場合、クエリ生成部１２は、重要度のしきい値を更新し、再度単語の絞込みを行ってもよい。 If the number of words whose importance exceeds the threshold is less than or equal to a predetermined threshold, the query generator 12 updates the importance threshold and narrows down the words again. Also good.

また、クエリ生成部１２は、いわゆるファジー（fuzzy）検索の技術を用いてもよい。すなわち、綴りが多少異なっていても、参照データにおいて単語の組み合わせが検出されたものとして処理してもよい。また、クエリとして採用される単語の数がしきい値よりも多い場合には、クエリ生成部１２は、単語の論理積（ＡＮＤ）をクエリとしてもよい。また、クエリ生成部１２は、クエリとして用いられる単語の数がしきい値を超えないように、単語の数を制限してもよい。 The query generation unit 12 may use a so-called fuzzy search technique. That is, even if the spelling is slightly different, processing may be performed assuming that a combination of words is detected in the reference data. In addition, when the number of words employed as a query is greater than the threshold value, the query generation unit 12 may use a logical product (AND) of the words as the query. In addition, the query generation unit 12 may limit the number of words so that the number of words used as a query does not exceed a threshold value.

３．他の実施形態
本発明は上述の実施形態に限定されるものではなく種々の変形実施が可能である。
上述の実施形態において、図１に示される機能構成要素（検索サーバ２００を除く）は、検索端末１００の機能として説明された。しかし、複数のコンピュータ装置からなる検索システムが、全体として図１に示される機能構成を有していてもよい。このとき、抽出・分類部１５、検索結果整理部１６、フィードバック部１７、カテゴリデータベース１８、ユーザインターフェース１９のうち１以上の機能は省略されてもよい。 3. Other Embodiments The present invention is not limited to the above-described embodiments, and various modifications can be made.
In the above-described embodiment, the functional components (excluding the search server 200) illustrated in FIG. 1 have been described as functions of the search terminal 100. However, a search system composed of a plurality of computer devices may have the functional configuration shown in FIG. 1 as a whole. At this time, one or more functions of the extraction / classification unit 15, the search result organizing unit 16, the feedback unit 17, the category database 18, and the user interface 19 may be omitted.

第１実施形態において、クラスタリング処理に用いられるパラメータは、ユーザの指定により決定されてもよい。また、クラスタリング処理を行った後、複数のクラスタの中から１以上のクラスタを選択する条件は、ユーザの指定により決定されてもよい。第２実施形態におけるしきい値も、ユーザの指定により決定されてもよい。 In the first embodiment, the parameters used for the clustering process may be determined by user designation. In addition, after performing the clustering process, a condition for selecting one or more clusters from a plurality of clusters may be determined by a user's designation. The threshold value in the second embodiment may also be determined by user designation.

図１に示される機能構成要素は、それぞれ個別のプログラムモジュールとして提供されてもよい。この場合において、各プログラムモジュールは、個別にインストールまたはアンインストール可能であってもよい。また、各プログラムモジュール、またはプログラムモジュールの集合は、ＣＤ−ＲＯＭ（Compact Disk Read Only Memory）などの記憶媒体により提供されてもよい。 The functional components shown in FIG. 1 may be provided as individual program modules. In this case, each program module may be individually installable or uninstallable. Each program module or a set of program modules may be provided by a storage medium such as a CD-ROM (Compact Disk Read Only Memory).

なお、検索端末１００は、図２に示される構成要素のすべてを有していなくてもよい。検索端末１００は、ＨＤＤ１５０、キーボード・マウス１６０およびネットワークＩＦ１８０を有さない情報表示装置であってもよい。 Note that the search terminal 100 does not have to include all of the components shown in FIG. The search terminal 100 may be an information display device that does not include the HDD 150, the keyboard / mouse 160, and the network IF 180.

第１実施形態に係る検索システム１の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the search system 1 which concerns on 1st Embodiment. 検索端末１００のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of a search terminal 100. FIG. 検索システム１の動作を示すフローチャートである。3 is a flowchart showing the operation of the search system 1. 参照データ入力部１０によるユーザインターフェースを例示する図である。FIG. 3 is a diagram illustrating a user interface by a reference data input unit 10; クラスタリング処理を説明する図である。It is a figure explaining a clustering process. クラスタに分類されたベクトルを例示する図である。It is a figure which illustrates the vector classified into the cluster. 参照データとして用いられるファイルを例示する図である。It is a figure which illustrates the file used as reference data. マトリクスＭを例示する図である。3 is a diagram illustrating a matrix M. FIG. 各単語に対する統計処理パラメータを例示する図である。It is a figure which illustrates the statistical processing parameter with respect to each word. 図９に例示される単語に対してクラスタリング処理を行った例を示す。The example which performed the clustering process with respect to the word illustrated by FIG. 9 is shown. しきい値関数を例示する図である。It is a figure which illustrates a threshold function.

符号の説明Explanation of symbols

１…検索システム、１０…参照データ入力部、１１…図、１１…アナライザ部、１２…クエリ生成部、１３…類似辞書、１４…検索部、１５…抽出・分類部、１６…検索結果整理部、１７…フィードバック部、１８…カテゴリデータベース、１９…ユーザインターフェース、１００…検索端末、１１０…ＣＰＵ、１２０…ＲＯＭ、１３０…ＲＡＭ、１４０…Ｉ／Ｆ、１５０…ＨＤＤ、１６０…キーボード・マウス、１７０…ディスプレイ、１８０…ネットワークＩＦ、１９０…バス、２００…検索サーバ、４０１…ウインドウ、４０２…ウインドウ、４０３…ウインドウ、４０４…ボタン DESCRIPTION OF SYMBOLS 1 ... Search system, 10 ... Reference data input part, 11 ... Diagram, 11 ... Analyzer part, 12 ... Query generation part, 13 ... Similar dictionary, 14 ... Search part, 15 ... Extraction / classification part, 16 ... Search result arrangement part , 17 ... Feedback unit, 18 ... Category database, 19 ... User interface, 100 ... Search terminal, 110 ... CPU, 120 ... ROM, 130 ... RAM, 140 ... I / F, 150 ... HDD, 160 ... Keyboard / mouse, 170 ... Display, 180 ... Network IF, 190 ... Bus, 200 ... Search server, 401 ... Window, 402 ... Window, 403 ... Window, 404 ... Button

Claims

各々が１以上の単語を含む１以上の参照データの入力を受け付ける参照データ入力手段と、
前記参照データ入力手段により入力された１以上の参照データに基づいて複数の参照ベクトルからなるマトリクスを生成し、前記参照ベクトルが属性ベクトルを成分として含み、前記属性ベクトルが前記参照データに含まれる単語の属性を成分として含むマトリクス生成手段と、
前記マトリクス生成手段により生成されたマトリクスに基づいて複数の統計ベクトルを生成し、各統計ベクトルが前記属性ベクトルの成分に対して統計処理を施すことにより得られたスカラーを成分として有する統計ベクトル生成手段と、
前記統計ベクトル生成手段により生成された複数の統計ベクトルに基づいて、１以上の単語を含むクエリを生成するクエリ生成手段と、
前記クエリ生成手段により生成されたクエリを出力するクエリ出力手段と
を有する検索端末装置。 Reference data input means for receiving input of one or more reference data each including one or more words;
A word including a plurality of reference vectors based on one or more reference data input by the reference data input means, the reference vector including an attribute vector as a component, and the attribute vector being included in the reference data Matrix generating means including the attribute of
Statistical vector generation means for generating a plurality of statistical vectors based on the matrix generated by the matrix generation means, and each statistical vector having a scalar obtained by applying statistical processing to the component of the attribute vector as a component When,
Query generation means for generating a query including one or more words based on a plurality of statistical vectors generated by the statistical vector generation means;
A search terminal device comprising: query output means for outputting a query generated by the query generation means.

前記統計ベクトル生成手段により生成された複数の統計ベクトルを複数のクラスタに分類するクラスタ化手段を有し、
前記クエリ生成手段が、前記複数のクラスタの中から、ある条件を満たすクラスタに含まれる１以上の統計ベクトルに対応する単語を抽出してクエリを生成する
ことを特徴とする請求項１に記載の検索端末装置。 Clustering means for classifying a plurality of statistical vectors generated by the statistical vector generation means into a plurality of clusters;
The query generation unit generates a query by extracting a word corresponding to one or more statistical vectors included in a cluster satisfying a certain condition from the plurality of clusters. Search terminal device.

前記クエリ生成手段が、しきい値関数および各クラスタに属する統計ベクトルを用いて各クラスタの重要度を算出し、少なくとも重要度が最大のクラスタから抽出された単語を用いてクエリを生成する
ことを特徴とする請求項２に記載の検索端末装置。 The query generation means calculates the importance of each cluster using a threshold function and a statistical vector belonging to each cluster, and generates a query using a word extracted from at least the cluster having the highest importance. The search terminal device according to claim 2, wherein

前記統計ベクトル生成手段により生成された複数の統計ベクトルから、各統計ベクトルに対応する単語の重要度を算出する重要度算出手段を有し、
前記複数の統計ベクトルのうち、前記重要度算出手段により算出された重要度がしきい値以上である統計ベクトルに対応する１以上の単語を含むクエリを生成する
ことを特徴とする請求項１に記載の検索端末装置。 Importance calculation means for calculating the importance of words corresponding to each statistical vector from a plurality of statistical vectors generated by the statistical vector generation means,
The query including one or more words corresponding to a statistical vector whose importance calculated by the importance calculation means is not less than a threshold value among the plurality of statistical vectors is generated. The described search terminal device.

前記クエリに応じた検索結果を取得する検索結果取得手段と、
前記検索結果取得手段により取得された検索結果のカテゴリを指定するカテゴリ指定手段と、
前記カテゴリ指定手段により指定されたカテゴリ、前記検索結果、および前記クエリを記録するデータベースと、
前記クエリ生成手段により新たなクエリが生成された場合、前記データベースの記録内容に基づいて検索結果の絞込みを行う絞込み手段と
を有する請求項１に記載の検索端末装置。 Search result acquisition means for acquiring a search result according to the query;
Category specifying means for specifying the category of the search result acquired by the search result acquiring means;
A database that records the category specified by the category specifying means, the search result, and the query;
The search terminal device according to claim 1, further comprising: narrowing-down means for narrowing down search results based on the recorded contents of the database when a new query is generated by the query generation means.

前記単語の属性が、前記参照データのファイルタイプ、前記参照データにおける前記単語の頻度、前記参照データにおける前記単語の相対位置、前記単語の品詞、前記単語が表示されるフォントの種類、前記参照データのページ数、前記単語が表示される色、前記単語が表示されるフォントの大きさ、前記参照データにおけるリンク情報、前記参照データに付加されたメタ情報のうち少なくとも１つを含む
ことを特徴とする請求項１に記載の検索端末装置。 The attribute of the word is the file type of the reference data, the frequency of the word in the reference data, the relative position of the word in the reference data, the part of speech of the word, the type of font in which the word is displayed, the reference data Including at least one of a number of pages, a color in which the word is displayed, a font size in which the word is displayed, link information in the reference data, and meta information added to the reference data. The search terminal device according to claim 1.

前記参照データから単語を抽出する単語抽出手段を有し、
前記マトリクス生成手段が、前記単語抽出手段により抽出された単語および前記参照データに基づいて前記マトリクスを生成し、
前記単語抽出手段が、複数の言語に対応している
ことを特徴とする請求項１に記載の検索端末装置。 Word extraction means for extracting words from the reference data;
The matrix generating means generates the matrix based on the word extracted by the word extracting means and the reference data;
The search terminal device according to claim 1, wherein the word extraction unit supports a plurality of languages.

各々が１以上の単語を含む１以上の参照データの入力を受け付ける参照データ入力手段と、
前記参照データ入力手段により入力された１以上の参照データに基づいて複数の参照ベクトルからなるマトリクスを生成し、前記参照ベクトルが属性ベクトルを成分として含み、前記属性ベクトルが前記参照データに含まれる単語の属性を成分として含むマトリクス生成手段と、
前記マトリクス生成手段により生成されたマトリクスに基づいて複数の統計ベクトルを生成し、各統計ベクトルが前記属性ベクトルの成分に対して統計処理を施すことにより得られたスカラーを成分として有する統計ベクトル生成手段と、
前記統計ベクトル生成手段により生成された複数の統計ベクトルに基づいて、１以上の単語を含むクエリを生成するクエリ生成手段と、
前記クエリ生成手段により生成されたクエリを出力するクエリ出力手段と
を有する検索システム。 Reference data input means for receiving input of one or more reference data each including one or more words;
A word including a plurality of reference vectors based on one or more reference data input by the reference data input means, the reference vector including an attribute vector as a component, and the attribute vector being included in the reference data Matrix generating means including the attribute of
Statistical vector generation means for generating a plurality of statistical vectors based on the matrix generated by the matrix generation means, and each statistical vector having a scalar obtained by applying statistical processing to the component of the attribute vector as a component When,
Query generation means for generating a query including one or more words based on a plurality of statistical vectors generated by the statistical vector generation means;
And a query output means for outputting a query generated by the query generation means.

コンピュータ装置に、
各々が１以上の単語を含む１以上の参照データの入力を受け付ける参照データ入力ステップと、
前記１以上の参照データに基づいて複数の参照ベクトルからなるマトリクスを生成し、前記参照ベクトルが属性ベクトルを成分として含み、前記属性ベクトルが前記参照データに含まれる単語の属性を成分として含むマトリクス生成ステップと、
前記マトリクスに基づいて複数の統計ベクトルを生成し、各統計ベクトルが前記属性ベクトルの成分に対して統計処理を施すことにより得られたスカラーを成分として有する統計ベクトル生成ステップと、
前記複数の統計ベクトルに基づいて、１以上の単語を含むクエリを生成するクエリ生成ステップと、
前記クエリを出力するクエリ出力ステップと
を実行させるプログラム。 Computer equipment,
A reference data input step for receiving input of one or more reference data each including one or more words;
Generating a matrix comprising a plurality of reference vectors based on the one or more reference data, the reference vector including an attribute vector as a component, and the attribute vector including a word attribute included in the reference data as a component Steps,
A statistical vector generation step of generating a plurality of statistical vectors based on the matrix, each statistical vector having as a component a scalar obtained by performing statistical processing on the component of the attribute vector;
A query generation step of generating a query including one or more words based on the plurality of statistical vectors;
A program for executing a query output step for outputting the query.