JPH10307837A

JPH10307837A - Retrieval device and recording medium recording retrieval program

Info

Publication number: JPH10307837A
Application number: JP9119403A
Authority: JP
Inventors: Kenichi Kuromushiya; 健一黒武者; Ikuo Karashi; 育雄芥子
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1997-05-09
Filing date: 1997-05-09
Publication date: 1998-11-17

Abstract

PROBLEM TO BE SOLVED: To shorten the time of searching target WWW(world wide web) data by displaying the summary sentence of the WWW data along with the uniform resource locator(URL) of the retrieved WWW data at the time of retrieving the WWW data. SOLUTION: Retrieval is started by inputting the request sentence of the retrieval from an input means 1. The inputted retrieval request sentence is sent to a retrieval means 2 and the URL stored in a URL storage means 3 is retrieved by the retrieval method of key word retrieval or semantic retrieval, etc. The retrieved URL is sent to an output means 7, the summary sentence of the WWW data corresponding to the URL is read from a summary sentence storage means 6 and the URL and the summary sentence of the WWW data corresponding to the URL are simultaneously outputted. Also, since a sentence including the word of a high frequency is selected by the frequency of the word of a noun included in the respective sentences, the summary sentence matched with the retrieval request sentence is generated.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、インターネット
上のＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）データの検
索装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for searching WWW (World Wide Web) data on the Internet.

【０００２】[0002]

【従来の技術】インターネットの普及に伴い、インター
ネット上のＷＷＷデータの数は膨大になり、それらのＷ
ＷＷデータの中から自分にとって必要なデータを探すの
は困難な作業になっている。そこで、それらのＷＷＷデ
ータを検索するための、いわゆるＷＷＷ検索エンジンが
登場した。このＷＷＷ検索エンジンを用いて、インター
ネット上のＷＷＷデータの収集を行い、各データのＵＲ
Ｌ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏ
ｒ）とそのＷＷＷデータの一部をＷＷＷ検索エンジン上
の記憶装置に保持することにより、一度に多数のＷＷＷ
データを検索することができるようになった。2. Description of the Related Art With the spread of the Internet, the number of WWW data on the Internet has become enormous.
It is a difficult task to find necessary data from WW data. Therefore, a so-called WWW search engine for searching these WWW data has appeared. Using this WWW search engine, WWW data on the Internet is collected and the UR of each data is collected.
L (Uniform Resource Locato
r) and a part of the WWW data is stored in a storage device on a WWW search engine, so that a large number of WWW
You can now search for data.

【０００３】検索結果を要約表示する情報検索装置とし
て、特開平７−１４６８７５号公報がある。これは、画
像データの検索を行う時に、その画像データの特定ペー
ジだけを検索結果として表示することによって要約表示
を行うものである。すなわち、検索条件に合致する画像
データファイルを検索した後、検索したそれぞれの画像
データの特定ページ、例えば第１ページだけを縮小し
て、それらを各ページ毎に所定数ずつ配置した抄録画像
ページを作成して、その抄録画像データを印刷するもの
である。この装置により、検索結果の抄録として特定ペ
ージを縮小したものを複数個組み合わせた画像データを
作成して印刷するので、不必要な画像データファイルを
全文印刷することがなくなる。[0003] Japanese Patent Application Laid-Open No. Hei 7-146875 discloses an information retrieval apparatus for summarizing and displaying retrieval results. In this method, when searching for image data, only a specific page of the image data is displayed as a search result to perform a summary display. That is, after searching for image data files that match the search conditions, a specific page of each searched image data, for example, only the first page, is reduced, and an abstract image page in which a predetermined number of these are arranged for each page is displayed. It is created and the abstract image data is printed. This apparatus creates and prints image data in which a plurality of reduced specific pages are combined as an abstract of a search result, and prints unnecessary image data files in full text.

【０００４】また、文書要約装置として、本件発明者を
含む発明者等が発明した特開平６−２１５０４９号公報
があり、これは文脈ベクトルによって文書全体と最も意
味の近い段落・文、あるいは各段落に最も意味の近い文
を求め、それらを要約として指示するものである。すな
わち、文書入力部から入力された文書を文書解析部で段
落、文、単語に分割し、単語辞書を用いて、この単語、
文、段落および入力文書の特徴ベクトルを生成し、この
各特徴ベクトルより入力文書、段落および文の特徴ベク
トル間の距離を算出し、各特徴ベクトル間距離に基づい
て入力文書の要約を生成している。このように特徴ベク
トルを用いた入力文書の解析結果に基づいて入力文書の
要約を生成しているので、文書の形式や文脈を仮定する
ことなく簡単な処理によって文書における質の良い重要
部を要約として抽出することができる。また、入力文書
と各段落との特徴ベクトル間距離、入力文書と各段落毎
の文との特徴ベクトル間距離、各段落とそれぞれの段落
内の文との特徴ベクトル間距離または入力文書と各文と
の特徴ベクトル間距離を算出し、入力文書に最も近い段
落、入力文書に最も近い各段落毎の文、各段落に最も近
いそれぞれの段落内の文書および入力文書に最も近い複
数の文のどれか一つ以上を入力文書の要約として抽出す
るものであり、簡単な処理によって文書における質の良
い重要部を抽出することができる。As a document summarizing apparatus, there is Japanese Patent Laid-Open Publication No. Hei 6-215049 invented by the inventors including the present inventor. It seeks the sentences that have the closest meaning to, and indicates them as a summary. That is, a document input from the document input unit is divided into paragraphs, sentences, and words by the document analysis unit, and the words,
Generate a feature vector of a sentence, a paragraph and an input document, calculate a distance between the input document, a paragraph and a feature vector of the sentence from each of the feature vectors, and generate a summary of the input document based on the distance between each feature vector. I have. Since the input document summary is generated based on the analysis result of the input document using the feature vector in this manner, high-quality important parts of the document are summarized by simple processing without assuming the format and context of the document. Can be extracted as Also, the distance between feature vectors between the input document and each paragraph, the distance between feature vectors between the input document and the sentence of each paragraph, the distance between feature vectors between each paragraph and the sentence in each paragraph, or the distance between the input document and each sentence Which of the paragraphs closest to the input document, the sentence for each paragraph closest to the input document, the document in each paragraph closest to each paragraph, and the multiple sentences closest to the input document One or more are extracted as a summary of the input document, and a high-quality important part of the document can be extracted by simple processing.

【０００５】[0005]

【発明が解決しようとする課題】しかし、ＷＷＷ検索エ
ンジンを用いてＷＷＷデータを検索した場合も、検索し
たときに表示される情報は、そのＷＷＷデータのＵＲＬ
と、そのＷＷＷデータに張られていたリンクのアンカー
文字列やそのＷＷＷデータに含まれている頻度の高い単
語等だけが出力として画面に表示されたり、印刷され
る。従って、その情報の詳細を知るには、実際にそのＵ
ＲＬデータの中身を見てみないと内容が分からない場合
が多く、実際に中身を見るためにはネットワークを通し
てデータを取得し、中身を見るために時間が掛かってし
まい、検索してからも目的のＷＷＷデータを見つけるの
に多大の時間を掛けることになる。However, even when WWW data is searched using a WWW search engine, the information displayed at the time of the search is the URL of the WWW data.
Then, only an anchor character string of a link attached to the WWW data, a frequently-used word included in the WWW data, or the like is displayed on the screen as an output or printed. Therefore, to know the details of the information,
In many cases, the contents cannot be understood unless the contents of the RL data are examined. To actually see the contents, it takes time to acquire the data through a network, and it takes time to see the contents. It takes a lot of time to find WWW data.

【０００６】このようにして目的のＷＷＷデータを見つ
けた後、上記特開平７−１４６８７５号公報を利用し
て、検索後の特定ページ、例えば第１ページだけを縮小
して、抄録画像ページを作成することができるが、この
場合も特定ページ、例えば第１ページに重要情報がある
とは限らず、特定ページだけを見ても全体の内容が分か
らないことがある。また、特開平６−２１５０４９号公
報を利用して要約を作成することができるが、この場合
もＨＴＭＬデータからそのまま文章を抽出することはで
きず、何等かの方法でＨＴＭＬタグを除いたとしてもＨ
ＴＭＬタグで文章整形をしているので、必ずしも句読点
が付いているとは限らず、文章の抽出が行えないことが
ある。After the target WWW data is found in this way, a specific page after the search, for example, only the first page, is reduced by using the above-mentioned JP-A-7-146875 to create an abstract image page. However, in this case as well, important information is not always present on a specific page, for example, the first page, and the entire contents may not be understood by looking at only the specific page. In addition, an abstract can be created using JP-A-6-215049, but in this case, too, the sentence cannot be extracted from the HTML data as it is, and even if the HTML tag is removed by any method. H
Since the sentence is formatted using the TML tag, the sentence cannot always be extracted because the punctuation is not always attached.

【０００７】[0007]

【課題を解決するための手段】本発明の特許請求の範囲
の請求項１に記載の検索装置は、インターネット上に存
在するＷＷＷデータの多数のＵＲＬを保持するＵＲＬ記
憶手段と、検索要求文を入力するための入力手段と、前
記ＵＲＬ記憶手段内に保持されているＵＲＬの検索を行
う検索手段を持つ検索装置において、前記ＵＲＬ記憶手
段に記憶されているＵＲＬによって指定されるＨＴＭＬ
データに対して、該ＨＴＭＬデータをインターネット上
から取得し、該ＨＴＭＬデータ内の句読点とＨＴＭＬの
タグの認識を行い、ＨＴＭＬデータ内に含まれている文
章を抽出する文章抽出手段と、前記文章抽出手段によっ
て抽出された文章の中から、該ＨＴＭＬデータに対する
要約になっている要約文を選択する要約文選択手段と、
前記要約文選択手段によって選択された要約文を該ＨＴ
ＭＬデータに対する要約文として保持する要約文記憶手
段と、前記検索手段によって検索された検索データに対
応する要約文を前記要約文記憶手段から読み出し、表示
する出力手段を持つことを特徴とする。According to a first aspect of the present invention, there is provided a retrieval apparatus comprising: a URL storage unit for storing a large number of URLs of WWW data existing on the Internet; In a search device having input means for inputting, and search means for searching for a URL held in the URL storage means, an HTML specified by the URL stored in the URL storage means
Sentence extracting means for acquiring the HTML data from the Internet for the data, recognizing punctuation marks in the HTML data and HTML tags, and extracting a sentence included in the HTML data; A summary sentence selecting means for selecting, from the sentences extracted by the means, a summary sentence that is a summary for the HTML data;
The summary sentence selected by the summary sentence selection means is stored in the HT
A summary sentence storing means for holding as a summary sentence for ML data, and an output means for reading and displaying a summary sentence corresponding to the search data retrieved by the search means from the summary sentence storage means.

【０００８】本発明の特許請求の範囲の請求項２に記載
の検索装置は、前記請求項１に記載の検索装置におい
て、前記要約文選択手段が、各文章に含まれている名詞
の単語の頻度によって、頻度の高い単語を含む文章を選
択することを特徴とする。[0008] According to a second aspect of the present invention, in the retrieval apparatus according to the first aspect, the abstract sentence selecting means is adapted to select a noun word included in each sentence. According to the frequency, a sentence including a frequently used word is selected.

【０００９】本発明の特許請求の範囲の請求項３に記載
の検索装置は、前記請求項１の発明において、前記要約
文選択手段が、複数の文章を入力するための文章入力手
段と、単語と該単語の特徴を表す単語ベクトルとを対応
づけて保持する単語辞書と、前記単語辞書を用いて、入
力された文章中に含まれている単語を抽出する単語抽出
手段と、前記単語辞書と、前記単語抽出手段を使って、
入力された文章に含まれている単語から該文章の特徴を
表すデータベクトルを生成するベクトル生成手段と、２
つのデータベクトルのベクトル間の距離を計算する距離
計算手段と、前記ベクトル生成手段を用いて、前記文章
入力手段から入力された文章全体のベクトルと、各文章
のベクトルとを生成し、前記距離計算手段を用いて、文
章全体のベクトルと各文章のベクトル間の距離を計算
し、最も文章全体との距離の近かった文章のいくつかを
前記要約文記憶手段に格納する文章選択手段を持つこと
を特徴とする。According to a third aspect of the present invention, in the retrieval apparatus according to the first aspect of the present invention, the abstract sentence selecting means includes: a sentence input means for inputting a plurality of sentences; And a word dictionary holding the word vectors representing the features of the words in association with each other; word extracting means for extracting words contained in the input sentence using the word dictionary; , Using the word extraction means,
Vector generating means for generating a data vector representing the characteristics of the sentence from words contained in the input sentence;
A distance calculating means for calculating a distance between two data vectors, and a vector generating means for generating a vector of the whole sentence input from the sentence input means and a vector of each sentence, and calculating the distance Means for calculating a distance between a vector of the entire sentence and a vector of each sentence, and having a sentence selecting means for storing some of the sentences having the closest distance to the entire sentence in the summary sentence storage means. Features.

【００１０】本発明の特許請求の範囲の請求項４に記載
の検索プログラムを記録した記録媒体は、コンピュータ
を、インターネット上に存在するＷＷＷデータの多数の
ＵＲＬを保持するＵＲＬ記憶手段と、検索要求文を入力
するための入力手段と、前記ＵＲＬ記憶手段内に保持さ
れているＵＲＬの検索を行う検索手段として機能させる
ための検索プログラムを記録した媒体において、コンピ
ュータを、前記ＵＲＬ記憶手段に記憶されているＵＲＬ
によって指定されるＨＴＭＬデータに対して、該ＨＴＭ
Ｌデータをインターネット上から取得し、該ＨＴＭＬデ
ータ内の句読点とＨＴＭＬのタグの認識を行い、ＨＴＭ
Ｌデータ内に含まれている文章を抽出する文章抽出手段
と、前記文章抽出手段によって抽出された文章の中か
ら、該ＨＴＭＬデータに対する要約になっている要約文
を選択する要約文選択手段と、前記要約文選択手段によ
って選択された要約文を該ＨＴＭＬデータに対する要約
文として保持する要約文記憶手段と、前記検索手段によ
って検索された検索データに対応する要約文を前記要約
文記憶手段から読み出し、表示する出力手段として機能
させることを特徴とする。According to a fourth aspect of the present invention, there is provided a recording medium storing a search program, comprising: a computer that stores a plurality of URLs of WWW data existing on the Internet; In a medium storing an input unit for inputting a sentence and a search program for functioning as a search unit for searching for a URL held in the URL storage unit, a computer is stored in the URL storage unit. URL
The HTML data specified by
L data is obtained from the Internet, and punctuation marks in the HTML data and HTML tags are recognized.
Sentence extraction means for extracting a sentence included in the L data, and summary sentence selection means for selecting, from the sentences extracted by the sentence extraction means, a summary sentence that is a summary for the HTML data; A summary sentence storage means for holding the summary sentence selected by the summary sentence selection means as a summary sentence for the HTML data, and a summary sentence corresponding to the search data retrieved by the search means being read from the summary sentence storage means; It is characterized by functioning as output means for displaying.

【００１１】本発明の特許請求の範囲の請求項５に記載
の検索プログラムを記録した記録媒体は、前記請求項４
に記載の検索プログラムを記録した媒体において、前記
要約文選択手段が、複数の文章を入力するための文章入
力手段と、単語と該単語の特徴を表す単語ベクトルとを
対応づけて保持する単語辞書と、前記単語辞書を用い
て、入力された文章中に含まれている単語を抽出する単
語抽出手段と、前記単語辞書と、前記単語抽出手段を使
って、入力された文章に含まれている単語から該文章の
特徴を表すデータベクトルを生成するベクトル生成手段
と、２つのデータベクトルのベクトル間の距離を計算す
る距離計算手段と、前記ベクトル生成手段を用いて、前
記文章入力手段から入力された文章全体のベクトルと、
各文章のベクトルとを生成し、前記距離計算手段を用い
て、文章全体のベクトルと各文章のベクトル間の距離を
計算し、最も文章全体との距離の近かった文章のいくつ
かを前記要約文記憶手段に格納する文章選択手段を持つ
ことを特徴とする。[0011] A recording medium on which a search program according to claim 5 of the present invention is recorded is a recording medium according to claim 4.
Wherein the summary sentence selecting means stores a sentence input means for inputting a plurality of sentences, and a word and a word vector representing a feature of the word in association with each other. Word extracting means for extracting words contained in the input sentence using the word dictionary; and the word dictionary and the word extracting means for extracting words contained in the input sentence using the word extracting means. Vector generating means for generating a data vector representing a feature of the sentence from a word, distance calculating means for calculating a distance between two data vector vectors, and input from the sentence input means using the vector generating means Vector of the entire sentence,
A vector of each sentence is generated, a distance between the vector of the entire sentence and a vector of each sentence is calculated using the distance calculation means, and some of the sentences having the closest distance to the entire sentence are described in the summary sentence. It is characterized by having sentence selection means for storing in the storage means.

【００１２】上記した請求項１の発明によれば、ＷＷＷ
データを検索したときに、検索されたＷＷＷデータのＵ
ＲＬと共に、そのＷＷＷデータの要約文が表示されるた
めに、実際にそのＷＷＷデータの中身を見なくてもデー
タの内容を知る事ができ、目的のＷＷＷデータを探す時
間を短くすることができる。また、要約文の生成方法
も、従来にあるような単に頻度の高い単語だけを抽出す
るのではなく、ＷＷＷデータ内の句読点とＨＴＭＬタグ
を利用して文章を抽出したものに対して、要約文として
適当なものを自動的に選択しているので、文章でＷＷＷ
データの内容を知る事ができ、単に単語だけの説明がな
されるものよりも分かりやすくなる。According to the first aspect of the present invention, WWW
When data is searched, U of the searched WWW data
Since the summary of the WWW data is displayed together with the RL, the contents of the data can be known without actually looking at the contents of the WWW data, and the time for searching for the target WWW data can be shortened. . In addition, the method of generating a summary sentence is not limited to extracting only frequently used words as in the conventional method, but is a method of extracting a sentence using punctuation marks and HTML tags in WWW data. Is automatically selected as the
You can understand the contents of the data, and it will be easier to understand than what is explained simply by words.

【００１３】また、請求項２の発明によれば、各文章に
含まれている名詞の単語の頻度によって、頻度の高い単
語を含む文章を選択するから、検索要求文に合った要約
文を生成することができる。According to the second aspect of the present invention, a sentence containing a frequently used word is selected based on the frequency of the noun words included in each sentence, so that a summary sentence matching the search request sentence is generated. can do.

【００１４】また、請求項３の発明によれば、要約文の
生成を文章全体のベクトルと各文章のベクトル間の距離
を計算し、最も文章全体との距離の近かった文章を選択
して、意味処理を行うことによって自動的に行うことが
でき、頻度解析等に比べ精度の良い要約文を生成するこ
とができる。According to the third aspect of the present invention, a summary sentence is generated by calculating the distance between the entire sentence vector and each sentence vector, and selecting the sentence closest to the entire sentence. This can be automatically performed by performing the semantic processing, and a summary sentence with higher accuracy than frequency analysis or the like can be generated.

【００１５】請求項４の発明によれば、コンピュータを
請求項１記載の検索装置として機能させることができ
る。According to the fourth aspect of the present invention, the computer can be made to function as the search device according to the first aspect.

【００１６】また請求項５の発明によれば、コンピュー
タを請求項３記載の検索装置として機能させることがで
きる。According to the invention of claim 5, the computer can be made to function as the search device of claim 3.

【００１７】[0017]

【発明の実施の形態】BEST MODE FOR CARRYING OUT THE INVENTION

（実施例１）本検索装置は、後述するような動作を行う
が、この動作は検索プログラムに従って行われ、この検
索プログラムは例えば、磁気記録されたテープ、フロッ
ピー、半導体メモリに記録されたＲＯＭ、あるいは光記
録されたＣＤ−ＲＯＭ等の記憶媒体に保持され、コンピ
ュータシステムである検索装置のメモリ上にロードされ
てから用いられる。(Embodiment 1) The search apparatus performs an operation as described later. This operation is performed in accordance with a search program. The search program includes, for example, a magnetically recorded tape, a floppy disk, a ROM recorded in a semiconductor memory, Alternatively, it is stored in an optically recorded storage medium such as a CD-ROM, and is used after being loaded on a memory of a search device which is a computer system.

【００１８】図１に本発明の第１の発明にかかる検索装
置の概略構成を示す。この検索装置は、検索の要求文を
入力するキーボード、光学式文字読み取り装置（ＯＣ
Ｒ）、あるいは通信回線や着脱式の外部記憶装置等で構
成される入力手段１、ＵＲＬの検索を行う検索手段２、
ＵＲＬを記憶しているハードディスク等のＵＲＬ記憶手
段３、指定されたＵＲＬのＷＷＷデータをインターネッ
ト８から取得し、取得したデータ内の句読点とＨＴＭＬ
（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇ
ｅ）タグを利用して、その取得データから文章を抽出す
る文章抽出手段４、文章抽出手段４で抽出された文章の
中で要約文となる文章を選択する要約文選択手段５、そ
の要約文を記憶しているハードディスク等の要約文記憶
手段６、検索結果のＵＲＬとそのＵＲＬによって指定さ
れるＷＷＷデータの要約文を同時に表示するディスプレ
イ等の出力手段７からなるコンピュータシステムであ
る。検索プログラムは、上記入力手段１、検索手段２、
文章抽出手段４、要約文選択手段５、出力手段７の動作
を制御する。FIG. 1 shows a schematic configuration of a retrieval apparatus according to the first invention of the present invention. This search device includes a keyboard for inputting a search request sentence and an optical character reading device (OC).
R) or an input unit 1 including a communication line or a detachable external storage device, a search unit 2 for searching for a URL,
URL storage means 3 such as a hard disk storing the URL, obtains WWW data of the specified URL from the Internet 8, and reads punctuation marks and HTML in the obtained data.
(HyperText Markup Language)
e) sentence extracting means 4 for extracting a sentence from the acquired data by using a tag, an abstract sentence selecting means 5 for selecting a sentence to be an abstract sentence from the sentences extracted by the sentence extracting means 4, and an abstract sentence Is a computer system comprising a summary sentence storage means 6 such as a hard disk or the like, and an output means 7 such as a display for simultaneously displaying a URL of a search result and a summary sentence of WWW data designated by the URL. The search program includes the input unit 1, the search unit 2,
It controls the operations of the sentence extraction means 4, the summary sentence selection means 5, and the output means 7.

【００１９】次に、第１の発明にかかる検索装置の検索
動作について説明する。検索は入力手段１から検索の要
求文を入力することによって開始される。入力された検
索要求文は検索手段２に送られ、キーワード検索や意味
検索等の既存の検索手法によって、ＵＲＬ記憶手段３に
記憶されているＵＲＬの検索を行う。検索されたＵＲＬ
は出力手段７に送られ、それらのＵＲＬに対応するＷＷ
Ｗデータの要約文を要約文記憶手段６から読み出し、Ｕ
ＲＬとそのＵＲＬに対応するＷＷＷデータの要約文とを
同時に出力する。Next, a search operation of the search device according to the first invention will be described. The search is started by inputting a search request sentence from the input unit 1. The input search request sentence is sent to the search means 2, and the URL stored in the URL storage means 3 is searched by an existing search method such as a keyword search or a meaning search. URL searched
Is sent to the output means 7 and the WW corresponding to those URLs
The summary sentence of the W data is read out from the summary sentence storage means 6, and
The RL and the summary sentence of the WWW data corresponding to the URL are simultaneously output.

【００２０】ＵＲＬ記憶手段３に記憶されているＵＲＬ
データの一例を図３に示す。この例の場合には、テキス
トファイルの形で保存され、１行目にＵＲＬ、２行目に
そのＵＲＬの簡単な説明が書かれていて、２行で１組の
データになっている。図３には３組のＵＲＬデータが示
されていることになる。また、要約文記憶手段６に記憶
されているＷＷＷデータの要約文の一例を図４に示す。
この例の場合も同様にテキストファイルの形で保存さ
れ、１行目にＵＲＬ、２〜４行目に要約文が書かれてい
て、４行で１組のデータになっている。このような形
で、ＵＲＬ記憶手段３に記憶されているＵＲＬに対応す
るＷＷＷデータの要約文が要約文記憶手段６に記憶され
ている。図４には、３組のＵＲＬに対応するＷＷＷデー
タの要約文が示されている。The URL stored in the URL storage means 3
FIG. 3 shows an example of the data. In the case of this example, the URL is stored in the form of a text file, the first line describes the URL, and the second line describes a brief description of the URL, and the two lines form a set of data. FIG. 3 shows three sets of URL data. FIG. 4 shows an example of a summary sentence of WWW data stored in the summary sentence storage means 6.
In the case of this example as well, the data is similarly saved in the form of a text file, and the URL is written on the first line, and the summary sentence is written on the second to fourth lines. In this manner, the summary sentence of the WWW data corresponding to the URL stored in the URL storage means 3 is stored in the summary sentence storage means 6. FIG. 4 shows a summary of WWW data corresponding to three sets of URLs.

【００２１】例えば、入力手段１で入力された検索要求
文が「旅行」で、検索手段２でキーワードの一致してい
るデータを出力するキーワード検索を行っている場合、
図３に示されているデータの中で検索結果として出力さ
れるＵＲＬは、「旅行」を含んでいる、ｈｔｔｐ：／／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／旅行代理店と、ｈｔｔｐ：／／ｗｗｗ．ｂｏｏ．ｏｒ．ｊｐ／ｈｔｍｌ
１／ｔｅｓｔ．ｈｔｍｌ国内旅行の２つのデータになる。これらの２つのＵＲＬは出力手
段７に送られ、要約文記憶手段６から、これらのＵＲＬ
に対応するＷＷＷデータの要約文が読み出され、ｈｔｔｐ：／／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／さまざまな旅行プランを取り揃えております。ハワイ３泊４日激安！ゴールドコースト、シドニー７泊９日ｈｔｔｐ：／／ｗｗｗ．ｂｏｏ．ｏｒ．ｊｐ／ｈｔｍｌ
１／ｔｅｓｔ．ｈｔｍｌ国内旅行は当店で！桜のシーズン、今が見ごろ！沖縄リゾートプランのように、検索されたＵＲＬとそのＵＲＬに対応するＷ
ＷＷデータの要約文とが同時にディスプレイ等の出力手
段７に表示されることになる。For example, if the search request sentence input by the input means 1 is “travel” and the search means 2 is performing a keyword search for outputting data with matching keywords,
The URL output as a search result in the data shown in FIG. 3 includes “Travel”, http: // www. foo. co. jp / Travel agency and http: // www. boo. or. jp / html
1 / test. html Domestic travel data. These two URLs are sent to the output means 7 and the summary sentence storage means 6 outputs these URLs.
Is read out, and the summary sentence of WWW data corresponding to http: // www. foo. co. jp / We have a variety of travel plans. Hawaii 3 nights 4 days discount! Gold Coast, Sydney 7 days 9 nights http: // www. boo. or. jp / html
1 / test. html Domestic travel at our shop! The cherry blossom season is in full bloom! Like the Okinawa resort plan, the retrieved URL and W corresponding to the URL
The summary sentence of the WW data is simultaneously displayed on the output means 7 such as a display.

【００２２】ここで、要約文記憶手段６に記憶されてい
るＷＷＷデータの要約文の生成方法について説明する。
文章抽出手段４ではＵＲＬ記憶手段３に記憶されている
ＵＲＬを読み出し、そのＵＲＬに対応するＷＷＷデータ
をインターネット８上から取得する。例として、図３の
ＵＲＬに示されているＵＲＬのうち、ｈｔｔｐ：／／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／に対応するＷＷＷデータの要約文を生成する場合につい
て考えていく。まず、このＵＲＬに対応するＷＷＷデー
タをインターネット８上から取得する。このＵＲＬの示
すインターネット８上のホストマシン（ｗｗｗ．ｆｏ
ｏ．ｃｏ．ｊｐ）にソケット接続を行い、ＨＴＴＰ（Ｈ
ｙｐｅｔＴｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏ
ｌ）に基づきＵＲＬを相手ホストに送信して、その返信
結果を受け取ることによって、そのＵＲＬの示すＷＷＷ
データを取得することができる。取得したＷＷＷデータ
の例を図５に示す。文章抽出手段４では、このように取
得されたＷＷＷデータから文章の抽出を行う。抽出の際
に文章の区切りとして利用されるのは、句読点とＨＴＭ
Ｌタグである。区切りとして利用される句読点の例を示
すと、“。”、“．”、“！”、“？”等である。ま
た、区切りとして利用されるＨＴＭＬタグの例を図６に
示す。Here, a method of generating a summary sentence of WWW data stored in the summary sentence storage means 6 will be described.
The text extracting means 4 reads out the URL stored in the URL storage means 3 and acquires WWW data corresponding to the URL from the Internet 8. As an example, among the URLs shown in the URL of FIG. 3, http: // www. foo. co. Consider a case where a summary sentence of WWW data corresponding to jp / is generated. First, WWW data corresponding to the URL is obtained from the Internet 8. A host machine on the Internet 8 indicated by the URL (www.fo)
o. co. jp), a socket connection is made, and HTTP (H
ypetText Transfer Protocol
By sending the URL to the other host based on 1) and receiving the reply result, the WWW indicated by the URL
Data can be obtained. FIG. 5 shows an example of the acquired WWW data. The sentence extracting means 4 extracts a sentence from the WWW data thus obtained. Punctuation and HTM are used as sentence breaks during extraction.
L tag. Examples of punctuation used as a delimiter are “.”, “.”, “!”, “?” And the like. FIG. 6 shows an example of an HTML tag used as a delimiter.

【００２３】これらの句読点とＨＴＭＬタグを文章の区
切りとして、ＨＴＭＬタグを取り除いたものを文章抽出
結果として出力する。この規則に基づいて、図５のＷＷ
Ｗデータから文章を抽出すると、図７のようになる。こ
のように抽出された文章は、要約文選択手段５に送られ
る。要約文選択手段５では、既存の要約文選択方法を用
いて、入力された文章の中から、全体の要約文として適
当な文章を選択し、要約文記憶手段６に格納する。The punctuation marks and the HTML tags are used as text breaks, and the HTML tag is removed and the result is output as a text extraction result. Based on this rule, WW of FIG.
FIG. 7 shows a sentence extracted from the W data. The sentence thus extracted is sent to the summary sentence selection means 5. The abstract sentence selecting means 5 selects an appropriate sentence as the entire abstract sentence from the inputted sentences by using the existing abstract sentence selecting method, and stores the selected sentence in the abstract sentence storage means 6.

【００２４】ここでは例として、各文章に含まれている
名詞の単語の頻度によって、頻度の高い単語を含む文章
を選択する場合を考える。図７に示されている文章に含
まれている名詞の単語の頻度を調べると、図８のように
なる。この図８の中で頻度の高い単語を含む文を３文選
択すると、図４に示したようになる。このようにして、
要約文を生成する。図４は検索データと要約文とからな
るが、必要に応じて要約文のみとしてもよい。Here, as an example, a case is considered in which a sentence containing a frequently-used word is selected based on the frequency of the noun words included in each sentence. FIG. 8 shows the frequency of noun words included in the sentence shown in FIG. When three sentences including a frequently used word in FIG. 8 are selected, the result is as shown in FIG. In this way,
Generate a summary sentence. Although FIG. 4 includes search data and a summary sentence, only a summary sentence may be used if necessary.

【００２５】その他に例えば、特開平２−２９７１５７
号公報に開示されているように、選択基準として語彙情
報、構文情報および文間の連接関係の情報、具体的には
助動詞、接続詞の少なくとも一つを用い手入力文章の中
の重要文を選択する文章要約方法がある。また、特開平
３−１９１４７５号公報に開示されているように、文書
を見出しのついた段落単位に分割し、段落の種類に対応
した重要文抽出規則群を段落毎に選択し、この重要文抽
出規則群と段落に依存しない重要文抽出規則群の双方を
参照して各段落における重要文を抽出することにより文
書を要約する方法がある。また、特開平４−９００５５
号公報に開示されているように、自然言語の文章を解析
して論旨構造より、複数の文間の接続関係を文単位とし
て木構造表現し、文間の接続関係に固有な選択規則に基
づいて木構造で結ばれた文の一方または両方を棄却する
操作を繰り返すことによって要約文として用いる重要文
を抽出する方法がある。In addition, for example, Japanese Patent Application Laid-Open No. 2-297157
As disclosed in Japanese Unexamined Patent Publication, the vocabulary information, the syntactic information and the information of the connection between sentences, specifically, at least one of an auxiliary verb and a conjunction are used as selection criteria to select an important sentence in a manually input sentence. There is a sentence summarization method to do. Further, as disclosed in Japanese Patent Application Laid-Open No. 3-191475, a document is divided into paragraphs each having a heading, and an important sentence extraction rule group corresponding to the type of paragraph is selected for each paragraph. There is a method of summarizing a document by extracting an important sentence in each paragraph by referring to both an extraction rule group and an important sentence extraction rule group independent of a paragraph. Also, JP-A-4-90055
As disclosed in the official gazette, a sentence in natural language is analyzed and the connection relationship between a plurality of sentences is expressed in a tree structure as a sentence unit based on the thesis structure, based on a selection rule specific to the connection relationship between sentences. There is a method of extracting an important sentence used as a summary sentence by repeating an operation of rejecting one or both of sentences connected by a tree structure.

【００２６】（実施例２）本発明の第２の発明の検索装
置の概略構成は図１と同じで、図１の要約文選択手段５
が図２のように構成される点が異なっている。図２の要
約文選択手段５ａは、図１の文章抽出手段４で抽出され
た文章を入力するための文章入力手段９、単語とその単
語の特徴を表す単語ベクトルとを対応づけて保持してい
る単語辞書１１、単語辞書１１を用いて文章入力手段９
から入力された文章に含まれている単語を抽出する単語
抽出手段１０、単語抽出手段１０と単語辞書１１を用い
て文章入力手段９から入力された文章の特徴を表すデー
タベクトルを生成するベクトル生成手段１２、ベクトル
生成手段１２で生成された２つのベクトルのベクトル間
の距離の計算を行う距離計算手段１３、距離計算手段１
３の距離計算結果に基づいて文章入力手段９から入力さ
れた文章の中から要約文を選択し要約文記憶手段６に格
納する文章選択手段１４からなる。(Embodiment 2) The schematic structure of the retrieval apparatus according to the second invention of the present invention is the same as that of FIG.
Are different in that they are configured as shown in FIG. The summary sentence selection unit 5a in FIG. 2 stores a sentence input unit 9 for inputting the sentence extracted by the sentence extraction unit 4 in FIG. 1, and associates a word with a word vector representing a feature of the word. Word dictionary 11, sentence input means 9 using word dictionary 11
Word extracting means 10 for extracting words contained in a sentence input from a document, vector generation for generating a data vector representing characteristics of the sentence input from the sentence input means 9 using the word extracting means 10 and the word dictionary 11 Means 12, distance calculating means 13 for calculating the distance between two vectors generated by vector generating means 12, distance calculating means 1
The sentence selection means 14 selects an abstract sentence from the sentences input from the sentence input means 9 based on the distance calculation result of No. 3 and stores it in the abstract sentence storage means 6.

【００２７】第２の発明では、特徴ベクトルを利用して
要約文の生成を行っている。特徴ベクトルは、本件発明
者を含む発明者等が発表した１９９３年１月２２日
（社）電子情報通信学会発行の信学技報ＡＩ９２−９９
（１９９３）「大規模データベースからの連想検索」で
提案された文脈ベクトルのことである。つまり、本発明
中の「特徴ベクトル」は上記文献の「文脈ベクトル」に
そのまま対応する。この特徴ベクトルを用いた文書検索
装置として本件発明者等が出願した特開平６−１９５３
８８号公報がある。本発明はこの手法をＷＷＷデータの
要約文作成に応用するものである。In the second invention, a summary sentence is generated using a feature vector. The feature vector is described in IEICE Technical Report AI92-99 published by the Institute of Electronics, Information and Communication Engineers on January 22, 1993, published by the inventors including the present inventors.
(1993) A context vector proposed in "associative search from a large-scale database". That is, the “feature vector” in the present invention directly corresponds to the “context vector” in the above document. JP-A-6-1953 filed by the present inventors as a document search device using this feature vector
No. 88 publication. The present invention applies this method to the creation of a summary sentence of WWW data.

【００２８】特徴ベクトルとは、文章中の単語が持つ概
念と文脈との関係の程度を示したものであり、多数の特
徴単語との意味的な結合関係の程度をベクトル表現した
ものである。Ｎ個の概念分類を特徴単語とすると、Ｎ次
元ベクトルの各要素の値を一つ一つの特徴単語に対応さ
せることになる。単語ｉの特徴ベクトルＸｉ＝（ｘｉ
１，ｘｉ２，…，ｘｉＮ）の各要素の値は、０≦ｘｉｊ
≦Ｅｍとなる。Ｅｍは、正の定数である。単語ｉと特徴
単語ｊとの間に関係がない場合には、ｘｉｊ＝０にな
り、関係がある場合にはその関係の程度に応じて大きい
値をとる。例えば、特徴ベクトルが５つの特徴単語（自然，都会，騒音，動物，緑）から成り立っているとし、それぞれの要素の値が０か１
の２値である場合には、単語「山」の特徴ベクトルを、（１，０，０，１，１）等と表すことができる。The feature vector indicates the degree of the relationship between the concept and the context of the word in the sentence, and is a vector representation of the degree of the semantic connection with a number of feature words. Assuming that N concept classifications are feature words, the values of each element of the N-dimensional vector correspond to each feature word. Feature vector Xi = (xi) of word i
1, xi2,..., XiN) is 0 ≦ xij
≤ Em. Em is a positive constant. When there is no relationship between the word i and the characteristic word j, xij = 0, and when there is a relationship, a large value is taken according to the degree of the relationship. For example, assume that a feature vector is composed of five feature words (nature, city, noise, animal, green), and the value of each element is 0 or 1
, The feature vector of the word “mountain” can be represented as (1,0,0,1,1) or the like.

【００２９】第２の発明の検索装置において、要約文選
択手段５の要約文選択動作について説明する。図１の文
章抽出手段４より文章入力手段９に入力された文章は、
単語抽出手段１０に送られる。単語抽出手段１０では、
単語辞書１１に含まれている単語を、入力された文章か
ら抽出を行い、ベクトル生成手段１２に送る。ベクトル
生成手段１２では、送られてきた単語に対応する単語の
文脈ベクトルである単語ベクトルを単語辞書１１から読
み出し、各文章に対する文章の特徴を表すデータベクト
ルを生成する。データベクトルは、文章に含まれている
単語の単語ベクトルの和を正規化することによって生成
される。また、同時に入力された文章全体のデータベク
トルも生成する。生成された各文章のデータベクトルと
文章全体のデータベクトルは、距離計算手段１３に送ら
れる。距離計算手段１３では、文章全体のデータベクト
ルと各文章のデータベクトルとのベクトル間の距離を計
算し、その距離計算結果を文章選択手段１４の送る。デ
ータベクトル間の距離の計算は、例えば、２つのベクト
ルの内積を取ることによって行われる。文章選択手段１
４では、距離計算手段１３から送られてきた文章全体と
の距離計算結果に基づいて、文章全体との距離が近かっ
た文章のいくつかを文章入力手段９から受け取り、それ
らを文章全体の要約文として要約文記憶手段６に格納す
る。The summary sentence selecting operation of the summary sentence selecting means 5 in the search device of the second invention will be described. The text input to the text input means 9 from the text extraction means 4 of FIG.
It is sent to the word extracting means 10. In the word extracting means 10,
The words included in the word dictionary 11 are extracted from the input text and sent to the vector generation means 12. The vector generation unit 12 reads a word vector, which is a context vector of a word corresponding to the sent word, from the word dictionary 11 and generates a data vector representing a feature of the sentence for each sentence. The data vector is generated by normalizing the sum of the word vectors of the words included in the text. In addition, a data vector of the entire sentence input at the same time is generated. The generated data vector of each sentence and the data vector of the entire sentence are sent to the distance calculating means 13. The distance calculating means 13 calculates the distance between the data vector of the entire sentence and the data vector of each sentence, and sends the distance calculation result to the sentence selecting means 14. The calculation of the distance between the data vectors is performed, for example, by taking the inner product of the two vectors. Sentence selection means 1
In step 4, based on the result of distance calculation from the whole sentence sent from the distance calculation means 13, some of the sentences whose distance to the whole sentence is short are received from the sentence input means 9, and they are summarized in the entire sentence. Is stored in the summary sentence storage means 6.

【００３０】図７に示された７つの文章が文章入力手段
９に入力された場合を例にして説明を行う。図９は、単
語辞書１１の一例を示したものである。単語とその単語
に対応する単語ベクトルとが組になって格納されてい
る。この例では、文脈ベクトルは５次元ベクトルであ
る。単語抽出手段１０では、この単語辞書１１に含まれ
ている単語を各文章から抽出する。単語抽出を行った結
果を図１０に示す。この単語抽出結果を元に、ベクトル
生成手段１２では、抽出された単語の単語ベクトルを単
語辞書１１から読み出し、各文章のデータベクトルと文
章全体のデータベクトルを生成する。データベクトル
は、文章に含まれている単語の単語ベクトルの和を正規
化することによって生成される。ここでは、ベクトルの
大きさ（ベクトルの各要素の平方和の平方根を取ったも
の）が１０になるように正規化を行っている。データベ
クトルの生成結果を図１１に示す。データベクトルの各
要素の値は、小数第２位を四捨五入してある。距離計算
手段１３では、データベクトルの生成結果に基づいて、
各文章のデータベクトルと文章全体のデータベクトルと
の距離計算を行う。距離計算は２つのベクトルの内積を
取ることによって行われ、データベクトルの大きさは１
０に正規化されているので、その内積値は０から１００
の値を取り、内積値が大きいほど２つのデータベクトル
間の距離が近いということになる。図１１に示した文章
のデータベクトルと文章全体のデータベクトルとのベク
トル間の距離計算結果を図１２に示す。この距離計算結
果は文章選択手段１４に送られる。文章選択手段では、
文章全体との距離の近かったいくつかの文章を要約文記
憶手段６に格納する。ここでは、距離の近かった３つの
文章を選択するとすると、図１２の距離計算結果から、さまざまな旅行プランを取り揃えております。以下のリンクを参照してください。取り揃え豊富な当店で！の３文が選択され要約文記憶手段６に格納されることに
なる。The case where the seven sentences shown in FIG. 7 are inputted to the sentence input means 9 will be described as an example. FIG. 9 shows an example of the word dictionary 11. A word and a word vector corresponding to the word are stored as a set. In this example, the context vector is a five-dimensional vector. The word extracting means 10 extracts words contained in the word dictionary 11 from each sentence. FIG. 10 shows the result of word extraction. Based on the word extraction result, the vector generation means 12 reads the word vector of the extracted word from the word dictionary 11, and generates a data vector of each sentence and a data vector of the entire sentence. The data vector is generated by normalizing the sum of the word vectors of the words included in the text. Here, normalization is performed so that the magnitude of the vector (the square root of the sum of the squares of the elements of the vector) is 10. FIG. 11 shows the generation result of the data vector. The value of each element of the data vector is rounded to the first decimal place. In the distance calculation means 13, based on the generation result of the data vector,
The distance between the data vector of each sentence and the data vector of the entire sentence is calculated. The distance calculation is performed by taking the inner product of the two vectors, and the size of the data vector is 1
Since it is normalized to 0, its inner product value is from 0 to 100
And the larger the inner product value, the closer the distance between the two data vectors is. FIG. 12 shows the result of calculating the distance between the data vector of the sentence shown in FIG. 11 and the data vector of the entire sentence. The result of this distance calculation is sent to the sentence selection means 14. In the text selection means,
Some sentences whose distance from the whole sentence is short are stored in the summary sentence storage means 6. Here, if you select three sentences that are close to each other, we have a variety of travel plans based on the distance calculation results in Figure 12. See the link below. In our shop with abundant stock! Are selected and stored in the summary sentence storage means 6.

【００３１】このように、実施例１は意味に関係なく、
頻度の高いものを選択したが、実施例２では特徴ベクト
ルを用いて意味的に近い文章を選択しているので、単語
辞書が精度よく作成されていれば、精度よく要約文が選
択されることになる。As described above, Embodiment 1 has no relation to the meaning,
In the second embodiment, a sentence that is semantically close is selected using the feature vector. Therefore, if the word dictionary is created with high accuracy, a summary sentence is selected with high accuracy. become.

【００３２】[0032]

【発明の効果】第１の発明によれば、ＷＷＷデータを検
索したときに、検索されたＷＷＷデータのＵＲＬと共
に、そのＷＷＷデータの要約文が表示されるために、実
際にそのＷＷＷデータの中身を見なくてもデータの内容
を知る事ができ、目的のＷＷＷデータを探す時間を短く
することができる。また、要約文の生成方法も、従来に
あるような単に頻度の高い単語だけを抽出するのではな
く、ＷＷＷデータ内の句読点とＨＴＭＬタグを利用して
文章を抽出したものに対して、要約文として適当なもの
を自動的に抽出しているので、文章でＷＷＷデータの内
容を知る事ができ、単語だけよりも分かりやすくなる。According to the first aspect of the present invention, when WWW data is searched, the URL of the searched WWW data is displayed together with the summary of the WWW data, so that the contents of the WWW data are actually displayed. It is possible to know the contents of the data without looking at the data, and it is possible to shorten the time for searching for the target WWW data. In addition, the method of generating a summary sentence is not limited to extracting only frequently used words as in the conventional method, but is a method of extracting a sentence using punctuation marks and HTML tags in WWW data. Since the appropriate one is automatically extracted, the contents of the WWW data can be known from the sentence, and it becomes easier to understand than the word alone.

【００３３】第２の発明によれば、要約文の生成を意味
処理を行うことによって自動的に行うことができ、頻度
解析等に比べ精度の良い要約文を生成することができ
る。According to the second aspect, the generation of the summary sentence can be automatically performed by performing the semantic processing, and the summary sentence with higher accuracy than the frequency analysis or the like can be generated.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の実施例１の検索装置の構成図である。FIG. 1 is a configuration diagram of a search device according to a first embodiment of the present invention.

【図２】本発明の実施例２の検索装置の要約文選択手段
の構成図である。FIG. 2 is a configuration diagram of a summary sentence selection unit of a search device according to a second embodiment of the present invention.

【図３】本発明の実施例１の検索装置のＵＲＬ記憶手段
に記憶されているデータの一例である。FIG. 3 is an example of data stored in a URL storage unit of the search device according to the first embodiment of the present invention.

【図４】本発明の実施例１の検索装置の要約文記憶手段
に記憶されているデータの一例である。FIG. 4 is an example of data stored in a summary sentence storage unit of the search device according to the first embodiment of the present invention.

【図５】本発明の実施例１の検索装置の文章抽出手段に
よって取得されたＷＷＷデータの一例である。FIG. 5 is an example of WWW data obtained by a text extracting unit of the search device according to the first embodiment of the present invention.

【図６】本発明の実施例１の検索装置の文章抽出手段に
よって文章を抽出する際に、文章の区切りとして認識さ
れるＨＴＭＬタグの一例である。FIG. 6 is an example of an HTML tag that is recognized as a text segment when text is extracted by the text extraction unit of the search device of the first embodiment of the present invention.

【図７】本発明の実施例１の検索装置の文章抽出手段に
よって文章を抽出した一例である。FIG. 7 is an example in which a sentence is extracted by a sentence extracting unit of the search device according to the first embodiment of the present invention.

【図８】本発明の実施例１の検索装置の要約文選択手段
によって文章に含まれる単語の頻度を計算した一例であ
る。FIG. 8 is an example of calculating the frequency of words included in a sentence by the summary sentence selecting means of the search device of the first embodiment of the present invention.

【図９】本発明の実施例２の検索装置の要約文選択手段
の単語辞書の一例である。FIG. 9 is an example of a word dictionary of a summary sentence selection unit of the search device according to the second embodiment of the present invention.

【図１０】本発明の実施例２の検索装置の要約文選択手
段の単語抽出手段の抽出単語の一例である。FIG. 10 is an example of an extracted word of a word extracting unit of a summary sentence selecting unit of the search device according to the second embodiment of the present invention.

【図１１】本発明の実施例２の検索装置の要約文選択手
段のベクトル生成手段のデータベクトル生成の一例であ
る。FIG. 11 is an example of data vector generation of a vector generation unit of a summary sentence selection unit of the search device according to the second embodiment of the present invention.

【図１２】本発明の実施例２の検索装置の要約文選択手
段の距離計算手段の距離計算結果の一例である。FIG. 12 is an example of a distance calculation result of the distance calculation unit of the summary sentence selection unit of the search device according to the second embodiment of the present invention.

【符号の説明】[Explanation of symbols]

１入力手段２検索手段３ＵＲＬ記憶手段４文章抽出手段５要約文選択手段６要約文記憶手段７出力手段８インターネット９文章入力手段１０単語抽出手段１１単語辞書１２ベクトル生成手段１３距離計算手段１４文章選択手段 DESCRIPTION OF SYMBOLS 1 Input means 2 Search means 3 URL storage means 4 Sentence extraction means 5 Abstract sentence selection means 6 Abstract sentence storage means 7 Output means 8 Internet 9 Sentence input means 10 Word extraction means 11 Word dictionary 12 Vector generation means 13 Distance calculation means 14 Sentences Selection means

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号ＦＩＧ０６Ｆ 15/403 ３８０Ｄ 15/419 ３２０ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁶ Identification code FIG06F 15/403 380D 15/419 320

Claims

【特許請求の範囲】[Claims]

【請求項１】インターネット上に存在するＷＷＷ（Ｗ
ｏｒｌｄＷｉｄｅＷｅｂ）データの多数のＵＲＬ（Ｕｎ
ｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）を保
持するＵＲＬ記憶手段と、検索要求文を入力するための
入力手段と、前記ＵＲＬ記憶手段内に保持されているＵ
ＲＬの検索を行う検索手段を持つ検索装置において、前記ＵＲＬ記憶手段に記憶されているＵＲＬによって指
定されるＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐ
Ｌａｎｇｕａｇｅ）データに対して、該ＨＴＭＬデー
タをインターネット上から取得し、該ＨＴＭＬデータ内
の句読点とＨＴＭＬのタグの認識を行い、ＨＴＭＬデー
タ内に含まれている文章を抽出する文章抽出手段と、前記文章抽出手段によって抽出された文章の中から、該
ＨＴＭＬデータに対する要約になっている要約文を選択
する要約文選択手段と、前記要約文選択手段によって選択された要約文を該ＨＴ
ＭＬデータに対する要約文として保持する要約文記憶手
段と、前記検索手段によって検索された検索データに対応する
要約文を前記要約文記憶手段から読み出し、表示する出
力手段を持つことを特徴とする検索装置。1. WWW (WWW) existing on the Internet
oldWideWeb) multiple URLs (Un
URL storage means for holding an i.e. Resource Resource Locator, input means for inputting a search request sentence, and U information held in the URL storage means.
A search device having search means for searching for an RL, comprising: an HTML (HyperText Markup) specified by a URL stored in the URL storage means;
(Language) data, the HTML data is obtained from the Internet, a punctuation mark in the HTML data and an HTML tag are recognized, and a text extracting means for extracting a text included in the HTML data, Abstract sentence selecting means for selecting an abstract sentence that is an abstract for the HTML data from the sentences extracted by the sentence extracting means, and summarizing the abstract sentence selected by the abstract sentence selecting means to the HT.
A retrieval apparatus comprising: a summary sentence storage unit that holds a summary sentence for ML data; and an output unit that reads out and displays the summary sentence corresponding to the search data searched by the search unit from the summary sentence storage unit. .

【請求項２】請求項１に記載の検索装置において、前記要約文選択手段は、各文章に含まれている名詞の単
語の頻度によって、頻度の高い単語を含む文章を選択す
ることを特徴とする検索装置。2. The retrieval apparatus according to claim 1, wherein the summary sentence selecting means selects a sentence including a frequently used word according to a frequency of a word of a noun included in each sentence. Search device.

【請求項３】請求項１に記載の検索装置において、前記要約文選択手段は、複数の文章を入力するための文章入力手段と、単語と該単語の特徴を表す単語ベクトルとを対応づけて
保持する単語辞書と、前記単語辞書を用いて、入力された文章中に含まれてい
る単語を抽出する単語抽出手段と、前記単語辞書と、前記単語抽出手段を使って、入力され
た文章に含まれている単語から該文章の特徴を表すデー
タベクトルを生成するベクトル生成手段と、２つのデータベクトルのベクトル間の距離を計算する距
離計算手段と、前記ベクトル生成手段を用いて、前記文章入力手段から
入力された文章全体のベクトルと、各文章のベクトルと
を生成し、前記距離計算手段を用いて、文章全体のベク
トルと各文章のベクトル間の距離を計算し、最も文章全
体との距離の近かった文章のいくつかを前記要約文記憶
手段に格納する文章選択手段を持つことを特徴とする検
索装置。3. The retrieval apparatus according to claim 1, wherein the summary sentence selecting unit associates the sentence input unit for inputting a plurality of sentences with a word and a word vector representing a feature of the word. A word dictionary to be retained; word extraction means for extracting words contained in the input sentence using the word dictionary; and a word sentence input using the word dictionary and the word extraction means. A vector generating means for generating a data vector representing the characteristics of the sentence from the included words; a distance calculating means for calculating a distance between two vectors of the data vector; A vector of the whole sentence input from the means and a vector of each sentence are generated, and the distance between the vector of the whole sentence and the vector of each sentence is calculated using the distance calculation means. Entire chapter and retrieval apparatus characterized by having a text selection means for a number of close was sentence of distance stored in the summary sentence storage means.

【請求項４】コンピュータを、インターネット上に存
在するＷＷＷデータの多数のＵＲＬを保持するＵＲＬ記
憶手段と、検索要求文を入力するための入力手段と、前
記ＵＲＬ記憶手段内に保持されているＵＲＬの検索を行
う検索手段として機能させるための検索プログラムを記
録した媒体において、コンピュータを、前記ＵＲＬ記憶手段に記憶されている
ＵＲＬによって指定されるＨＴＭＬデータに対して、該
ＨＴＭＬデータをインターネット上から取得し、該ＨＴ
ＭＬデータ内の句読点とＨＴＭＬのタグの認識を行い、
ＨＴＭＬデータ内に含まれている文章を抽出する文章抽
出手段と、前記文章抽出手段によって抽出された文章の中から、該
ＨＴＭＬデータに対する要約になっている要約文を選択
する要約文選択手段と、前記要約文選択手段によって選択された要約文を該ＨＴ
ＭＬデータに対する要約文として保持する要約文記憶手
段と、前記検索手段によって検索された検索データに対応する
要約文を前記要約文記憶手段から読み出し、表示する出
力手段として機能させることを特徴とする検索プログラ
ムを記録した記録媒体。4. A computer, comprising: a URL storage unit for storing a large number of URLs of WWW data existing on the Internet; an input unit for inputting a search request sentence; and a URL stored in the URL storage unit. A computer in which a search program for functioning as search means for performing a search is stored, the computer obtains the HTML data specified by the URL stored in the URL storage means from the Internet. And the HT
Recognize punctuation marks and HTML tags in ML data,
A sentence extracting means for extracting a sentence included in the HTML data; an abstract sentence selecting means for selecting, from the sentences extracted by the sentence extracting means, an abstract sentence which is an abstract for the HTML data; The summary sentence selected by the summary sentence selection means is stored in the HT
A search function, wherein a summary sentence storage unit for holding as a summary sentence for ML data, and a summary sentence corresponding to the search data retrieved by the search unit are read from the summary sentence storage unit and function as output means for displaying the summary sentence. A recording medium on which a program is recorded.

【請求項５】請求項４に記載の検索プログラムを記録
した媒体において、前記要約文選択手段は、複数の文章を入力するための文章入力手段と、単語と該単語の特徴を表す単語ベクトルとを対応づけて
保持する単語辞書と、前記単語辞書を用いて、入力された文章中に含まれてい
る単語を抽出する単語抽出手段と、前記単語辞書と、前記単語抽出手段を使って、入力され
た文章に含まれている単語から該文章の特徴を表すデー
タベクトルを生成するベクトル生成手段と、２つのデータベクトルのベクトル間の距離を計算する距
離計算手段と、前記ベクトル生成手段を用いて、前記文章入力手段から
入力された文章全体のベクトルと、各文章のベクトルと
を生成し、前記距離計算手段を用いて、文章全体のベク
トルと各文章のベクトル間の距離を計算し、最も文章全
体との距離の近かった文章のいくつかを前記要約文記憶
手段に格納する文章選択手段を持つことを特徴とする検
索プログラムを記録した記録媒体。5. A medium in which the search program according to claim 4 is recorded, wherein the summary sentence selecting means includes: a sentence inputting means for inputting a plurality of sentences; a word and a word vector representing a feature of the word; A word dictionary that holds the words in association with each other; a word extraction unit that extracts words included in the input sentence using the word dictionary; and an input using the word dictionary and the word extraction unit. A vector generating means for generating a data vector representing characteristics of the sentence from words included in the sentence, a distance calculating means for calculating a distance between two vectors of the data vector, and the vector generating means Generating a vector of the entire sentence input from the sentence input unit and a vector of each sentence, and using the distance calculation unit to calculate a vector between the entire sentence and the vector of each sentence. Distance is calculated, and most recording medium writing several distances close it was sentences as a whole was recorded search program characterized by having a text selection means for storing the summary sentence storage means.