JPH03125265A

JPH03125265A - Key word extracting device

Info

Publication number: JPH03125265A
Application number: JP1263617A
Authority: JP
Inventors: Shiyou Imasato; 詔今郷
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-10-09
Filing date: 1989-10-09
Publication date: 1991-05-28

Abstract

PURPOSE:To attain a proper key word extracting/disusing process by recognizing that the entire key word candidate consisting of a compound word is equal to an unnecessary word even though all component words obtained by dividing a compound word are registered in an unnecessary word dictionary and not defining as a key word. CONSTITUTION:An unnecessary word disusing means 5 disuses the key word candidates registered in an unnecessary word dictionary 2 as well as the key word candidates whose component words are all registered in the dictionary 2. The entire key word candidate consisting of a compound word is recognized as an unnecessary word and not defined as a key word not only in a case where the key word candidate is registered as it is in the dictionary 2 but in a case where the component words obtained by dividing a compound word are all registered is in the dictionary 2. Meanwhile a compound word combined with a word which is not registered in the dictionary 2 is extracted as a key word. Thus it is possible to perform a proper key word extracting/disusing process even with the compound words with use of a practical dictionary 2.

Description

【発明の詳細な説明】産業上の利用分野本発明は、文書情報をキーワードと対応付けて蓄積させ
ておき、キーワードを含む検索条件の入力によって対応
する文書情報を出力するような情報検索システムにおい
て、文書情報の登録段階で文書にキーワードを自動的に
付与するためのキーワード抽出装置に関する。DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application The present invention relates to an information retrieval system that stores document information in association with keywords and outputs corresponding document information by inputting search conditions including keywords. , relates to a keyword extraction device for automatically assigning keywords to documents at the stage of document information registration.

従来の技術大量の情報から希望する情報を検索する場合、予め個々
の情報に対してキーワードを付けて蓄積させておき、検
索時にキーワードを含む条件式を入力し、それにマツチ
するキーワードを持つ情報を出力する、というのが一般
的である。このようなシステムでは、文書を登録する時
に人間がキーワードを付与するのが普通であるが、これ
では下記のような不都合がある。まず、キーワード付与
は面倒であるため、十分な数のキーワードが付与される
ことが少ない。また、大半の場合、文書を一つの観点か
ら見た場合のキーワード付与しかなされず、別の観点か
ら見たキーワードは殆ど付与されない。さらに、大量若
しくは多種類の文書について短時間に登録することがで
きない。Conventional technology When searching for desired information from a large amount of information, keywords are assigned to each piece of information and stored, and when searching, a conditional expression containing the keyword is entered, and information with the keyword that matches the information is searched. It is common to output. In such systems, it is common for humans to assign keywords when registering documents, but this has the following disadvantages. First, since assigning keywords is troublesome, it is rare that a sufficient number of keywords are assigned. Furthermore, in most cases, keywords are only assigned when a document is viewed from one perspective, and keywords when viewed from another perspective are rarely assigned. Furthermore, it is not possible to register a large amount or many types of documents in a short time.

このような問題を解決するため、キーワードを自動的に
付与する研究が以前から行われている。In order to solve such problems, research on automatically assigning keywords has been conducted for some time.

このようなキーワードの自動抽出方法として一般的なも
のは、「新聞記事データベースにおけるキーワード自動
抽出」　（情報管理　Ｖｏｌ、３２隘４゜Ｊｕｌｙ　１
９８９）に示されている。その概略を説明する。A common method for automatically extracting keywords is "Automatic extraction of keywords in newspaper article databases" (Information Management Vol. 32, 4゜July 1)
989). The outline will be explained below.

■　字種の相違に着目したり、単語テーブルなどを参照
したりして、文章からキーワードとなりそうな文字列（
キーワード候補）を抽出する。■ Find character strings that are likely to be keywords (
keyword candidates).

■　キーワードとはならない語を集めた不要語辞書に、
上記■の処理で抽出されたキーワード候補の文字列が含
まれているかどうか調べる。■ An unnecessary word dictionary that collects words that are not keywords.
Check whether the character string of the keyword candidate extracted in the above process (■) is included.

■　不要語辞書に含まれていればキーワードとはしない
が、含まれていなければキーワードとして対象文書に付
与する。■ If it is included in the unnecessary word dictionary, it will not be used as a keyword, but if it is not included, it will be added to the target document as a keyword.

発明が解決しようとする課題不要語辞書は予め作成しておくわけであるが、キーワー
ドとならない語を全て登録しておくのは困難である。こ
れは、複合語の数（種類）が極めて多いからである。即
ち、複合語は無数にあるといってもよく、かつ、新しい
複合語が常に作り出されており、このような状況下にキ
ーワードとならない複合語を全て登録しておくことは原
理的に不可能だからである。Problems to be Solved by the Invention Although a dictionary of unnecessary words is created in advance, it is difficult to register all words that are not keywords. This is because the number (types) of compound words is extremely large. In other words, it can be said that there are an infinite number of compound words, and new compound words are constantly being created. Under these circumstances, it is theoretically impossible to register all compound words that do not serve as keywords. That's why.

しかして、従来はこのような不要語辞書に含まれていな
い語は全てキーワードとしているので、例えば、「対応
」　「関係」の各々の語が不要語辞書に登録されていて
も、「対応関係」なる複合語の状態で登録されていなけ
れば、「対応関係」なる語をキーワードとしてしまう不
適切な結果となる。However, conventionally, all words that are not included in such unnecessary word dictionaries are treated as keywords, so for example, even if the words "correspondence" and "relationship" are registered in the unnecessary word dictionary, "correspondence relation"'' is not registered as a compound word, an inappropriate result would be that the word ``correspondence'' is used as a keyword.

課題を解決するための手段文章からキーワード候補の文字列を抽出するキーワード
候補抽出手段と、キーワードとならない語を登録した不
要語辞書とを設け、複合語によるキーワード候補を構成
単語に分割する複合語分割手段を設けるとともに、不要
語辞書に登録されているキーワード候補及び分割された
個々の構成単語の全てが前記不要語辞書に登録されてい
るキーワード候補を棄却する不要語棄却手段を設けた。Means for Solving the Problem A keyword candidate extraction means that extracts character strings of keyword candidates from sentences and an unnecessary word dictionary that registers words that are not keywords are provided, and a compound word dictionary that divides compound word keyword candidates into constituent words is provided. A dividing means is provided, and an unnecessary word rejection means is provided for rejecting the keyword candidates registered in the unnecessary word dictionary and all of the divided individual constituent words registered in the unnecessary word dictionary.

作用キーワード候補がそのまま不要語辞書に登録されている
場合はもちろん、複合語であって複合語を分割した個々
の構成単語の全てが不要語辞書に登録されている場合も
、複合語構成のキーワード候補全体が不要語であると認
定され、キーワードとはされない。一方、不要語辞書に
登録されていない語と組合さった複合語についてはこれ
をキーワードとして抽出できる。よって、現実的な不要
語辞書を用いて複合語についても適切なキーワード抽出
／棄却処理を行うことができる。Not only when the action keyword candidate is registered as it is in the unnecessary word dictionary, but also when it is a compound word and all the individual constituent words that are divided into compound words are registered in the unnecessary word dictionary, the keyword of the compound word composition The entire candidate is recognized as an unnecessary word and is not used as a keyword. On the other hand, compound words combined with words not registered in the unnecessary word dictionary can be extracted as keywords. Therefore, appropriate keyword extraction/rejection processing can be performed for compound words using a realistic unnecessary word dictionary.

実施例本発明の一実施例を図面に基づいて説明する。Example An embodiment of the present invention will be described based on the drawings.

まず、対象となる文章中からキーワードとなる可能性の
ある文字列（キーワード候補）を抽出するキーワード候
補抽出手段１が設けられている。First, a keyword candidate extraction means 1 is provided which extracts character strings (keyword candidates) that may become keywords from a target text.

これは周知のものでよいが、本実施例ではひらがな以外
の文字の連続列を全てキーワード候補として文書から抽
出するものとしている。即ち、文字種の変化点を基準に
キーワード候補を抽出するものであるがこの方法に限ら
れるものではない。This may be a well-known method, but in this embodiment, all consecutive strings of characters other than hiragana are extracted from the document as keyword candidates. That is, keyword candidates are extracted based on the points of change in character types, but the method is not limited to this method.

また、キーワードとならない単語を登録した不要語辞書
２が設けられている。第２図はこの不要語辞書２の一例
を示すものであり、後述する不要語登録手段により更新
される。Further, an unnecessary word dictionary 2 in which words that are not keywords are registered is provided. FIG. 2 shows an example of this unnecessary word dictionary 2, which is updated by an unnecessary word registration means to be described later.

さらに、第３図に示すような各構成単語についての表記
と品詞との対を登録した単語テーブル３を用いて、キー
ワード候補中の複合語を個々の構成単語に分割する複合
語分割手段４が設けられている。この場合の制限として
、・複合語の先頭は名詞か接頭辞である。Furthermore, a compound word dividing means 4 divides a compound word in a keyword candidate into individual constituent words using a word table 3 in which pairs of notation and part of speech for each constituent word are registered as shown in FIG. It is provided. The restrictions in this case are: - The beginning of the compound word must be a noun or a prefix.

・複合語の末尾は名詞か接尾辞である。-The end of a compound word is a noun or a suffix.

・接頭辞の直後には接尾辞は続かない。- A suffix does not immediately follow a prefix.

という規則がある。分割パターンが複数ある場合は構成
単語の数が最少となるものを採用する。There is a rule. If there are multiple division patterns, the one with the least number of constituent words is adopted.

しかして、前記キーワード候補抽出手段１、複合語分割
手段４及び不要語辞書２に接続されて不要語棄却手段５
が設けられている。この不要語棄却手段５は基本的には
抽出されたキーワード候補が不要語辞書２中に登録され
ているかチエツクし登録されていればそのキーワード候
補を棄却する。Therefore, unnecessary word rejection means 5 is connected to the keyword candidate extraction means 1, compound word division means 4 and unnecessary word dictionary 2.
is provided. This unnecessary word rejection means 5 basically checks whether the extracted keyword candidate is registered in the unnecessary word dictionary 2, and if it is registered, rejects the keyword candidate.

登録されていなければ、従来であればそのままキーワー
ドとしたが、本実施例では、そのキーワード候補を複合
語分割手段４により個々の構成単語に分割し、個々の構
成単語が不要語辞書２中に登録されているかどうか調べ
る。全ての構成単語が不要語辞書２中に登録されている
場合も、その複合語自体を不要語であると認定しキーワ
ード候補から棄却する。構成単語の一つでも不要語辞書
２に含まれていない場合には、その複合語を棄却しない
。If it is not registered, conventionally it is used as a keyword as is, but in this embodiment, the compound word dividing means 4 divides the keyword candidate into individual constituent words, and each constituent word is stored in the unnecessary word dictionary 2. Check whether it is registered. Even when all constituent words are registered in the unnecessary word dictionary 2, the compound word itself is recognized as an unnecessary word and rejected from keyword candidates. If even one of the constituent words is not included in the unnecessary word dictionary 2, the compound word is not rejected.

また、この不要語棄却手段５により棄却された語、特に
複合語が不要語辞書２中に登録されていない場合にこの
不要語辞書２に登録し更新する不要語登録手段６も設け
られている。Further, an unnecessary word registration means 6 is also provided for registering and updating the word rejected by the unnecessary word rejection means 5 in the unnecessary word dictionary 2 when the word, especially a compound word, is not registered in the unnecessary word dictionary 2. .

このような構成において、全体の概略処理を第４図によ
り説明する。まず、与えられた文章からキーワードとな
る可能性のある文字列を全てキーワード候補として抽出
する。抽出された文字列に重複があれば（同じ文字列が
複数個所で出現していれば）、一つを残して他は削除す
る。ついで、各々のキーワード候補が不要語辞書２中に
登録されているか調べる。登録されていればキーワード
ではないと判定する。登録されていなければそのキーワ
ード候補を個々の構成単語に分割できるかどうか（即ち
、複合語かどうか）調べる。分割できない場合には、そ
の候補をそのままキーワードであると判定する。分割で
きる場合には、各々の構成単語の全てが不要語辞書２中
に登録されているかどうか調べる。全ての構成単語が登
録されていれば候補全体もキーワードでないと判定する
とともに、不要語辞書２に登録する。このような新規に
出現した複合語による不要語が不要語辞書２に登録され
るので、その後、別の文書処理時に同じ複合語が出現し
た場合には、不要語辞書２を参照するだけでその複合語
について個々の構成単語に分割するといった一連の処理
を要せず不要語と判定でき、処理時間を短縮できる。一
方、一つでも登録されていない構成単語があれば、候補
全体をキーワードであると判定する。In such a configuration, the overall processing will be schematically explained with reference to FIG. First, all character strings that can be keywords are extracted from a given sentence as keyword candidates. If there are duplicates in the extracted character strings (if the same character string appears in multiple places), leave one and delete the others. Next, it is checked whether each keyword candidate is registered in the unnecessary word dictionary 2. If it is registered, it is determined that it is not a keyword. If the keyword candidate is not registered, it is checked whether the keyword candidate can be divided into individual constituent words (that is, whether it is a compound word). If the candidate cannot be divided, the candidate is determined to be a keyword as is. If it can be divided, it is checked whether all of the constituent words of each are registered in the unnecessary word dictionary 2. If all constituent words are registered, it is determined that the entire candidate is not a keyword and is registered in the unnecessary word dictionary 2. Unnecessary words resulting from such newly appearing compound words are registered in the unnecessary word dictionary 2, so if the same compound word appears later during another document processing, you can simply refer to the unnecessary word dictionary 2. Compound words can be determined as unnecessary words without requiring a series of processes such as dividing them into individual constituent words, and processing time can be shortened. On the other hand, if there is even one constituent word that is not registered, the entire candidate is determined to be a keyword.

例えば、不要語辞書２、単語テーブル３が各々第２図、
第３図に示すような登録内容になっていると仮定した場
合に、「対応関係」というキーワード候補が得られた時
の処理を考える。「対応関係」という文字列は不要語辞
書２中に登録されていない。しかし、単語テーブル３を
参照すると「対応Ｊと「関係」との構成単語に分割でき
る。For example, the unnecessary word dictionary 2 and the word table 3 are shown in FIG.
Assuming that the registered contents are as shown in FIG. 3, let us consider the process when a keyword candidate of "correspondence" is obtained. The character string "correspondence" is not registered in the unnecessary word dictionary 2. However, when referring to word table 3, it can be divided into constituent words of "correspondence J" and "relationship".

分割された構成単語「対応Ｊ、「関係」はともに不要語
辞書２に登録されている。よって、「対応関係」なる候
補自体もキーワードでないと判定される。そして、［対
応関係Ｊなる語が不要語辞書２に登録される。The divided constituent words "correspondence J" and "relation" are both registered in the unnecessary word dictionary 2. Therefore, it is determined that the candidate "correspondence" itself is not a keyword. Then, the word "correspondence J" is registered in the unnecessary word dictionary 2.

発明の効果本発明は、上述したように構成したので、キーワード候
補がそのまま不要語辞書に登録されている場合はもちろ
ん、不要語辞書に登録されていない複合語であってもそ
の複合語を分割した個々の構成単語の全てが不要語辞書
に登録されている場合も、複合語構成のキーワード候補
全体を不要語であると認定し、キーワードとはしないた
め、現実的な不要語辞書を用いたまま複合語についても
適切なキーワード抽出／棄却処理を行うことがで３，１スきるものである。Effects of the Invention Since the present invention is configured as described above, it is possible to divide a compound word even if the keyword candidate is registered as it is in the unnecessary word dictionary, or even if it is a compound word that is not registered in the unnecessary word dictionary. Even if all of the individual constituent words are registered in the unnecessary word dictionary, the entire keyword candidates of the compound word composition are recognized as unnecessary words and are not used as keywords. Appropriate keyword extraction/rejection processing can also be applied to compound words.

【図面の簡単な説明】[Brief explanation of the drawing]

図面は本発明の一実施例を示すもので、第１図はブロッ
ク図、第２図は不要語辞書の登録内容の一例を示す説明
図、第３図は単語テーブルの登録内容の一例を示す説明
図、第４図は処理を示すフローチャートである。１・・・キーワード候補抽出手段、２・・・不要語辞書
、４・・・複合語分割手段、５・・・不要語棄却手段Ｚ
図The drawings show an embodiment of the present invention; FIG. 1 is a block diagram, FIG. 2 is an explanatory diagram showing an example of registered contents of an unnecessary word dictionary, and FIG. 3 is an example of registered contents of a word table. The explanatory diagram, FIG. 4, is a flowchart showing the processing. 1...Keyword candidate extraction means, 2...Unnecessary word dictionary, 4...Compound word division means, 5...Unnecessary word rejection means Z
figure

Claims

【特許請求の範囲】[Claims]

文章からキーワード候補の文字列を抽出するキーワード
候補抽出手段と、キーワードとならない語を登録した不
要語辞書と、複合語によるキーワード候補を構成単語に
分割する複合語分割手段と、前記不要語辞書に登録され
ているキーワード候補及び分割された個々の構成単語の
全てが前記不要語辞書に登録されているキーワード候補
を棄却する不要語棄却手段とよりなることを特徴とする
キーワード抽出装置。a keyword candidate extracting means for extracting character strings of keyword candidates from a sentence; an unnecessary word dictionary in which words that are not keywords are registered; a compound word dividing means for dividing a compound word keyword candidate into constituent words; A keyword extraction device characterized in that all registered keyword candidates and divided individual constituent words are comprised of an unnecessary word rejection means for rejecting keyword candidates registered in the unnecessary word dictionary.