JPH06162092A

JPH06162092A - Information retrieval device

Info

Publication number: JPH06162092A
Application number: JP4308355A
Authority: JP
Inventors: Hide Fuji; 秀富士; Toshihiro Kakimoto; 俊博柿元; Makoto Yoshioka; 誠吉岡
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-11-18
Filing date: 1992-11-18
Publication date: 1994-06-10

Abstract

PURPOSE:To reduce the labor of maintenance of dictionary, to reduce index capacity and to retrieve text data by retrieving the corresponding character string from a text by means of n-character index and word index. CONSTITUTION:A word division processing section 6 divides the text into words and a word index preparation section 8 prepares the word index 3 linking the text while taking the word as a word index. An n-character index preparation section 9 prepares an n-character index 4 to be linked from an n-character index taken out from the start on the word index. In this case, an unregistered word processing section 7 makes a word index on the character string unregistered in a word dictionary 12 in a batch based on the character information. According to the retrieval direction with a keyword specified, the corresponding word index of a word index 3 to be linked is found from the n-character from the corresponding n-character index of an n-character index 4. Then the corresponding character string in the text 2 is retrieved and outputted.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書の検索を行う情報
検索装置に関するものである。近年、多量の電子化文書
が出回るようになるにつれ、これらの文書の中から必要
な情報を取り出す検索技術が必要となってきている。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval device for retrieving documents. In recent years, as a large amount of electronic documents have become available, a search technique for extracting necessary information from these documents is required.

【０００２】検索は、インデックスファイルを持つこと
によって高速化できる。このインデックスファイルを最
小化することが検索システムを構築する上で要求されて
いる。また、一方、インデックスファイルに入っている
文字列かどうかにかかわらず、どんな文字列でも検索し
たいという要求がある。これらの際に、単語辞書などの
整備作業が少ないことも要求されている。Searches can be sped up by having an index file. Minimizing this index file is required to build a search system. On the other hand, there is a demand to search for any character string regardless of whether or not the character string is contained in the index file. In these cases, it is also required that there is little maintenance work on word dictionaries.

【０００３】[0003]

【従来の技術】従来、文書中の単語などを検索する手法
として以下が用いられていた。（１）形態素解析を利用したキーワードインデックス
法：これは、文書を形態素解析して抽出したキーワード
に対してインデックスを付けるので、インデックスファ
イルが小さくて済む。また、各キーワードは辞書属性な
どを持っているので、単語の知識を利用した処理が可能
となる。しかし、単語辞書を整備する必要がある。ま
た、未登録語抽出に失敗すると、この未登録語の文字列
が検索できなくなる。2. Description of the Related Art Conventionally, the following has been used as a method for searching a word in a document. (1) Keyword index method using morphological analysis: This is because a keyword extracted by performing morphological analysis on a document is indexed, so that an index file can be small. Moreover, since each keyword has a dictionary attribute or the like, it is possible to perform processing using knowledge of words. However, it is necessary to maintain a word dictionary. Further, if the unregistered word extraction fails, the character string of this unregistered word cannot be searched.

【０００４】（２）ｎ文字インデックス法：文書の全
ての文字に対してインデックスを付けるので、辞書など
の単語の知識が必要なく、結果として辞書メンテナンス
のコストが削減できる。また、出現する文字列全てが検
索対象となるので検索もれがない。しかし、インデック
スの量が膨大になる。(2) n-character index method: Since all characters in a document are indexed, knowledge of words in a dictionary or the like is not required, and as a result, dictionary maintenance costs can be reduced. Further, since all the appearing character strings are the search targets, there is no omission in the search. However, the amount of indexes becomes huge.

【０００５】[0005]

【発明が解決しようとする課題】上述した（１）の単語
辞書を利用した形態素解析をベースにしたキーワードイ
ンデックス法は、精度を保つためには単語辞書の整備を
行う必要があり、その労力が大変であるという問題があ
る。また、上述した（２）のｎ文字インデックス法によ
りｎ文字単位の転置ファイル（例えば１文字単位）を作
成したのでは、インデックスの量が増加して、元のテキ
ストよりも大きくなり、インデックス容量が膨大となっ
てしまう問題がある。The keyword index method based on the morphological analysis using the word dictionary of (1) described above requires maintenance of the word dictionary in order to maintain accuracy, and the labor thereof is great. There is a problem that it is difficult. In addition, if a transposed file in units of n characters (for example, in units of 1 character) is created by the n-character index method of (2) described above, the amount of index increases and becomes larger than the original text. There is a huge problem.

【０００６】本発明は、これらの問題を解決するため、
辞書のメンテナンスの労力を省き、インデックス容量を
削減してテキストデータの検索を可能にすることを目的
としている。The present invention solves these problems.
The purpose is to reduce the maintenance work of the dictionary, reduce the index capacity, and enable text data search.

【０００７】[0007]

【課題を解決するための手段】図１および図２を参照し
て課題を解決するための手段を説明する。図１および図
２において、語分割処理部６は、テキストを語分割して
単語にするものである。Means for solving the problems will be described with reference to FIGS. 1 and 2. FIG. 1 and 2, the word division processing unit 6 divides the text into words to form words.

【０００８】未登録語処理部７は、単語辞書１２に登録
されていない文字列について、文字種情報などでまとめ
るものである。単語インデックス作成部８は、テキスト
を分割した単語を単語見出しとしてテキストの該当する
位置をリンクする単語インデックス３を作成するもので
ある。The unregistered word processing unit 7 collects character strings not registered in the word dictionary 12 by character type information or the like. The word index creation unit 8 creates a word index 3 that links a corresponding position in the text with the word obtained by dividing the text as a word heading.

【０００９】ｎ文字インデックス作成部９は、単語イン
デックス３の単語見出しについて、先頭などから取り出
したｎ文字見出しからリンクするｎ文字インデックス４
を作成するものである。The n-character index creation unit 9 links the word index of the word index 3 with the n-character index 4 that is linked from the n-character index extracted from the beginning or the like.
Is to create.

【００１０】[0010]

【作用】本発明は、図１および図２に示すように、語分
割処理部６がテキストを語分割して単語にし、単語イン
デックス作成部８がこの単語を単語見出しとしてテキス
トをリンクする単語インデックス３を作成し、ｎ文字イ
ンデックス作成部９がこれらの単語見出しについて、先
頭などから取り出したｎ文字見出しからリンクするｎ文
字インデックス４を作成するようにしている。この際、
未登録語処理部７が単語辞書１２に登録されていない文
字列について、文字種情報をもとにまとめて単語見出し
とするようにしている。According to the present invention, as shown in FIGS. 1 and 2, the word division processing unit 6 divides the text into words, and the word index creation unit 8 links the texts using the words as word headings. 3 is created, and the n-character index creating unit 9 creates an n-character index 4 that links these word headings from the n-character headings extracted from the beginning or the like. On this occasion,
The unregistered word processing unit 7 collects the character strings that are not registered in the word dictionary 12 based on the character type information to form a word heading.

【００１１】また、キーワードを指定した検索指示に対
応して、ｎ文字インデックス４の該当するｎ文字見出し
からリンクする単語インデックス３の該当する単語見出
しを見つけ、これからリンクからテキスト２中の該当す
る文字列を検索して出力するようにしている。Further, in response to the search instruction specifying the keyword, the corresponding word heading of the word index 3 to be linked is found from the corresponding n character heading of the n character index 4, and from this link the corresponding character in the text 2 is searched. I am trying to search for a column and output it.

【００１２】これらの際に、ｎ文字インデックス４とし
て、１文字インデックスとするようにしている。従っ
て、ｎ文字インデックス４および単語インデックス３を
利用してテキスト２から該当する文字列を検索すること
により、単語辞書１２のメンテナンスの労力を省き、イ
ンデックス容量を削減してテキストデータの検索を行う
ことが可能となる。特に、単語インデックス３を作成し
たことによって、単語辞書１２に登録されている単語
（２文字、３文字、４文字など）および文字種情報でま
とめた未登録語にインデックスを付与でき、インデック
ス量を削減できる。また、この単語インデックス３の単
語見出しについて、ｎ文字インデックス４、特に１文字
インデックスからリンクすることにより、デキスト中の
文字列の取りこぼしを無くすことが可能となる。In these cases, the n-character index 4 is a 1-character index. Therefore, by searching the corresponding character string from the text 2 using the n-character index 4 and the word index 3, the maintenance work of the word dictionary 12 can be saved and the index capacity can be reduced to search the text data. Is possible. In particular, by creating the word index 3, it is possible to add an index to a word (2 characters, 3 characters, 4 characters, etc.) registered in the word dictionary 12 and an unregistered word collected by character type information, and reduce the index amount. it can. Also, by linking the word heading of this word index 3 from the n-character index 4, especially the 1-character index, it becomes possible to eliminate the omission of the character string in the text.

【００１３】[0013]

【実施例】次に、図１から図３を用いて本発明の実施例
の構成および動作を順次詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, the construction and operation of an embodiment of the present invention will be described in detail with reference to FIGS.

【００１４】図１は、本発明の１実施例構成図を示す。
図１において、テキストデータ１は、検索対象のテキス
トデータであって、複数のテキスト２を格納したもので
ある。FIG. 1 shows a block diagram of an embodiment of the present invention.
In FIG. 1, text data 1 is text data to be searched and stores a plurality of texts 2.

【００１５】単語インデックス３は、単語見出しおよび
リンクの複数の組から構成され、テキスト２を語分割し
た単語および文字種情報でまとめた未登録語を単語見出
しとし、この単語見出しからテキスト２中の該当する文
字列をリンクしたものである。例えば図示のように、単語見出しリンク “情報” テキスト２の該当する文字列“情報”へのリンクといように、単語見出し“情報”についてテキスト２中
の該当する文字列“情報”へのリンク（ポインタ）を設
定する。The word index 3 is composed of a plurality of sets of word headings and links, and a word heading is an unregistered word obtained by dividing the text 2 into words and character type information. It is a linked string of characters. For example, as shown in the figure, the word heading link “information” is a link to the corresponding character string “information” in the text 2, and the word heading “information” is a link to the corresponding character string “information” in the text 2 ( Pointer).

【００１６】ｎ文字インデックス４は、単語インデック
ス３の単語見出しの先頭など（例えば先頭あるいは全
体）からｎ文字を取り出してｎ文字見出しとし、このｎ
文字見出しから該当する単語をリンクするようにしたも
のである。例えば図示のように、ｎ文字見出しリンク情 “情報”へのリンク、“情勢”へのリンク “情報検索”へのリンクというように、ｎ文字見出し、ここでは、１文字見出し
“情”について単語インデックス３中の該当する文字列
“情報”、“情勢”、“情報検索”などへのリンクを設
定する。これにより、キーワードの先頭のｎ文字、特に
１文字の検索を高速に行うことができる。１文字インデ
ックス４でキーワードの先頭の１文字が存在することが
判明したとき、リンクする単語インデックス３の見出し
から先頭の１文字を含む文字列を取り出し、キーワード
との一致を高速に判定できる。一致したときは、更に単
語インデックス３の単語見出しのリンクを辿ってテキス
ト２中の該当する文字列を検索できる。The n-character index 4 is an n-character index obtained by extracting n characters from the beginning (for example, the beginning or the whole) of the word index of the word index 3.
The corresponding word is linked from the character heading. For example, as shown in the figure, n-letter heading link information link to "information", link to "condition" link to "information search" Links to the corresponding character strings “information”, “condition”, “information search”, etc. in the index 3 are set. As a result, the search for the first n characters of the keyword, especially one character, can be performed at high speed. When it is determined by the one-character index 4 that the first character of the keyword is present, a character string including the first one character can be extracted from the heading of the linked word index 3 and the matching with the keyword can be determined at high speed. When they match, the corresponding character string in the text 2 can be searched by further following the link of the word heading of the word index 3.

【００１７】キーワードは、検索対象の文字列である。
検索するときは、このキーワードの先頭の１文字“情”
をｎ文字インデックス４である１文字インデックス４か
ら見つけ、この見つけた１文字のリンクから単語インデ
ックス３の単語に当該キーワードを含むものを見つけ
る。見つかったときは、この単語のリンクによりテキス
ト中から該当する文字列を取り出し、表示などする。The keyword is a character string to be searched.
When searching, the first character of the keyword is "JO"
Is found from the 1-character index 4 which is the n-character index 4, and the word of the word index 3 containing the keyword is found from the found 1-character link. When found, the relevant character string is extracted from the text by the link of this word and displayed.

【００１８】次に、図２を用いて、図１の単語インデッ
クス３およびｎ文字インデックス４を作成するインデッ
クス作成系１１、および作成したｎ文字インデックス４
および単語インデックス３を利用して文字列を検索する
ときの検索系２１の動作を順次詳細に説明する。Next, referring to FIG. 2, an index creating system 11 for creating the word index 3 and the n-character index 4 in FIG. 1 and the created n-character index 4 will be described.
The operation of the search system 21 when searching for a character string using the word index 3 will be sequentially described in detail.

【００１９】（１）インデックス作成系１１について
説明する。図２において、Ｓ１は、テキストデータ１を
取り込む。Ｓ２は、前処理部５が前処理を行う。テキス
トデータ１について、改行などを取り、１行１文にす
る。(1) The index creating system 11 will be described. In FIG. 2, S1 takes in the text data 1. In S2, the preprocessing unit 5 performs preprocessing. For text data 1, take line breaks and make one sentence per line.

【００２０】Ｓ３は、語分割処理部６がテキストデータ
１の語分割を行う。これは、単語辞書１２を参照して、
例えば後述する図３の語分割結果に示すように、／の区
切り記号によって、／に／ついて／の／情報／を／集め／・・・のように語分割する。In S3, the word division processing unit 6 divides the text data 1 into words. This refers to the word dictionary 12,
For example, as shown in a word division result of FIG. 3 described later, the word is divided into ////////// collect / ... by the / delimiter.

【００２１】Ｓ４は、未登録語処理部７が未登録語の処
理を行う。これは、単語辞書１２にないテキストデータ
１について、文字種情報、例えばカタカナの一連の文字
列を未登録語としたり、一連の漢字のつながりを未登録
語としたりする。In step S4, the unregistered word processing unit 7 processes the unregistered word. For text data 1 that is not in the word dictionary 12, character type information, such as a series of katakana character strings, is an unregistered word, or a series of kanji characters is an unregistered word.

【００２２】Ｓ５は、語分割処理部６が語分割した単語
および未登録語処理部７が分割した未登録語について、
単語インデックス作成部８がこれら単語および未登録語
を見出しとすると共にリンクによってテキストデータ１
中の該当する位置をリンクする。これらにより、Ｓ６の
単語インデックス３が作成できたこととなる。In S5, the words divided by the word division processing unit 6 and the unregistered words divided by the unregistered word processing unit 7 are
The word index creation unit 8 uses these words and unregistered words as headings and links the text data 1
Link the corresponding position in. As a result, the word index 3 in S6 has been created.

【００２３】Ｓ７は、ｎ文字インデックス作成部９が単
語インデックス３の単語見出しについて先頭のｎ文字
（例えば先頭の１文字）を抽出し、この抽出したｎ文字
をｎ文字見出しとすると共に単語見出しをリンクするｎ
文字インデックス４を作成する。これらにより、一連の
単語インデックス３およびｎ文字インデックス４が作成
でき、テキストデータ１を検索する準備ができたことと
なる。In step S7, the n-character index creating unit 9 extracts the first n characters (for example, the first character) of the word heading of the word index 3, sets the extracted n characters as an n-character heading, and sets the word heading. Link n
Create character index 4. As a result, a series of word index 3 and n-character index 4 can be created, and the text data 1 is ready to be searched.

【００２４】また、辞書エディタ１３は、単語インデッ
クス３の効率を向上させるために、単語辞書１２をチュ
ーニング（新たな単語を登録したり、修正したり、未登
録語を新たな単語として登録したりなどしてチェーニン
グ）する。In order to improve the efficiency of the word index 3, the dictionary editor 13 tunes the word dictionary 12 (registers a new word, corrects it, or registers an unregistered word as a new word). And so on).

【００２５】以上によって、テキストデータ１から任意
の単語および未登録語を検索するための、ｎ文字インデ
ッスク４および単語インデックス３が作成できたことと
なる。As described above, the n-character index 4 and the word index 3 for searching an arbitrary word and unregistered word from the text data 1 can be created.

【００２６】（２）検索系２１について説明する。こ
こでは、ｎ文字インデックス４は、１文字インデックス
とする。図２において、Ｓ１１は、キーワードを入力す
る。これは、オペレータが画面上から検索しようとする
キーワードを入力する。(2) The search system 21 will be described. Here, the n-character index 4 is a 1-character index. In FIG. 2, in S11, a keyword is input. For this, the operator inputs a keyword to be searched from the screen.

【００２７】Ｓ１２は、キーワードの先頭１文字と１文
字見出しを比較する。これは、例えばキーワード“情
報”について先頭の１文字“情”と、１文字インデック
ス４の１文字見出しと比較し、一致するものを見つけ
る。ない場合には、ない旨のメッセージを画面上に表示
する。ありの場合には、Ｓ１３に進む。In step S12, the first character of the keyword is compared with the one-character headline. This compares, for example, the leading one character “information” with respect to the keyword “information” and the one character index of the one character index 4, and finds a match. If there is not, a message indicating that there is no is displayed on the screen. If there is, go to S13.

【００２８】Ｓ１３は、単語インデックス３の単語見出
しとキーワードと比較する。これは、Ｓ１２のありでキ
ーワードの先頭の１文字例えば“情”が１文字インデッ
クス４の１文字見出しにありと判明したので、この１文
字見出しのリンク先の単語インデックス３の単語と、キ
ーワードとを比較する。ない場合、即ちキーワードと一
致しないあるいは一致する部分を含まない場合には、な
い旨のメッセージを画面上に表示する。ありの場合に
は、Ｓ１４に進む。In step S13, the word index of word index 3 is compared with the keyword. This is because it is found in S12 that one character at the beginning of the keyword, for example, "information", exists in the one-character index of the one-character index 4, so the word of the word index 3 of the link destination of this one-character index and the keyword To compare. If there is no match, that is, if there is no match with the keyword or no matching part is included, a message indicating that there is no match is displayed on the screen. If there is, go to S14.

【００２９】Ｓ１４は、テキストとキーワードを比較す
る。これは、Ｓ１３のありでキーワードと単語見出しと
が一致あるいはキーワードが一部単語見出しに含まれて
いたので、リンク先のテキストとキーワードを比較す
る。ない場合、即ちキーワードがテキストと一致しない
場合には、ない旨のメッセージを画面上に表示する。あ
りの場合には、Ｓ１５に進む。In step S14, the text is compared with the keyword. This is because in S13, the keyword matches the word heading or the keyword is partially included in the word heading, so the text of the link destination and the keyword are compared. When there is no keyword, that is, when the keyword does not match the text, a message indicating that there is no keyword is displayed on the screen. If yes, the process proceeds to S15.

【００３０】Ｓ１５は、Ｓ１４でキーワードとテキスト
の文字列とが一致すると判明したので、このテキストの
位置（単語見出しのリンク先の位置）の内容を画面上に
表示する。これにより、検索指示したキーワードの存在
するテキスト（例えば文単位、段落単位、ページ単位の
テキスト）が画面上に表示されたこととなる。In S15, since it is found in S14 that the keyword matches the character string of the text, the contents of the position of this text (the position of the link destination of the word heading) is displayed on the screen. As a result, the text in which the keyword instructed to search is present (for example, text unit, paragraph unit, page unit text) is displayed on the screen.

【００３１】以上によって、画面上からキーワードを入
力したことに対応して、キーワードの先頭１文字と一致
する１文字インデックス４の１文字見出しを見つけ、こ
の１文字見出しのリンク先の単語インデックス３の単語
見出しのうち一致するものを見つけ、更にこの一致した
単語見出しのリンク先のテキストとキーワードが一致し
たときに、この範囲のテキストを画面上に検索結果とし
て表示する。これらにより、キーワードが存在するテキ
スト上の文字列が表示されることとなる。As described above, in response to the input of a keyword on the screen, a one-character index having a one-character index 4 matching the leading one character of the keyword is found, and the word index 3 of the link destination of this one-character index is searched. When a matching one of the word headings is found, and when the linked text of the matching word heading and the keyword match, the text in this range is displayed as a search result on the screen. As a result, the character string on the text in which the keyword exists is displayed.

【００３２】図３は、本発明の単語インデックスの作成
説明図を示す。テキストデータ１は、文書であって、図
示のＴ２（文単位、段落単位、あるいはページ単位な
ど）の場合には、“ある事柄についての情報を集め、こ
れをファイルに蓄える。そして必要に応じ・・・・・”
である。FIG. 3 is a diagram for explaining the creation of the word index according to the present invention. The text data 1 is a document, and in the case of T2 shown in the figure (sentence unit, paragraph unit, page unit, etc.), "collects information about a certain matter and stores it in a file. ... "
Is.

【００３３】前処理結果は、ここでは、文単位にまとめ
たものである。語分割結果は、前処理結果について、単
語辞書１２を参照して単語に分割し、更に未登録語につ
いては文字種情報をもとに単語に分割したものである。
ここでは、図示の／に示す区切り記号で単語に下記のよ
うに分割する。Here, the preprocessing results are summarized in sentence units. The word division result is obtained by dividing the preprocessing result into words by referring to the word dictionary 12 and further dividing the unregistered words into words based on the character type information.
Here, the words are divided into the following words with the delimiters shown in / as shown below.

【００３４】Ｔ２／に／ついて／の／情報／を／集め／・・・また、これら分割した単語およびリンクを分かり易く並
べると下記のようになる。T2 / on / on / on / information / collecting / ... Further, the divided words and links are arranged in an easy-to-understand manner as follows.

【００３５】単語見出しリンクについてＴ２のＴ２情報Ｔ２をＴ２単語インデックス３は、語分割結果について、単語見出
しおよびリンクとして下記のように重複しないように、
単語インデックス３に格納する。Regarding the word heading link T2 information of T2 T2 is T2 The word index 3 is, as to the word division result, as the word heading and the link, as shown below,
Store in word index 3.

【００３６】単語見出しリンク情報Ｔ２、Ｔ３、Ｔ１０・・・情勢Ｔ８、Ｔ９、Ｔ２３・・・そして、これらの単語インデックス３の単語見出しの先
頭の１文字をとりだし、１文字インデックス３の１文字
見出しとすると共にリンクによって単語見出しの位置を
ポイントする。Word heading link information T2, T3, T10 ... Situation T8, T9, T23 ... And the first character of the word heading of these word index 3 is taken out and one character heading of 1 character index 3 is taken out. And the link points to the position of the word heading.

【００３７】以上によって、テキストデータを指定した
ことに対応して、前処理結果を得て、次に語分割結果を
得て、この語分割結果をもとに単語インデックス３の見
出しおよびリンクを登録する。そして、単語見出しの先
頭１文字を１文字インデックス４の１文字見出しとする
と共に単語見出しをリンクする。これらにより、テキス
トデータから自動的に単語インデックス３および１文字
インデックス４を作成することが可能となる。As described above, the preprocessing result is obtained corresponding to the designation of the text data, the word division result is obtained next, and the heading and the link of the word index 3 are registered based on the word division result. To do. Then, the first one character of the word heading is set as the one-character heading of the one-character index 4, and the word heading is linked. As a result, it becomes possible to automatically create the word index 3 and the one-character index 4 from the text data.

【００３８】[0038]

【発明の効果】以上説明したように、本発明によれば、
テキストデータ１から語分割して単語インデックス３を
作成およびこの単語インデックス３の単語見出しのｎ文
字をｎ文字インデックス４に設定およりリンクを設定
し、単語インデックス３およびｎ文字インデックス４を
作成する構成を採用しているため、テキストデータ１か
ら任意文字列の単語および文字種情報でまとめた未登録
語を取り出して単語インデックス３の単語見出しとし、
エントリ数を削減してメモリ容量を小さくできると共
に、単語見出しの先頭ｎ文字（例えば先頭１文字）をｎ
文字インデックス４に設定し、単語もれを無くすことが
できる。これらにより、単語辞書のメンテナンスの労力
を省き、インデックス容量を削減してテキストデータの
検索を行うことができる。特に、単語インデックス３を
作成したことによって、単語辞書に登録されている単語
（２文字、３文字、４文字など）を１つとしておよび文
字種情報でまとめた未登録語を１つとしてインデックス
を付与でき、インデックス量を削減できる。また、この
単語インデックス３の単語見出しについて、ｎ文字イン
デックス４、特に１文字インデックスからリンクするこ
とにより、テキスト中の文字列の取りこぼしを無くすこ
とが可能となる。As described above, according to the present invention,
A configuration in which the word index 3 is created by dividing the word from the text data 1 and the n characters of the word heading of this word index 3 are set to the n character index 4 and a link is set to create the word index 3 and the n character index 4. Therefore, the unregistered word that is collected by the word of the arbitrary character string and the character type information is extracted from the text data 1 as the word index of the word index 3,
The number of entries can be reduced to reduce the memory capacity, and the first n characters (for example, the first character) of the word heading can be changed to n.
By setting the character index to 4, it is possible to eliminate word leakage. As a result, the labor of maintaining the word dictionary can be saved, the index capacity can be reduced, and the text data can be searched. In particular, by creating the word index 3, an index is given by setting one word (two characters, three characters, four characters, etc.) registered in the word dictionary and one unregistered word collected by the character type information. Yes, the amount of indexes can be reduced. Also, by linking the word heading of the word index 3 from the n-character index 4, especially the 1-character index, it becomes possible to eliminate the omission of the character string in the text.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の１実施例構成図である。FIG. 1 is a configuration diagram of an embodiment of the present invention.

【図２】本発明の動作説明図である。FIG. 2 is an operation explanatory diagram of the present invention.

【図３】本発明の単語インデックスの作成説明図であ
る。FIG. 3 is an explanatory diagram of creating a word index according to the present invention.

【符号の説明】[Explanation of symbols]

１：テキストデータ２：テキスト３：単語インデックス４：ｎ文字インデックス５：前処理部６：語分割処理部７：未登録語処理部８：単語インデックス作成部９：ｎ文字インデックス作成部１２：単語辞書１３：辞書エディタ 1: Text data 2: Text 3: Word index 4: N character index 5: Pre-processing unit 6: Word division processing unit 7: Unregistered word processing unit 8: Word index creation unit 9: N character index creation unit 12: Word Dictionary 13: Dictionary editor

Claims

【特許請求の範囲】[Claims]

【請求項１】文書の検索を行う情報検索装置において、テキストを語分割して単語にし、この単語を単語見出し
として該当テキストへのリンクを設定する単語インデッ
クス（３）と、これらの単語見出しについて、先頭などから取り出した
ｎ文字見出しから該当単語見出しへのリンクを設定する
ｎ文字インデックス（４）とを作成するように構成した
ことを特徴とする情報検索装置。1. An information retrieval apparatus for retrieving a document, wherein a text is word-divided into words, and a word index (3) for setting a link to the corresponding text as a word heading, and these word headings. , An n-character index (4) for setting a link to the corresponding word heading from an n-character heading taken out from the beginning or the like, and an information retrieving apparatus.

【請求項２】上記テキストを語分割して単語にする際
に、単語辞書（１２）に登録されていない未登録語を文
字種情報でまとめて単語見出しとするように構成したこ
とを特徴とする請求項１記載の情報検索装置。2. When the above-mentioned text is word-divided into words, unregistered words that are not registered in the word dictionary (12) are grouped by character type information into word headings. The information search device according to claim 1.

【請求項３】キーワードを指定した検索指示に対応し
て、上記ｎ文字インデックス（４）の該当するｎ文字見
出しからリンクする上記単語インデックス（３）の該当
する単語見出しを見つけ、この単語見出しのリンクから
該当テキストの文字列を検索し、出力するように構成し
たことを特徴とする請求項１記載の情報検索装置。3. A corresponding word heading of the word index (3) linked from the corresponding n character heading of the n character index (4) is found in response to a search instruction specifying a keyword, and this word heading The information retrieval apparatus according to claim 1, wherein the information retrieval apparatus is configured to retrieve and output a character string of the corresponding text from the link.

【請求項４】上記ｎ文字インデックス（４）として、１
文字インデックスとしたことを特徴とする請求項１記載
から請求項３記載の情報検索装置。4. The n character index (4) is 1
The information retrieval device according to claim 1, wherein the information retrieval device is a character index.