JP5245908B2

JP5245908B2 - Search method and apparatus

Info

Publication number: JP5245908B2
Application number: JP2009042098A
Authority: JP
Inventors: 友哉岩倉; 青史岡本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-02-25
Filing date: 2009-02-25
Publication date: 2013-07-24
Anticipated expiration: 2029-02-25
Also published as: JP2010198288A

Description

本技術は、入力テキストと類似するテキストを探索する検索技術に関する。 The present technology relates to a search technology for searching for text similar to input text.

従来技術における検索技術の一例を図１乃至図３を用いて説明する。最初に、インデックスを生成する処理について図１及び図２を用いて説明する。まず、テキストＩＤが「１」である「太郎と花子と二郎と三郎が泣く。」というテキストが入力される（図１（ａ））。そうすると、入力テキストから、周知の単語切り出し技術を用いて単語のうち自立語（名詞及び動詞）を切り出す。本例では、「太郎」「花子」「二郎」「三郎」「泣く」が切り出される（図１（ｂ））。そして、各単語に対応付けて当該単語が出現したテキストのＩＤが格納されるインデックスＤＢに、抽出された単語「太郎」「花子」「二郎」「三郎」「泣く」のそれぞれに対応付けてＩＤ「１」が登録される（図１（ｃ））。さらに、各テキストに対応付けて単語数が登録されるテキストサイズＤＢに、今回の入力テキストのＩＤ「１」に対応付けて単語数「５」が登録される（図１（ｄ））。 An example of a conventional search technique will be described with reference to FIGS. First, processing for generating an index will be described with reference to FIGS. First, a text “Taro, Hanako, Jiro, and Saburo cry” with a text ID “1” is input (FIG. 1A). Then, independent words (nouns and verbs) are extracted from the input text by using a well-known word extraction technique. In this example, “Taro”, “Hanako”, “Jiro”, “Saburo”, and “Cry” are cut out (FIG. 1B). In the index DB that stores the ID of the text in which the word appears in association with each word, the ID is associated with each of the extracted words “Taro”, “Hanako”, “Jiro”, “Saburo”, and “Crying” “1” is registered (FIG. 1C). Furthermore, the number of words “5” is registered in association with the ID “1” of the current input text in the text size DB in which the number of words is registered in association with each text (FIG. 1D).

その後、テキストＩＤが「２」である「太郎と花子と二郎が泣く。」というテキストが入力される（図２（ａ））。そうすると、入力テキストから、「太郎」「花子」「二郎」「泣く」という自立語が切り出される（図２（ｂ））。そして、インデックスＤＢに、抽出された単語「太郎」「花子」「二郎」「泣く」のそれぞれのレコードに、テキストＩＤ「２」が追加される（図２（ｃ））。さらに、テキストサイズＤＢに、今回の入力テキストのＩＤ「２」に対応付けて単語数「４」が登録される（図２（ｄ））。 Thereafter, the text “Taro, Hanako, and Jiro cry” whose text ID is “2” is input (FIG. 2A). Then, independent words such as “Taro”, “Hanako”, “Jiro”, and “Cry” are cut out from the input text (FIG. 2B). Then, the text ID “2” is added to each record of the extracted words “Taro”, “Hanako”, “Jiro”, and “cry” in the index DB (FIG. 2C). Furthermore, the number of words “4” is registered in the text size DB in association with the ID “2” of the current input text (FIG. 2D).

このような前処理が行われた後に、検索キーとなるテキストが入力される。例えば、「太郎と二郎が泣く。」というテキストが入力されたものとする（図３（ａ））。そうすると、周知の単語切り出し技術を用いて自立語が切り出されて、「太郎」「二郎」「泣く」が得られる（図３（ｂ））。ここで、図２（ｃ）に示したインデックスＤＢに対して、「太郎」で検索すると、テキストのＩＤ「１」及び「２」が得られ、「二郎」で検索するとテキストのＩＤ「１」及び「２」が得られ、「泣く」で検索するとテキストのＩＤ「１」及び「２」が得られる。ここで、得られたテキストＩＤ毎に、一致する単語数をカウントして、共通出現単語数格納部に格納しておく。今回の例では、テキストＩＤが「１」のテキストについては一致する単語数は３個、テキストＩＤが「２」のテキストについても一致する単語数は３個である（図３（ｃ））。 After such preprocessing is performed, text serving as a search key is input. For example, it is assumed that the text “Taro and Jiro cry” is input (FIG. 3A). Then, independent words are extracted using a well-known word segmentation technique, and “Taro”, “Jiro”, and “cry” are obtained (FIG. 3B). When searching for “Taro” in the index DB shown in FIG. 2C, text IDs “1” and “2” are obtained. When searching for “Jiro”, the text ID “1” is obtained. And “2” are obtained, and text IDs “1” and “2” are obtained by searching for “crying”. Here, for each obtained text ID, the number of matching words is counted and stored in the common appearance word number storage unit. In this example, the number of matching words is 3 for the text with the text ID “1”, and the number of matching words is 3 for the text with the text ID “2” (FIG. 3C).

ここでは、例えば２つのテキストの類似度に余弦（コサイン）類似度を用いるものとする。また、例えばテキストＡを当該テキストＡに含まれる単語についてのバイナリベクトルＡで表し、テキストＢを当該テキストＢに含まれる単語についてのバイナリベクトルＢで表すものとする。バイナリベクトルは、ある単語が出現する場合、対応する次元の値を「１」、それ以外については「０」とするベクトルである。例えば、「太郎」「花子」「二郎」「泣く」「三郎」という順番の場合には、テキストＩＤ「１」のバイナリベクトルは（１，１，１，１，１）であり、テキストＩＤ「２」のバイナリベクトルは（１，１，１，１，０）である。また、検索キーとなるテキストは（１，０，１，１，０）となる。 Here, for example, the cosine similarity is used as the similarity between two texts. Further, for example, the text A is represented by a binary vector A for words included in the text A, and the text B is represented by a binary vector B for words included in the text B. The binary vector is a vector in which the value of the corresponding dimension is “1” when a certain word appears, and “0” otherwise. For example, in the order of “Taro” “Hanako” “Jiro” “Crying” “Saburo”, the binary vector of the text ID “1” is (1, 1, 1, 1, 1), and the text ID “ The binary vector of “2” is (1, 1, 1, 1, 0). Further, the text as a search key is (1, 0, 1, 1, 0).

この際、テキストＡとテキストＢとのコサイン類似度は、以下のように算出される。
ｃｏｓ（Ａ，Ｂ）＝Ａ・Ｂ／（｜Ａ｜｜Ｂ｜）^1/2
Ａ及びＢは、それぞれテキストに含まれる単語のバイナリベクトルであり、Ａ・Ｂは、ＡとＢの内積である。バイナリベクトルの場合には、Ａ・Ｂは、ＡとＢに共通に含まれる単語数であり、共通出現単語数格納部に格納されている値と一致する。｜Ａ｜はＡの長さであり、ここではＡに含まれる単語（ここでは自立語）の数であり、｜Ｂ｜はＢの長さであり、ここではＢに含まれる単語の数である。｜Ａ｜｜Ｂ｜は、Ａの長さとＢの長さの積である。 At this time, the cosine similarity between the text A and the text B is calculated as follows.
cos (A, B) = A · B / (| A || B |) ^1/2
A and B are binary vectors of words included in the text, respectively, and A and B are inner products of A and B. In the case of a binary vector, A · B is the number of words commonly included in A and B and matches the value stored in the common appearance word number storage unit. | A | is the length of A, here is the number of words (independent words here) included in A, | B | is the length of B, and here is the number of words included in B is there. | A || B | is the product of the length of A and the length of B.

このようなコサイン類似度を用いる場合、分母は、検索キーとなる入力テキストの単語数とテキストサイズＤＢに登録されている単語数とを用いて算出され、分子は、共通出現単語数格納部に格納されている値を用いればよい。 When such cosine similarity is used, the denominator is calculated using the number of words of the input text serving as a search key and the number of words registered in the text size DB, and the numerator is stored in the common appearance word number storage unit. A stored value may be used.

具体的に、検索キーとなる入力テキストとテキストＩＤ「１」のテキストとのコサイン類似度ｃｏｓ（入力，１）＝３／（３×５）^1/2＝０．７７４であり、検索キーとなる入力テキストとテキストＩＤ「２」のテキストとのコサイン類似度ｃｏｓ（入力，２）＝３／（３×４）^1/2＝０．８６６となる（図３（ｄ））。 Specifically, the cosine similarity cos (input, 1) = 3 / (3 × 5) ^1/2 = 0.774 between the input text serving as the search key and the text with the text ID “1”, The cosine similarity cos (input, 2) = 3 / (3 × 4) ^1/2 = 0.866 between the input text and the text with the text ID “2” (FIG. 3D).

従って、類似度閾値ｓ＝０．８５以上という条件が設定されていれば、検索キーとなる入力テキストに類似するテキストとして、テキストＩＤ「２」が出力される（図３（ｅ））。 Therefore, if the condition that the similarity threshold s = 0.85 or more is set, the text ID “2” is output as the text similar to the input text serving as the search key (FIG. 3E).

このような技術では、インデックスＤＢに登録されており且つ単語が１つでも一致するテキストであれば、全て類似度計算を実施しなければならないので、インデックスＤＢに登録されているテキストの数が多くなると、非常に検索時間が長くなってしまう。そこで、従来技術にも検索対象を絞り込むような技術は存在しているが、統計的に絞り込みの条件を設定しており、絞り込みが適切ではない場合も生じ得る。 In such a technique, if the text is registered in the index DB and matches even one word, the similarity must be calculated for all the texts. Therefore, the number of texts registered in the index DB is large. As a result, the search time becomes very long. Therefore, there is a technique for narrowing down the search target in the prior art, but there are cases where the narrowing conditions are set statistically and narrowing down is not appropriate.

特開平１１−６６０８６号公報JP 11-66086 A

以上のように、検索処理の高速化のためには検索対象の絞り込みは有効であるが、従来技術では適切に絞り込みがなされない場合も生じ得る。 As described above, narrowing down the search target is effective for speeding up the search process, but there may be cases where the prior art does not narrow down appropriately.

従って、本技術の目的は、検索対象を適切に絞り込むための新規な技術を提供することである。 Therefore, an object of the present technology is to provide a new technology for appropriately narrowing down search targets.

本検索方法は、記憶装置に格納されている入力テキストから自立語を抽出するステップと、抽出された自立語を条件として入力テキストとの類似度が、記憶装置に格納されている類似度閾値以上となる、既存テキスト中の自立語語数の範囲を特定する範囲特定ステップと、自立語語数が自立語語数の範囲内である既存テキストに限定して、記憶装置に格納されている、当該既存テキスト中の自立語と抽出された自立語とを用いて当該既存テキストと入力テキストの類似度を算出し、記憶装置に格納する類似度算出ステップと、記憶装置に格納された類似度が類似度閾値以上となる既存テキストを特定するステップとを含む。 In this search method, the step of extracting an independent word from the input text stored in the storage device, and the similarity between the input text and the extracted independent word as a condition is equal to or greater than the similarity threshold stored in the storage device. The existing text stored in the storage device, the range specifying step for specifying the range of the number of independent words in the existing text and the existing text whose number of independent words is within the range of the number of independent words A similarity calculation step of calculating the similarity between the existing text and the input text using the isolated independent words and the extracted independent words, and storing the similarity in the storage device; and the similarity stored in the storage device is a similarity threshold And identifying the existing text as described above.

検索対象を適切に絞り込むことができる。 The search target can be appropriately narrowed down.

図１は、従来技術のインデックス生成処理を説明するための図である。FIG. 1 is a diagram for explaining a conventional index generation process. 図２は、従来技術のインデックス生成処理を説明するための図である。FIG. 2 is a diagram for explaining a conventional index generation process. 図３は、従来技術の検索処理を説明するための図である。FIG. 3 is a diagram for explaining the search processing of the prior art. 図４は、検索装置の機能ブロック図である。FIG. 4 is a functional block diagram of the search device. 図５は、第１の実施の形態のメインの処理フローを示す図である。FIG. 5 is a diagram illustrating a main processing flow according to the first embodiment. 図６は、インデックスＤＢに格納されるデータの一例を示す図である。FIG. 6 is a diagram illustrating an example of data stored in the index DB. 図７は、テキストサイズＤＢに格納されるデータの一例を示す図である。FIG. 7 is a diagram illustrating an example of data stored in the text size DB. 図８は、インデックス変換処理後のインデックスＤＢに格納されるデータの一例を示す図である。FIG. 8 is a diagram illustrating an example of data stored in the index DB after the index conversion process. 図９は、インデックス生成処理の処理フローを示す図である。FIG. 9 is a diagram illustrating a processing flow of index generation processing. 図１０は、インデックス生成処理の処理フローを示す図である。FIG. 10 is a diagram illustrating a processing flow of index generation processing. 図１１は、インデックス変換処理の処理フローを示す図である。FIG. 11 is a diagram illustrating a processing flow of index conversion processing. 図１２は、共通出現単語数算出処理の処理フローを示す図である。FIG. 12 is a diagram illustrating a processing flow of the common appearance word count calculation processing. 図１３は、共通出現単語数算出処理の処理フローを示す図である。FIG. 13 is a diagram illustrating a processing flow of the common appearance word count calculation processing. 図１４（ａ）乃至（ｅ）は、第１の実施の形態の処理を具体的に説明するための図である。FIGS. 14A to 14E are diagrams for specifically explaining the processing according to the first embodiment. 図１５は、類似テキスト選択処理の処理フローを示す図である。FIG. 15 is a diagram showing a processing flow of similar text selection processing. 図１６は、第２の実施の形態のメインの処理フローを示す図である。FIG. 16 is a diagram illustrating a main processing flow according to the second embodiment. 図１７は、サイズ別インデックスＤＢの一例を示す図である。FIG. 17 is a diagram illustrating an example of the size-specific index DB. 図１８は、サイズ別インデックスＤＢの一例を示す図である。FIG. 18 is a diagram illustrating an example of the size-specific index DB. 図１９は、サイズ別インデックス生成処理の処理フローを示す図である。FIG. 19 is a diagram illustrating a processing flow of the size-specific index generation processing. 図２０は、サイズ別インデックス生成処理の処理フローを示す図である。FIG. 20 is a diagram illustrating a processing flow of the size-specific index generation processing. 図２１（ａ）及び（ｂ）は、サイズ別インデックス生成処理を説明するための具体例を示す図である。FIGS. 21A and 21B are diagrams illustrating a specific example for explaining the size-specific index generation processing. 図２２（ａ）及び（ｂ）は、サイズ別インデックス生成処理を説明するための具体例を示す図である。FIGS. 22A and 22B are diagrams illustrating a specific example for explaining the size-specific index generation processing. 図２３は、第２共通出現単語数算出処理の処理フローを示す図である。FIG. 23 is a diagram illustrating a processing flow of second common appearance word number calculation processing. 図２４は、第２共通出現単語数算出処理の処理フローを示す図である。FIG. 24 is a diagram illustrating a processing flow of the second common appearance word number calculation processing. 図２５（ａ）乃至（ｅ）は、第２の実施の形態の処理を具体的に説明するための図である。FIGS. 25A to 25E are diagrams for specifically explaining the processing of the second embodiment. 図２６は、第３の実施の形態のメインの処理フローを示す図である。FIG. 26 is a diagram illustrating a main processing flow according to the third embodiment. 図２７は、第３共通出現単語数算出処理の処理フローを示す図である。FIG. 27 is a diagram illustrating a processing flow of the third common appearance word number calculation processing. 図２８は、比較対象テキストのサイズ範囲決定処理の処理フローを示す図である。FIG. 28 is a diagram illustrating a processing flow of the size range determination processing for the comparison target text. 図２９（ａ）乃至（ｅ）は、第３の実施の形態の処理を具体的に説明するための図である。FIGS. 29A to 29E are diagrams for specifically explaining the processing of the third embodiment. 図３０は、第４共通出現単語数算出処理の処理フローを示す図である。FIG. 30 is a diagram illustrating a processing flow of the fourth common appearance word number calculation processing. 図３１は、コンピュータの機能ブロック図である。FIG. 31 is a functional block diagram of a computer.

［実施の形態１］
第１の実施の形態について図４乃至図１５を用いて説明する。 [Embodiment 1]
A first embodiment will be described with reference to FIGS.

まず、図４に本実施の形態における検索装置の機能ブロック図を示す。検索装置１００は、入力部１１と、入力部１１から入力されたインデックス対象テキストを格納するインデックス対象テキスト格納部１２と、インデックス対象テキスト格納部１２に格納されているデータを用いてインデックス生成処理を実施するインデックス生成部１３と、インデックス生成部１３により生成されたインデックスのデータを格納するインデックスＤＢ１４と、インデックス生成部１３により生成されたテキストサイズのデータを格納するテキストサイズＤＢ１５と、テキストサイズＤＢ１５のデータに基づきインデックスＤＢ１４に格納されたインデックスデータについて変換処理を実施するインデックス変換部１６と、入力部１１から入力された検索キーである入力テキストを格納する検索入力テキスト格納部１８と、入力部１１から入力された類似度閾値を格納する類似度閾値格納部２０と、インデックスＤＢ１４とテキストサイズＤＢ１５と検索入力テキスト格納部１８と類似度閾値格納部２０とに格納されているデータを用いて処理を行う共通出現単語数算出部１７と、共通出現単語数算出部１７の処理結果を格納する共通出現単語数格納部１９と、共通出現単語数格納部１９とテキストサイズＤＢ１５と検索入力テキスト格納部１８と類似度閾値格納部２０とに格納されたデータを用いて処理を実施する類似テキスト選択処理部２１と、類似テキスト選択処理部２１の処理結果を格納するテキストＩＤ格納部２２と、テキストＩＤ格納部２２に格納されているデータを出力する出力部２３とを有する。なお、共通出現単語数算出部１７は、検索処理を行っても目的の文書（すなわちテキスト）が得られないと判断した場合には解無し通知を出力部２３に出力するようになっている。 First, FIG. 4 shows a functional block diagram of the search device in the present embodiment. The search device 100 performs an index generation process using the input unit 11, the index target text storage unit 12 that stores the index target text input from the input unit 11, and the data stored in the index target text storage unit 12. An index generation unit 13 to be implemented, an index DB 14 that stores data of an index generated by the index generation unit 13, a text size DB 15 that stores text size data generated by the index generation unit 13, and a text size DB 15 An index conversion unit 16 that performs conversion processing on the index data stored in the index DB 14 based on the data, a search input text storage unit 18 that stores an input text that is a search key input from the input unit 11, Data stored in the similarity threshold storage unit 20 that stores the similarity threshold input from the input unit 11, the index DB 14, the text size DB 15, the search input text storage unit 18, and the similarity threshold storage unit 20 is used. The common appearance word number calculation unit 17 that performs processing, the common appearance word number storage unit 19 that stores the processing results of the common appearance word number calculation unit 17, the common appearance word number storage unit 19, the text size DB 15, and the search input text A similar text selection processing unit 21 that performs processing using data stored in the storage unit 18 and the similarity threshold storage unit 20, a text ID storage unit 22 that stores processing results of the similar text selection processing unit 21, And an output unit 23 that outputs data stored in the text ID storage unit 22. The common appearance word number calculation unit 17 outputs a no solution notification to the output unit 23 when it is determined that the target document (that is, text) cannot be obtained even after performing the search process.

次に、図５乃至図１５を用いて検索装置１００の処理内容について説明する。まず、インデックス生成部１３は、入力部１１から入力され且つインデックス対象テキスト格納部１２に格納されているインデックス対象テキストに対してインデックス生成処理を実施する（図５：ステップＳ１）。インデックス生成処理については、本実施の形態では従来技術と同様であるが、後に詳しく述べる。インデックス生成処理では、例えば図６に示すようなデータがインデックスＤＢ１４に格納される。データ構造については従来技術で説明したものと同じである。さらに、インデックス生成処理では、例えば図７に示すようなデータがテキストサイズＤＢ１５に格納される。データ構造については従来技術で説明したものと同じである。 Next, processing contents of the search device 100 will be described with reference to FIGS. First, the index generation unit 13 performs index generation processing on the index target text input from the input unit 11 and stored in the index target text storage unit 12 (FIG. 5: step S1). The index generation processing is the same as that of the prior art in this embodiment, but will be described in detail later. In the index generation process, for example, data as shown in FIG. 6 is stored in the index DB 14. The data structure is the same as described in the prior art. Furthermore, in the index generation process, for example, data as shown in FIG. 7 is stored in the text size DB 15. The data structure is the same as described in the prior art.

そして、インデックス変換部１６は、インデックスＤＢ１４に新たにデータが蓄積されると、インデックス変換処理を実施する（ステップＳ３）。インデックス変換処理については、後に詳しく述べる。簡単に述べれば、インデックスＤＢ１４に格納されている各単語について、テキストＩＤをそのテキストが含む単語の数に基づき昇順で並び替える。すなわち、図６のようなインデックスＤＢ１４が存在する場合には、図７のようなテキストサイズＤＢ１５に格納されている各テキストの単語数に応じてテキストＩＤを昇順にソートする。図７の例では、単語数の小さい順に、ＩＤ「２」「１」「３」「４」の順番になるので、図８に示すように各単語についてテキストＩＤが並べ替えられる。 Then, when new data is accumulated in the index DB 14, the index conversion unit 16 performs an index conversion process (step S3). The index conversion process will be described in detail later. Briefly, for each word stored in the index DB 14, the text ID is rearranged in ascending order based on the number of words included in the text. That is, when the index DB 14 as shown in FIG. 6 exists, the text IDs are sorted in ascending order according to the number of words of each text stored in the text size DB 15 as shown in FIG. In the example of FIG. 7, the IDs are “2”, “1”, “3”, and “4” in ascending order of the number of words, so that the text IDs are rearranged for each word as shown in FIG.

そして、共通出現単語数算出部１７は、入力部１１により入力され且つ検索入力テキスト格納部１８に格納された、検索キーとなる入力テキストについて、インデックスＤＢ１４とテキストサイズＤＢ１５と類似度閾値格納部２０とに格納されているデータを用いて検索対象テキストを絞り込みつつ類似度計算に必要なデータである共通出現単語数を算出する共通単語算出処理を実施する（ステップＳ５）。共通出現単語数は、従来技術の説明において述べたコサイン類似度を用いる場合に必要な内積Ａ・Ｂの値に該当する。検索対象テキストが絞り込まれているので、本実施の形態によれば共通出現単語数格納部１９に格納されるテキストＩＤの数は従来技術より少なくなっている。共通出現単語数算出処理については、後に詳しく述べる。 Then, the common appearance word number calculation unit 17 inputs the index DB 14, the text size DB 15, and the similarity threshold storage unit 20 for the input text that is input by the input unit 11 and stored in the search input text storage unit 18. A common word calculation process is performed for calculating the number of common appearance words, which is data necessary for similarity calculation, while narrowing down the search target text using the data stored in (5). The number of common appearance words corresponds to the value of the inner product A · B required when the cosine similarity described in the description of the prior art is used. Since the search target texts are narrowed down, according to the present embodiment, the number of text IDs stored in the common appearance word number storage unit 19 is smaller than that in the prior art. The common appearance word count calculation process will be described in detail later.

そして、類似テキスト選択処理部２１は、検索入力テキスト格納部１８と類似度閾値格納部２０とテキストサイズＤＢ１５と共通出現単語数格納部１９とに格納されているデータに基づき、共通出現単語数格納部１９に格納されているテキストＩＤ毎に類似度を算出して、テキストＩＤと共にテキストＩＤ格納部２２に格納する類似テキスト選択処理を実施する（ステップＳ７）。類似テキスト選択処理については、出力部２３の処理も含まれるが、詳細については後に述べる。 The similar text selection processing unit 21 stores the number of common appearance words based on data stored in the search input text storage unit 18, the similarity threshold storage unit 20, the text size DB 15, and the common appearance word number storage unit 19. A similarity is calculated for each text ID stored in the unit 19, and a similar text selection process is performed in which the similarity is stored in the text ID storage unit 22 together with the text ID (step S7). The similar text selection processing includes processing of the output unit 23, and details will be described later.

以上のような処理を実施することによって、検索キーとなる入力テキストに対して類似度閾値以上の類似度となるテキストを、高速に抽出することができるようになる。 By performing the processing as described above, it becomes possible to extract text having a similarity higher than the similarity threshold with respect to the input text serving as a search key at high speed.

次に、図９及び図１０を用いて、インデックス生成処理について説明する。インデックス生成部１３は、インデックス対象テキスト格納部１２に格納されているインデックス対象のテキストのうち未処理のテキストを１つ特定する（ステップＳ１１）。そして、インデックスＤＢ１４において未使用のＩＤを、特定されたテキスト用に選択する（ステップＳ１５）。また、テキストサイズＤＢ１５において、選択されたＩＤに対応する値を０に初期化する（ステップＳ１７）。 Next, the index generation process will be described with reference to FIGS. The index generation unit 13 identifies one unprocessed text among the index target texts stored in the index target text storage unit 12 (step S11). Then, an unused ID in the index DB 14 is selected for the specified text (step S15). In the text size DB 15, a value corresponding to the selected ID is initialized to 0 (step S17).

その後、インデックス生成部１３は、特定されたテキストから自立語（名詞及び動詞の単語）を、周知の方法で切り出し、例えばメインメモリなどの記憶装置に格納する（ステップＳ１９）。そして、未処理の単語を１つ特定する（ステップＳ２１）。処理は端子Ａを介して図１０の処理に移行する。 Thereafter, the index generation unit 13 cuts out independent words (nouns and verb words) from the identified text by a well-known method and stores them in a storage device such as a main memory (step S19). Then, one unprocessed word is specified (step S21). The processing shifts to the processing in FIG.

図１０の処理の説明に移行して、インデックス生成部１３は、インデックスＤＢ１４において、特定された単語に対応付けて、選択されたＩＤを登録する（ステップＳ２３）。さらに、テキストサイズＤＢ１５において、選択されたＩＤに対応する値に「１」を加算する（ステップＳ２５）。そして、未処理の単語があるか判断して（ステップＳ２７）、未処理の単語が存在している場合には端子Ｂを介してステップＳ２１に戻る。一方、未処理の単語が存在していない場合には、未処理のテキストがインデックス対象テキスト格納部１２に存在するか判断する（ステップＳ２９）。未処理のテキストが存在する場合には端子Ｃを介してステップＳ１１に戻る。一方、未処理のテキストが存在しない場合には、元の処理に戻る。 Shifting to the description of the processing in FIG. 10, the index generation unit 13 registers the selected ID in association with the identified word in the index DB 14 (step S <b> 23). Further, “1” is added to the value corresponding to the selected ID in the text size DB 15 (step S25). Then, it is determined whether or not there is an unprocessed word (step S27). If there is an unprocessed word, the process returns to step S21 via the terminal B. On the other hand, if an unprocessed word does not exist, it is determined whether an unprocessed text exists in the index target text storage unit 12 (step S29). If unprocessed text exists, the process returns to step S11 via the terminal C. On the other hand, if there is no unprocessed text, the process returns to the original process.

以上のような処理を実施することによって、図６のようなインデックスＤＢ１４のデータが生成され、さらに図７のようなテキストサイズＤＢ１５のデータが生成される。 By executing the processing as described above, data of the index DB 14 as shown in FIG. 6 is generated, and further data of the text size DB 15 as shown in FIG. 7 is generated.

次に、インデックス変換処理について図１１を用いて説明する。インデックス変換部１６は、インデックスＤＢ１４において未処理の単語を１つ特定する（ステップＳ３１）。そして、特定された単語に対応付けて登録されているＩＤに対応する単語数をテキストサイズＤＢ１５から特定し、単語数に基づき昇順にＩＤをソートする（ステップＳ３３）。さらに、インデックスＤＢ１４において、特定された単語に対応付けてソート結果（すなわちソート後のＩＤ列）を登録する（ステップＳ３５）。 Next, the index conversion process will be described with reference to FIG. The index conversion unit 16 identifies one unprocessed word in the index DB 14 (step S31). Then, the number of words corresponding to the ID registered in association with the specified word is specified from the text size DB 15, and the IDs are sorted in ascending order based on the number of words (step S33). Further, in the index DB 14, the sorting result (that is, the sorted ID string) is registered in association with the identified word (step S35).

この後、インデックス変換部１６は、インデックスＤＢ１４において未処理の単語が存在するか判断し（ステップＳ３７）、未処理の単語が存在する場合にはステップＳ３１に戻る。一方、全ての単語を処理した場合には、元の処理に戻る。 Thereafter, the index conversion unit 16 determines whether or not an unprocessed word exists in the index DB 14 (step S37), and returns to step S31 when an unprocessed word exists. On the other hand, when all the words have been processed, the process returns to the original process.

このような処理を実施することによって、図６に示したようなインデックスＤＢ１４は、図８に示すようなインデックスＤＢ１４に変換される。テキストＩＤが単語数の順番で並んでいるので、小さい順に処理すれば、ある順番以降のＩＤについては、以下で述べるサイズ範囲外ということで処理対象外となり、検索対象テキストの絞り込みを容易に且つ高速に実施できるようになる。 By performing such processing, the index DB 14 as shown in FIG. 6 is converted into the index DB 14 as shown in FIG. Since the text IDs are arranged in the order of the number of words, if processing is performed in ascending order, IDs after a certain order are not processed because they are out of the size range described below, and the search target text can be easily narrowed down and It becomes possible to carry out at high speed.

次に、図１２乃至図１５を用いて共通出現単語数算出処理について説明する。共通出現単語数算出部１７は、検索入力テキスト格納部１８から、検索キーとなる入力テキストを読み出す（ステップＳ４１）。また、共通出現単語数格納部１９を初期化する（ステップＳ４３）。そして、入力テキストから自立語（動詞及び名詞の単語）を、周知の方法にて切り出し、単語数と共に、例えば検索入力テキスト格納部１８に格納する（ステップＳ４５）。そして、入力テキストの単語数と、類似度閾値格納部２０に格納されている類似度閾値ｓとから、比較対象テキストのサイズ範囲を決定し、例えばメインメモリなどの記憶装置に格納する（ステップＳ４７）。 Next, the common appearance word count calculation process will be described with reference to FIGS. The common appearance word number calculation unit 17 reads the input text serving as the search key from the search input text storage unit 18 (step S41). Further, the common appearance word number storage unit 19 is initialized (step S43). Then, independent words (verb and noun words) are extracted from the input text by a well-known method and stored together with the number of words, for example, in the search input text storage unit 18 (step S45). Then, the size range of the comparison target text is determined from the number of words of the input text and the similarity threshold s stored in the similarity threshold storage unit 20, and stored in a storage device such as a main memory (step S47). ).

サイズ範囲は、例えばコサイン類似度の場合には、以下の算式にて算出できる。

ここで入力テキストの単語数は｜Ａ｜で表されており、｜Ｂ_i｜が比較対象テキストの単語数を表している。この（１）式が得られる理由については後に詳細に述べるが、コサイン類似度の計算式から導出されており、類似度閾値ｓが与えられているとすると、入力テキストの単語数を変数とした上限値及び下限値算出関数となっている。この範囲以外では、入力テキストの単語数からして類似度閾値ｓの条件を満たすことはあり得ない。また、この式には、確率論的な観点はない。このようにして比較対象テキストのサイズ範囲が狭くなれば、比較対象テキストが解析的に絞り込まれるので、比較すべきテキストが漏れなく処理されると共にその処理の高速化が図られる。なお、上限値と下限値とにより整数のサイズ範囲が得られない場合もある。すなわち、上限値が２．８で下限値が２．５というような範囲が算出された場合には、整数の解（すなわちサイズ範囲）は得られないので、検索処理を実施しても目的の文書を特定することはできない。従って、共通出現単語数算出部１７は、出力部２３に解無し通知を行い、出力部２３は、検索の解無し（例えば「条件に合致するような文書は存在しませんでした。」というようなメッセージ）を表示装置や印刷装置などの出力装置に出力して、処理を終了する。 For example, in the case of cosine similarity, the size range can be calculated by the following formula.

Here, the number of words in the input text is represented by | A |, and | B _i | represents the number of words in the comparison target text. The reason why the equation (1) can be obtained will be described in detail later, but it is derived from the cosine similarity calculation formula, and if the similarity threshold s is given, the number of words in the input text is used as a variable. It is an upper limit and lower limit calculation function. Outside this range, the similarity threshold s cannot be satisfied from the number of words in the input text. Also, this equation has no probabilistic viewpoint. If the size range of the comparison target text is narrowed in this way, the comparison target text is analytically narrowed down, so that the text to be compared is processed without omission and the processing speed is increased. An integer size range may not be obtained depending on the upper limit value and the lower limit value. That is, when a range in which the upper limit is 2.8 and the lower limit is 2.5 is calculated, an integer solution (that is, a size range) cannot be obtained. The document cannot be specified. Accordingly, the common appearance word number calculation unit 17 notifies the output unit 23 that there is no solution, and the output unit 23 indicates that there is no search solution (for example, “There is no document that matches the condition”). Message is output to an output device such as a display device or a printing device, and the process ends.

その後、共通出現単語数算出部１７は、入力テキストから抽出された単語のうち未処理の単語を特定する（ステップＳ４９）。そして、インデックスＤＢ１４に、特定された単語が登録されている判断する（ステップＳ５１）。登録されていない場合には、未処理の単語が存在するか判断し（ステップＳ５３）、未処理の単語が存在する場合にはステップＳ４９に戻る。未処理の単語が存在しない場合には、端子Ｇを介して本処理を終了して元の処理に戻る。 Thereafter, the common appearance word number calculation unit 17 identifies an unprocessed word among the words extracted from the input text (step S49). Then, it is determined that the specified word is registered in the index DB 14 (step S51). If it is not registered, it is determined whether there is an unprocessed word (step S53). If there is an unprocessed word, the process returns to step S49. If there is no unprocessed word, the process is terminated via the terminal G and the process returns to the original process.

一方、インデックスＤＢ１４に、特定された単語が登録されている場合には、共通出現単語数算出部１７は、インデックスＤＢ１４において、特定された単語に対応付けられているＩＤのうち単語数が少ない方から未処理のＩＤを１つ特定する（ステップＳ５５）。処理は端子Ｆを介して図１３の処理に移行する。 On the other hand, when the identified word is registered in the index DB 14, the common appearance word number calculation unit 17 has a smaller number of words among the IDs associated with the identified word in the index DB 14. One unprocessed ID is identified (step S55). The processing shifts to the processing in FIG.

図１３の処理の説明に移行して、共通出現単語数算出部１７は、特定されたＩＤの単語数をテキストサイズＤＢ１５から読み出して、当該単語数がサイズ範囲内であるか判断する（ステップＳ５７）。単語数がサイズ範囲内ではない場合には、ステップＳ６１に移行する。単語数が下限値未満である場合には、これから単語数が増加してサイズ範囲内に入る場合もあるので、ステップＳ６１に移行して、上限値を超えていないことを確認した上で、次の処理を決定する。 Shifting to the description of the processing in FIG. 13, the common appearance word number calculation unit 17 reads the number of words of the identified ID from the text size DB 15 and determines whether the number of words is within the size range (step S <b> 57). ). If the number of words is not within the size range, the process proceeds to step S61. If the number of words is less than the lower limit value, the number of words may increase and fall within the size range from now on. Therefore, after proceeding to step S61 and confirming that the upper limit value is not exceeded, the next Determine the processing.

一方、単語数がサイズ範囲内である場合には、共通出現単語数算出部１７は、共通出現単語数格納部１９において、特定されたＩＤに対応付けられている値を１インクリメントする（ステップＳ５９）。そして、特定されたＩＤの単語数がサイズ範囲の上限を超えたか判断する（ステップＳ６１）。特定されたＩＤの単語数がサイズ範囲の上限を超えた場合には、この単語についてはこれ以上処理する必要はないので、ステップＳ６５に移行する。一方、特定されたＩＤの単語数がサイズ範囲の上限以下であれば、処理に係る単語について未処理のＩＤがまだ存在するか判断する（ステップＳ６３）。未処理のＩＤが存在する場合には、端子Ｅを介してステップＳ５５に戻る。 On the other hand, when the number of words is within the size range, the common appearance word number calculation unit 17 increments the value associated with the identified ID in the common appearance word number storage unit 19 by 1 (step S59). ). Then, it is determined whether the number of words of the identified ID exceeds the upper limit of the size range (step S61). If the number of words with the specified ID exceeds the upper limit of the size range, no further processing is required for this word, and the process proceeds to step S65. On the other hand, if the number of words of the identified ID is less than or equal to the upper limit of the size range, it is determined whether there is still an unprocessed ID for the word related to the process (step S63). If there is an unprocessed ID, the process returns to step S55 via the terminal E.

一方、未処理のＩＤが存在しない場合には、共通出現単語数算出部１７は、入力テキストから抽出された単語のうち未処理の単語が存在しているか判断する（ステップＳ６５）。未処理の単語が存在している場合には、端子Ｄを介してステップＳ４９に戻る。未処理の単語が存在しない場合には、入力テキストから抽出された単語を全て処理したことになるので、元の処理に戻る。 On the other hand, when there is no unprocessed ID, the common appearance word number calculation unit 17 determines whether there is an unprocessed word among the words extracted from the input text (step S65). If there is an unprocessed word, the process returns to step S49 via the terminal D. If there are no unprocessed words, all the words extracted from the input text have been processed, and the process returns to the original process.

図１４を用いて図１２及び図１３の処理を具体的に説明する。例えば、検索キーとなる入力テキスト「太郎と二郎が泣く。」が得られると（図１４（ａ））、ステップＳ４５で「太郎」「二郎」「泣く」という３自立語（単語）に分割される。また、類似度閾値ｓが０．８５と設定されているものとする。そうすると、比較対象テキストのサイズ範囲は、（１）式から、２．１６７５（＝ｓ²＊｜Ａ｜＝0.85²＊３）≦比較対象テキストの単語数≦４．１５２２（＝｜Ａ｜／ｓ²＝３／0.85²）であるから、整数である単語数は「３」及び「４」でなければならないということになる。 The processing of FIGS. 12 and 13 will be specifically described with reference to FIG. For example, when the input text “Taro and Jiro cry” is obtained as a search key (FIG. 14A), it is divided into three independent words (words) “Taro”, “Jiro” and “Cry” in step S45. The Further, it is assumed that the similarity threshold s is set to 0.85. Then, the size range of the comparison target text is calculated from the expression (1) as follows: 2.1675 (= s ² * | A | = 0.85 ² * 3) ≦ number of words of comparison target ≦ 4.1522 (= | A | / Since s ² = 3 / 0.85 ² ), the number of words that are integers must be “3” and “4”.

そして、「太郎」で図８のインデックスＤＢ１４を検索すると、該当レコードが存在し、ＩＤ「２」「１」「３」「４」が得られ、ＩＤ「２」の単語数は図７のテキストサイズＤＢ１５から「４」であることが分かる。従って、共通出現単語数格納部１９には、ＩＤ「２」に対応付けて共通出現単語数「１」を登録する。次に、ＩＤ「１」の単語数は図７のテキストサイズＤＢ１５から「５」であることが分かる。「５」はサイズ範囲外であり上限を超えているので、「太郎」についての処理は終了する。次に、「二郎」で図８のインデックスＤＢ１４を検索すると、該当レコードが存在し、ＩＤ「２」「１」が得られる。「太郎」と同様に、ＩＤ「２」の単語数「４」だけがサイズ範囲内であるので、共通出現単語数格納部１９には、ＩＤ「２」に対応付けて共通出現単語数「２」を登録する。さらに、「泣く」で図８のインデックスＤＢ１４を検索すると、該当レコードが存在し、ＩＤ「２」「１」が得られる。「太郎」「二郎」と同様に、ＩＤ「２」の単語数「４」だけがサイズ範囲内であるので、共通出現単語数格納部１９には、ＩＤ「２」に対応付けて共通出現単語数「３」を登録する（図１３（ｃ））。この後の処理については、処理フローの説明をしてから説明する。 Then, when searching the index DB 14 of FIG. 8 with “Taro”, the corresponding record exists, and IDs “2”, “1”, “3”, and “4” are obtained, and the number of words of ID “2” is the text of FIG. It can be seen from the size DB 15 that it is “4”. Therefore, the common appearance word number storage unit 19 registers the common appearance word number “1” in association with the ID “2”. Next, it can be seen that the number of words of ID “1” is “5” from the text size DB 15 of FIG. Since “5” is outside the size range and exceeds the upper limit, the process for “Taro” ends. Next, when searching the index DB 14 of FIG. 8 with “Jiro”, the corresponding record exists and IDs “2” and “1” are obtained. Similar to “Taro”, only the number of words “4” with ID “2” is within the size range, so the number of common appearance words “2” is associated with the ID “2” in the common appearance word number storage unit 19. ". Further, when the index DB 14 of FIG. 8 is searched for “crying”, the corresponding record exists and IDs “2” and “1” are obtained. Similarly to “Taro” and “Jiro”, only the word number “4” of the ID “2” is within the size range, and therefore, the common appearance word number storage unit 19 associates the ID “2” with the common appearance word. The number “3” is registered (FIG. 13C). The subsequent processing will be described after the processing flow is described.

次に、類似テキスト選択処理を図１５を用いて説明する。類似テキスト選択処理部２１は、共通出現単語数格納部１９に登録されているＩＤのうち未処理のＩＤを特定する（ステップＳ２５１）。そして、特定されたＩＤについて類似度を算出し、例えばメインメモリなどの記憶装置に格納する（ステップＳ２５２）。例えばコサイン類似度であれば、共通出現単語数格納部１９から、特定されたＩＤに対応付けている共通出現単語数（＝Ａ・Ｂ）を読み出し、テキストサイズＤＢ１５から、特定されたＩＤに対応付けられている単語数を読み出し、例えば検索入力テキスト格納部１８から入力テキストの単語数を読み出し、共通出現単語数／｛（特定されたＩＤに対応付けられている単語数）^1/2＊（入力テキストの単語数）^1/2｝でコサイン類似度を算出する。 Next, similar text selection processing will be described with reference to FIG. The similar text selection processing unit 21 specifies an unprocessed ID among the IDs registered in the common appearance word number storage unit 19 (step S251). Then, the similarity is calculated for the identified ID, and stored in a storage device such as a main memory (step S252). For example, in the case of cosine similarity, the number of common appearing words (= A · B) associated with the identified ID is read from the common appearing word number storage unit 19 and is associated with the identified ID from the text size DB 15. For example, the number of words in the input text is read from the search input text storage unit 18 and the number of common appearance words / {(number of words associated with the identified ID) ^1/2 * ( The number of words in the input text) ^1/2 } to calculate the cosine similarity.

図１４の例では、コサイン類似度ｃｏｓ（入力，２）は、３／｛３＊４｝^1/2＝０．８６６と算出される（図１４（ｄ））。 In the example of FIG. 14, the cosine similarity cos (input, 2) is calculated as 3 / {3 * 4} ^1/2 = 0.866 (FIG. 14 (d)).

そして、類似テキスト選択処理部２１は、計算された類似度が類似度閾値ｓ以上であるか判断する（ステップＳ２５３）。計算された類似度が類似度閾値ｓ未満であれば、ステップＳ２４９に移行する。計算された類似度が類似度閾値ｓ以上であれば、特定されたＩＤ及び類似度をテキストＩＤ格納部２２に格納する（ステップＳ２５４）。図１４（ｄ）で算出された類似度は、類似度閾値ｓ＝０．８５以上であるから、テキストＩＤ格納部２２に格納される。 Then, the similar text selection processing unit 21 determines whether the calculated similarity is equal to or higher than the similarity threshold s (step S253). If the calculated similarity is less than the similarity threshold s, the process proceeds to step S249. If the calculated similarity is equal to or greater than the similarity threshold s, the identified ID and similarity are stored in the text ID storage unit 22 (step S254). The similarity calculated in FIG. 14D is stored in the text ID storage unit 22 because the similarity threshold s = 0.85 or more.

その後、類似テキスト選択処理部２１は、ステップＳ２４７の後に又は計算された類似度が類似度閾値ｓ未満であれば、共通出現単語数格納部１９内の全てのＩＤについて処理したか判断する（ステップＳ２５５）。未処理のＩＤが存在する場合にはステップＳ２４１に戻る。一方、未処理のＩＤが存在しない場合には、出力部２３は、テキストＩＤ格納部２２に格納されているテキストＩＤ又はテキストＩＤ及び類似度を、表示装置や印刷装置などの出力装置に出力する（ステップＳ２５６）。例えば、検索装置１００にネットワークに接続されている他のコンピュータに送信するようにしても良い。図１４の例では、ＩＤ「２」が出力される（図１４（ｅ））。 Thereafter, the similar text selection processing unit 21 determines whether or not all IDs in the common appearance word number storage unit 19 have been processed after the step S247 or if the calculated similarity is less than the similarity threshold s (step S247). S255). If there is an unprocessed ID, the process returns to step S241. On the other hand, if there is no unprocessed ID, the output unit 23 outputs the text ID or the text ID stored in the text ID storage unit 22 and the similarity to an output device such as a display device or a printing device. (Step S256). For example, the search apparatus 100 may be transmitted to another computer connected to the network. In the example of FIG. 14, ID “2” is output (FIG. 14E).

従来技術のような手法を採用すると、図３（ｃ）に示すように、２つのＩＤが特定されてしまうが、図１４（ｃ）に示すように、比較対象テキストが絞り込まれて１つのＩＤのみが特定される。そして、ＩＤの数が減れば類似度の計算回数も削減され、処理全体が高速化される。 If a technique like the prior art is adopted, two IDs are specified as shown in FIG. 3C, but the comparison target text is narrowed down to one ID as shown in FIG. 14C. Only identified. If the number of IDs is reduced, the number of similarity calculations is also reduced, and the overall processing is speeded up.

［実施の形態２］
本実施の形態では、絞り込み後のテキストをより簡単に特定できるようにして、処理を高速化するものである。具体的には、インデックスＤＢに格納されるデータを、単語数毎に生成する。 [Embodiment 2]
In this embodiment, the processing speed is increased by making it possible to more easily specify the text after narrowing down. Specifically, data stored in the index DB is generated for each number of words.

本実施の形態に係る検索装置の構成は、インデックス変換部１６を有しない部分を除き、図４で示した機能ブロック図と同じである。従って本実施の形態では、図４をベースに説明する。但し、各処理部は以下で述べるような異なる処理を実施する。 The configuration of the search device according to the present embodiment is the same as the functional block diagram shown in FIG. 4 except for the part that does not have the index conversion unit 16. Therefore, this embodiment will be described based on FIG. However, each processing unit performs different processing as described below.

図１６に、第２の実施の形態に係るメインの処理フローを示す。まず、インデックス生成部１３は、入力部１１から入力され且つインデックス対象テキスト格納部１２に格納されているインデックス対象テキストに対してサイズ別インデックス生成処理を実施する（ステップＳ２６１）。サイズ別インデックス生成処理については、後に詳しく述べる。なお、図１７及び図１８に示すようなインデックスデータが、インデックスＤＢ１４に格納される。図１７のインデックスデータは単語数「４」のインデックスデータであり、図１８のインデックスデータは単語数「５」のインデックスデータである。このように、単語数毎に、インデックスデータが生成されるようになる。なお、サイズ別インデックス生成処理では、例えば図７に示すようなデータがテキストサイズＤＢ１５に格納される。データ構造については従来技術で説明したものと同じである。 FIG. 16 shows a main processing flow according to the second embodiment. First, the index generation unit 13 performs size-specific index generation processing on the index target text input from the input unit 11 and stored in the index target text storage unit 12 (step S261). The size-specific index generation process will be described in detail later. Note that index data as shown in FIGS. 17 and 18 is stored in the index DB 14. The index data in FIG. 17 is index data with the number of words “4”, and the index data in FIG. 18 is index data with the number of words “5”. In this way, index data is generated for each number of words. In the size-specific index generation process, for example, data as shown in FIG. 7 is stored in the text size DB 15. The data structure is the same as described in the prior art.

また、共通出現単語数算出部１７は、入力部１１により入力され且つ検索入力テキスト格納部１８に格納された、検索キーとなる入力テキストについて、インデックスＤＢ１４とテキストサイズＤＢ１５と類似度閾値格納部２０とに格納されているデータを用いて比較対象テキストを絞り込みつつ類似度計算に必要なデータである共通出現単語数を算出する第２共通単語算出処理を実施する（ステップＳ２６３）。インデックスデータを絞り込むことにより、比較対象テキストが絞り込まれる。第２共通出現単語数算出処理については、後に詳しく述べる。 The common appearance word number calculation unit 17 also includes an index DB 14, a text size DB 15, and a similarity threshold storage unit 20 for the input text that is input by the input unit 11 and stored in the search input text storage unit 18 as a search key. A second common word calculation process is performed for calculating the number of common appearance words, which is data necessary for similarity calculation, while narrowing down the comparison target text using the data stored in (Step S263). By narrowing down the index data, the comparison target text is narrowed down. The second common appearance word number calculation process will be described in detail later.

そして、類似テキスト選択処理部２１は、検索入力テキスト格納部１８と類似度閾値格納部２０とテキストサイズＤＢ１５と共通出現単語数格納部１９とに格納されているデータに基づき、共通出現単語数格納部１９に格納されているテキストＩＤ毎に類似度を算出して、テキストＩＤと共にテキストＩＤ格納部２２に格納する類似テキスト選択処理を実施する（ステップＳ２６５）。類似テキスト選択処理については、図１５で述べたものと同一である。従って、ここでは説明は省略する。 The similar text selection processing unit 21 stores the number of common appearance words based on data stored in the search input text storage unit 18, the similarity threshold storage unit 20, the text size DB 15, and the common appearance word number storage unit 19. A similarity is calculated for each text ID stored in the unit 19, and a similar text selection process is performed in which the similarity is stored in the text ID storage unit 22 together with the text ID (step S265). The similar text selection processing is the same as that described in FIG. Therefore, the description is omitted here.

以上のような処理を実施することによって、検索キーとなる入力テキストに対して類似度閾値以上の類似度となるテキストを、さらに高速に抽出することができるようになる。 By performing the processing as described above, it becomes possible to extract text having a similarity higher than the similarity threshold with respect to the input text serving as the search key at a higher speed.

次に、サイズ別インデックス生成処理について図１９乃至図２２を用いて説明する。インデックス生成部１３は、インデックス対象テキスト格納部１２に格納されているインデックス対象のテキストのうち未処理のテキストを１つ特定する（ステップＳ７１）。そして、インデックスＤＢ１４において未使用のＩＤを、特定されたテキスト用に選択する（ステップＳ７３）。また、テキストサイズＤＢ１５において、選択されたＩＤに対応する値を０に初期化する（ステップＳ７５）。 Next, the size-specific index generation processing will be described with reference to FIGS. The index generation unit 13 identifies one unprocessed text among the index target texts stored in the index target text storage unit 12 (step S71). Then, an unused ID in the index DB 14 is selected for the specified text (step S73). In the text size DB 15, a value corresponding to the selected ID is initialized to 0 (step S75).

その後、インデックス生成部１３は、特定されたテキストから自立語（名詞及び動詞の単語）を、周知の方法で切り出し、例えばメインメモリなどの記憶装置に格納する（ステップＳ７７）。ここで単語数をカウントする。そして、未処理の単語を１つ特定する（ステップＳ７９）。処理は端子Ｈを介して図２０の処理に移行する。 After that, the index generation unit 13 cuts out independent words (nouns and verb words) from the identified text by a well-known method and stores it in a storage device such as a main memory (step S77). Here, the number of words is counted. Then, one unprocessed word is specified (step S79). The processing shifts to the processing in FIG.

図２０の処理の説明に移行して、インデックス生成部１３は、特定されたテキストの単語数に対応するサイズ別インデックスＤＢを選択する（ステップＳ８１）。サイズ別インデックスＤＢは、インデックスＤＢ１４内に設けられている。そして、サイズ別インデックスＤＢにおいて、特定された単語に対応付けて、選択されたＩＤを登録する（ステップＳ８３）。さらに、テキストサイズＤＢ１５において、選択されたＩＤに対応する値に「１」を加算する（ステップＳ８５）。そして、未処理の単語があるか判断して（ステップＳ８７）、未処理の単語が存在している場合には端子Ｊを介してステップＳ７９に戻る。一方、未処理の単語が存在していない場合には、未処理のテキストがインデックス対象テキスト格納部１２に存在するか判断する（ステップＳ８９）。未処理のテキストが存在する場合には端子Ｋを介してステップＳ７１に戻る。一方、未処理のテキストが存在しない場合には、元の処理に戻る。 Shifting to the description of the processing in FIG. 20, the index generation unit 13 selects a size-specific index DB corresponding to the number of words of the specified text (step S <b> 81). The size-specific index DB is provided in the index DB 14. In the size-specific index DB, the selected ID is registered in association with the identified word (step S83). Further, “1” is added to the value corresponding to the selected ID in the text size DB 15 (step S85). Then, it is determined whether or not there is an unprocessed word (step S87). If there is an unprocessed word, the process returns to step S79 via the terminal J. On the other hand, if an unprocessed word does not exist, it is determined whether an unprocessed text exists in the index target text storage unit 12 (step S89). If unprocessed text exists, the process returns to step S71 via the terminal K. On the other hand, if there is no unprocessed text, the process returns to the original process.

例えば、図２１（ａ）のようにＩＤ「１」のテキスト「太郎と花子と二郎が泣く。」から、自立語を抽出すると図２１（ｂ）のように「太郎」「花子」「二郎」「泣く」という単語が得られる。そして上で述べたような処理を実施することによって、単語数４のためのサイズ別インデックスＤＢ（図１７）が得られる。さらに、図２２（ａ）のようにＩＤ「２」のテキスト「太郎と花子と二郎と三郎が泣く。」から、自立語を抽出すると図２２（ｂ）のように「太郎」「花子」「二郎」「三郎」「泣く」という単語が得られる。そして上で述べたような処理を実施することによって、単語数５のためのサイズ別インデックスＤＢ（図１８）が得られる。さらに図７のようなテキストサイズＤＢ１５のデータが生成される。 For example, when independent words are extracted from the text “Taro, Hanako and Jiro cry” with ID “1” as shown in FIG. 21A, “Taro” “Hanako” “Jiro” as shown in FIG. The word “cry” is obtained. Then, by executing the processing as described above, an index DB for each size (FIG. 17) for the number of words 4 is obtained. Further, when independent words are extracted from the text “Taro, Hanako, Jiro, and Saburo cry” with ID “2” as shown in FIG. 22A, “Taro” “Hanako” “ The words "Jiro", "Saburo" and "Cry" are obtained. Then, by executing the processing as described above, an index DB for each size (FIG. 18) for the number of words 5 is obtained. Further, data of the text size DB 15 as shown in FIG. 7 is generated.

次に、第２共通出現単語数算出処理を図２３乃至図２５を用いて説明する。共通出現単語数算出部１７は、検索入力テキスト格納部１８から、検索キーとなる入力テキストを読み出す（図２３：ステップＳ９１）。また、共通出現単語数格納部１９を初期化する（ステップＳ９３）。そして、入力テキストから自立語（動詞及び名詞の単語）を、周知の方法にて切り出し、単語数と共に、例えば検索入力テキスト格納部１８に格納する（ステップＳ９５）。そして、入力テキストの単語数と、類似度閾値格納部２０に格納されている類似度閾値ｓとから、比較対象テキストのサイズ範囲を決定し、例えばメインメモリなどの記憶装置に格納する（ステップＳ９７）。例えば上で述べた（１）式に従って、サイズ範囲を算出する。なお、上限値と下限値とにより整数のサイズ範囲が得られない場合もある。すなわち、上限値が２．８で下限値が２．５というような範囲が算出された場合には、整数の解（すなわちサイズ範囲）は得られないので、検索処理を実施しても目的の文書を特定することはできない。従って、共通出現単語数算出部１７は、出力部２３に解無し通知を行い、出力部２３は、検索の解無し（例えば「条件に合致するような文書は存在しませんでした。」というようなメッセージ）を表示装置や印刷装置などの出力装置に出力して、処理を終了する。 Next, the second common appearance word number calculation process will be described with reference to FIGS. The common appearance word number calculation unit 17 reads the input text serving as the search key from the search input text storage unit 18 (FIG. 23: step S91). Further, the common appearance word number storage unit 19 is initialized (step S93). Then, independent words (verb and noun words) are cut out from the input text by a well-known method and stored together with the number of words, for example, in the search input text storage unit 18 (step S95). Then, the size range of the comparison target text is determined from the number of words in the input text and the similarity threshold s stored in the similarity threshold storage unit 20, and stored in a storage device such as a main memory (step S97). ). For example, the size range is calculated according to the equation (1) described above. An integer size range may not be obtained depending on the upper limit value and the lower limit value. That is, when a range in which the upper limit is 2.8 and the lower limit is 2.5 is calculated, an integer solution (that is, a size range) cannot be obtained. The document cannot be specified. Accordingly, the common appearance word number calculation unit 17 notifies the output unit 23 that there is no solution, and the output unit 23 indicates that there is no search solution (for example, “There is no document that matches the condition”). Message is output to an output device such as a display device or a printing device, and the process ends.

その後、共通出現単語数算出部１７は、入力テキストから抽出された単語のうち未処理の単語を特定する（ステップＳ９９）。そして、上で決定されたサイズ範囲内における未処理のサイズに係るサイズ別インデックスＤＢを１つ選択する（ステップＳ１０１）。そして、選択されたサイズ別インデックスＤＢを、特定された単語で検索して、当該サイズ別インデックスＤＢに、特定された単語が登録されているか判断する（ステップＳ１０３）。登録されていない場合には、端子Ｍを介してステップＳ１１１に移行する。 Then, the common appearance word number calculation part 17 specifies an unprocessed word among the words extracted from the input text (step S99). Then, one size-specific index DB relating to an unprocessed size within the size range determined above is selected (step S101). Then, the selected size-specific index DB is searched with the specified word, and it is determined whether the specified word is registered in the size-specific index DB (step S103). If not registered, the process proceeds to step S111 via the terminal M.

一方、選択されたサイズ別インデックスＤＢに、特定された単語が登録されている場合には、共通出現単語数算出部１７は、選択されたサイズ別インデックスＤＢにおいて、特定された単語に対応付けられているＩＤのうち未処理のＩＤを１つ特定する（ステップＳ１０５）。処理は端子Ｌを介して図２４の処理に移行する。 On the other hand, when the specified word is registered in the selected size-specific index DB, the common appearance word number calculation unit 17 is associated with the specified word in the selected size-specific index DB. One unprocessed ID is identified from the IDs that are currently being processed (step S105). The processing shifts to the processing in FIG.

図２４の処理の説明に移行して、共通出現単語数算出部１７は、共通出現単語数格納部１９において、特定されたＩＤに対応付けられている値を１インクリメントする（ステップＳ１０７）。そして、処理に係る単語について未処理のＩＤがまだ存在するか判断する（ステップＳ１０９）。未処理のＩＤが存在する場合には、端子Ｎを介してステップＳ１０５に戻る。 Shifting to the description of the processing in FIG. 24, the common appearance word number calculation unit 17 increments the value associated with the identified ID in the common appearance word number storage unit 19 by 1 (step S107). Then, it is determined whether there is still an unprocessed ID for the word related to the process (step S109). If there is an unprocessed ID, the process returns to step S105 via the terminal N.

一方、未処理のＩＤが存在しない場合には、共通出現単語数算出部１７は、ステップＳ１０１で選択されたサイズ別インデックスＤＢのうち、未処理のサイズに係るサイズ別インデックスＤＢが存在するか判断する（ステップＳ１１１）。未処理のサイズに係るサイズ別インデックスＤＢが存在する場合には、端子Ｐを介してステップＳ１０１に戻る。一方、ステップＳ１０１で選択されたサイズ別インデックスＤＢを全て処理した場合には、入力テキストから抽出された単語のうち未処理の単語が存在しているか判断する（ステップＳ１１３）。未処理の単語が存在している場合には、端子Ｑを介してステップＳ９９に戻る。未処理の単語が存在しない場合には、入力テキストから抽出された単語を全て処理したことになるので、元の処理に戻る。 On the other hand, when there is no unprocessed ID, the common appearance word number calculation unit 17 determines whether there is an index DB by size related to the unprocessed size among the index DBs by size selected in step S101. (Step S111). If there is a size-specific index DB related to an unprocessed size, the process returns to step S101 via the terminal P. On the other hand, when all the size-specific index DBs selected in step S101 are processed, it is determined whether or not there is an unprocessed word among the words extracted from the input text (step S113). If there is an unprocessed word, the process returns to step S99 via the terminal Q. If there are no unprocessed words, all the words extracted from the input text have been processed, and the process returns to the original process.

図２５を用いて図２３及び図２４の処理を具体的に説明する。例えば、検索キーとなる入力テキスト「太郎と二郎が泣く。」が得られると（図２５（ａ））、ステップＳ９５で「太郎」「二郎」「泣く」という３自立語（単語）に分割される（図２５（ｂ））。また、類似度閾値ｓが０．８５と設定されているものとする。そうすると、比較対象テキストのサイズ範囲は、（１）式から、２．１６７５（＝ｓ²＊｜Ａ｜＝0.85²＊３）≦比較対象テキストの単語数≦４．１５２２（＝｜Ａ｜／ｓ²＝３／0.85²）であるから、整数である単語数は「３」及び「４」ということになる。 The processing of FIGS. 23 and 24 will be specifically described with reference to FIG. For example, when the input text “Taro and Jiro cry” is obtained as a search key (FIG. 25A), it is divided into three independent words (words) “Taro”, “Jiro” and “Cry” in step S95. (FIG. 25B). Further, it is assumed that the similarity threshold s is set to 0.85. Then, the size range of the comparison target text is calculated from the expression (1) as follows: 2.1675 (= s ² * | A | = 0.85 ² * 3) ≦ number of words of comparison target ≦ 4.1522 (= | A | / Since s ² = 3 / 0.85 ² ), the numbers of words that are integers are “3” and “4”.

従って、単語数「３」のサイズ別インデックスＤＢと単語数「４」のサイズ別インデックスＤＢとを選択する。但し、本例では、単語数「４」のサイズ別インデックスＤＢ（図１７）が選択される。 Accordingly, the size-specific index DB with the word count “3” and the size-specific index DB with the word count “4” are selected. However, in this example, the size-specific index DB (FIG. 17) with the number of words “4” is selected.

そして、「太郎」で図１７のサイズ別インデックスＤＢを検索すると、該当レコードが存在し、ＩＤ「１」が得られるので、共通出現単語数格納部１９には、ＩＤ「１」に対応付けて共通出現単語数「１」を登録する。次に、「二郎」で図１７のサイズ別インデックスＤＢを検索すると、該当レコードが存在し、ＩＤ「１」が得られる。「太郎」と同様に、共通出現単語数格納部１９には、ＩＤ「１」に対応付けて共通出現単語数「２」を登録する。さらに、「泣く」で図１７のサイズ別インデックスＤＢを検索すると、該当レコードが存在し、ＩＤ「１」が得られる。「太郎」「二郎」と同様に、共通出現単語数格納部１９には、ＩＤ「１」に対応付けて共通出現単語数「３」を登録する（図２５（ｃ））。 Then, when searching for the size-specific index DB of FIG. 17 with “Taro”, the corresponding record exists and the ID “1” is obtained, so the common appearance word count storage unit 19 associates it with the ID “1”. The number of common appearance words “1” is registered. Next, when “Jiro” searches the index DB by size in FIG. 17, the corresponding record exists and ID “1” is obtained. Similarly to “Taro”, the common appearance word number storage unit 19 registers the common appearance word number “2” in association with the ID “1”. Further, when the index DB classified by size in FIG. 17 is searched by “crying”, the corresponding record exists and the ID “1” is obtained. Similarly to “Taro” and “Jiro”, the common appearance word number storage unit 19 registers the common appearance word number “3” in association with the ID “1” (FIG. 25C).

そうすると、第１の実施の形態と同じ類似テキスト選択処理が実施される。図２５の例では、コサイン類似度ｃｏｓ（入力，１）は、３／｛３＊４｝^1/2＝０．８６６と算出される（図２５（ｄ））。図２５（ｄ）で算出された類似度は、類似度閾値ｓ＝０．８５以上であるから、テキストＩＤ格納部２２に格納される。そして出力部２３により、ＩＤ「１」が出力される（図２５（ｅ））。 Then, the similar text selection process as in the first embodiment is performed. In the example of FIG. 25, the cosine similarity cos (input, 1) is calculated as 3 / {3 * 4} ^1/2 = 0.866 (FIG. 25 (d)). The similarity calculated in FIG. 25D is stored in the text ID storage unit 22 because the similarity threshold s = 0.85 or more. Then, the output unit 23 outputs ID “1” (FIG. 25E).

このように、サイズ別インデックスＤＢを用いることによって、テキストサイズＤＢ１５へのアクセス回数が減少していることが分かる。従って、その分検索時における処理速度が向上する。そのほか、比較対象テキストが絞り込まれる点については第１の実施の形態と同様である。 Thus, it can be seen that the number of accesses to the text size DB 15 is reduced by using the size-specific index DB. Therefore, the processing speed at the time of searching is improved accordingly. In addition, it is the same as in the first embodiment in that the comparison target text is narrowed down.

［実施の形態３］
本実施の形態では、さらに比較対象テキストのサイズ範囲を絞り込む方法を採用する。本実施の形態に係る検索装置の構成は、インデックス変換部１６を有しない部分を除き、図４で示した機能ブロック図と同じである。従って本実施の形態では、図４をベースに説明する。但し、各処理部は以下で述べるような異なる処理を実施する。 [Embodiment 3]
In the present embodiment, a method of further narrowing down the size range of the comparison target text is adopted. The configuration of the search device according to the present embodiment is the same as the functional block diagram shown in FIG. Therefore, this embodiment will be described based on FIG. However, each processing unit performs different processing as described below.

図２６に本実施の形態に係るメイン処理フローを示す。まず、インデックス生成部１３は、入力部１１から入力され且つインデックス対象テキスト格納部１２に格納されているインデックス対象テキストに対してインデックス生成処理を実施する（ステップＳ１２１）。インデックス生成処理については、図９及び図１０に示したものと同じであるから、説明を省略する。 FIG. 26 shows a main processing flow according to the present embodiment. First, the index generation unit 13 performs an index generation process on the index target text input from the input unit 11 and stored in the index target text storage unit 12 (step S121). The index generation process is the same as that shown in FIGS. 9 and 10 and will not be described.

そして、インデックス変換部１６は、インデックスＤＢ１４に新たにデータが蓄積されると、インデックス変換処理を実施する（ステップＳ１２３）。このインデックス変換処理についても、図１１に示したものと同じであるから、説明を省略する。 Then, when new data is accumulated in the index DB 14, the index conversion unit 16 performs an index conversion process (step S123). This index conversion process is also the same as that shown in FIG.

また、共通出現単語数算出部１７は、入力部１１により入力され且つ検索入力テキスト格納部１８に格納された、検索キーとなる入力テキストについて、インデックスＤＢ１４とテキストサイズＤＢ１５と類似度閾値格納部２０とに格納されているデータを用いて比較対象テキストを絞り込みつつ類似度計算に必要なデータである共通出現単語数を算出する第３共通単語算出処理を実施する（ステップＳ１２５）。比較対象テキストがさらに絞り込まれるので、本実施の形態によれば共通出現単語数格納部１９に格納されるテキストＩＤの数は第１の実施の形態より少なくなっている。第３共通出現単語数算出処理については、後に詳しく述べる。 The common appearance word number calculation unit 17 also includes an index DB 14, a text size DB 15, and a similarity threshold storage unit 20 for the input text that is input by the input unit 11 and stored in the search input text storage unit 18 as a search key. The third common word calculation process for calculating the number of common appearance words, which is data necessary for calculating the similarity, while narrowing down the comparison target text using the data stored in and (step S125). Since the comparison target texts are further narrowed down, according to the present embodiment, the number of text IDs stored in the common appearance word number storage unit 19 is smaller than that in the first embodiment. The third common appearance word number calculation process will be described in detail later.

そして、類似テキスト選択処理部２１は、検索入力テキスト格納部１８と類似度閾値格納部２０とテキストサイズＤＢ１５と共通出現単語数格納部１９とに格納されているデータに基づき、共通出現単語数格納部１９に格納されているテキストＩＤ毎に類似度を算出して、テキストＩＤと共にテキストＩＤ格納部２２に格納する類似テキスト選択処理を実施する（ステップＳ１２７）。類似テキスト選択処理については、図１５で述べたものと同一である。従って、ここでは説明は省略する。 The similar text selection processing unit 21 stores the number of common appearance words based on data stored in the search input text storage unit 18, the similarity threshold storage unit 20, the text size DB 15, and the common appearance word number storage unit 19. A similarity is calculated for each text ID stored in the unit 19, and a similar text selection process is performed in which the similarity is stored in the text ID storage unit 22 together with the text ID (step S127). The similar text selection processing is the same as that described in FIG. Therefore, the description is omitted here.

次に、図２７乃至図２９を用いて第３共通出現単語数算出処理について説明する。共通出現単語数算出部１７は、検索入力テキスト格納部１８から、検索キーとなる入力テキストを読み出す（ステップＳ１３１）。また、共通出現単語数格納部１９を初期化する（ステップＳ１３３）。そして、入力テキストから自立語（動詞及び名詞の単語）を、周知の方法にて切り出し、単語数と共に、例えば検索入力テキスト格納部１８に格納する（ステップＳ１３５）。そして、比較対象テキストのサイズ範囲決定処理を実施する（ステップＳ１３７）。この比較対象テキストのサイズ範囲決定処理は、第１の実施の形態よりもさらに比較対象テキストの範囲を絞り込むための処理であり、図２８を用いて説明する。 Next, the third common appearance word number calculation process will be described with reference to FIGS. The common appearance word number calculation unit 17 reads the input text serving as the search key from the search input text storage unit 18 (step S131). Further, the common appearance word number storage unit 19 is initialized (step S133). Then, independent words (verb and noun words) are extracted from the input text by a well-known method, and stored together with the number of words, for example, in the search input text storage unit 18 (step S135). Then, a size range determination process for the comparison target text is performed (step S137). The comparison target text size range determination process is a process for further narrowing down the comparison target text range as compared with the first embodiment, and will be described with reference to FIG.

比較対象テキストのサイズ範囲決定処理を図２８を用いて説明する。まず、共通出現単語数算出部１７は、単語カウンタＺを０に初期化する（ステップＳ１５１）。また、入力テキストの未処理の単語を１つ特定する（ステップＳ１５３）。そして、特定された単語で、インデックスＤＢ１４を検索して、インデックスＤＢ１４内に、特定された単語が登録されているか判断する（ステップＳ１５５）。登録されていれば、単語カウンタＺを１インクリメントする（ステップＳ１５７）。そしてステップＳ１５９に移行する。一方、登録されていなければステップＳ１５９に移行する。 Processing for determining the size range of the comparison target text will be described with reference to FIG. First, the common appearance word number calculation unit 17 initializes the word counter Z to 0 (step S151). Also, one unprocessed word in the input text is specified (step S153). Then, the index DB 14 is searched with the specified word, and it is determined whether or not the specified word is registered in the index DB 14 (step S155). If registered, the word counter Z is incremented by 1 (step S157). Then, control goes to a step S159. On the other hand, if not registered, the process proceeds to step S159.

ステップＳ１５９では、共通出現単語数算出部１７は、入力テキストから抽出された単語に未処理の単語が存在しているか判断する（ステップＳ１５９）。未処理の単語が存在している場合にはステップＳ１５３に戻る。一方、全ての単語について処理した場合には、サイズ範囲の下限値を、（１）式に従って、入力テキストの単語数を用いて算出する（ステップＳ１６１）。サイズ範囲の下限値については、変更はない。 In step S159, the common appearance word number calculation unit 17 determines whether an unprocessed word exists in the words extracted from the input text (step S159). If an unprocessed word exists, the process returns to step S153. On the other hand, when all the words are processed, the lower limit value of the size range is calculated using the number of words of the input text according to the equation (1) (step S161). There is no change in the lower limit of the size range.

一方、共通出現単語数算出部１７は、サイズ範囲の上限値を、（２）式に従って単語カウンタＺの値を用いて算出する（ステップＳ１６３）。

On the other hand, the common appearance word number calculation unit 17 calculates the upper limit value of the size range using the value of the word counter Z according to the equation (2) (step S163).

そして、共通出現単語数算出部１７は、ステップＳ１６１で算出した下限値と、ステップＳ１６３で算出した上限値とから整数の解（すなわちサイズ範囲）が得られるか判断する（ステップＳ１６５）。単語カウンタＺの値が小さい場合には、上限値と下限値が逆転する場合もある。また、例えば上限値が２．８で下限値が２．５というような範囲が算出されても、使用可能なサイズ範囲は整数にならない。もし、整数のサイズ範囲が得られないような場合には、これ以上処理を実施しても条件を満たすような文書は得られない。従って、共通出現単語数算出部１７は、出力部２３に解無し通知を行い、出力部２３は、検索の解無し（例えば「条件に合致するような文書は存在しませんでした。」というようなメッセージ）を表示装置や印刷装置などの出力装置に出力して（ステップＳ１６７）、処理を終了する。一方、上で述べた下限値と上限値で整数の解が得られる場合には、元の処理に戻る。 Then, the common appearance word number calculation unit 17 determines whether an integer solution (that is, a size range) can be obtained from the lower limit value calculated in step S161 and the upper limit value calculated in step S163 (step S165). When the value of the word counter Z is small, the upper limit value and the lower limit value may be reversed. For example, even if a range in which the upper limit value is 2.8 and the lower limit value is 2.5 is calculated, the usable size range is not an integer. If an integer size range cannot be obtained, a document that satisfies the conditions cannot be obtained even if processing is performed further. Accordingly, the common appearance word number calculation unit 17 notifies the output unit 23 that there is no solution, and the output unit 23 indicates that there is no search solution (for example, “There is no document that matches the condition”). Message) to an output device such as a display device or a printing device (step S167), and the process is terminated. On the other hand, when an integer solution is obtained with the above-described lower limit value and upper limit value, the processing returns to the original processing.

ステップＳ１６３で（２）式を使用できるのは、入力テキストの単語のうち、インデックスＤＢ１４に登録されている単語の数が内積（Ａ・Ｂ）の上限値となるという条件を利用すると、入力テキストの単語数ではなく、入力テキストに含まれる単語のうち実際にインデックスＤＢ１４に登録されている単語の数によって上限値を決定できるためである。詳細については、後に述べる。 The expression (2) can be used in step S163 when the condition that the number of words registered in the index DB 14 among the words of the input text is the upper limit value of the inner product (A · B) is used. This is because the upper limit value can be determined not by the number of words but by the number of words actually registered in the index DB 14 among the words included in the input text. Details will be described later.

通常単語カウンタＺの値は入力テキストの単語数より小さい値になるので、第１の実施の形態より上限値が下がる。従って、比較対象テキストのサイズ範囲がさらに狭められ、処理の高速化が図られる。 Since the value of the normal word counter Z is smaller than the number of words in the input text, the upper limit value is lower than that in the first embodiment. Therefore, the size range of the comparison target text is further narrowed, and the processing speed is increased.

図２９の例を用いて具体例を説明する。例えば、検索キーとなる入力テキスト「太郎と二郎と五郎が泣く。」が得られると（図２９（ａ））、ステップＳ１３５で「太郎」「二郎」「五郎」「泣く」という４自立語（単語）に分割される。ここで、図２８に従って、これらの単語でインデックスＤＢ１４（図８）を検索すると、「五郎」は登録されておらず、「太郎」「二郎」「泣く」の３つの単語が登録されていることが分かる。そうすると、単語カウンタＺの値は「３」となる（図２９（ｃ））。 A specific example will be described using the example of FIG. For example, when the input text “Taro, Jiro, and Goro cry” is obtained as a search key (FIG. 29A), four independent words “Taro”, “Jiro”, “Goro”, and “Cry” are obtained in step S135 ( Word). Here, according to FIG. 28, when searching the index DB 14 (FIG. 8) with these words, “Goro” is not registered, and three words “Taro”, “Jiro”, and “Crying” are registered. I understand. Then, the value of the word counter Z becomes “3” (FIG. 29C).

そして、類似度閾値ｓが０．８５と設定されているものとする。そうすると、比較対象テキストのサイズ範囲は、（２）式から、下限値は、２．８９（＝ｓ²＊｜Ａ｜＝0.85²＊４）であり、上限値は３．１１（＝Ｚ²／（｜Ａ｜＊ｓ²）＝３²／（４＊0.85²））となる。この例では、整数であるサイズ範囲は「３」だけということになる（図２９（ｄ）及び（ｅ））。もしも、上限値を入力テキストの単語数「４」で算出すると、５．５３と算出される。従って、上限値は整数「５」となり、「３」「４」「５」がサイズ範囲となって、範囲が絞り込まれていることが分かる。 Assume that the similarity threshold s is set to 0.85. Then, the size range of the comparison target text is, from the formula (2), the lower limit value is 2.89 (= s ² * | A | = 0.85 ² * 4), and the upper limit value is 3.11 (= Z ^2). / (| A | * s ² ) = 3 ² /(4*0.85 ² )). In this example, the size range that is an integer is only “3” (FIGS. 29D and 29E). If the upper limit value is calculated by the number of words of the input text “4”, it is calculated as 5.53. Therefore, the upper limit value is an integer “5”, and “3”, “4”, and “5” are size ranges, and it can be seen that the range is narrowed down.

その後、共通出現単語数算出部１７は、入力テキストから抽出された単語のうち未処理の単語を特定する（ステップＳ１３９）。そして、インデックスＤＢ１４に、特定された単語が登録されている判断する（ステップＳ１４１）。登録されていない場合には、未処理の単語が存在するか判断し（ステップＳ１４３）、未処理の単語が存在する場合にはステップＳ１３９に戻る。未処理の単語が存在しない場合には、端子Ｇを介して本処理を終了して元の処理に戻る。 Thereafter, the common appearance word number calculation unit 17 identifies an unprocessed word among the words extracted from the input text (step S139). Then, it is determined that the specified word is registered in the index DB 14 (step S141). If it is not registered, it is determined whether there is an unprocessed word (step S143). If there is an unprocessed word, the process returns to step S139. If there is no unprocessed word, the process is terminated via the terminal G and the process returns to the original process.

一方、インデックスＤＢ１４に、特定された単語が登録されている場合には、共通出現単語数算出部１７は、インデックスＤＢ１４において、特定された単語に対応付けられているＩＤのうち単語数が少ない方から未処理のＩＤを１つ特定する（ステップＳ１４５）。処理は端子Ｆを介して図１３の処理に移行する。 On the other hand, when the identified word is registered in the index DB 14, the common appearance word number calculation unit 17 has a smaller number of words among the IDs associated with the identified word in the index DB 14. 1 to identify one unprocessed ID (step S145). The processing shifts to the processing in FIG.

図１３については既に説明しており、処理内容は同じであるから、説明を省略する。 Since FIG. 13 has already been described and the processing contents are the same, description thereof will be omitted.

以上述べたように、入力テキストの単語のインデックスＤＢ１４への登録状況に応じて、可能であれば比較対象テキストのサイズ範囲がさらに狭められ、処理の高速化が図られる。 As described above, the size range of the comparison target text is further narrowed if possible in accordance with the registration status of the words of the input text in the index DB 14, and the processing speed can be increased.

［実施の形態４］
第３の実施の形態では、ステップＳ１３７において前もってインデックスＤＢ１４を、入力テキストから抽出された単語で検索する例を示したが、インデックスＤＢ１４の検索回数は増加してしまう。そこで例えば図３０に示すような処理を採用するようにしても良い。 [Embodiment 4]
In the third embodiment, an example is shown in which the index DB 14 is searched in advance in step S137 using words extracted from the input text, but the number of searches in the index DB 14 increases. Therefore, for example, a process as shown in FIG. 30 may be adopted.

まず、共通出現単語数算出部１７は、検索入力テキスト格納部１８から、検索キーとなる入力テキストを読み出す（ステップＳ１７１）。また、共通出現単語数格納部１９を初期化する（ステップＳ１７３）。そして、入力テキストから自立語（動詞及び名詞の単語）を、周知の方法にて切り出し、単語数と共に、例えば検索入力テキスト格納部１８に格納する（ステップＳ１７５）。そして、入力テキストの単語数と、類似度閾値格納部２０に格納されている類似度閾値ｓとから、比較対象テキストのサイズ範囲を決定し、例えばメインメモリなどの記憶装置に格納する（ステップＳ１７７）。この処理は例外を含めてステップＳ４７と同じである。 First, the common appearance word number calculation unit 17 reads an input text serving as a search key from the search input text storage unit 18 (step S171). Further, the common appearance word number storage unit 19 is initialized (step S173). Then, independent words (verb and noun words) are extracted from the input text by a well-known method, and stored together with the number of words, for example, in the search input text storage unit 18 (step S175). Then, the size range of the comparison target text is determined from the number of words in the input text and the similarity threshold s stored in the similarity threshold storage unit 20, and stored in a storage device such as a main memory (step S177). ). This process is the same as step S47, including exceptions.

また、共通出現単語数算出部１７は、単語カウンタＺに、入力テキストの単語数を初期的に設定する（ステップＳ１７９）。また、入力テキストから抽出された単語のうち未処理の単語を特定する（ステップＳ１８１）。そして、インデックスＤＢ１４に、特定された単語が登録されている判断する（ステップＳ１８３）。登録されていない場合には、未処理の単語が存在するか判断し（ステップＳ１８５）、未処理の単語が存在する場合には、単語カウンタＺ＝（Ｚ−１）として、ステップＳ１６３と同様に単語カウンタＺを用いて（２）式に従ってサイズ範囲の上限値を再計算する（ステップＳ１８７）。 Further, the common appearance word number calculation unit 17 initially sets the number of words of the input text in the word counter Z (step S179). Moreover, an unprocessed word is specified among the words extracted from the input text (step S181). Then, it is determined that the specified word is registered in the index DB 14 (step S183). If it is not registered, it is determined whether or not there is an unprocessed word (step S185). If there is an unprocessed word, the word counter Z = (Z-1) is set as in step S163. Using the word counter Z, the upper limit value of the size range is recalculated according to the equation (2) (step S187).

そして、共通出現単語数算出部１７は、ステップＳ１７７で算出した下限値と、ステップＳ１８７で算出した上限値とから整数の解（すなわちサイズ範囲）が得られるか判断する（ステップＳ１９１）。単語カウンタＺの値が小さい場合には、上限値と下限値が逆転する場合もある。また、例えば上限値が２．８で下限値が２．５というような範囲が算出されても、使用可能なサイズ範囲は整数にならない。もし、整数のサイズ範囲が得られないような場合には、これ以上処理を実施しても条件を満たすような文書は得られない。従って、共通出現単語数算出部１７は、出力部２３に解無し通知を行い、出力部２３は、検索の解無し（例えば「条件に合致するような文書は存在しませんでした。」というようなメッセージ）を表示装置や印刷装置などの出力装置に出力して（ステップＳ１９３）、処理を終了する。一方、上で述べた下限値と上限値で整数の解が得られる場合には、ステップＳ１８１に戻る。このようにすれば、インデックスＤＢ１４を検索する回数を削減することができ、さらにサイズ範囲を動的に変更することができるようになる。なお、未処理の単語が存在しない場合には、端子Ｇを介して本処理を終了して元の処理に戻る。 Then, the common appearance word number calculation unit 17 determines whether an integer solution (that is, a size range) is obtained from the lower limit value calculated in step S177 and the upper limit value calculated in step S187 (step S191). When the value of the word counter Z is small, the upper limit value and the lower limit value may be reversed. For example, even if a range in which the upper limit value is 2.8 and the lower limit value is 2.5 is calculated, the usable size range is not an integer. If an integer size range cannot be obtained, a document that satisfies the conditions cannot be obtained even if processing is performed further. Accordingly, the common appearance word number calculation unit 17 notifies the output unit 23 that there is no solution, and the output unit 23 indicates that there is no search solution (for example, “There is no document that matches the condition”). Message) to an output device such as a display device or a printing device (step S193), and the process ends. On the other hand, when an integer solution is obtained with the lower limit value and the upper limit value described above, the process returns to step S181. In this way, the number of times of searching the index DB 14 can be reduced, and the size range can be dynamically changed. If there is no unprocessed word, the process is terminated via the terminal G and the process returns to the original process.

一方、インデックスＤＢ１４に、特定された単語が登録されている場合には、共通出現単語数算出部１７は、インデックスＤＢ１４において、特定された単語に対応付けられているＩＤのうち単語数が少ない方から未処理のＩＤを１つ特定する（ステップＳ１８９）。処理は端子Ｆを介して図１３の処理に移行する。 On the other hand, when the identified word is registered in the index DB 14, the common appearance word number calculation unit 17 has a smaller number of words among the IDs associated with the identified word in the index DB 14. 1 to identify one unprocessed ID (step S189). The processing shifts to the processing in FIG.

［その他の実施の形態］
例えば第２の実施の形態のように、サイズ別インデックスＤＢを採用する場合においても、第３の実施の形態のように、入力テキストから抽出された単語がインデックスＤＢに登録されているか否かに応じてサイズ範囲を変更するようにしても良い。さらに、第４の実施の形態を第２の実施の形態に適用しても良い。 [Other embodiments]
For example, even when the size-specific index DB is adopted as in the second embodiment, whether or not words extracted from the input text are registered in the index DB as in the third embodiment. The size range may be changed accordingly. Furthermore, the fourth embodiment may be applied to the second embodiment.

［（１）式について詳細説明］
ステップＳ４７の説明で示した（１）式がどのようにして得られるのかについて説明する。なお、入力テキストの単語数は｜Ａ｜で表され、｜Ｂ_i｜は比較対象テキストの単語数を表しているものとする。 [Detailed explanation about equation (1)]
How the equation (1) shown in the description of step S47 is obtained will be described. It is assumed that the number of words in the input text is represented by | A |, and | B _i | represents the number of words in the comparison target text.

（条件１）｜Ｂ_i｜≦｜Ａ｜の場合、ＡとＢ_iに共通に含まれる単語数Ａ・Ｂ_iの上限値は、Ｂ_i・Ｂ_iであるので、以下の式が得られる。

(Condition _{1) | B i | ≦ |} A | if the upper limit of the number of words A · B _i included in common in the A and B _i is because it is B _i · B _i, the following equation is obtained .

このようにして得られた（３）式をさらに変形すれば、以下の式がサイズ範囲の下限値を算出するための式として得られる。

If the equation (3) thus obtained is further modified, the following equation is obtained as an equation for calculating the lower limit value of the size range.

なお、Ｂ_i・Ｂ_iを｜Ｂ_i｜に置換しているのは、Ｂ_i・Ｂ_iはＢ_iとＢ_iに共通して含まれる単語数、すなわちＢ_iに含まれる単語数｜Ｂ_i｜であるからである。 B _i · B _i is replaced by | B _i | because B _i · B _i is the number of words included in both B _i and B _i , that is, the number of words included in B _i | B _{This is because i} |.

（条件２）｜Ａ｜≦｜Ｂ_i｜の場合、ＡとＢ_iに共通に含まれる単語数Ａ・Ｂ_iの上限値は、Ａ・Ａなので、以下の式が成立する。

(Condition 2) | A | ≦ | B i | where, the upper limit of the number of words A · B _i included in common in the A and B _i is because A · A, the following expression holds.

このようにして得られた（４）式をさらに変形すれば、以下の式がサイズ範囲の上限値を算出するための式として得られる。

If the equation (4) thus obtained is further modified, the following equation can be obtained as an equation for calculating the upper limit value of the size range.

以上の条件１及び２から、入力テキストＡを条件として、類似度閾値ｓを満たすデータを既存テキスト集合Ｂ＝｛Ｂ_i｝（１≦ｉ≦Ｎ）から抽出する場合には、以下の式を満たすテキストＢ_iだけを比較対象とすればよい。

From the

above conditions

1 and 2, when the data satisfying the similarity threshold s is extracted from the existing text set B = {B _i } (1 ≦ i ≦ N) with the input text A as a condition, the following expression is used: Only the text B _i to be satisfied needs to be compared.

（５）式の両辺を二乗すれば、（１）式が得られる。 If both sides of equation (5) are squared, equation (1) is obtained.

［類似度の他の例について］
上で述べた実施の形態では、類似度の計算はコサイン類似度ということで説明した。しかし、類似度計算については他の計算方法を採用することも可能である。例えば、バイナリベクトルではなく、各単語について出現する回数まで考慮してコサイン類似度を算出するようにしても良い。以下、出現回数を考慮する場合について前提条件から説明する。 [Other examples of similarity]
In the embodiment described above, the calculation of the similarity is described as the cosine similarity. However, other calculation methods can be employed for similarity calculation. For example, the cosine similarity may be calculated in consideration of the number of occurrences of each word instead of the binary vector. Hereinafter, the case where the number of appearances is considered will be described from the preconditions.

１．前提条件
例えば、テキストＡにおいて「太郎」が１回、「花子」が２回、「泣く」が１回出現する場合、Ａ＝｛太郎：１，花子：２，泣く：１｝と表記するものとする。ここで、「：」の後の数字が出現回数である。 1. Preconditions For example, in the text A, when “Taro” appears once, “Hanako” appears twice, and “cry” appears once, A = {Taro: 1, Hanako: 2, cry: 1} And Here, the number after “:” is the number of appearances.

この場合のＡのサイズ｜Ａ｜は、Ａに含まれる単語とそれらの出現数から計算するものとする。具体的には、各単語の出現数の二乗和とする。上で述べたＡであれば、｜Ａ｜＝１²＋２²＋１²＝６となる。 In this case, the size | A | of A is calculated from the words included in A and the number of appearances thereof. Specifically, the sum of squares of the number of appearances of each word is used. In the case of A described above, | A | = 1 ² +2 ² +1 ² = 6.

また、Ｂ_i＝｛太郎：１，花子：３，二郎：１，三郎：１，泣く：１｝とすると、内積Ａ・Ｂ_iは以下のように算出される。すなわち、共通に含まれる単語は｛太郎，花子，泣く｝であるので、それぞれの出現回数の積和となる。
Ａ・Ｂ_i＝（Ａにおける「太郎」の出現回数）×（Ｂ_iにおける「太郎」の出現回数）＋（Ａにおける「花子」の出現回数）×（Ｂ_iにおける「花子の出現回数）＋（Ａにおける「泣く」の出現回数）×（Ｂ_iにおける「泣く」の出現回数）＝（１×１）＋（２×３）＋（１×１）＝８ If B _i = {Taro: 1, Hanako: 3, Jiro: 1, Saburo: 1, cry: 1}, the inner product A · B _i is calculated as follows. That is, since the word included in common is {Taro, Hanako, cry}, it is the product sum of the number of appearances.
A · B _i = (Number of appearances of “Taro” in A) × (Number of appearances of “Taro” in B _i ) + (Number of appearances of “Hanako” in A)) × (Number of appearances of “Hanako” in B _i ) + (Number of appearances of “cry” in A) × (Number of appearances of “cry” in B _i ) = (1 × 1) + (2 × 3) + (1 × 1) = 8

従って、出現回数を考慮する場合におけるコサイン類似度は、以下のように算出される。

Accordingly, the cosine similarity when the number of appearances is considered is calculated as follows.

次に、サイズ範囲（ここでは上限値）の算出法について説明する。最初に、Ｂ＝｛Ｂ_i｝中のテキストに出現する全ての単語の集合をＷとする。すなわち、Ｂに属するテキストは、Ｗ中の単語を含むものとする。 Next, a method for calculating the size range (here, the upper limit value) will be described. First, let W be the set of all words that appear in the text in B = {B _i }. That is, the text belonging to B includes the word in W.

具体的には、Ｂ＝｛Ｂ₁，Ｂ₂｝であり、Ｂ₁＝｛太郎：１，花子：３，二郎：１，三郎：１，泣く：１｝、Ｂ₂＝｛太郎：１，花子：２，二郎：１，泣く：１｝である場合、Ｗ＝｛太郎，花子，二郎，三郎，泣く｝となる。 Specifically, B = {B ₁ , B ₂ }, B ₁ = {Taro: 1, Hanako: 3, Jiro: 1, Saburo: 1, Crying: 1}, B ₂ = {Taro: 1, Hanako: 2, Jiro: 1, cry: 1}, W = {Taro, Hanako, Jiro, Saburo, cry}.

また、ＢにおけるＷ中のある単語ｗの出現回数の最大値をＭＡＸ（ｗ）と表記する。具体的には、ＭＡＸ（太郎）＝１、ＭＡＸ（花子）＝３、ＭＡＸ（二郎）＝１、ＭＡＸ（三郎）＝１、ＭＡＸ（泣く）＝１である。 Further, the maximum value of the number of appearances of a word w in W in B is denoted as MAX (w). Specifically, MAX (Taro) = 1, MAX (Hanako) = 3, MAX (Jiro) = 1, MAX (Saburo) = 1, MAX (crying) = 1.

さらに、Ｗmaxを、ＢのＷ中の各単語とそれらの最大値で表される集合とする。上で述べた例では、Ｗmax＝｛太郎：１，花子：３，二郎：１，三郎：１，泣く：１｝と表される。 Furthermore, let Wmax be a set represented by each word in W of B and their maximum value. In the example described above, Wmax = {Taro: 1, Hanako: 3, Jiro: 1, Saburo: 1, cry: 1}.

このような前提の下、サイズ範囲の計算を考えると、入力テキストＡとＢ中のテキストとの内積の上限値はＡとＷmaxとの内積Ａ・Ｗmaxとなる。 Considering the calculation of the size range under such a premise, the upper limit value of the inner product of the input text A and the text in B is the inner product A · Wmax of A and Wmax.

従って、Ｂ中のテキストのうち入力テキストＡと類似度閾値ｓ以上の類似度を有するテキストを探索する場合には、以下の条件が成り立つ。

Therefore, the following conditions are satisfied when searching for a text having a similarity greater than or equal to the input text A and the similarity threshold s among the texts in B.

整理すると、以下のような関係が得られる。

When organized, the following relationship is obtained.

両辺を二乗すれば、｜Ｂ_i｜についての条件となる。

If both sides are squared, the condition for | B _i |

［Ｚを用いても良い理由］
単語カウンタＺは、入力テキストＡとインデックス対象テキストＢ_iとの内積の最大値となる。 [Why Z may be used]
Word counter Z becomes the maximum value of the inner product of the input text A and indexed text B _i.

従って、（４）式のＡ・Ａの代わりにＺを用いて、以下のように定義できる。

Therefore, using Z instead of A · A in the equation (4), it can be defined as follows.

これは、Ａ・Ｂ_i≦Ｚ≦｜Ａ｜であるから、（４）式における分子のＡ・ＡをＺに置換しても不等号は成り立つ。そうすると、式（６）から、Ｚを用いた上限値の（７）式が導出される。最終的には、（７）式の両辺を二乗すれば、（２）式のうち上限を算出する式が得られる。

Since this is A · B _i ≦ Z ≦ | A |, the inequality sign holds even if A · A of the molecule in formula (4) is replaced with Z. Then, the upper limit (7) expression using Z is derived from the expression (6). Ultimately, if both sides of equation (7) are squared, an equation for calculating the upper limit of equation (2) can be obtained.

以上本技術の実施の形態を説明したが、本技術はこれに限定されるものではない。例えば、図４に示した機能ブロック図は、一例であって、必ずしも実際のプログラムモジュール構成と一致するわけではない。さらに、処理フローについても処理結果が変わらない限り、処理順番を入れ替えたり並列実行したりすることが可能である。さらに、上で述べた例では、自立語を抽出する例を示したが、自立語＋付属語で上で述べた処理を実施する場合もある。 Although the embodiment of the present technology has been described above, the present technology is not limited to this. For example, the functional block diagram shown in FIG. 4 is an example, and does not necessarily match the actual program module configuration. Furthermore, as long as the processing result does not change for the processing flow, it is possible to change the processing order or execute the processing in parallel. Furthermore, in the example described above, an example in which an independent word is extracted has been described. However, the above-described processing may be performed using an independent word + an attached word.

なお、上で述べた検索装置は、コンピュータ装置であって、図３１に示すように、メモリ２５０１とＣＰＵ２５０３とハードディスク・ドライブ（ＨＤＤ）２５０５と表示装置２５０９に接続される表示制御部２５０７とリムーバブル・ディスク２５１１用のドライブ装置２５１３と入力装置２５１５とネットワークに接続するための通信制御部２５１７とがバス２５１９で接続されている。オペレーティング・システム（ＯＳ：Operating System）及び本実施例における処理を実施するためのアプリケーション・プログラムは、ＨＤＤ２５０５に格納されており、ＣＰＵ２５０３により実行される際にはＨＤＤ２５０５からメモリ２５０１に読み出される。必要に応じてＣＰＵ２５０３は、表示制御部２５０７、通信制御部２５１７、ドライブ装置２５１３を制御して、必要な動作を行わせる。また、処理途中のデータについては、メモリ２５０１に格納され、必要があればＨＤＤ２５０５に格納される。本技術の実施例では、上で述べた処理を実施するためのアプリケーション・プログラムはコンピュータ読み取り可能なリムーバブル・ディスク２５１１に格納されて頒布され、ドライブ装置２５１３からＨＤＤ２５０５にインストールされる。インターネットなどのネットワーク及び通信制御部２５１７を経由して、ＨＤＤ２５０５にインストールされる場合もある。このようなコンピュータ装置は、上で述べたＣＰＵ２５０３、メモリ２５０１などのハードウエアとＯＳ及び必要なアプリケーション・プログラムとが有機的に協働することにより、上で述べたような各種機能を実現する。 The search device described above is a computer device. As shown in FIG. 31, a memory 2501, a CPU 2503, a hard disk drive (HDD) 2505, a display control unit 2507 connected to the display device 2509, a removable device, and the like. A drive device 2513 for the disk 2511, an input device 2515, and a communication control unit 2517 for connecting to a network are connected by a bus 2519. An operating system (OS: Operating System) and an application program for executing the processing in this embodiment are stored in the HDD 2505, and are read from the HDD 2505 to the memory 2501 when executed by the CPU 2503. If necessary, the CPU 2503 controls the display control unit 2507, the communication control unit 2517, and the drive device 2513 to perform necessary operations. Further, data in the middle of processing is stored in the memory 2501 and stored in the HDD 2505 if necessary. In an embodiment of the present technology, an application program for performing the above-described processing is stored in a computer-readable removable disk 2511 and distributed, and installed from the drive device 2513 to the HDD 2505. In some cases, the HDD 2505 may be installed via a network such as the Internet and the communication control unit 2517. Such a computer apparatus realizes various functions as described above by organically cooperating hardware such as the CPU 2503 and the memory 2501 described above, the OS, and necessary application programs.

以上本実施の形態をまとめると以下のようになる。 The present embodiment can be summarized as follows.

このように類似度算出の対象となる既存テキストを絞り込むことによって、検索速度を向上させることができるようになる。 In this way, the search speed can be improved by narrowing down the existing text that is the target of similarity calculation.

また、上で述べた範囲特定ステップが、入力テキストと既存テキストとの類似度を算出するための類似度算出式に基づき予め規定され且つ入力テキスト中の自立語の語数を変数として類似度が類似度閾値以上となる既存テキスト中の自立語語数の範囲を算出する算式を用いて、抽出された上記自立語の語数を入力として上記類似度閾値以上となる既存テキスト中の自立語語数の範囲を算出する範囲算出ステップを含むようにしてもよい。このような算式を予め用意しておくことによって、入力テキストから抽出された自立語の語数をベースに、例えば範囲の上限値及び下限値又は上限値を算出することができるようになる。 In addition, the range specifying step described above is preliminarily defined based on the similarity calculation formula for calculating the similarity between the input text and the existing text, and the similarity is similar using the number of independent words in the input text as a variable. Using the formula to calculate the range of the number of independent words in the existing text that is equal to or higher than the degree threshold, the range of the number of independent words in the existing text that is equal to or higher than the similarity threshold is input using the number of words of the extracted independent words as an input. You may make it include the range calculation step to calculate. By preparing such a formula in advance, for example, the upper limit value and the lower limit value or the upper limit value of the range can be calculated based on the number of independent words extracted from the input text.

さらに、上で述べた範囲算出ステップにおいて、いずれかの既存テキストに出現する、入力テキスト内の自立語のみの語数を、自立語語数の範囲の上限値算出のための入力としてさらに用いて、既存テキスト中の自立語語数の範囲を算出するようにしてもよい。このように、既存テキストに出現しない自立語が入力テキスト内に存在する場合には、より範囲を限定的にして、検索速度をさらに向上させることができるようになる。 Furthermore, in the range calculation step described above, the number of words of only independent words in the input text that appear in any existing text is further used as an input for calculating the upper limit value of the range of the number of independent words. The range of the number of independent words in the text may be calculated. As described above, when an independent word that does not appear in the existing text exists in the input text, the search speed can be further improved by limiting the range.

また、上で述べた類似度算出ステップが、既存テキストに出現する自立語毎に当該自立語を含む既存テキストの識別子が当該自立語を含む既存テキスト中の自立語語数順に列挙されている、記憶装置内のインデックス格納部を、抽出された自立語で検索して、一致する自立語について自立語語数が上記自立語語数の範囲内である既存テキストの識別子を順に抽出するステップを含むようにしてもよい。このようなインデックス格納部を用意することによって、上記自立語語数の範囲内である既存テキストの識別子を高速に抽出することができるようになる。 In addition, the similarity calculation step described above stores, for each independent word appearing in the existing text, identifiers of the existing text including the independent word are listed in the order of the number of independent words in the existing text including the independent word. The index storage unit in the apparatus may be searched with the extracted independent words, and the steps may include sequentially extracting identifiers of existing text whose number of independent words is within the range of the number of independent words for matching independent words. . By preparing such an index storage unit, it becomes possible to extract identifiers of existing text within the range of the number of independent words at high speed.

さらに、上で述べた類似度算出ステップが、既存テキストに出現する自立語毎に当該自立語を含む既存テキストの識別子が列挙されており且つ既存テキスト中の自立語語数毎に設けられている、記憶装置内のインデックス格納部のうち、自立語語数の範囲に含まれる自立語語数についてのインデックス格納部を選択するステップと、選択されたインデックス格納部を、抽出された上記自立語で検索して、一致する自立語を含む既存テキストの識別子を抽出するステップとを含むようにしてもよい。このようにインデックス格納部を既存テキスト中の自立語語数毎に設けることによって、上記自立語語数の範囲内にある既存テキストの識別子を高速に抽出することができるようになる。 Further, in the similarity calculation step described above, for each independent word appearing in the existing text, an identifier of the existing text including the independent word is listed and provided for each number of independent words in the existing text. A step of selecting an index storage unit for the number of independent words included in the range of the number of independent words in the index storage unit in the storage device, and searching the selected index storage unit with the extracted independent words A step of extracting an identifier of an existing text including a matching independent word. Thus, by providing an index storage unit for each number of independent words in the existing text, it becomes possible to extract the identifiers of the existing text within the range of the number of independent words at a high speed.

また、上で述べた類似度算出ステップが、抽出された既存テキストの識別子について、一致する自立語の語数をカウントするステップと、既存テキストの識別子に対応付けて当該既存テキスト中の自立語語数が格納されているテキストサイズ格納部から、抽出された既存テキストの識別子に対応付けられている当該既存テキスト中の自立語語数を読み出し、当該既存テキスト中の自立語語数と、入力テキストから抽出された自立語の語数と、一致する自立語の語数とから、既存テキストと前記入力テキストとの類似度を算出するステップとをさらに含むようにしてもよい。例えば類似度として余弦値を採用する場合には、このような処理によってさらに高速に類似度を算出することができる。 Further, the similarity calculation step described above counts the number of matching independent words for the extracted existing text identifier, and the number of independent words in the existing text is associated with the existing text identifier. The number of independent words in the existing text associated with the extracted identifier of the existing text is read from the stored text size storage unit, and the number of independent words in the existing text and extracted from the input text A step of calculating the similarity between the existing text and the input text from the number of independent words and the number of matching independent words may be further included. For example, when a cosine value is adopted as the similarity, the similarity can be calculated at higher speed by such processing.

さらに、上で述べた範囲算出ステップが、インデックス格納部を、入力テキストから抽出された自立語で検索して一致する自立語の語数を特定するステップと、入力テキストと既存テキストとの類似度を算出するための類似度算出式に基づき予め規定され且つ入力テキスト中の自立語の語数を変数として類似度が類似度閾値以上となる既存テキスト中の自立語語数の範囲を算出する算式を用いて、特定された自立語の語数を入力として上記類似度閾値以上となる既存テキスト中の自立語語数の範囲を算出するステップとを含むようにしてもよい。このように既存テキストに含まれない自立語が入力テキストに含まれる場合には、より自立語語数の範囲を限定することができる。よって、上で述べたように事前に確認するようにしても良い。 Further, the range calculation step described above searches the index storage unit with independent words extracted from the input text and specifies the number of matching independent words, and the similarity between the input text and the existing text is determined. Using a formula that is preliminarily defined based on the similarity calculation formula for calculation and calculates the range of the number of independent words in the existing text in which the similarity is equal to or greater than the similarity threshold using the number of independent words in the input text as a variable A step of calculating a range of the number of independent words in the existing text that is equal to or more than the similarity threshold by inputting the number of words of the specified independent words as an input. In this way, when an independent word that is not included in the existing text is included in the input text, the range of the number of independent words can be further limited. Therefore, it may be confirmed in advance as described above.

さらに、上で述べた類似度算出ステップが、インデックス格納部を、入力テキストから抽出された自立語で検索して一致する自立語が登録されていないことを検出した場合に、入力テキストから抽出された自立語から、登録されていない自立語を除いたものを条件として入力テキストとの類似度が、記憶装置に格納されている類似度閾値以上となる、既存テキスト中の自立語語数の範囲を再設定するステップをさらに含むようにしてもよい。このように、入力テキストから抽出された自立語で検索して一致する自立語が登録されていないことを検出すれば、動的に自立語語数の範囲を変更するようにしても良い。 Further, when the similarity calculation step described above searches the index storage unit with independent words extracted from the input text and detects that no matching independent words are registered, it is extracted from the input text. The range of the number of independent words in the existing text in which the similarity with the input text is equal to or greater than the similarity threshold stored in the storage device on condition that the independent words that are not registered are removed from the independent words A step of resetting may be further included. As described above, the range of the number of independent words may be dynamically changed by searching for the independent words extracted from the input text and detecting that no matching independent words are registered.

なお、上で述べたような処理をハードウエアに実施させるためのプログラムを作成することができ、当該プログラムは、例えばフレキシブル・ディスク、ＣＤ−ＲＯＭ、光磁気ディスク、半導体メモリ、ハードディスク等のコンピュータ読み取り可能な記憶媒体又は記憶装置に格納される。なお、処理途中のデータについては、コンピュータのメモリ等の記憶装置に一時保管される。 It is possible to create a program for causing the hardware to perform the processing described above, and the program can be read by a computer such as a flexible disk, a CD-ROM, a magneto-optical disk, a semiconductor memory, and a hard disk. It is stored in a possible storage medium or storage device. Note that data being processed is temporarily stored in a storage device such as a computer memory.

以上の実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）
記憶装置に格納されている入力テキストから自立語を抽出するステップと、
抽出された前記自立語を条件として前記入力テキストとの類似度が、前記記憶装置に格納されている類似度閾値以上となる、既存テキスト中の自立語語数の範囲を特定する範囲特定ステップと、
自立語語数が前記自立語語数の範囲内である既存テキストに限定して、前記記憶装置に格納されている、当該既存テキスト中の自立語と抽出された前記自立語とを用いて当該既存テキストと前記入力テキストの類似度を算出し、前記記憶装置に格納する類似度算出ステップと、
前記記憶装置に格納された前記類似度が前記類似度閾値以上となる前記既存テキストを特定するステップと、
を含み、コンピュータに実行される検索方法。 (Appendix 1)
Extracting independent words from input text stored in a storage device;
A range specifying step for specifying a range of the number of independent words in the existing text, wherein the similarity with the input text is equal to or greater than a similarity threshold stored in the storage device on the basis of the extracted independent words;
The existing text is limited to the existing text whose number of independent words is within the range of the number of independent words, and the existing text is stored in the storage device using the independent words in the existing text and the extracted independent words. Calculating the similarity of the input text and storing it in the storage device;
Identifying the existing text whose similarity stored in the storage device is greater than or equal to the similarity threshold;
And a search method executed on a computer.

（付記２）
前記範囲特定ステップが、
前記入力テキストと前記既存テキストとの類似度を算出するための類似度算出式に基づき予め規定され且つ前記入力テキスト中の自立語の語数を変数として類似度が類似度閾値以上となる前記既存テキスト中の自立語語数の範囲を算出する算式を用いて、抽出された前記自立語の語数を入力として前記類似度閾値以上となる前記既存テキスト中の自立語語数の範囲を算出する範囲算出ステップ
を含む付記１記載の検索方法。 (Appendix 2)
The range specifying step includes:
The existing text that is defined in advance based on a similarity calculation formula for calculating the similarity between the input text and the existing text, and whose similarity is equal to or greater than a similarity threshold with the number of independent words in the input text as a variable A range calculating step of calculating a range of the number of independent words in the existing text that is equal to or greater than the similarity threshold, using the number of words of the extracted independent words as an input, using an equation for calculating a range of the number of independent words in the The search method of the additional statement 1 containing.

（付記３）
前記範囲算出ステップにおいて、
いずれかの前記既存テキストに出現する、前記入力テキスト内の自立語のみの語数を、前記自立語語数の範囲の上限値算出のための入力としてさらに用いて、前記既存テキスト中の自立語語数の範囲を算出する
付記２記載の検索方法。 (Appendix 3)
In the range calculating step,
The number of independent words in the input text that appears in any of the existing texts is further used as an input for calculating the upper limit value of the range of the independent words, and the number of independent words in the existing text The search method according to attachment 2, wherein the range is calculated.

（付記４）
前記類似度算出ステップが、
前記既存テキストに出現する自立語毎に当該自立語を含む前記既存テキストの識別子が当該自立語を含む前記既存テキスト中の自立語語数順に列挙されている、前記記憶装置内のインデックス格納部を、抽出された前記自立語で検索して、一致する前記自立語について前記自立語語数が前記自立語語数の範囲内である前記既存テキストの識別子を順に抽出するステップ、
を含む付記１記載の検索方法。 (Appendix 4)
The similarity calculation step includes:
For each independent word that appears in the existing text, an identifier of the existing text that includes the independent word is listed in the order of the number of independent words in the existing text that includes the independent word, an index storage unit in the storage device, Searching with the extracted independent words, and sequentially extracting identifiers of the existing text whose number of independent words is within the range of the number of independent words for the matching independent words;
The search method according to supplementary note 1 including:

（付記５）
前記類似度算出ステップが、
前記既存テキストに出現する自立語毎に当該自立語を含む前記既存テキストの識別子が列挙されており且つ前記既存テキスト中の自立語語数毎に設けられている、前記記憶装置内のインデックス格納部のうち、前記自立語語数の範囲に含まれる自立語語数についてのインデックス格納部を選択するステップと、
選択された前記インデックス格納部を、抽出された前記自立語で検索して、一致する前記自立語を含む前記既存テキストの識別子を抽出するステップと、
を含む付記１記載の検索方法。 (Appendix 5)
The similarity calculation step includes:
An index storage unit in the storage device, in which identifiers of the existing text including the independent words are listed for each independent word appearing in the existing text and provided for each number of independent words in the existing text A step of selecting an index storage unit for the number of independent words included in the range of the number of independent words;
Searching the selected index storage unit with the extracted independent words to extract an identifier of the existing text including the matching independent words;
The search method according to supplementary note 1 including:

（付記６）
前記類似度算出ステップが、
抽出された前記既存テキストの識別子について、一致する前記自立語の語数をカウントするステップと、
前記既存テキストの識別子に対応付けて当該既存テキスト中の自立語語数が格納されているテキストサイズ格納部から、抽出された前記既存テキストの識別子に対応付けられている当該既存テキスト中の自立語語数を読み出し、当該既存テキスト中の自立語語数と、前記入力テキストから抽出された前記自立語の語数と、一致する前記自立語の語数とから、前記既存テキストと前記入力テキストとの類似度を算出するステップと、
をさらに含む付記４又は５記載の検索方法。 (Appendix 6)
The similarity calculation step includes:
Counting the number of matching independent words for the extracted identifier of the existing text;
The number of independent words in the existing text associated with the identifier of the existing text extracted from the text size storage unit in which the number of independent words in the existing text is stored in association with the identifier of the existing text And the similarity between the existing text and the input text is calculated from the number of independent words in the existing text, the number of independent words extracted from the input text, and the number of matching independent words. And steps to
The search method according to appendix 4 or 5, further comprising:

（付記７）
前記範囲算出ステップが、
前記インデックス格納部を、前記入力テキストから抽出された前記自立語で検索して一致する自立語の語数を特定するステップと、
前記入力テキストと既存テキストとの類似度を算出するための類似度算出式に基づき予め規定され且つ前記入力テキスト中の自立語の語数を変数として類似度が類似度閾値以上となる前記既存テキスト中の自立語語数の範囲を算出する算式を用いて、特定された前記自立語の語数を入力として前記類似度閾値以上となる前記既存テキスト中の自立語語数の範囲を算出するステップと、
を含む付記４乃至６のいずれか１つ記載の検索方法。 (Appendix 7)
The range calculation step includes:
Searching the index storage unit with the independent words extracted from the input text to identify the number of matching independent words;
In the existing text that is defined in advance based on a similarity calculation formula for calculating the similarity between the input text and the existing text, and the similarity is equal to or greater than the similarity threshold with the number of independent words in the input text as a variable Calculating the range of the number of independent words in the existing text that is equal to or greater than the similarity threshold, using the number of words of the specified independent word as an input, using an equation for calculating the range of the number of independent words of
The search method according to any one of appendices 4 to 6, including:

（付記８）
前記類似度算出ステップが、
前記インデックス格納部を、前記入力テキストから抽出された前記自立語で検索して一致する自立語が登録されていないことを検出した場合に、前記入力テキストから抽出された前記自立語から、登録されていない自立語を除いたものを条件として前記入力テキストとの類似度が、前記記憶装置に格納されている類似度閾値以上となる、前記既存テキスト中の自立語語数の範囲を再設定するステップ
をさらに含む付記４又は５記載の検索方法。 (Appendix 8)
The similarity calculation step includes:
When the index storage unit is searched with the independent words extracted from the input text and it is detected that no matching independent words are registered, the index storage unit is registered from the independent words extracted from the input text. Resetting the range of the number of independent words in the existing text, wherein the similarity with the input text is equal to or greater than a similarity threshold stored in the storage device on condition that no independent words are removed The search method according to appendix 4 or 5, further comprising:

（付記９）
付記１乃至８のいずれか１つ記載の検索方法をコンピュータに実行させるためのプログラム。 (Appendix 9)
A program for causing a computer to execute the search method according to any one of appendices 1 to 8.

（付記１０）
記憶装置に格納されている入力テキストから自立語を抽出する手段と、
抽出された前記自立語を条件として前記入力テキストとの類似度が、前記記憶装置に格納されている類似度閾値以上となる、既存テキスト中の自立語語数の範囲を特定する範囲特定手段と、
自立語語数が前記自立語語数の範囲内である既存テキストに限定して、前記記憶装置に格納されている、当該既存テキスト中の自立語と抽出された前記自立語とを用いて当該既存テキストと前記入力テキストの類似度を算出し、前記記憶装置に格納する類似度算出手段と、
前記記憶装置に格納された前記類似度が前記類似度閾値以上となる前記既存テキストを特定する手段と、
を有する検索装置。 (Appendix 10)
Means for extracting independent words from input text stored in a storage device;
A range specifying means for specifying a range of the number of independent words in the existing text, wherein the similarity with the input text is equal to or greater than a similarity threshold stored in the storage device on the basis of the extracted independent words;
The existing text is limited to the existing text whose number of independent words is within the range of the number of independent words, and the existing text is stored in the storage device using the independent words in the existing text and the extracted independent words. Calculating the similarity of the input text and storing the similarity in the storage device;
Means for identifying the existing text in which the similarity stored in the storage device is equal to or greater than the similarity threshold;
A search device having:

１１入力部１２インデックス対象テキスト格納部
１３インデックス生成部１４インデックスＤＢ
１５テキストサイズＤＢ１６インデックス変換部
１７共通出現単語数算出部１８検索入力テキスト格納部
１９共通出現単語数格納部２０類似度閾値格納部
２１類似テキスト選択処理部２２テキストＩＤ格納部
２３出力部 DESCRIPTION OF SYMBOLS 11 Input part 12 Index object text storage part 13 Index production | generation part 14 Index DB
15 Text Size DB 16 Index Conversion Unit 17 Common Appearance Word Count Calculation Unit 18 Search Input Text Storage Unit 19 Common Appearance Word Count Storage Unit 20 Similarity Threshold Storage Unit 21 Similar Text Selection Processing Unit 22 Text ID Storage Unit 23 Output Unit

Claims

記憶装置に格納されている入力テキストから自立語を抽出するステップと、
抽出された前記自立語を条件として前記入力テキストとの類似度が、前記記憶装置に格納されている類似度閾値以上となる、既存テキスト中の自立語語数の範囲を特定する範囲特定ステップと、
自立語語数が前記自立語語数の範囲内である既存テキストに限定して、前記記憶装置に格納されている、当該既存テキスト中の自立語と抽出された前記自立語とを用いて当該既存テキストと前記入力テキストの類似度を算出し、前記記憶装置に格納する類似度算出ステップと、
前記記憶装置に格納された前記類似度が前記類似度閾値以上となる前記既存テキストを特定するステップと、
を含み、コンピュータに実行される検索方法。 Extracting independent words from input text stored in a storage device;
A range specifying step for specifying a range of the number of independent words in the existing text, wherein the similarity with the input text is equal to or greater than a similarity threshold stored in the storage device on the basis of the extracted independent words;
The existing text is limited to the existing text whose number of independent words is within the range of the number of independent words, and the existing text is stored in the storage device using the independent words in the existing text and the extracted independent words. Calculating the similarity of the input text and storing it in the storage device;
Identifying the existing text whose similarity stored in the storage device is greater than or equal to the similarity threshold;
And a search method executed on a computer.

前記範囲特定ステップが、
前記入力テキストと前記既存テキストとの類似度を算出するための類似度算出式に基づき予め規定され且つ前記入力テキスト中の自立語の語数を変数として類似度が類似度閾値以上となる前記既存テキスト中の自立語語数の範囲を算出する算式を用いて、抽出された前記自立語の語数を入力として前記類似度閾値以上となる前記既存テキスト中の自立語語数の範囲を算出する範囲算出ステップ
を含む請求項１記載の検索方法。 The range specifying step includes:
The existing text that is defined in advance based on a similarity calculation formula for calculating the similarity between the input text and the existing text, and whose similarity is equal to or greater than a similarity threshold with the number of independent words in the input text as a variable A range calculating step of calculating a range of the number of independent words in the existing text that is equal to or greater than the similarity threshold, using the number of words of the extracted independent words as an input, using an equation for calculating a range of the number of independent words in the The search method according to claim 1.

前記範囲算出ステップにおいて、
いずれかの前記既存テキストに出現する、前記入力テキスト内の自立語のみの語数を、前記自立語語数の範囲の上限値算出のための入力としてさらに用いて、前記既存テキスト中の自立語語数の範囲を算出する
請求項２記載の検索方法。 In the range calculating step,
The number of independent words in the input text that appears in any of the existing texts is further used as an input for calculating the upper limit value of the range of the independent words, and the number of independent words in the existing text The search method according to claim 2, wherein the range is calculated.

前記類似度算出ステップが、
前記既存テキストに出現する自立語毎に当該自立語を含む前記既存テキストの識別子が当該自立語を含む前記既存テキスト中の自立語語数順に列挙されている、前記記憶装置内のインデックス格納部を、抽出された前記自立語で検索して、一致する前記自立語について前記自立語語数が前記自立語語数の範囲内である前記既存テキストの識別子を順に抽出するステップ、
を含む請求項１記載の検索方法。 The similarity calculation step includes:
For each independent word that appears in the existing text, an identifier of the existing text that includes the independent word is listed in the order of the number of independent words in the existing text that includes the independent word, an index storage unit in the storage device, Searching with the extracted independent words, and sequentially extracting identifiers of the existing text whose number of independent words is within the range of the number of independent words for the matching independent words;
The search method according to claim 1, comprising:

前記類似度算出ステップが、
前記既存テキストに出現する自立語毎に当該自立語を含む前記既存テキストの識別子が列挙されており且つ前記既存テキスト中の自立語語数毎に設けられている、前記記憶装置内のインデックス格納部のうち、前記自立語語数の範囲に含まれる自立語語数についてのインデックス格納部を選択するステップと、
選択された前記インデックス格納部を、抽出された前記自立語で検索して、一致する前記自立語を含む前記既存テキストの識別子を抽出するステップと、
を含む請求項１記載の検索方法。 The similarity calculation step includes:
An index storage unit in the storage device, in which identifiers of the existing text including the independent words are listed for each independent word appearing in the existing text and provided for each number of independent words in the existing text A step of selecting an index storage unit for the number of independent words included in the range of the number of independent words;
Searching the selected index storage unit with the extracted independent words to extract an identifier of the existing text including the matching independent words;
The search method according to claim 1, comprising:

請求項１乃至５のいずれか１つ記載の検索方法をコンピュータに実行させるためのプログラム。 The program for making a computer perform the search method of any one of Claims 1 thru | or 5.

記憶装置に格納されている入力テキストから自立語を抽出する手段と、
抽出された前記自立語を条件として前記入力テキストとの類似度が、前記記憶装置に格納されている類似度閾値以上となる、既存テキスト中の自立語語数の範囲を特定する範囲特定手段と、
自立語語数が前記自立語語数の範囲内である既存テキストに限定して、前記記憶装置に格納されている、当該既存テキスト中の自立語と抽出された前記自立語とを用いて当該既存テキストと前記入力テキストの類似度を算出し、前記記憶装置に格納する類似度算出手段と、
前記記憶装置に格納された前記類似度が前記類似度閾値以上となる前記既存テキストを特定する手段と、
を有する検索装置。 Means for extracting independent words from input text stored in a storage device;
A range specifying means for specifying a range of the number of independent words in the existing text, wherein the similarity with the input text is equal to or greater than a similarity threshold stored in the storage device on the basis of the extracted independent words;
The existing text is limited to the existing text whose number of independent words is within the range of the number of independent words, and the existing text is stored in the storage device using the independent words in the existing text and the extracted independent words. Calculating the similarity of the input text and storing the similarity in the storage device;
Means for identifying the existing text in which the similarity stored in the storage device is equal to or greater than the similarity threshold;
A search device having: