JP5284761B2

JP5284761B2 - Document search apparatus and method, program, and recording medium recording program

Info

Publication number: JP5284761B2
Application number: JP2008299988A
Authority: JP
Inventors: 克人別所; 俊郎内山; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-11-25
Filing date: 2008-11-25
Publication date: 2013-09-11
Anticipated expiration: 2028-11-25
Also published as: JP2010128598A

Description

本発明は、文書検索装置及び方法及びプログラム及びプログラムを記録した記録媒体に係り、特に、検索入力文に適合する文書を高精度に検索するための文書検索装置及び方法及びプログラム及びプログラムを記録した記録媒体に関する。 The present invention relates to a document search apparatus and method, a program, and a recording medium on which the program is recorded. In particular, the document search apparatus, method, program, and program for searching a document that matches a search input sentence with high accuracy are recorded. The present invention relates to a recording medium.

文書検索の方式としては、入力したキーワードを含む文書を検索するキーワード検索が主流となっている（例えば、特許文献１参照）。
特開平０１−２０１７２３号公報 As a document search method, a keyword search for searching for a document including an input keyword has become the mainstream (see, for example, Patent Document 1).
Japanese Patent Laid-Open No. 01-201723

しかしながら、キーワード検索では、入力単語を含んでいないが、入力単語と関連のある文書が検索されないという問題がある。 However, in the keyword search, there is a problem that a document related to the input word is not searched although the input word is not included.

入力単語の下位概念にあたる単語は、該入力単語に該当する。例えば、入力単語「精神病」の下位概念にあたる単語「鬱病」は、入力単語「精神病」に該当する。入力単語の下位概念にあたる単語を含んでいる文書は、該入力単語に該当する。例えば、入力単語「精神病」の下位概念にあたる単語「鬱病」を含んでいる文書は、入力単語「精神病」に該当する。 A word corresponding to a subordinate concept of the input word corresponds to the input word. For example, the word “depression”, which is a subordinate concept of the input word “psychiatric”, corresponds to the input word “psychiatric”. A document including a word corresponding to a subordinate concept of the input word corresponds to the input word. For example, a document including the word “depression”, which is a subordinate concept of the input word “psychiatric”, corresponds to the input word “psychiatric”.

一方、入力単語の兄弟概念や上位概念にあたる単語は、該入力単語に必ずしも該当しない。例えば、入力単語「精神病」の兄弟概念にあたる単語「心臓病」や上位概念にあたる単語「病気」は、入力単語「精神病」に必ずしも該当しない。入力単語の兄弟概念や上位概念にあたる単語を含んでいる文書は、該入力単語に必ずしも該当しない。例えば、入力単語「精神病」の兄弟概念にあたる単語「心臓病」や上位概念にあたる単語「病気」を含んでいる文書は、入力単語「精神病」に必ずしも該当しない。 On the other hand, words corresponding to sibling concepts or superordinate concepts of input words do not necessarily correspond to the input words. For example, the word “heart disease” corresponding to the sibling concept of the input word “psychiatric” and the word “disease” corresponding to the superordinate concept do not necessarily correspond to the input word “psychiatric disease”. A document including a word corresponding to a sibling concept or a superordinate concept of an input word does not necessarily correspond to the input word. For example, a document including the word “heart disease” corresponding to the sibling concept of the input word “psychiatric” and the word “disease” corresponding to the superordinate concept does not necessarily correspond to the input word “psychiatric disease”.

このように、入力単語を含んでいないが、入力単語と関連のある文書を検索するためには、入力単語の下位概念にあたる単語を含んでいる文書を検索する必要がある。 Thus, in order to search for a document that does not include an input word but is related to the input word, it is necessary to search for a document that includes a word that is a subordinate concept of the input word.

これを実現するために、特許文献１で示されているように、単語の上位・下位関係を規定したシソーラスを用い、入力単語の下位概念にあたる単語をシソーラスから取得するというやり方がある。しかしながら、シソーラスは人手で作成するので不完全性があり、単語間の上位・下位関係を正確かつ網羅的には表せていないという問題がある。また、単語間の上位・下位関係は、分野毎に変わり、また、時の経過と共に変化するので、シソーラス再構築のコストが大きいという問題がある。 In order to realize this, there is a method of acquiring a word corresponding to a subordinate concept of an input word from the thesaurus using a thesaurus that defines upper / lower relations of words as shown in Patent Document 1. However, since the thesaurus is created manually, there is incompleteness, and there is a problem that the upper and lower relations between words cannot be expressed accurately and comprehensively. In addition, there is a problem that the cost of the thesaurus reconstruction is high because the upper / lower relationship between words changes for each field and also changes with the passage of time.

本発明は、上記の点に鑑みなされたもので、分野や時期といった適用領域に応じた単語間の上位・下位関係をコスト上の問題なく獲得し、その関係性を用いて、検索入力文に適合する文書を高精度に検索することが可能な文書検索装置及び方法及びプログラム及びプログラムを記録した記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and can acquire a high-order / low-order relationship between words in accordance with an application area such as a field or a period without cost problems, and use that relationship as a search input sentence. It is an object of the present invention to provide a document search apparatus and method, a program, and a recording medium on which the program is recorded, which can search a suitable document with high accuracy.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、任意の単語Ａに対し、各成分の値が、該単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベース１５と、
検索入力文を形態素解析する形態素解析手段１１と、
形態素解析手段１１で得られた形態素解析結果中の単語Ｂに対し、該単語Ｂの単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース１５中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｃを、該単語Ｂの関連単語として取得する第１関連単語取得手段１２と、
形態素解析手段１１で得られた形態素解析結果中の単語Ｂを、該単語Ｂの第１関連単語取得手段１２で取得した関連単語で置換して得られる置換後形態素解析結果を取得する第１形態素解析結果取得手段１３と、
第１形態素解析結果取得手段１３で取得した置換後形態素解析結果を、形態素解析手段で得られた形態素解析結果に付加した上で文書検索を実行する第１検索手段１４と、
を有する。 In the present invention (Claim 1), for an arbitrary word A, a value of each component is a word vector whose relative value is a co-occurrence frequency between the word A and a word or word semantic attribute corresponding to the component A. An associated word vector database 15;
Morphological analysis means 11 for morphological analysis of the search input sentence;
For the word B in the morpheme analysis result obtained by the morpheme analysis means 11, α between the word vector in the word vector database of the word B and the word vector in the word vector database 15 of the arbitrary word C The divergence distance,

The first divergence to calculate the α divergence distance by setting the threshold value of α in the equation to be less than 0.5 and to obtain one or more words C having the small distance as related words of the word B Word acquisition means 12;
A first morpheme that acquires a replacement morpheme analysis result obtained by replacing the word B in the morpheme analysis result obtained by the morpheme analysis unit 11 with a related word obtained by the first related word obtaining unit 12 of the word B Analysis result acquisition means 13;
A first search means for executing a document search after adding the replacement morpheme analysis result acquired by the first morpheme analysis result acquisition means to the morpheme analysis result obtained by the morpheme analysis means;
Have

本発明（請求項２）は、任意の単語Ａに対し、各成分の値が、単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベースを参照し、任意の単語Ｂに対し、該単語Ｂの該単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｂ，Ｃと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している単語間関連データベースと、
検索入力文を形態素解析する形態素解析手段と、
形態素解析手段で得られた形態素解析結果中の単語Ｄに対し、単語間関連データベースから、該単語Ｄとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｅを、該単語Ｄの関連単語として取得する第２関連単語取得手段と、
形態素解析手段で得られた形態素解析結果中の単語Ｄを、該単語Ｄの第２関連単語取得手段で取得した関連単語で置換して得られる置換後形態素解析結果を取得する第２形態素解析結果取得手段と、
第２形態素解析結果取得手段で取得した置換後形態素解析結果を、形態素解析手段で得られた形態素解析結果に付加した上で文書検索を実行する第１検索手段と、
を有する。 In the present invention (Claim 2), an arbitrary word A corresponds to a word vector in which each component value is a relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the component A The α divergence distance between the word vector in the word vector database of the word B and the word vector of the arbitrary word C in the word vector database with reference to the attached word vector database The

The α divergence distance is calculated by setting the α threshold value of the expression to be less than 0.5 , and the distance or the relevance obtained by the calculation is stored together with the ordered word pairs B and C, or A word-to-word relation database storing one or more words having a small distance or a large degree of relevance for each word;
Morphological analysis means for morphological analysis of the search input sentence;
With respect to the word D in the morpheme analysis result obtained by the morpheme analysis means, one or more words E having a small distance or a high degree of relevance from the word relation database are associated with the word D. Second related word acquisition means for acquiring as a word;
A second morpheme analysis result for obtaining a post-replacement morpheme analysis result obtained by replacing the word D in the morpheme analysis result obtained by the morpheme analysis unit with a related word obtained by the second related word obtaining unit of the word D Acquisition means;
A first search unit for performing a document search after adding the post-replacement morpheme analysis result acquired by the second morpheme analysis result acquisition unit to the morpheme analysis result obtained by the morpheme analysis unit;
Have

本発明（請求項３）は、任意の単語Ａに対し、各成分の値が、該単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベースと、
検索入力文を形態素解析する形態素解析手段と、
形態素解析手段で得られた形態素解析結果中の単語Ｂに対し、該単語Ｂの単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｃを、該単語Ｂの関連単語として取得する第１関連単語取得手段と、
形態素解析手段で得られた形態素解析結果中の単語Ｂを、該単語Ｂの第１関連単語取得手段で取得した関連単語で置換して得られる置換後形態素解析結果を取得する第１形態素解析結果取得手段と、
第１形態素解析結果取得手段で取得した置換後形態素解析結果を表示する表示手段と、
ユーザが選択した置換後形態素解析結果を、形態素解析手段で得られた形態素解析結果に付加した上で文書検索を実行する第２検索手段と、を有する。 In the present invention (Claim 3), for an arbitrary word A, a value of each component is a word vector that is a relative value of the co-occurrence frequency between the word A and a word or word semantic attribute corresponding to the component A. An associated word vector database;
Morphological analysis means for morphological analysis of the search input sentence;
Α divergence distance between the word vector in the word vector database of the word B and the word vector of the arbitrary word C in the word vector database for the word B in the morpheme analysis result obtained by the morpheme analysis means The

The first divergence to calculate the α divergence distance by setting the threshold value of α in the equation to be less than 0.5 and to obtain one or more words C having the small distance as related words of the word B Word acquisition means;
A first morpheme analysis result for obtaining a post-replacement morpheme analysis result obtained by replacing the word B in the morpheme analysis result obtained by the morpheme analysis unit with a related word obtained by the first related word obtaining unit of the word B Acquisition means;
Display means for displaying the post-replacement morpheme analysis result acquired by the first morpheme analysis result acquisition means;
Second search means for executing a document search after adding the post-replacement morpheme analysis result selected by the user to the morpheme analysis result obtained by the morpheme analysis means.

本発明（請求項４）は、任意の単語Ａに対し、各成分の値が、単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベースを参照し、任意の単語Ｂに対し、該単語Ｂの該単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｂ，Ｃと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している単語間関連データベースと、
検索入力文を形態素解析する形態素解析手段と、
形態素解析手段で得られた形態素解析結果中の単語Ｄに対し、単語間関連データベースから、該単語Ｄとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｅを、該単語Ｄの関連単語として取得する第２関連単語取得手段と、
形態素解析手段で得られた形態素解析結果中の単語Ｄを、該単語Ｄの第２関連単語取得手段で取得した関連単語で置換して得られる置換後形態素解析結果を取得する第２形態素解析結果取得手段と、
第２形態素解析結果取得手段で取得した置換後形態素解析結果を表示する表示手段と、
ユーザが選択した置換後形態素解析結果を、形態素解析手段で得られた形態素解析結果に付加した上で文書検索を実行する第２検索手段と、を有する。 In the present invention (Claim 4), an arbitrary word A is associated with a word vector in which the value of each component is the relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the component A The α divergence distance between the word vector in the word vector database of the word B and the word vector of the arbitrary word C in the word vector database with reference to the attached word vector database The

The α divergence distance is calculated by setting the α threshold value of the expression to be less than 0.5 , and the distance or the relevance obtained by the calculation is stored together with the ordered word pairs B and C, or A word-to-word relation database storing one or more words having a small distance or a large degree of relevance for each word;
Morphological analysis means for morphological analysis of the search input sentence;
With respect to the word D in the morpheme analysis result obtained by the morpheme analysis means, one or more words E having a small distance or a high degree of relevance from the word relation database are associated with the word D. Second related word acquisition means for acquiring as a word;
A second morpheme analysis result for obtaining a post-replacement morpheme analysis result obtained by replacing the word D in the morpheme analysis result obtained by the morpheme analysis unit with a related word obtained by the second related word obtaining unit of the word D Acquisition means;
Display means for displaying the post-replacement morpheme analysis result acquired by the second morpheme analysis result acquisition means;
Second search means for executing a document search after adding the post-replacement morpheme analysis result selected by the user to the morpheme analysis result obtained by the morpheme analysis means .

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項５）は、形態素解析手段が、検索入力文を形態素解析する形態素解析ステップ（ステップ１）と、
第１関連単語取得手段が、形態素解析ステップで得られた形態素解析結果中の単語Ｂに対し、該単語Ｂの、任意の単語Ａに対し、各成分の値が、該単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｃを、該単語Ｂの関連単語として取得する第１関連単語取得ステップ（ステップ２）と、
第１形態素解析結果取得手段が、形態素解析ステップで得られた形態素解析結果中の単語Ｂを、該単語Ｂの第１関連単語取得ステップで取得した関連単語で置換して得られる置換後形態素解析結果を取得する第１形態素解析結果取得ステップ（ステップ３）と、
第１検索手段が、第１形態素解析結果取得ステップ（ステップ３）で取得した置換後形態素解析結果を、形態素解析ステップで得られた形態素解析結果に付加した上で文書検索を実行する第１検索ステップと、を行う。 According to the present invention (Claim 5), the morpheme analyzing unit performs a morpheme analysis step (Step 1) in which a search input sentence is analyzed,
For the word B in the morpheme analysis result obtained in the morpheme analysis step, the first related word acquisition means has the value of each component for the arbitrary word A of the word B, the word A, and the component Between a word vector in a word vector database associated with a word vector that is a relative value of a co-occurrence frequency with a word or a word semantic attribute corresponding to the word vector, and a word vector in the word vector database of an arbitrary word C α divergence distance

The first divergence to calculate the α divergence distance by setting the threshold value of α in the equation to be less than 0.5 and to obtain one or more words C having the small distance as related words of the word B A word acquisition step (step 2);
The post-substitution morpheme analysis obtained by the first morpheme analysis result acquisition means replacing the word B in the morpheme analysis result obtained in the morpheme analysis step with the related word obtained in the first related word acquisition step of the word B A first morphological analysis result acquisition step (step 3) for acquiring a result;
A first search in which the first search means performs a document search after adding the replacement morpheme analysis result acquired in the first morpheme analysis result acquisition step (step 3) to the morpheme analysis result acquired in the morpheme analysis step. And step .

本発明（請求項６）は、形態素解析手段が、検索入力文を形態素解析する形態素解析ステップと、
第２関連単語取得手段が、形態素解析ステップで得られた形態素解析結果中の単語Ｄに対し、任意の単語Ａに対し、各成分の値が、単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベースを参照し、任意の単語Ｂに対し、該単語Ｂの該単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｂ，Ｃと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している単語間関連データベースから、該単語Ｄとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｅを、該単語Ｄの関連単語として取得する第２関連単語取得ステップと、
第２形態素解析結果取得手段が、形態素解析ステップで得られた形態素解析結果中の単語Ｄを、該単語Ｄの第２関連単語取得手段で取得した関連単語で置換して得られる置換後形態素解析結果を取得する第２形態素解析結果取得ステップと、
第１検索手段が、第２形態素解析結果取得ステップで取得した置換後形態素解析結果を、形態素解析ステップで得られた形態素解析結果に付加した上で文書検索を実行する第１検索ステップと、を行う。 According to the present invention (Claim 6), the morpheme analyzing means performs a morpheme analysis step of analyzing a search input sentence;
For the word D in the morphological analysis result obtained in the morphological analysis step by the second related word acquisition means, the value of each component for the arbitrary word A is the word A and the word or word corresponding to the component With reference to a word vector database in which word vectors that are relative values of co-occurrence frequencies with semantic attributes are associated, a word vector in the word vector database of the word B and an arbitrary word C Α divergence distance from the word vector in the word vector database of

The α divergence distance is calculated by setting the α threshold value of the expression to be less than 0.5 , and the distance or the relevance obtained by the calculation is stored together with the ordered word pairs B and C, or From the inter-word relation database storing one or a plurality of words having a small distance or a large degree of association for each word, one or a plurality of words having a small distance or a large degree of relation to the word D A second related word acquisition step of acquiring the word E as a related word of the word D;
The post-replacement morpheme analysis obtained by the second morpheme analysis result acquisition unit replacing the word D in the morpheme analysis result obtained in the morpheme analysis step with the related word acquired by the second related word acquisition unit of the word D A second morphological analysis result acquisition step of acquiring a result;
A first search step in which the first search means adds the post-substitution morpheme analysis result acquired in the second morpheme analysis result acquisition step to the morpheme analysis result obtained in the morpheme analysis step, and then executes a document search. Do.

本発明（請求項７）は、形態素解析手段が、検索入力文を形態素解析する形態素解析ステップと、
第１関連単語取得手段が、形態素解析ステップで得られた形態素解析結果中の単語Ｂに対し、該単語Ｂの、任意の単語Ａに対し、各成分の値が、該単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｃを、該単語Ｂの関連単語として取得する第１関連単語取得ステップと、
第１形態素解析結果取得手段が、形態素解析ステップで得られた形態素解析結果中の単語Ｂを、該単語Ｂの第１関連単語取得ステップで取得した関連単語で置換して得られる置換後形態素解析結果を取得する第１形態素解析結果取得ステップと、
表示手段が、第１形態素解析結果取得ステップで取得した置換後形態素解析結果を表示する表示ステップと、
第２検索手段が、ユーザが選択した置換後形態素解析結果を、形態素解析ステップで得られた形態素解析結果に付加した上で文書検索を実行する第２検索ステップと、を行う。 According to the present invention (Claim 7), the morpheme analyzing means performs a morpheme analysis step of analyzing a search input sentence,
For the word B in the morpheme analysis result obtained in the morpheme analysis step, the first related word acquisition means has the value of each component for the arbitrary word A of the word B, the word A, and the component Between a word vector in a word vector database associated with a word vector that is a relative value of a co-occurrence frequency with a word or a word semantic attribute corresponding to the word vector, and a word vector in the word vector database of an arbitrary word C α divergence distance

The first divergence to calculate the α divergence distance by setting the threshold value of α in the equation to be less than 0.5 and to obtain one or more words C having the small distance as related words of the word B A word acquisition step;
The post-substitution morpheme analysis obtained by the first morpheme analysis result acquisition means replacing the word B in the morpheme analysis result obtained in the morpheme analysis step with the related word obtained in the first related word acquisition step of the word B A first morphological analysis result acquisition step of acquiring a result;
A display step for displaying the post-replacement morpheme analysis result acquired in the first morpheme analysis result acquisition step;
The second search means performs a document search by adding the post-replacement morpheme analysis result selected by the user to the morpheme analysis result obtained in the morpheme analysis step .

本発明（請求項８）は、形態素解析手段が、検索入力文を形態素解析する形態素解析ステップと、
第２関連単語取得手段が、形態素解析ステップで得られた形態素解析結果中の単語Ｄに対し、任意の単語Ａに対し、各成分の値が、単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベースを参照し、任意の単語Ｂに対し、該単語Ｂの該単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｂ，Ｃと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している単語間関連データベースから、該単語Ｄとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｅを、該単語Ｄの関連単語として取得する第２関連単語取得ステップと、
第２形態素解析結果取得手段が、形態素解析ステップで得られた形態素解析結果中の単語Ｄを、該単語Ｄの第２関連単語取得手段で取得した関連単語で置換して得られる置換後形態素解析結果を取得する第２形態素解析結果取得ステップと、
表示手段が、第２形態素解析結果取得ステップで取得した置換後形態素解析結果を表示する表示ステップと、
第２検索手段が、ユーザが選択した置換後形態素解析結果を、形態素解析ステップで得られた形態素解析結果に付加した上で文書検索を実行する第２検索ステップと、を行う。 According to the present invention (Claim 8), the morpheme analyzing means performs a morpheme analysis step of analyzing a search input sentence;
For the word D in the morphological analysis result obtained in the morphological analysis step by the second related word acquisition means, the value of each component for the arbitrary word A is the word A and the word or word corresponding to the component With reference to a word vector database in which word vectors that are relative values of co-occurrence frequencies with semantic attributes are associated, a word vector in the word vector database of the word B and an arbitrary word C Α divergence distance from the word vector in the word vector database of

The α divergence distance is calculated by setting the α threshold value of the expression to be less than 0.5 , and the distance or the relevance obtained by the calculation is stored together with the ordered word pairs B and C, or From the inter-word relation database storing one or a plurality of words having a small distance or a large degree of association for each word, one or a plurality of words having a small distance or a large degree of relation to the word D A second related word acquisition step of acquiring the word E as a related word of the word D;
The post-replacement morpheme analysis obtained by the second morpheme analysis result acquisition unit replacing the word D in the morpheme analysis result obtained in the morpheme analysis step with the related word acquired by the second related word acquisition unit of the word D A second morphological analysis result acquisition step of acquiring a result;
A display step for displaying the post-replacement morpheme analysis result acquired in the second morpheme analysis result acquisition step;
Second searching means, the user selected the replaced morphological analysis result, it intends row and the second search step, an executing the document search on appended to the morphological analysis result obtained in the morphological analysis step.

本発明（請求項９）は、請求項１乃至４のいずれか１項に記載の文書検索装置を構成する各手段としてコンピュータを機能させるための文書検索プログラムである。 The present invention (Claim 9) is a document search program for causing a computer to function as each means constituting the document search apparatus according to any one of Claims 1 to 4.

本発明（請求項１０）は、請求項９記載のプログラムを格納したコンピュータ読み取り可能な記録媒体である。 The present invention (Claim 10) is a computer-readable recording medium storing the program according to Claim 9.

上記のように、本発明では、単語ベクトルデータベース中の単語ベクトルを、各成分を確率変数、成分値を確率値とする確率分布とみなし、αダイバージェンスにより、確率分布間の距離を算出する。単語Ｂの確率分布から、単語Ｃの確率分布へのαダイバージェンスを、 As described above, in the present invention, a word vector in the word vector database is regarded as a probability distribution in which each component is a random variable and the component value is a probability value, and the distance between the probability distributions is calculated by α divergence. Α divergence from the probability distribution of word B to the probability distribution of word C

と表すこととする。Ｂの分布を固定したときに

It shall be expressed as When the distribution of B is fixed

を小さくするＣの分布は、αが小さい場合はＢの分布に包含される傾向があり、逆にαが大きい場合はＢの分布を包含する傾向がある。

When C is small, the distribution of C tends to be included in the distribution of B, and conversely, when α is large, the distribution of B tends to be included.

一般に、下位概念の単語と共起する単語や単語意味属性とは、その上位概念の単語も共起する傾向がある。したがって、下位概念の単語ベクトルは、その上位概念の単語ベクトルに包含される傾向がある。ゆえに、単語Ｂを固定したときに In general, a word or a word semantic attribute that co-occurs with a word of a lower concept tends to co-occur with a word of the higher concept. Therefore, the word vector of the lower concept tends to be included in the word vector of the higher concept. Therefore, when word B is fixed

を小さくする単語Ｃは、αが小さい場合はＢの下位概念である傾向があり、逆にαが大きい場合はＢの上位概念である傾向がある。

The word C that reduces the value tends to be a subordinate concept of B when α is small, and conversely, the word C tends to be a superordinate concept of B when α is large.

したがって、αを小さい値に定めた場合、 Therefore, when α is set to a small value,

が小さい単語Ｃは単語Ｂの下位概念であるので、単語Ｃは単語Ｂに該当する。したがって、入力キーワード中の単語Ｂを単語Ｃに置換して得られるキーワードも、入力キーワードに該当し、当該置換後のキーワードを含む文書は、入力キーワードに該当する。このようにして、入力単語を含まず、かつ、入力単語に該当する文書を検索でき、検索精度が向上する。

Since the word C with a small is a subordinate concept of the word B, the word C corresponds to the word B. Therefore, a keyword obtained by replacing the word B in the input keyword with the word C also corresponds to the input keyword, and a document including the replaced keyword corresponds to the input keyword. In this way, it is possible to search for a document that does not include an input word and corresponds to the input word, and the search accuracy is improved.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施の形態］
図３は、本発明の第１の実施の形態における検索装置の構成を示す。 [First Embodiment]
FIG. 3 shows the configuration of the search device according to the first embodiment of the present invention.

同図に示す検索装置は、形態素解析部１１、第１関連単語取得部１２、第１形態素解析結果取得部１３、第１検索部１４、単語ベクトルデータベース（ＤＢ）１５、表示部１６から構成される。 The search device shown in the figure includes a morpheme analysis unit 11, a first related word acquisition unit 12, a first morpheme analysis result acquisition unit 13, a first search unit 14, a word vector database (DB) 15, and a display unit 16. The

単語ベクトルＤＢ１５は、任意の単語Ａに対し、各成分の値が、単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けたものである。単語ベクトルＤＢ１５は、例えば、文献１「別所克人、内山俊郎、片岡良治、"単語・意味属性間共起に基づく概念ベースの拡張方式，"情処研報、vol． 2006-ICS-144, pp.29-34, Jul. 2006.」や文献２「別所克人、内山俊郎、片岡良治、"単語・意味属性間共起に基づく単語間の階層関係の抽出,"信学技報、vol. NLC2006-92, pp.31-36, Jan. 2007」で述べられている方法で生成される。 The word vector DB 15 associates an arbitrary word A with a word vector whose value of each component is the relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the component A. is there. The word vector DB 15 is, for example, document 1 “Katsuhito Bessho, Toshiro Uchiyama, Ryoji Kataoka,“ a concept-based expansion method based on co-occurrence between words and meaning attributes, ”“ Journal of Research, vol. 2006-ICS-144, pp.29-34, Jul. 2006. ”and reference 2“ Katsuhito Bessho, Toshiro Uchiyama, Ryoji Kataoka, “Extraction of hierarchical relationship between words based on co-occurrence between words and semantic attributes,” IEICE Technical Report, vol. NLC2006-92, pp.31-36, Jan. 2007 ".

この方法では、コーパスを形態素解析し、名詞や用言等の処理に必要な品詞のリスト等を参照して、処理に必要な単語を特定する。各行が単語に対応し各列が単語に対応している単語間共起行列、または、各行が単語に対応し各列が単語意味属性に対応している単語・意味属性間共起行列を生成する。形態素解析結果から、任意の単語の対、または任意の単語と任意の単語意味属性の対に対し、コーパスにおける所定の範囲（典型的には１文）において、該対が共起する頻度を算出し、該頻度をコーパス全体にわたって加算した値を算出し、共起行列中の該対の成分に書き込む。 In this method, a morphological analysis is performed on the corpus, and a word necessary for processing is specified with reference to a list of parts of speech necessary for processing of nouns and predicates. Generate an interword co-occurrence matrix where each row corresponds to a word and each column corresponds to a word, or a word / semantic attribute co-occurrence matrix where each row corresponds to a word and each column corresponds to a word semantic attribute To do. From the morpheme analysis result, the frequency of the pair co-occurring in a predetermined range (typically one sentence) in the corpus is calculated for an arbitrary word pair, or an arbitrary word and an arbitrary word semantic attribute pair. Then, a value obtained by adding the frequencies over the entire corpus is calculated and written to the pair of components in the co-occurrence matrix.

図４は、本発明の第１の実施の形態における生成される共起行列の例である。 FIG. 4 is an example of the co-occurrence matrix generated in the first embodiment of the present invention.

共起行列の各行ベクトルは、対応する単語が、単語または単語意味属性と共起するパターンを表している。意味の似た単語は、共通の単語または単語意味属性と共起する傾向があるので、対応するパターンも似る傾向がある。このことから、単語のベクトルを該単語の概念とみなして、単語間の関連性を、対応するベクトル間の類似度により定量的に算出することが可能となる。 Each row vector of the co-occurrence matrix represents a pattern in which the corresponding word co-occurs with a word or word semantic attribute. Because words with similar meanings tend to co-occur with common words or word semantic attributes, the corresponding patterns also tend to be similar. From this, it is possible to regard a word vector as a concept of the word and quantitatively calculate the relevance between words based on the similarity between corresponding vectors.

生成されたある単語のベクトルが（ａ_１，ａ_２，…，ａ_ｎ）（ａ_ｉ≧０（１≦ｉ≦ｎ））であったとする。この単語のベクトルは、
（ｘ_１，ｘ_２，…，ｘ_ｎ）
但し、 Assume that a generated vector of words is (a ₁ , a ₂ ,..., _An ) (a _i ≧ 0 (1 ≦ i ≦ n)). This word vector is
_{_{(X 1, x 2, ...}} , x n)
However,

に変換される。各ｘ_ｉ（１≦ｉ≦ｎ）は、変換前のベクトルにおける、対応する成分の値の、全成分の値の和に対する相対頻度である。

Is converted to Each x _i (1 ≦ i ≦ n) is a relative frequency with respect to the sum of the values of the corresponding components in the vector before conversion.

であるので、変換後のベクトルは、各成分を確率変数、成分値を確率値とする確率分布と捉えることができる。変換後の共起行列が単語ベクトルＤＢ１５である。

Therefore, the converted vector can be regarded as a probability distribution in which each component is a random variable and the component value is a probability value. The co-occurrence matrix after conversion is the word vector DB 15.

形態素解析部１１は、検索入力文を形態素解析する。以下の検索入力文ＩＮに対する形態素解析結果として、ＯＵＴ１またはＯＵＴ２が得られる。 The morphological analysis unit 11 performs morphological analysis on the search input sentence. OUT1 or OUT2 is obtained as a morphological analysis result for the following search input sentence IN.

・検索入力文
ＩＮ）精神病の研究
・形態素解析結果
ＯＵＴ１）精神病／の／研究
ＯＵＴ２）精神病，研究
ＯＵＴ１では、名詞・用言等の内容語以外の単語「の」を残しているが、ＯＵＴ２では、内容語以外は除去している。・ Search input sentences IN) Research on psychosis ・ Results of morphological analysis OUT1) Psychiatric /// research OUT2) Psychiatry, research In OUT1, words other than content words such as nouns and idioms are left, but in OUT2, , Except for content words.

第１関連単語取得部１２は、形態素解析部１１で得られた形態素解析結果中の内容語Ｘに対し、Ｘの単語ベクトルＤＢ１５中の単語ベクトルと、単語ベクトルＤＢ１５中の任意の単語Ｙの単語ベクトルとの間のαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｙを、Ｘの関連単語として取得する。 For the content word X in the morpheme analysis result obtained by the morpheme analysis unit 11, the first related word acquisition unit 12 is a word vector in the word vector DB 15 of X and a word of an arbitrary word Y in the word vector DB 15. The α divergence distance from the vector is calculated, and one or more words Y having a small distance are acquired as related words of X.

図５は、本発明の第１の実施の形態における第１関連単語取得部の動作のフローチャートである。 FIG. 5 is a flowchart of the operation of the first related word acquisition unit in the first embodiment of the present invention.

同図の動作は、形態素解析結果中の各内容語Ｘから、Ｘの関連単語Ｙを取得するものである。但し、第１関連単語取得部１２では、以下の２つの生成法のいずれかをとる。 The operation in the figure is to acquire a related word Y of X from each content word X in the morphological analysis result. However, the first related word acquisition unit 12 takes one of the following two generation methods.

（１）第１の方法：
第１の生成法として、図５のフローチャートをステップ５０１，５０２，５０３のみから構成し、ステップ５０２で処理対象とする単語Ｙがなければステップ５０１に進むようにする。ステップ５０３では、算出した (1) First method:
As a first generation method, the flowchart of FIG. 5 is composed of only steps 501, 502, and 503, and if there is no word Y to be processed in step 502, the process proceeds to step 501. In step 503, the calculated

が、ある閾値以下となる単語Ｙを、処理中の単語Ｘの関連単語として取得する。

However, the word Y which becomes below a certain threshold is acquired as a related word of the word X being processed.

（２）第２の方法：
第２の生成法として、図５のフローチャートをステップ５０１，５０２，５０３，５０４，５０６のみから構成し、ステップ５０４の処理が終了した後、ステップ５０６に進むようにする。 (2) Second method:
As a second generation method, the flowchart of FIG. 5 is composed of only steps 501, 502, 503, 504, and 506, and the process proceeds to step 506 after the processing of step 504 is completed.

以下、図５のフローチャートの各処理内容を説明する。 Hereinafter, each processing content of the flowchart of FIG. 5 will be described.

ステップ５０１）これまでに処理していない単語の中で、処理対象とする単語Ｘを一つ決定する。あればステップ５０２に移行し、なければ本処理を終了する。 Step 501) Among the words not processed so far, one word X to be processed is determined. If there is, the process proceeds to step 502, and if not, this process is terminated.

ステップ５０２）これまでに処理していない単語の中で、処理対象とする単語Ｙを一つ決定する。あればステップ５０３に移行し、なければステップ５０４に移行する。任意の単語対Ｘ，Ｙに対しＸ，Ｙ間の関連度を算出する処理の計算量を低減するために、単語Ｙの集合を、例えば、コーパス中の高頻度語集合に限定してもよい。 Step 502) Among the words that have not been processed so far, one word Y to be processed is determined. If there is, the process proceeds to step 503, and if not, the process proceeds to step 504. In order to reduce the amount of calculation for calculating the degree of association between X and Y for an arbitrary word pair X and Y, the set of words Y may be limited to, for example, a high-frequency word set in a corpus .

ステップ５０３）単語の対Ｘ，Ｙに対し、Ｘ，Ｙのベクトルｖ（Ｘ），ｖ（Ｙ）が、
ｖ（Ｘ）：＝（ｘ_１，ｘ_２，…，ｘ_ｎ）
ｖ（Ｙ）：＝（ｙ_１，ｙ_２，…，ｙ_ｎ）
のようになっているとき、ＸからＹへのαダイバージェンス Step 503) For the word pairs X and Y, the vectors v (X) and v (Y) of X and Y are
v (X): = (x ₁ , x ₂ ,..., x _n )
v (Y): = (y ₁ , y ₂ ,..., y _n )
Α divergence from X to Y when

を、

The

として算出する。

Calculate as

ここで、 here,

の値を常に有限値にするため、

To always make the value of finite,

（Ｄ：定数）と定義し、
α＜０かつｘ_ｉ＝０のとき、

(D: constant)
When α <0 and x _i = 0,

１−α＜０かつｙ_ｉ＝０のとき、

When 1−α <0 and y _i = 0,

として算出する。

Calculate as

特定のαに対し、 For a specific α

は以下のように表される。

Is expressed as follows.

上記で、KL(X‖Y)は、ＸからＹへのカルバック・ライブラー距離であり、以下の式で表される。

In the above, KL (X‖Y) is a Cullback-Ribler distance from X to Y, and is expressed by the following equation.

便宜上、P₀(X‖Y)=KL(Y‖X)， P₁(X‖Y)=KL(X‖Y)とおけば、αダイバージェンスはカルパック・ライブラー距離を拡張したものと捉えることができる。

For convenience, if P ₀ (X‖Y) = KL (Y‖X), P ₁ (X‖Y) = KL (X‖Y), α divergence can be regarded as an extension of the Calpac-Ribler distance. Can do.

αを０．５未満の値に定めた場合、 When α is set to a value less than 0.5,

が小さい単語Ｙは単語Ｘの下位概念であるので、単語Ｙは単語Ｘに該当する。

Since the word Y with a small is a subordinate concept of the word X, the word Y corresponds to the word X.

ステップ５０３の処理を終了した後、ステップ５０２に移行する。
ステップ５０４） After the processing of step 503 is completed, the routine proceeds to step 502.
Step 504)

の小さい順に単語Ｙをランキングする。当該ステップ５０４の処理が終了した後、ステップ５０５に移行する。

The word Y is ranked in ascending order. After the processing of step 504 is completed, the process proceeds to step 505.

ステップ５０５）任意の単語Ｙに対し、Ｙの順位をｍとしたとき、ＸからＹへの関連度 Step 505) For an arbitrary word Y, when the rank of Y is m, the degree of association from X to Y

を、一例として、

As an example

として算出する。

Calculate as

単語Ｘ毎のランキング結果上位におけるαダイバージェンスの大きさは異なる。しかし、ランキング結果の上位は、Ｘ毎の距離の大きさの違いに関わらず、αに応じた概念レベルの単語が常に占める。したがって、ランキングにおける順位は、αに応じたレベルの概念である度合いを表す。よって、この順位により算出される関連度は、ＸからＹへの関連度を的確に表す。ステップ５０５の処理が終了した後、ステップ５０６に移行する。 The magnitude of α divergence at the top of the ranking results for each word X is different. However, the highest ranking results are always occupied by words at the concept level corresponding to α, regardless of the difference in distance for each X. Therefore, the rank in the ranking represents a degree that is a concept of a level according to α. Therefore, the degree of association calculated by this rank accurately represents the degree of association from X to Y. After the processing in step 505 is completed, the process proceeds to step 506.

ステップ５０６）ランキングにおいて、ある順位までの単語Ｙを、処理中の単語Ｘの関連単語として取得する。ステップ５０６の処理が終了した後、ステップ５０１に移行する。 Step 506) In the ranking, the words Y up to a certain ranking are acquired as related words of the word X being processed. After the processing of step 506 is completed, the routine proceeds to step 501.

なお、第１関連単語取得部１２において、単語Ｘに対し、複数の異なるα毎に関連単語Ｙを取得し、それらの関連単語集合をマージしたものを、最終的な単語Ｘの関連単語群としてもよい。 The first related word acquisition unit 12 acquires a related word Y for each of a plurality of different αs from the word X, and merges those related word sets as a related word group of the final word X. Also good.

以上の処理により、ＯＵＴ１（ＯＵＴ２）中の内容語「精神病」、「研究」に対し、以下の下位概念の単語が得られる。 By the above processing, the following subordinate words are obtained for the content words “psychiatric” and “research” in OUT1 (OUT2).

・精神病：鬱病、躁病、ＰＳＴＤ、幻聴
・研究：論文、観察
第１形態素解析結果取得部１３は、形態素解析部１１で得られた形態素解析結果中の単語Ｘを、単語Ｘの第１関連単語取得部１２で取得した関連単語で置換して得られる置換後形態素解析結果を取得する。・ Psychosis: depression, mania, PSTD, hallucination ・ Research: paper, observation The first morpheme analysis result acquisition unit 13 uses the word X in the morpheme analysis result obtained by the morpheme analysis unit 11 as the first related word of the word X The post-substitution morpheme analysis result obtained by substituting with the related word acquired by the acquisition unit 12 is acquired.

ＯＵＴ１（ＯＵＴ２）に対する置換後の形態素解析結果は以下のようになる。「∨」は論理和を表す記号である。 The morpheme analysis result after replacement for OUT1 (OUT2) is as follows. “∨” is a symbol representing a logical sum.

・ＯＵＴ１：（鬱病∨躁病∨ＰＴＳＤ∨幻聴）／の／（論文∨観察）
・OUT２:（鬱病∨躁病∨ＰＴＳＤ∨幻聴），（論文∨観察）
第１検索部１４は、第１形態素解析結果取得部１３で取得した置換後形態素解析結果を、形態素解析部１１で得られた形態素解析結果に付加した上で文書検索を実行する。・ OUT1: (depression, depression, PTSD, hallucinations) /// (paper observation)
・ OUT2: (depression, depression, PTSD, hallucinations), (paper observation)
The first search unit 14 performs a document search after adding the replacement morpheme analysis result acquired by the first morpheme analysis result acquisition unit 13 to the morpheme analysis result obtained by the morpheme analysis unit 11.

ＯＵＴ１（ＯＵＴ２）に対応する最終的な検索キーは以下のようになる。「∧」は論理積を表す記号である。 The final search key corresponding to OUT1 (OUT2) is as follows. “∧” is a symbol representing a logical product.

・ＯＵＴ１：（精神病の研究）∨（鬱病の研究）∨…∨（精神病の論文）∨…∨（幻聴の観察）
・ＯＵＴ２：（精神病∨鬱病∨躁病∨ＰＴＳＤ∨幻聴）∧（研究∨論文∨観察）
このようにして、形態素解析部１１で得られた単語「精神病」、「研究」を含まないが「精神病」、「研究」に該当する文書を検索することが可能となる。・ OUT1: (Psychological research) ∨ (Depression research) ∨… ∨ (Psychological paper) ∨… ∨ (Observation of hallucinations)
・ OUT2: (Psychiatric, Depressive, Depressive, PTSD, Hallucination) ∨ (Research ∨ Paper ∨ Observation)
In this way, it is possible to search for documents that do not include the words “psychiatric” and “research” obtained by the morphological analysis unit 11 but fall under “psychiatric” and “research”.

表示部１６は、第１形態素解析結果取得部１３で取得した置換後形態素解析結果を、ディスプレイ等に表示する。このようにして、ユーザに、どの置換後形態素解析結果が適切であるかを選択できるようにする。 The display unit 16 displays the post-replacement morpheme analysis result acquired by the first morpheme analysis result acquisition unit 13 on a display or the like. In this way, the user can select which post-substitution morpheme analysis result is appropriate.

[第２の実施の形態]
図６は、本発明の第２の実施の形態における検索装置の構成を示す。 [Second Embodiment]
FIG. 6 shows the configuration of the search device according to the second embodiment of the present invention.

同図に示す検索装置は、形態素解析部２１、第２関連単語取得部２２、第２形態素解析結果取得部２３、第１検索部２４、単語間関連データベース２５、表示部２６から構成される。 The search device shown in the figure includes a morpheme analysis unit 21, a second related word acquisition unit 22, a second morpheme analysis result acquisition unit 23, a first search unit 24, an inter-word relationship database 25, and a display unit 26.

単語関連データベース２５は、単語ベクトルデータベース１５中の任意の単語Ｘに対し、Ｘの単語ベクトルと、単語ベクトルデータベース１５中の任意の単語Yの単語ベクトルとの間のαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｘ，Ｙと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している。図７は、前者の格納形式の単語間関連データベース２５の例であり、図８は、後者の格納形式の単語間関連データベース２５の例である。図７では、αが小さい場合は、Ｘが上位概念にあたる単語であり、Ｙが下位概念にあたる単語である。また図７では、αが大きい場合は、Ｘが下位概念にあたる単語であり、Ｙが上位概念にあたる単語である。図８では、Ｖが上位にあたる単語であり、Ｗが下位概念にあたる単語である。 The word related database 25 calculates an α divergence distance between a word vector of X and a word vector of an arbitrary word Y in the word vector database 15 for an arbitrary word X in the word vector database 15, The distance or the degree of association obtained by calculation is stored together with the ordered word pairs X and Y, or one or more words having a small distance or a large degree of association are stored for each word. . FIG. 7 is an example of the inter-word relation database 25 in the former storage format, and FIG. 8 is an example of the inter-word relation database 25 in the latter storage format. In FIG. 7, when α is small, X is a word corresponding to a superordinate concept and Y is a word corresponding to a subordinate concept. In FIG. 7, when α is large, X is a word corresponding to a lower concept, and Y is a word corresponding to a higher concept. In FIG. 8, V is a word corresponding to the upper level and W is a word corresponding to the lower level concept.

以下に単語間関連データベース２５を生成する動作を説明する。 The operation for generating the inter-word relation database 25 will be described below.

（１）第１の方法：
単語間関連データベース２５の生成の第１の方法として、図５のフローチャートを用いて説明する。第１の方法としては、図５におけるステップ５０１、５０２、５０３のみから構成し、ステップ５０２で処理対象とする単語Ｙがなければステップ５０１に進むようにする。このようにして、任意の単語対Ｘ，Ｙに対し (1) First method:
A first method for generating the inter-word relation database 25 will be described with reference to the flowchart of FIG. The first method is composed of only steps 501, 502, and 503 in FIG. 5, and if there is no word Y to be processed in step 502, the process proceeds to step 501. In this way, for any word pair X, Y

を算出し、図７の形式の単語間関連データベース２５を生成する。

And the inter-word relation database 25 having the format shown in FIG. 7 is generated.

αが小さい場合は、ステップ５０３で、算出した If α is small, calculated in step 503

が、ある閾値以下となる単語Ｙを、単語Ｘの関連単語として取得し、ＸをＶ、ＹをＷとして図８の形式の単語間関連データベース２５を生成してもよい。

However, a word Y that is equal to or less than a certain threshold may be acquired as a related word of the word X, and the inter-word related database 25 in the format of FIG.

（２）第２の方法：
単語間関連データベース２５の生成の第２の方法として、αが小さい場合に、図５のフローチャートのステップ５０１，５０２，５０３，５０４，５０６のみから構成し、ステップ５０４の処理が終了した後、ステップ５０６に進むようにする。このようにして、単語Ｘに対し、 (2) Second method:
As a second method of generating the inter-word relation database 25, when α is small, the process is composed of only the steps 501, 502, 503, 504, and 506 of the flowchart of FIG. Go to 506. In this way, for the word X,

の小さい一つまたは複数の単語Ｙを関連単語として取得し、ＸをＶ、ＹをＷとして図８の形式の単語間関連データベース２５を生成する。

8 is generated as related words, and X is V and Y is W, and the inter-word relation database 25 in the format of FIG. 8 is generated.

（３）第３の方法：
単語間関連データベース２５の生成の第３の方法として、図５のフローチャートのステップ５０１，５０２，５０３，５０４，５０５のみから構成し、ステップ５０５の処理が終了した後、ステップ５０１に進むようにする。このようにして、任意の単語対Ｘ，Ｙに対し (3) Third method:
As a third method for generating the inter-word relation database 25, it is composed only of steps 501, 502, 503, 504, and 505 in the flowchart of FIG. 5, and proceeds to step 501 after the processing of step 505 is completed. . In this way, for any word pair X, Y

（４）第４の方法：
単語間関連データベース２５の生成の第４の方法として、αが小さい場合に、図５のフローチャートのステップ５０１，５０２，５０３，５０４，５０５，５０６から構成する。このようにして、単語Ｘに対し、 (4) Fourth method:
As a fourth method of generating the inter-word relation database 25, when α is small, the steps include steps 501, 502, 503, 504, 505, and 506 in the flowchart of FIG. In this way, for the word X,

の大きい一つまたは複数の単語Ｙを関連単語として取得し、ＸをＶ、ＹをＷとして図８の形式の単語間関連データベース２５を生成する。

One or a plurality of words Y having a large size are acquired as related words, and the inter-word related database 25 in the format of FIG. 8 is generated with X as V and Y as W.

また、複数の異なるαに対して図７の形式の単語間関連データベース２５を生成した後、それらのデータベース群をマージした図７の形式の単語間関連データベース２５を生成してもよい。具体的には以下のようにする。 Further, after the inter-word relation database 25 in the format of FIG. 7 is generated for a plurality of different αs, the inter-word relation database 25 in the form of FIG. 7 may be generated by merging these database groups. Specifically:

α_１，α_２，…，α_ｈ毎に、任意の単語対Ｘ，Ｙに対し、 For each α ₁ , α ₂ ,..., α _h , for any word pair X, Y,

を算出したとする。

Is calculated.

一例として、単語対Ｘ，Ｙに対する最終的な距離P(X‖Y)または関連度E(X‖Y)を、以下のように、 As an example, the final distance P (X‖Y) or relevance E (X‖Y) for the word pair X, Y is as follows:

の線形結合として算出する。

It is calculated as a linear combination of

β_ｇは、パラメータα_ｇのもとでの関連度を反映させる度合いを表す重みである。より反映させたいパラメータの重みを大きくする。

β _g is a weight representing the degree of reflection of the degree of association under the parameter α _g . Increase the weight of the parameter you want to reflect more.

別の一例として、単語対Ｘ，Ｙに対する最終的な距離P(X‖Y)または関連度E(X‖Y)を、以下のように、 As another example, the final distance P (X‖Y) or relevance E (X‖Y) for the word pair X, Y is expressed as follows:

の最小値または

Minimum of or

の最大値として算出する。

Calculated as the maximum value of.

このようにマージして得られた図７の形式の単語間関連データベース２５を最終的な単語間関連データベース２５として使用してもよい。

The inter-word relation database 25 in the format of FIG. 7 obtained by merging in this way may be used as the final inter-word relation database 25.

また、図７の形式の単語間関連データベース２５を生成した後、以下のようにして、図８の形式の単語間関連データベース２５を生成し、生成した図８の形式の単語間関連データベース２５を最終的な単語間関連データベース２５として使用してもよい。 Further, after generating the inter-word relation database 25 in the format of FIG. 7, the inter-word relation database 25 in the form of FIG. 8 is generated as follows, and the generated inter-word relation database 25 in the format of FIG. The final inter-word relation database 25 may be used.

P(X‖Y)やE(X‖Y)が小さいαから構成されている場合は、各単語Ｘに対し、P(X‖Y)がある閾値以下、または、E(X‖Y)がある閾値以上となる単語Ｙを、単語Ｘの関連単語として取得する。あるいは、各単語Ｘに対し、P(X‖Y)の小さい順またはE(X‖Y)の大きい順に、単語Ｙをランキングし、ランキングにおいて、ある順位までの単語Ｙを、単語Ｘの関連単語として取得する。ＸをＶ、ＹをＷとして図８の形式の単語間関連データベース２５を生成する。 If P (X‖Y) and E (X‖Y) are composed of small α, P (X‖Y) is below a certain threshold or E (X‖Y) A word Y that is equal to or greater than a certain threshold is acquired as a related word of the word X. Alternatively, for each word X, word Y is ranked in order of increasing P (X‖Y) or E (X‖Y), and the words Y up to a certain rank in the ranking are related words of word X. Get as. The inter-word relation database 25 in the format of FIG. 8 is generated with X as V and Y as W.

P(X‖Y)やE(X‖Y)が大きいαから構成されている場合は、各単語Ｙに対し、P(X‖Y)がある閾値以下、または、E(X‖Y)がある閾値以上となる単語Ｘを、単語Ｙの関連単語として取得する。あるいは、各単語Ｙに対し、P(X‖Y)の小さい順またはE(X‖Y)の大きい順に、単語Ｘをランキングし、ランキングにおいて、ある順位までの単語Ｘを、単語Ｙの関連単語として取得する。ＹをＶ，ＸをＷとして図８の形式の単語間関連データベース２５を生成する。 When P (X‖Y) and E (X‖Y) are composed of large α, for each word Y, P (X‖Y) is below a certain threshold or E (X‖Y) is A word X that is equal to or greater than a certain threshold is acquired as a related word of the word Y. Alternatively, for each word Y, the word X is ranked in the order of increasing P (X‖Y) or E (X‖Y), and the word X up to a certain rank in the ranking is related to the word Y. Get as. The inter-word relation database 25 in the format of FIG. 8 is generated with Y as V and X as W.

また、複数の異なるαそれぞれに対して図７の形式を経て図８の形式の単語間関連データベース２５を生成するか、複数の異なるαそれぞれに対して図７の形式を経ずに図８の形式の単語間関連データベース２５を生成するかした後に、それらのデータベース群をマージした図８の形式の単語間関連データベース２５を生成してもよい。具体的には、各単語Ｖに対し、各α毎の関連単語群をマージしたものを、単語Ｖの関連単語群とする。このようにマージして得られた図８の形式の単語間関連データベース２５を最終的な単語間関連データベース２５として使用してもよい。 Further, the inter-word relation database 25 in the format of FIG. 8 is generated for each of a plurality of different αs through the format of FIG. 7, or the format of FIG. After generating the inter-word relationship database 25 in the format, the inter-word relationship database 25 in the format of FIG. 8 may be generated by merging these database groups. Specifically, the word V is obtained by merging each word V with the related word group for each α. The inter-word relation database 25 in the format of FIG. 8 obtained by merging in this way may be used as the final inter-word relation database 25.

形態素解析部２１の処理内容は、形態素解析部１１と同一である。 The processing content of the morpheme analysis unit 21 is the same as that of the morpheme analysis unit 11.

第２関連単語取得部２２は、形態素解析部２１で得られた形態素解析結果中の単語Ｖに対し、単語間関連データベース２５から、単語Ｖとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｗを、単語Ｖの関連単語として取得する。 The second related word acquisition unit 22 selects one or a plurality of the words V in the morpheme analysis result obtained by the morpheme analysis unit 21 from the inter-word relationship database 25 with a small distance or a high degree of relevance. Are acquired as related words of the word V.

単語間関連データベース２５が図７の形式である場合は、以下のようにする。 When the inter-word relation database 25 is in the format shown in FIG.

P(X‖Y)やE(X‖Y)が小さいαから構成されている場合は、ＶをＸ、ＷをＹとみて、P(X‖Y)がある閾値以下、または、E(X‖Y)がある閾値以上となる単語Ｙを関連単語として取得する。あるいは、P(X‖Y)の小さい順またはE(X‖Y)の大きい順に、単語Ｙをランキングし、ランキングにおいて、ある順位までの単語Ｙを関連単語として取得する。 When P (X‖Y) and E (X‖Y) are composed of small α, V is assumed to be X and W is assumed to be Y, and P (X‖Y) is below a certain threshold or E (X A word Y that is equal to or greater than a certain threshold (Y) is acquired as a related word. Alternatively, the words Y are ranked in ascending order of P (X‖Y) or E (X‖Y) in descending order, and the words Y up to a certain rank in the ranking are acquired as related words.

P(X‖Y)やE(X‖Y)が大きいαから構成されている場合は、ＶをＹ、ＷをＸとみて、P(X‖Y)がある閾値以下、または、E(X‖Y)がある閾値以上となる単語Ｘを関連単語として取得する。あるいは、P(X‖Y)の小さい順またはE(X‖Y)の大きい順に、単語Ｘをランキングし、ランキングにおいて、ある順位までの単語Ｘを関連単語として取得する。 When P (X‖Y) and E (X‖Y) are composed of large α, it is assumed that V is Y and W is X, and P (X‖Y) is below a certain threshold or E (X A word X that is equal to or greater than a certain threshold (‖Y) is acquired as a related word. Alternatively, the words X are ranked in ascending order of P (X‖Y) or E (X‖Y) in ascending order, and the words X up to a certain rank in the ranking are acquired as related words.

単語間関連データベース２５が図８の形式である場合は、単語Ｖに対する一つまたは複数の単語Ｗを関連単語として取得する。 If the inter-word relation database 25 is in the format of FIG. 8, one or more words W for the word V are acquired as related words.

第２形態素解析結果取得部２３の処理内容は、第１形態素解析結果取得部１３と同一である。 The processing content of the second morpheme analysis result acquisition unit 23 is the same as that of the first morpheme analysis result acquisition unit 13.

第１検索部２４の処理内容は、第１検索部１４と同一である。 The processing content of the first search unit 24 is the same as that of the first search unit 14.

表示部２６の処理内容は、表示部１６と同一である。 The processing content of the display unit 26 is the same as that of the display unit 16.

また、請求項４の第２検索手段は、表示された置換後形態素解析結果の内、ユーザがクリック等の操作で選択した置換後形態素解析結果を、形態素解析部１１，２１で得られた形態素解析結果に付加した上で文書検索を実行する。ユーザが選択した後の処理内容は、第１検索部１４、２４と同一である。 Further, the second search means of the fourth aspect provides the morpheme obtained by the morpheme analysis units 11 and 21, of the displayed morpheme analysis results after replacement, the morpheme analysis results selected by the user by clicking or the like. A document search is executed after adding to the analysis result. The processing content after the user has selected is the same as that of the first search units 14 and 24.

上記の図３及び図６に示す装置の各構成要素の動作をプログラムとして構築し、文書検索装置として利用されるコンピュータにインストールし、ＣＰＵ等の手段で実施する、または、ネットワークを介して流通させることが可能である。 The operation of each component of the apparatus shown in FIGS. 3 and 6 is constructed as a program, installed in a computer used as a document search apparatus, and implemented by means such as a CPU, or distributed via a network. It is possible.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記録媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable recording medium such as a hard disk, a flexible disk, or a CD-ROM, and installed in a computer or distributed.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、文書の検索技術に適用可能である。 The present invention is applicable to a document search technique.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の第１の実施の形態における検索装置の構成図である。It is a block diagram of the search device in the 1st Embodiment of this invention. 本発明の第１の実施の形態における生成される共起行列の例である。It is an example of the co-occurrence matrix generated in the first embodiment of the present invention. 本発明の第１の実施の形態における第１関連単語取得部の動作のフローチャートである。It is a flowchart of operation | movement of the 1st related word acquisition part in the 1st Embodiment of this invention. 本発明の第２の実施の形態における検索装置の構成図である。It is a block diagram of the search device in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における単語間関連データベースの例（その１）である。It is an example (the 1) of the word related database in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における単語間関連データベースの例（その２）である。It is an example (the 2) of the related database between words in the 2nd Embodiment of this invention.

符号の説明Explanation of symbols

１１形態素解析手段、形態素解析部
１２第１関連単語取得手段、第１関連単語取得部
１３第１形態素解析結果取得手段、第１形態素解析結果取得部
１４第１検索手段、第１検索部
１５単語ベクトルデータベース
１６表示部
２１形態素解析部
２２第２関連単語取得部
２３第２形態素解析結果取得部
２４第１検索部
２５単語間関連データベース
２６表示部 11 morpheme analysis unit, morpheme analysis unit 12 first related word acquisition unit, first related word acquisition unit 13 first morpheme analysis result acquisition unit, first morpheme analysis result acquisition unit 14 first search unit, first search unit 15 word Vector database 16 Display unit 21 Morphological analysis unit 22 Second related word acquisition unit 23 Second morpheme analysis result acquisition unit 24 First search unit 25 Inter-word related database 26 Display unit

Claims

任意の単語Ａに対し、各成分の値が、該単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベースと、
検索入力文を形態素解析する形態素解析手段と、
前記形態素解析手段で得られた形態素解析結果中の単語Ｂに対し、該単語Ｂの前記単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｃを、該単語Ｂの関連単語として取得する第１関連単語取得手段と、
前記形態素解析手段で得られた形態素解析結果中の単語Ｂを、該単語Ｂの前記第１関連単語取得手段で取得した前記関連単語で置換して得られる置換後形態素解析結果を取得する第１形態素解析結果取得手段と、
前記第１形態素解析結果取得手段で取得した前記置換後形態素解析結果を、前記形態素解析手段で得られた形態素解析結果に付加した上で文書検索を実行する第１検索手段と、
を有することを特徴とする文書検索装置。 A word vector database in which for each word A, the value of each component is associated with a word vector that is a relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the component;
Morphological analysis means for morphological analysis of the search input sentence;
Α between the word vector in the word vector database of the word B and the word vector in the word vector database of the arbitrary word C with respect to the word B in the morphological analysis result obtained by the morpheme analyzing means The divergence distance,

The first divergence to calculate the α divergence distance by setting the threshold value of α in the equation to be less than 0.5 and to obtain one or more words C having the small distance as related words of the word B Word acquisition means;
A first morpheme analysis result obtained by replacing the word B in the morpheme analysis result obtained by the morpheme analysis unit with the related word obtained by the first related word obtaining unit of the word B; Morphological analysis result acquisition means;
First search means for executing a document search after adding the post-replacement morpheme analysis result acquired by the first morpheme analysis result acquisition means to the morpheme analysis result obtained by the morpheme analysis means;
A document search apparatus characterized by comprising:

任意の単語Ａに対し、各成分の値が、単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベースを参照し、任意の単語Ｂに対し、該単語Ｂの該単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｂ，Ｃと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している単語間関連データベースと、
検索入力文を形態素解析する形態素解析手段と、
前記形態素解析手段で得られた形態素解析結果中の単語Ｄに対し、前記単語間関連データベースから、該単語Ｄとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｅを、該単語Ｄの関連単語として取得する第２関連単語取得手段と、
前記形態素解析手段で得られた形態素解析結果中の単語Ｄを、該単語Ｄの前記第２関連単語取得手段で取得した前記関連単語で置換して得られる置換後形態素解析結果を取得する第２形態素解析結果取得手段と、
前記第２形態素解析結果取得手段で取得した前記置換後形態素解析結果を、前記形態素解析手段で得られた形態素解析結果に付加した上で文書検索を実行する第１検索手段と、
を有することを特徴とする文書検索装置。 For an arbitrary word A, refer to a word vector database in which the value of each component is associated with a word vector that is a relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the component A; For an arbitrary word B, an α divergence distance between a word vector in the word vector database of the word B and a word vector in the word vector database of an arbitrary word C,

The α divergence distance is calculated by setting the α threshold value of the expression to be less than 0.5 , and the distance or the relevance obtained by the calculation is stored together with the ordered word pairs B and C, or A word-to-word relation database storing one or more words having a small distance or a large degree of relevance for each word;
Morphological analysis means for morphological analysis of the search input sentence;
For the word D in the morpheme analysis result obtained by the morpheme analysis means, one or more words E having a small distance or a high degree of association with the word D from the inter-word relation database. Second related word acquisition means for acquiring as a related word of
A second acquisition unit obtains a post-substitution morpheme analysis result obtained by replacing the word D in the morpheme analysis result obtained by the morpheme analysis unit with the related word obtained by the second related word obtaining unit of the word D. Morphological analysis result acquisition means;
First search means for executing a document search after adding the post-replacement morpheme analysis result acquired by the second morpheme analysis result acquisition means to the morpheme analysis result obtained by the morpheme analysis means;
A document search apparatus characterized by comprising:

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｃを、該単語Ｂの関連単語として取得する第１関連単語取得手段と、
前記形態素解析手段で得られた形態素解析結果中の単語Ｂを、該単語Ｂの前記第１関連単語取得手段で取得した前記関連単語で置換して得られる置換後形態素解析結果を取得する第１形態素解析結果取得手段と、
前記第１形態素解析結果取得手段で取得した前記置換後形態素解析結果を表示する表示手段と、
ユーザが選択した置換後形態素解析結果を、前記形態素解析手段で得られた形態素解析結果に付加した上で文書検索を実行する第２検索手段と、
を有することを特徴とする文書検索装置。 A word vector database in which for each word A, the value of each component is associated with a word vector that is a relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the component;
Morphological analysis means for morphological analysis of the search input sentence;
Α between the word vector in the word vector database of the word B and the word vector in the word vector database of the arbitrary word C with respect to the word B in the morphological analysis result obtained by the morpheme analyzing means The divergence distance ,

The first divergence to calculate the α divergence distance by setting the threshold value of α in the equation to be less than 0.5 and to obtain one or more words C having the small distance as related words of the word B Word acquisition means;
A first morpheme analysis result obtained by replacing the word B in the morpheme analysis result obtained by the morpheme analysis unit with the related word obtained by the first related word obtaining unit of the word B; Morphological analysis result acquisition means;
Display means for displaying the post-replacement morpheme analysis result acquired by the first morpheme analysis result acquisition means;
A second search means for executing a document search after adding the post-replacement morpheme analysis result selected by the user to the morpheme analysis result obtained by the morpheme analysis means;
A document search apparatus characterized by comprising:

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｂ，Ｃと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している単語間関連データベースと、
検索入力文を形態素解析する形態素解析手段と、
前記形態素解析手段で得られた形態素解析結果中の単語Ｄに対し、前記単語間関連データベースから、該単語Ｄとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｅを、該単語Ｄの関連単語として取得する第２関連単語取得手段と、
前記形態素解析手段で得られた形態素解析結果中の単語Ｄを、該単語Ｄの前記第２関連単語取得手段で取得した前記関連単語で置換して得られる置換後形態素解析結果を取得する第２形態素解析結果取得手段と、
前記第２形態素解析結果取得手段で取得した前記置換後形態素解析結果を表示する表示手段と、
ユーザが選択した置換後形態素解析結果を、前記形態素解析手段で得られた形態素解析結果に付加した上で文書検索を実行する第２検索手段と、
を有することを特徴とする文書検索装置。 For an arbitrary word A, refer to a word vector database in which the value of each component is associated with a word vector that is a relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the component A; For an arbitrary word B, an α divergence distance between a word vector in the word vector database of the word B and a word vector in the word vector database of an arbitrary word C ,

The α divergence distance is calculated by setting the α threshold value of the expression to be less than 0.5 , and the distance or the relevance obtained by the calculation is stored together with the ordered word pairs B and C, or A word-to-word relation database storing one or more words having a small distance or a large degree of relevance for each word;
Morphological analysis means for morphological analysis of the search input sentence;
For the word D in the morpheme analysis result obtained by the morpheme analysis means, one or more words E having a small distance or a high degree of association with the word D from the inter-word relation database. Second related word acquisition means for acquiring as a related word of
A second acquisition unit obtains a post-substitution morpheme analysis result obtained by replacing the word D in the morpheme analysis result obtained by the morpheme analysis unit with the related word obtained by the second related word obtaining unit of the word D. Morphological analysis result acquisition means;
Display means for displaying the post-replacement morpheme analysis result acquired by the second morpheme analysis result acquisition means;
A second search means for executing a document search after adding the post-replacement morpheme analysis result selected by the user to the morpheme analysis result obtained by the morpheme analysis means;
A document search apparatus characterized by comprising:

形態素解析手段が、検索入力文を形態素解析する形態素解析ステップと、
第１関連単語取得手段が、前記形態素解析ステップで得られた形態素解析結果中の単語Ｂに対し、該単語Ｂの、任意の単語Ａに対し、各成分の値が、該単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｃを、該単語Ｂの関連単語として取得する第１関連単語取得ステップと、
第１形態素解析結果取得手段が、前記形態素解析ステップで得られた形態素解析結果中の単語Ｂを、該単語Ｂの前記第１関連単語取得ステップで取得した前記関連単語で置換して得られる置換後形態素解析結果を取得する第１形態素解析結果取得ステップと、
第１検索手段が、前記第１形態素解析結果取得ステップで取得した前記置換後形態素解析結果を、前記形態素解析ステップで得られた形態素解析結果に付加した上で文書検索を実行する第１検索ステップと、
を行うことを特徴とする文書検索方法。 A morpheme analyzing means for performing a morpheme analysis on the search input sentence;
For the word B in the morpheme analysis result obtained in the morpheme analysis step, the first related word acquisition means has the value of each component for the arbitrary word A of the word B, Between a word vector in a word vector database that associates a word vector that is a relative value of a co-occurrence frequency with a word corresponding to a component or a word meaning attribute, and a word vector in the word vector database of an arbitrary word C Α divergence distance of

The first divergence to calculate the α divergence distance by setting the threshold value of α in the equation to be less than 0.5 and to obtain one or more words C having the small distance as related words of the word B A word acquisition step;
The replacement obtained by the first morpheme analysis result acquisition unit replacing the word B in the morpheme analysis result obtained in the morpheme analysis step with the related word acquired in the first related word acquisition step of the word B A first morpheme analysis result acquisition step for acquiring a post-morpheme analysis result;
A first search step in which a first search means performs a document search after adding the replacement morpheme analysis result obtained in the first morpheme analysis result acquisition step to the morpheme analysis result obtained in the morpheme analysis step. When,
A document search method characterized by:

形態素解析手段が、検索入力文を形態素解析する形態素解析ステップと、
第２関連単語取得手段が、前記形態素解析ステップで得られた形態素解析結果中の単語Ｄに対し、任意の単語Ａに対し、各成分の値が、単語Ａと、該成分に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを対応付けた単語ベクトルデータベースを参照し、任意の単語Ｂに対し、該単語Ｂの該単語ベクトルデータベース中の単語ベクトルと、任意の単語Ｃの該単語ベクトルデータベース中の単語ベクトルとの間のαダイバージェンス距離を、

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｂ，Ｃと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している単語間関連データベースから、該単語Ｄとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｅを、該単語Ｄの関連単語として取得する第２関連単語取得ステップと、
第２形態素解析結果取得手段が、前記形態素解析ステップで得られた形態素解析結果中の単語Ｄを、該単語Ｄの前記第２関連単語取得手段で取得した前記関連単語で置換して得られる置換後形態素解析結果を取得する第２形態素解析結果取得ステップと、
第１検索手段が、前記第２形態素解析結果取得ステップで取得した前記置換後形態素解析結果を、前記形態素解析ステップで得られた形態素解析結果に付加した上で文書検索を実行する第１検索ステップと、
を行うことを特徴とする文書検索方法。 A morpheme analyzing means for performing a morpheme analysis on the search input sentence;
For the word D in the morpheme analysis result obtained in the morpheme analysis step, the second related word acquisition means has the value of each component as to the word A and the word corresponding to the component Refer to a word vector database in which word vectors that are relative values of co-occurrence frequencies with word semantic attributes are associated, and for any word B, the word vector in the word vector database of the word B and any word Α divergence distance between the word vectors in the word vector database of C,

The α divergence distance is calculated by setting the α threshold value of the expression to be less than 0.5 , and the distance or the relevance obtained by the calculation is stored together with the ordered word pairs B and C, or From the inter-word relation database storing one or a plurality of words having a small distance or a large degree of association for each word, one or a plurality of words having a small distance or a large degree of relation to the word D A second related word acquisition step of acquiring the word E as a related word of the word D;
The replacement obtained by the second morpheme analysis result acquisition unit replacing the word D in the morpheme analysis result obtained in the morpheme analysis step with the related word acquired by the second related word acquisition unit of the word D A second morpheme analysis result acquisition step for acquiring a post-morpheme analysis result;
A first search step in which a first search unit executes a document search after adding the replacement morpheme analysis result acquired in the second morpheme analysis result acquisition step to the morpheme analysis result obtained in the morpheme analysis step. When,
A document search method characterized by:

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離の小さい一つまたは複数の単語Ｃを、該単語Ｂの関連単語として取得する第１関連単語取得ステップと、
第１形態素解析結果取得手段が、前記形態素解析ステップで得られた形態素解析結果中の単語Ｂを、該単語Ｂの前記第１関連単語取得ステップで取得した前記関連単語で置換して得られる置換後形態素解析結果を取得する第１形態素解析結果取得ステップと、
表示手段が、前記第１形態素解析結果取得ステップで取得した前記置換後形態素解析結果を表示する表示ステップと、
第２検索手段が、ユーザが選択した置換後形態素解析結果を、前記形態素解析ステップで得られた形態素解析結果に付加した上で文書検索を実行する第２検索ステップと、
を行うことを特徴とする文書検索方法。 A morpheme analyzing means for performing a morpheme analysis on the search input sentence;
For the word B in the morpheme analysis result obtained in the morpheme analysis step, the first related word acquisition means has the value of each component for the arbitrary word A of the word B, Between a word vector in a word vector database that associates a word vector that is a relative value of a co-occurrence frequency with a word corresponding to a component or a word meaning attribute, and a word vector in the word vector database of an arbitrary word C Α divergence distance of

The first divergence to calculate the α divergence distance by setting the threshold value of α in the equation to be less than 0.5 and to obtain one or more words C having the small distance as related words of the word B A word acquisition step;
The replacement obtained by the first morpheme analysis result acquisition unit replacing the word B in the morpheme analysis result obtained in the morpheme analysis step with the related word acquired in the first related word acquisition step of the word B A first morpheme analysis result acquisition step for acquiring a post-morpheme analysis result;
A display step for displaying the post-replacement morpheme analysis result acquired in the first morpheme analysis result acquisition step;
A second search step in which the second search means adds the post-replacement morpheme analysis result selected by the user to the morpheme analysis result obtained in the morpheme analysis step and then executes a document search;
A document search method characterized by:

と表した時に該式のαの閾値を０．５未満と定めてαダイバージェンス距離を算出し、該距離または算出により得られる関連度を順序付けられた単語対Ｂ，Ｃと共に格納しているか、または、各単語毎に該距離の小さいまたは該関連度の大きい一つまたは複数の単語を格納している単語間関連データベースから、該単語Ｄとの距離の小さいまたは関連度の大きい一つまたは複数の単語Ｅを、該単語Ｄの関連単語として取得する第２関連単語取得ステップと、
第２形態素解析結果取得手段が、前記形態素解析ステップで得られた形態素解析結果中の単語Ｄを、該単語Ｄの前記第２関連単語取得手段で取得した前記関連単語で置換して得られる置換後形態素解析結果を取得する第２形態素解析結果取得ステップと、
表示手段が、前記第２形態素解析結果取得ステップで取得した前記置換後形態素解析結果を表示する表示ステップと、
第２検索手段が、ユーザが選択した置換後形態素解析結果を、前記形態素解析ステップで得られた形態素解析結果に付加した上で文書検索を実行する第２検索ステップと、
を行うことを特徴とする文書検索方法。 A morpheme analyzing means for performing a morpheme analysis on the search input sentence;
For the word D in the morpheme analysis result obtained in the morpheme analysis step, the second related word acquisition means has the value of each component as to the word A and the word corresponding to the component Refer to a word vector database in which word vectors that are relative values of co-occurrence frequencies with word semantic attributes are associated, and for any word B, the word vector in the word vector database of the word B and any word Α divergence distance between the word vectors in the word vector database of C ,

The α divergence distance is calculated by setting the α threshold value of the expression to be less than 0.5 , and the distance or the relevance obtained by the calculation is stored together with the ordered word pairs B and C, or From the inter-word relation database storing one or a plurality of words having a small distance or a large degree of association for each word, one or a plurality of words having a small distance or a large degree of relation to the word D A second related word acquisition step of acquiring the word E as a related word of the word D;
The replacement obtained by the second morpheme analysis result acquisition unit replacing the word D in the morpheme analysis result obtained in the morpheme analysis step with the related word acquired by the second related word acquisition unit of the word D A second morpheme analysis result acquisition step for acquiring a post-morpheme analysis result;
A display step for displaying the post-replacement morpheme analysis result acquired in the second morpheme analysis result acquisition step;
A second search step in which the second search means adds the post-replacement morpheme analysis result selected by the user to the morpheme analysis result obtained in the morpheme analysis step and then executes a document search;
A document search method characterized by:

請求項１乃至４のいずれか１項に記載の文書検索装置を構成する各手段としてコンピュータを機能させるための文書検索プログラム。 A document search program for causing a computer to function as each means constituting the document search device according to any one of claims 1 to 4.

請求項９記載のプログラムを格納したコンピュータ読み取り可能な記録媒体。 A computer-readable recording medium storing the program according to claim 9.