JP6773585B2

JP6773585B2 - Document processing equipment, document processing methods and programs

Info

Publication number: JP6773585B2
Application number: JP2017031043A
Authority: JP
Inventors: 原　正巳; 正巳原; 松永　務; 務松永
Original assignee: NTT Data Corp
Current assignee: NTT Data Corp
Priority date: 2017-02-22
Filing date: 2017-02-22
Publication date: 2020-10-21
Anticipated expiration: 2037-02-22
Also published as: JP2018136760A

Description

本発明は、文書ベクトルを生成するための装置、方法及びプログラムに関する。 The present invention relates to devices, methods and programs for generating document vectors.

従来、自然言語処理の分野において、文書の内容を定量的に表す表現として文書ベクトルが知られている。例えば、非特許文献１には、文書検索技術として、検索対象となる各文書をＴＦ−ＩＤＦ法を用いて文書ベクトルとして表現し、質問文書の文書ベクトルとのコサイン類似度を算出することによって類似度の高い文書を検索するという方法が記載されている。文書ベクトルは、文書検索以外にも文書分類や質問応答等の他の自然言語処理技術においても利用されている。 Conventionally, in the field of natural language processing, a document vector is known as an expression that quantitatively expresses the contents of a document. For example, in Non-Patent Document 1, as a document retrieval technique, each document to be searched is expressed as a document vector using the TF-IDF method, and the cosine similarity with the document vector of the question document is calculated to be similar. It describes how to search for high-quality documents. In addition to document retrieval, document vectors are also used in other natural language processing technologies such as document classification and question answering.

梅澤香矢乃、小林一郎、「文書ベクトルの次元削減に基づく有効な類似文書判定への取り組み」、情報処理学会全国大会講演論文集、情報処理学会、2011年3月2日、第73巻、第2号、p.2,393-2,394Kayano Umezawa, Ichiro Kobayashi, "Efforts to Determine Effective Similar Documents Based on Dimensional Reduction of Document Vectors", IPSJ National Conference Proceedings, IPSJ, March 2, 2011, Volume 73, No. 2, p.2,393-2,394

しかし、従来の文書ベクトルは、文書中に出現する個々の単語の頻度情報しか考慮しないことから、出現する単語が共通する文書間では、単語間の係り受け構造が異なっていたとしても同一の文書ベクトルになってしまうという問題があった。 However, since the conventional document vector considers only the frequency information of individual words appearing in the document, the same document is used even if the dependency structure between the words is different between the documents having the same appearing words. There was a problem that it became a vector.

本発明は、このような事情に鑑みてなされたものであり、文書中に出現する単語間の係り受け構造を反映した文書ベクトルを生成することを目的とする。 The present invention has been made in view of such circumstances, and an object of the present invention is to generate a document vector that reflects a dependency structure between words appearing in a document.

上記の課題を解決するため、本発明に係る文書処理装置は、文書集合を構成する各文書を解析して単語を抽出する解析部と、前記文書集合を構成する各文書について、前記解析部により抽出された各単語の係り受け関係を抽出する係り受け関係抽出部と、前記文書集合を構成する一の対象文書において、前記係り受け関係抽出部により抽出された各係り受け関係が出現する出現頻度を算出する出現頻度算出部と、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記文書集合において係り受け関係が出現する文書数を算出する出現文書数算出部と、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記文書集合を構成する文書数を、前記出現文書数算出部により算出された文書の数で除して逆文書頻度を算出する逆文書頻度算出部と、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記出現頻度算出部により算出された頻度に、前記逆文書頻度算出部により算出された逆文書頻度を乗じて重要度を算出する重要度算出部と、前記解析部により抽出された各単語が行及び列に対応付けられた行列であって、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、係り受け関係を構成する係り元の単語と係り先の単語のうち一方の行と他方の列に対応する成分として当該係り受け関係について前記重要度算出部により算出された重要度が割り当てられた行列を生成する重要度行列生成部と、前記重要度行列生成部により生成された行列の各成分を連結して前記対象文書の文書ベクトルを生成する文書ベクトル生成部とを備える。 In order to solve the above problems, the document processing apparatus according to the present invention has an analysis unit that analyzes each document that constitutes the document set and extracts words, and the analysis unit that analyzes each document that constitutes the document set. In the dependency relationship extraction unit that extracts the dependency relations of each extracted word and the one target document that constitutes the document set, the frequency of occurrence of each dependency relation extracted by the dependency relation extraction unit appears. And the appearance frequency calculation unit that calculates the number of documents in which the dependency relationship appears in the document set for each dependency relationship extracted by the dependency relationship extraction unit for the target document. For each dependency relationship extracted by the dependency relationship extraction unit for the target document, the reverse document is obtained by dividing the number of documents constituting the document set by the number of documents calculated by the appearance document number calculation unit. For each dependency relationship extracted by the dependency relationship extraction unit for the target document and the inverse document frequency calculation unit for calculating the frequency, the frequency calculated by the appearance frequency calculation unit is added to the frequency calculated by the inverse document frequency calculation unit. The importance calculation unit that calculates the importance by multiplying the calculated inverse document frequency, and the matrix in which each word extracted by the analysis unit is associated with rows and columns, and the dependency on the target document. For each dependency relationship extracted by the relationship extraction unit, the importance of the dependency relationship as a component corresponding to one row and the other column of the original word and the destination word constituting the dependency relationship. A document vector of the target document is generated by concatenating each component of the matrix generated by the importance matrix generation unit and the importance matrix generation unit that generates the matrix to which the importance calculated by the calculation unit is assigned. It includes a document vector generator.

好ましい態様において、上記の文書処理装置は、前記重要度行列生成部により生成された行列の次元を、所定の次元圧縮方法を用いて圧縮する次元圧縮部をさらに備え、前記文書ベクトル生成部は、前記次元圧縮部により次元を圧縮された行列の各成分を連結して前記対象文書の文書ベクトルを生成する。 In a preferred embodiment, the document processing apparatus further comprises a dimension compression unit that compresses the dimensions of the matrix generated by the importance matrix generation unit using a predetermined dimension compression method, and the document vector generation unit A document vector of the target document is generated by concatenating each component of the matrix whose dimensions are compressed by the dimension compression unit.

さらに好ましい態様において、上記の文書処理装置は、前記文書ベクトル生成部により生成された前記対象文書の文書ベクトルと、他の文書について前記文書ベクトル生成部により生成された文書ベクトルとに基づいて、所定の類似度計算方法を用いて、前記対象文書と前記他の文書の類似度を判定する類似度判定部をさらに備える。 In a more preferred embodiment, the document processing apparatus determines the document vector of the target document generated by the document vector generation unit and the document vector generated by the document vector generation unit for other documents. A similarity determination unit for determining the similarity between the target document and the other document is further provided by using the similarity calculation method of.

また、本発明に係る文書処理方法は、文書処理装置により実行される文書処理方法であって、文書集合を構成する各文書を解析して単語を抽出する解析ステップと、前記文書集合を構成する各文書について、前記解析ステップにより抽出された各単語の係り受け関係を抽出する係り受け関係抽出ステップと、前記文書集合を構成する一の対象文書において、前記係り受け関係抽出ステップにより抽出された各係り受け関係が出現する出現頻度を算出する出現頻度算出ステップと、前記対象文書について前記係り受け関係抽出ステップにより抽出された各係り受け関係について、前記文書集合において係り受け関係が出現する文書数を算出する出現文書数算出ステップと、前記対象文書について前記係り受け関係抽出ステップにより抽出された各係り受け関係について、前記文書集合を構成する文書数を、前記出現文書数算出ステップにより算出された文書の数で除して逆文書頻度を算出する逆文書頻度算出ステップと、前記対象文書について前記係り受け関係抽出ステップにより抽出された各係り受け関係について、前記出現頻度算出ステップにより算出された頻度に、前記逆文書頻度算出ステップにより算出された逆文書頻度を乗じて重要度を算出する重要度算出ステップと、前記解析ステップにより抽出された各単語が行及び列に対応付けられた行列であって、前記対象文書について前記係り受け関係抽出ステップにより抽出された各係り受け関係について、係り受け関係を構成する係り元の単語と係り先の単語のうち一方の行と他方の列に対応する成分として当該係り受け関係について前記重要度算出ステップにより算出された重要度が割り当てられた行列を生成する重要度行列生成ステップと、前記重要度行列生成ステップにより生成された行列の各成分を連結して前記対象文書の文書ベクトルを生成する文書ベクトル生成ステップとを備える。 Further, the document processing method according to the present invention is a document processing method executed by a document processing apparatus, and constitutes an analysis step of analyzing each document constituting the document set and extracting words, and the document set. For each document, each of the dependency relationship extraction step for extracting the dependency relationship of each word extracted by the analysis step and the dependency relationship extraction step for one target document constituting the document set. For the appearance frequency calculation step for calculating the appearance frequency at which the dependency relationship appears, and for each dependency relationship extracted by the dependency relationship extraction step for the target document, the number of documents in which the dependency relationship appears in the document set is calculated. For each dependency relationship extracted by the dependency relationship extraction step for the target document and the appearance document number calculation step to be calculated, the number of documents constituting the document set is calculated by the appearance document number calculation step. The reverse document frequency calculation step for calculating the reverse document frequency by dividing by the number of, and the frequency calculated by the appearance frequency calculation step for each dependency relationship extracted by the dependency relationship extraction step for the target document. , The importance calculation step of multiplying the reverse document frequency calculated by the reverse document frequency calculation step to calculate the importance, and the matrix in which each word extracted by the analysis step is associated with rows and columns. As a component corresponding to one row and the other column of the dependency word and the dependency word constituting the dependency relationship for each dependency relationship extracted by the dependency relationship extraction step for the target document. The importance matrix generation step for generating a matrix to which the importance calculated by the importance calculation step is assigned to the dependency relationship and each component of the matrix generated by the importance matrix generation step are connected to each other. It includes a document vector generation step for generating a document vector of the target document.

また、本発明に係るプログラムは、コンピュータを、文書集合を構成する各文書を解析して単語を抽出する解析部と、前記文書集合を構成する各文書について、前記解析部により抽出された各単語の係り受け関係を抽出する係り受け関係抽出部と、前記文書集合を構成する一の対象文書において、前記係り受け関係抽出部により抽出された各係り受け関係が出現する出現頻度を算出する出現頻度算出部と、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記文書集合において係り受け関係が出現する文書数を算出する出現文書数算出部と、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記文書集合を構成する文書数を、前記出現文書数算出部により算出された文書の数で除して逆文書頻度を算出する逆文書頻度算出部と、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記出現頻度算出部により算出された頻度に、前記逆文書頻度算出部により算出された逆文書頻度を乗じて重要度を算出する重要度算出部と、前記解析部により抽出された各単語が行及び列に対応付けられた行列であって、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、係り受け関係を構成する係り元の単語と係り先の単語のうち一方の行と他方の列に対応する成分として当該係り受け関係について前記重要度算出部により算出された重要度が割り当てられた行列を生成する重要度行列生成部と、前記重要度行列生成部により生成された行列の各成分を連結して前記対象文書の文書ベクトルを生成する文書ベクトル生成部として機能させる。 Further, in the program according to the present invention, the computer has an analysis unit that analyzes each document constituting the document set and extracts words, and each word extracted by the analysis unit for each document constituting the document set. Frequency of occurrence of calculating the frequency of appearance of each dependency extracted by the dependency extraction unit in the dependency relationship extraction unit that extracts the dependency relationship and one target document that constitutes the document set. Regarding the calculation unit, the appearance document number calculation unit that calculates the number of documents in which the dependency relationship appears in the document set, and the target document for each dependency relationship extracted by the dependency relationship extraction unit for the target document. For each dependency relationship extracted by the dependency relationship extraction unit, the inverse document frequency is calculated by dividing the number of documents constituting the document set by the number of documents calculated by the appearance document number calculation unit. For the document frequency calculation unit and each dependency relationship extracted by the dependency relationship extraction unit for the target document, the frequency calculated by the appearance frequency calculation unit is the reverse document calculated by the reverse document frequency calculation unit. An importance calculation unit that calculates the importance by multiplying the frequency, and a matrix in which each word extracted by the analysis unit is associated with rows and columns, and the target document is extracted by the dependency relationship extraction unit. For each of the dependency relationships, the importance calculation unit calculates the dependency relationships as components corresponding to one row and the other column of the original word and the destination word that make up the dependency relationship. As a document vector generation unit that generates a document vector of the target document by concatenating each component of the matrix generated by the importance matrix generation unit and the importance matrix generation unit that generates a matrix to which the importance is assigned. Make it work.

本発明によれば、文書中に出現する単語間の係り受け構造を反映した文書ベクトルを生成することができる。 According to the present invention, it is possible to generate a document vector that reflects the dependency structure between words appearing in the document.

文書処理装置１の構成の一例を示す図である。It is a figure which shows an example of the structure of the document processing apparatus 1. 文書ベクトル生成処理の一例を示すフロー図である。It is a flow chart which shows an example of the document vector generation processing. 形態素解析部１０２による解析結果の一例を示す図である。It is a figure which shows an example of the analysis result by the morphological analysis unit 102. 係り受け解析部１０３による解析結果の一例を示す図である。It is a figure which shows an example of the analysis result by the dependency analysis unit 103. 係り受け関係記憶部１０５に記録されるテーブルの一例を示す図である。It is a figure which shows an example of the table recorded in the dependency relation storage part 105. 文書集合Ｄを構成する単語群の一例を示す図である。It is a figure which shows an example of the word group which comprises the document set D. 重要度行列の一例を示す図である。It is a figure which shows an example of the importance matrix. 重要度行列を構成する各行ベクトルの一例を示す図である。It is a figure which shows an example of each row vector which constitutes the importance matrix.

１．実施形態
１−１．構成
図１は、本発明の一実施形態に係る文書処理装置１の構成の一例を示す図である。文書処理装置１は、ＣＰＵ等の演算処理装置と、ＨＤＤ等の記憶装置と、ＮＩＣ等の通信装置を備え、文書データベース２からインターネット等の通信回線３を介して文書を取得して、当該文書について文書ベクトルを生成するためのコンピュータである。文書処理装置１は、文書入力部１０１と、形態素解析部１０２と、係り受け解析部１０３と、係り受け関係抽出部１０４と、係り受け関係記憶部１０５と、出現頻度算出部１０６と、出現文書数算出部１０７と、重要度算出部１０８と、重要度行列生成部１０９と、文書ベクトル生成部１１０と、文書ベクトル記憶部１１１という機能を備える。これらの機能のうち、文書入力部１０１は通信装置により実現され、係り受け関係記憶部１０５と文書ベクトル記憶部１１１は記憶装置により実現され、その他の機能は、記憶装置に記憶されるプログラムを演算処理装置が実行することにより実現される。なお、文書処理装置１は、通信回線３により相互に接続される複数のサーバ装置により構成されてもよい。 1. 1. Embodiment 1-1. Configuration FIG. 1 is a diagram showing an example of the configuration of the document processing apparatus 1 according to the embodiment of the present invention. The document processing device 1 includes an arithmetic processing unit such as a CPU, a storage device such as an HDD, and a communication device such as a NIC, obtains a document from a document database 2 via a communication line 3 such as the Internet, and obtains the document. Is a computer for generating document vectors about. The document processing device 1 includes a document input unit 101, a morphological analysis unit 102, a dependency analysis unit 103, a dependency relationship extraction unit 104, a dependency relationship storage unit 105, an appearance frequency calculation unit 106, and an appearance document. It has functions of a number calculation unit 107, an importance calculation unit 108, an importance matrix generation unit 109, a document vector generation unit 110, and a document vector storage unit 111. Among these functions, the document input unit 101 is realized by the communication device, the dependency relationship storage unit 105 and the document vector storage unit 111 are realized by the storage device, and the other functions calculate the program stored in the storage device. It is realized by the processing device executing. The document processing device 1 may be composed of a plurality of server devices connected to each other by a communication line 3.

文書処理装置１が備える機能のうち、文書入力部１０１は、文書データベース２に記憶される文書データを、通信回線３を介して取得する。 Among the functions provided in the document processing device 1, the document input unit 101 acquires the document data stored in the document database 2 via the communication line 3.

形態素解析部１０２は、文書入力部１０１により取得された文書データにより表される文書を形態素解析して、当該文書を構成する単語を抽出する。形態素解析には、例えば、MeCab（http://taku910.github.io/mecab/）や、KyTea（http://www.phontron.com/kytea/）等の周知の形態素解析器を使用してよい。形態素解析部１０２は、本発明に係る「解析部」の一例である。 The morphological analysis unit 102 performs morphological analysis on a document represented by the document data acquired by the document input unit 101, and extracts words constituting the document. For morphological analysis, for example, a well-known morphological analyzer such as MeCab (http://taku910.github.io/mecab/) or KyTea (http://www.phontron.com/kytea/) is used. Good. The morphological analysis unit 102 is an example of the “analysis unit” according to the present invention.

係り受け解析部１０３は、形態素解析部１０２による解析結果に基づいて係り受け解析を行い、処理対象の文書を構成する各文節の係り受け構造（言い換えると係り受け木）を抽出する。係り受け解析には、例えば、CaboCha（https://taku910.github.io/cabocha/）や、KNP（http://nlp.ist.i.kyoto-u.ac.jp/index.php?KNP）等の周知の係り受け解析器を使用してよい。 The dependency analysis unit 103 performs a dependency analysis based on the analysis result by the morphological analysis unit 102, and extracts the dependency structure (in other words, the dependency tree) of each clause constituting the document to be processed. For dependency analysis, for example, CaboCha (https://taku910.github.io/cabocha/) and KNP (http://nlp.ist.i.kyoto-u.ac.jp/index.php?KNP) ) Etc., a well-known dependency analyzer may be used.

係り受け関係抽出部１０４は、係り受け解析部１０３による解析結果を参照して、処理対象の文書について、特定の品詞の単語からなる係り受け関係を抽出する。ここで特定の品詞とは、名詞、形容詞又は動詞である。係り受け関係を抽出すると、当該係り受け関係を構成する各単語に単語ＩＤを付与する。その際、同じ単語には同じ単語ＩＤが付与される。各単語に単語ＩＤを付与すると、係り受け関係を構成する係り元と係り先の単語と、それらの単語に付与された単語ＩＤの組とを対応付けて係り受け関係記憶部１０５に記録する。 The dependency relationship extraction unit 104 refers to the analysis result by the dependency analysis unit 103, and extracts the dependency relationship consisting of words of a specific part of speech for the document to be processed. Here, the specific part of speech is a noun, an adjective, or a verb. When the dependency relationship is extracted, a word ID is given to each word constituting the dependency relationship. At that time, the same word ID is given to the same word. When a word ID is assigned to each word, the dependency source and destination words constituting the dependency relationship and the set of word IDs assigned to those words are associated and recorded in the dependency relationship storage unit 105.

出現頻度算出部１０６は、係り受け関係記憶部１０５を参照して、処理対象の文書において特定された各係り受け関係が当該文書において出現する出現頻度（言い換えると出現回数）を算出する。係り受け関係について出願頻度を算出すると、算出した出現頻度を当該係り受け関係と対応付けて係り受け関係記憶部１０５に記録する。なお、出現頻度の算出後、係り受け関係記憶部１０５に重複して登録されている係り受け関係については、そのレコードを削除してよい。 The appearance frequency calculation unit 106 refers to the dependency relationship storage unit 105, and calculates the appearance frequency (in other words, the number of occurrences) in which each dependency relationship specified in the document to be processed appears in the document. When the application frequency is calculated for the dependency relationship, the calculated appearance frequency is recorded in the dependency relationship storage unit 105 in association with the dependency relationship. After calculating the frequency of appearance, the record of the dependency relationship registered in the dependency relationship storage unit 105 may be deleted.

出現文書数算出部１０７は、係り受け関係記憶部１０５を参照して、処理対象の文書において特定された各係り受け関係について、文書集合において係り受け関係が出現する文書数を算出する。係り受け関係について文書数を算出すると、算出した文書数を当該係り受け関係と対応付けて係り受け関係記憶部１０５に記録する。 The appearance document number calculation unit 107 calculates the number of documents in which the dependency relationship appears in the document set for each dependency relationship specified in the document to be processed with reference to the dependency relationship storage unit 105. When the number of documents is calculated for the dependency relationship, the calculated number of documents is recorded in the dependency relationship storage unit 105 in association with the dependency relationship.

重要度算出部１０８は、係り受け関係記憶部１０５を参照して、処理対象の文書において特定された各係り受け関係について、数１の式を用いて重要度を算出する。

The importance calculation unit 108 calculates the importance of each dependency relationship specified in the document to be processed by using the equation of Equation 1 with reference to the dependency relationship storage unit 105.

数１の式において、Ｗは重要度を表し、ｍは、出現頻度算出部１０６により算出される出現頻度を表し、Ｎは文書集合を構成する文書数を表し、ｎは、出現文書数算出部１０７により算出される文書数を表す。なおここでｌｏｇは常用対数である。重要度算出部１０８は、数１の式に表されるように、文書集合を構成する文書数を、出現文書数算出部１０７により算出される文書数で除して逆文書頻度を算出し、この逆文書頻度に「１」を加算して対数をとって得た値に、出現頻度算出部１０６により算出される出現頻度を乗ずることで重要度を算出する。係り受け関係について重要度を算出すると、算出した重要度を当該係り受け関係と対応付けて係り受け関係記憶部１０５に記録する。重要度算出部１０８は、本発明に係る「逆文書頻度算出部」と「重要度算出部」の一例である。 In the formula of Equation 1, W represents the importance, m represents the appearance frequency calculated by the appearance frequency calculation unit 106, N represents the number of documents constituting the document set, and n is the appearance document number calculation unit. Represents the number of documents calculated by 107. Here, log is a common logarithm. The importance calculation unit 108 calculates the inverse document frequency by dividing the number of documents constituting the document set by the number of documents calculated by the appearance document number calculation unit 107, as expressed by the equation of Equation 1. The importance is calculated by multiplying the value obtained by adding "1" to the inverse document frequency and taking a logarithm by the appearance frequency calculated by the appearance frequency calculation unit 106. When the importance of the dependency relationship is calculated, the calculated importance is recorded in the dependency relationship storage unit 105 in association with the dependency relationship. The importance calculation unit 108 is an example of the “reverse document frequency calculation unit” and the “importance calculation unit” according to the present invention.

重要度行列生成部１０９は、係り受け関係記憶部１０５を参照して、処理対象の文書について重要度行列を生成する。この重要度行列は、文書集合から抽出された係り受け関係を構成する各単語が行及び列に対応付けられた行列であって、処理対象の文書において特定された各係り受け関係について、係り受け関係を構成する係り元の単語の行と係り先の列に対応する成分として当該係り受け関係の重要度が割り当てられた行列である。重要度が割り当てられない要素には「０」が割り当てられる。 The importance matrix generation unit 109 generates an importance matrix for the document to be processed with reference to the dependency relationship storage unit 105. This importance matrix is a matrix in which each word constituting the dependency relationship extracted from the document set is associated with rows and columns, and is a dependency for each dependency relationship specified in the document to be processed. It is a matrix in which the importance of the dependency relationship is assigned as a component corresponding to the row of the dependency word and the column of the dependency that compose the relationship. "0" is assigned to the element to which the importance is not assigned.

文書ベクトル生成部１１０は、重要度行列生成部１０９により生成された重要度行列の各行ベクトルを連結して文書ベクトルを生成する。文書ベクトルを生成すると、当該文書ベクトルが生成された文書の文書ＩＤと対応付けて文書ベクトル記憶部１１１に記録する。 The document vector generation unit 110 generates a document vector by concatenating each row vector of the importance matrix generated by the importance matrix generation unit 109. When the document vector is generated, the document vector is recorded in the document vector storage unit 111 in association with the document ID of the generated document.

１−２．動作
文書処理装置１により実行される文書処理方法について説明する。具体的には文書ベクトル生成処理について説明する。図２は、文書ベクトル生成処理の一例を示すフロー図である。なお、本動作例の説明では、２０件の文書からなる文書集合Ｄを処理する場合を想定する。 1-2. The document processing method executed by the operation document processing apparatus 1 will be described. Specifically, the document vector generation process will be described. FIG. 2 is a flow chart showing an example of the document vector generation process. In the description of this operation example, it is assumed that a document set D composed of 20 documents is processed.

文書入力部１０１により文書集合Ｄのデータが取得されると、形態素解析部１０２は、文書集合Ｄのうち、処理対象の文書を形態素解析して、当該文書を構成する単語を抽出する（Ｓ１）。図３は、処理対象の文書が、「季節に関係なく花が咲く植物を教えて。」という一文のみにより構成される文書（以下、「文書Ａ」）であった場合に形態素解析部１０２により出力される解析結果の一例を示す図である。同図に示す解析結果の各行は、単語の表層形、品詞、品詞細分類１、品詞細分類２、品詞細分類３、活用型、活用形、原形、読み及び発音により構成される。アスタリスクは、その情報が辞書に登録されていないことを示している。 When the data of the document set D is acquired by the document input unit 101, the morphological analysis unit 102 morphologically analyzes the document to be processed from the document set D and extracts the words constituting the document (S1). .. FIG. 3 shows the case where the document to be processed is a document composed of only one sentence "Tell me about plants that bloom regardless of the season" (hereinafter, "Document A") by the morphological analysis unit 102. It is a figure which shows an example of the output analysis result. Each line of the analysis result shown in the figure is composed of the surface form of the word, the part of speech, the part of speech subclassification 1, the part of speech subclassification 2, the part of speech subclassification 3, the inflected type, the inflected form, the original form, the reading and the pronunciation. An asterisk indicates that the information is not registered in the dictionary.

次に、係り受け解析部１０３は、ステップＳ１による解析結果に基づいて係り受け解析を行い、処理対象の文書を構成する各文節の係り受け構造を抽出する（Ｓ２）。図４は、処理対象の文書が上記の文書Ａであった場合に係り受け解析部１０３により出力される解析結果の一例を示す図である。同図に示す解析結果において、アスタリスクから始まる第１、４、６、８、１１、１３及び１６行は、文節番号と、係り先の文節番号（係り先なしの場合は「−１」）とにより構成され、その他の行は、単語の表層形、品詞、品詞細分類１、品詞細分類２、品詞細分類３、活用型、活用形、原形、読み及び発音により構成される。その他の行におけるアスタリスクは、その情報が辞書に登録されていないことを示している。同図に示す解析結果は、例えば、文節「季節に」が文節「関係」に係っていることを示している。 Next, the dependency analysis unit 103 performs a dependency analysis based on the analysis result in step S1 and extracts the dependency structure of each clause constituting the document to be processed (S2). FIG. 4 is a diagram showing an example of an analysis result output by the dependency analysis unit 103 when the document to be processed is the above-mentioned document A. In the analysis results shown in the figure, the 1st, 4th, 6th, 8th, 11th, 13th and 16th lines starting with the asterisk are the clause number and the clause number of the conjugation (“-1” in the case of no conjugation). The other lines are composed of the surface form of the word, the part of speech, the part of speech subclassification 1, the part of speech subclassification 2, the part of speech subclassification 3, the inflected type, the inflected form, the original form, the reading and the pronunciation. Asterisks in the other lines indicate that the information is not registered in the dictionary. The analysis results shown in the figure show, for example, that the phrase "seasonally" is related to the phrase "relationship".

次に、係り受け関係抽出部１０４は、ステップＳ２の解析結果を参照して、処理対象の文書について、名詞、形容詞又は動詞の単語からなる係り受け関係を抽出する（Ｓ３）。係り受け関係を抽出すると、当該係り受け関係を構成する各単語に単語ＩＤを付与し、係り受け関係を構成する係り元と係り先の単語と、それらの単語に付与された単語ＩＤの組とを対応付けて係り受け関係記憶部１０５に記録する。図５は、処理対象の文書が上記の文書Ａであった場合に係り受け関係記憶部１０５に記録されるテーブルの一例を示す図である。係り受け関係記憶部１０５に記録されるテーブルは文書ごとに作成され、同図に示すテーブルでは、例えば、係り受け関係「季節−関係」に対応付けて単語ＩＤの組「１，３」が記録されている。なお、本動作例の説明では、文書集合Ｄを構成する各文書は、その文書に含まれる名詞、形容詞又は動詞の原形について、図６に例示する「季節」、「鉢植え」、「関係」、「ない」、「花」、「植物」、「咲く」、「花壇」、「枯れる」及び「教える」の１０単語のうちの１以上の単語により構成されているものとする。また、これら１０単語には、図６に例示するように、「１」〜「１０」のうちのいずれかの単語ＩＤが付与されるものとする。 Next, the dependency relationship extraction unit 104 extracts the dependency relationship consisting of nouns, adjectives, or verb words from the document to be processed with reference to the analysis result in step S2 (S3). When the dependency relationship is extracted, a word ID is given to each word that constitutes the dependency relationship, the words of the dependency source and the dependency that constitute the dependency relationship, and the set of word IDs assigned to those words. Are associated with each other and recorded in the dependency relationship storage unit 105. FIG. 5 is a diagram showing an example of a table recorded in the dependency relationship storage unit 105 when the document to be processed is the above-mentioned document A. The table recorded in the dependency relationship storage unit 105 is created for each document, and in the table shown in the figure, for example, a set of word IDs "1, 3" is recorded in association with the dependency relationship "season-relationship". Has been done. In the explanation of this operation example, each document constituting the document set D has the "season", "potted plant", "relationship", illustrated in FIG. 6 for the original forms of the noun, adjective, or verb contained in the document. It shall be composed of one or more of the ten words "not", "flower", "plant", "blooming", "flowerbed", "withering" and "teaching". Further, as illustrated in FIG. 6, these 10 words are given a word ID of any one of "1" to "10".

上記のステップＳ１〜Ｓ３は、文書集合Ｄを構成するすべての文書について実行され、すべての文書について実行が完了すると（Ｓ４のＹＥＳ）、出現頻度算出部１０６は、係り受け関係記憶部１０５を参照して、文書集合Ｄのうち、処理対象の文書において特定された各係り受け関係が当該文書において出現する出現頻度を算出する（Ｓ５）。係り受け関係について出願頻度を算出すると、算出した出現頻度を当該係り受け関係と対応付けて係り受け関係記憶部１０５に記録する。仮に処理対象の文書が上記の文書Ａであったとすると、例えば、係り受け関係「季節−関係」は当該文書において１度しか出現しないため、図５に示す通り、当該係り受け関係に対応付けて出現頻度「１」が記録される。 The above steps S1 to S3 are executed for all the documents constituting the document set D, and when the execution is completed for all the documents (YES in S4), the occurrence frequency calculation unit 106 refers to the dependency relationship storage unit 105. Then, in the document set D, the appearance frequency in which each dependency relationship specified in the document to be processed appears in the document is calculated (S5). When the application frequency is calculated for the dependency relationship, the calculated appearance frequency is recorded in the dependency relationship storage unit 105 in association with the dependency relationship. Assuming that the document to be processed is the above document A, for example, the dependency relationship "season-relationship" appears only once in the document, and therefore, as shown in FIG. 5, it is associated with the dependency relationship. The appearance frequency "1" is recorded.

次に、出現文書数算出部１０７は、係り受け関係記憶部１０５を参照して、処理対象の文書において特定された各係り受け関係について、文書集合Ｄにおいて係り受け関係が出現する文書数を算出する（Ｓ６）。係り受け関係について文書数を算出すると、算出した文書数を当該係り受け関係と対応付けて係り受け関係記憶部１０５に記録する。仮に処理対象の文書が上記の文書Ａであって、例えば、係り受け関係「季節−関係」が、文書集合Ｄを構成する２０件の文書のうち３件に出現する場合には、図５に示す通り、当該係り受け関係「季節−関係」に対応付けて文書数「３」が記録される。 Next, the appearance document number calculation unit 107 calculates the number of documents in which the dependency relationship appears in the document set D for each dependency relationship specified in the document to be processed with reference to the dependency relationship storage unit 105. (S6). When the number of documents is calculated for the dependency relationship, the calculated number of documents is recorded in the dependency relationship storage unit 105 in association with the dependency relationship. If the document to be processed is the above document A, and for example, the dependency relationship "season-relationship" appears in 3 out of 20 documents constituting the document set D, FIG. 5 shows. As shown, the number of documents "3" is recorded in association with the dependency relationship "season-relationship".

次に、重要度算出部１０８は、係り受け関係記憶部１０５を参照して、処理対象の文書において特定された各係り受け関係について、上記の数１の式を用いて重要度を算出する（Ｓ７）。係り受け関係について重要度を算出すると、算出した重要度を当該係り受け関係と対応付けて係り受け関係記憶部１０５に記録する。仮に図５に示す係り受け関係「季節−関係」について重要度を算出したとすると、同図に示す通り、当該係り受け関係「季節−関係」に対応付けて重要度「０．８８（＝１＊ｌｏｇ（２０／３＋１））」が記録される。 Next, the importance calculation unit 108 calculates the importance of each dependency relationship specified in the document to be processed by referring to the dependency relationship storage unit 105 using the above equation of equation (1). S7). When the importance of the dependency relationship is calculated, the calculated importance is recorded in the dependency relationship storage unit 105 in association with the dependency relationship. Assuming that the importance of the dependency relationship "season-relationship" shown in FIG. 5 is calculated, the importance "0.88 (= 1)" is associated with the dependency relationship "season-relationship" as shown in the figure. * Log (20/3 + 1)) ”is recorded.

次に、重要度行列生成部１０９は、係り受け関係記憶部１０５を参照して、処理対象の文書について重要度行列を生成する（Ｓ８）。図７は、図５に示す各係り受け関係の重要度に基づいて重要度行列生成部１０９により生成される重要度行列の一例を示す図である。図７に示す重要度行列は、文書集合Ｄから抽出された各係り受け関係を構成する各単語（図６参照）が行及び列に対応付けられた正方行列であって、処理対象の文書において特定された各係り受け関係について、係り受け関係を構成する係り元の単語の行と係り先の列に対応する成分として当該係り受け関係の重要度が割り当てられた行列である。本重要度行列においては、例えば、係り受け関係「季節−関係」を構成する係り元の単語「季節」の行「１」と係り先の単語「関係」の列「３」に対応する成分として当該係り受け関係の重要度「０．８８」が割り当てられている。 Next, the importance matrix generation unit 109 generates an importance matrix for the document to be processed with reference to the dependency relationship storage unit 105 (S8). FIG. 7 is a diagram showing an example of an importance matrix generated by the importance matrix generation unit 109 based on the importance of each dependency relationship shown in FIG. The importance matrix shown in FIG. 7 is a square matrix in which each word (see FIG. 6) constituting each dependency relationship extracted from the document set D is associated with rows and columns, and is a square matrix in which the documents to be processed are associated with each other. For each of the specified dependency relationships, it is a matrix to which the importance of the dependency relationship is assigned as a component corresponding to the row of the word of the dependency source and the column of the dependency destination that constitute the dependency relationship. In this importance matrix, for example, as a component corresponding to the row "1" of the dependency word "season" and the column "3" of the dependency word "relationship" constituting the dependency relationship "season-relationship". The importance of the dependency relationship "0.88" is assigned.

次に、文書ベクトル生成部１１０は、ステップＳ８において生成された重要度行列の各行ベクトルを連結して文書ベクトルを生成する（Ｓ９）。文書ベクトルを生成すると、当該文書ベクトルが生成された文書の文書ＩＤと対応付けて文書ベクトル記憶部１１１に記録する。仮に図７に示す重要度行列に基づいて文書ベクトルを生成したとすると、図８に例示するように１０本の行ベクトルにより構成される当該重要度行列は、数２の式のように表現される。

Next, the document vector generation unit 110 generates a document vector by concatenating the row vectors of the importance matrix generated in step S8 (S9). When the document vector is generated, the document vector is recorded in the document vector storage unit 111 in association with the document ID of the generated document. Assuming that the document vector is generated based on the importance matrix shown in FIG. 7, the importance matrix composed of 10 row vectors is expressed as the equation of Equation 2 as illustrated in FIG. To.

上記のステップＳ５〜Ｓ９は、文書集合Ｄを構成するすべての文書について実行され、すべての文書について実行が完了すると（Ｓ１０のＹＥＳ）、本文書ベクトル生成処理は終了する。 The above steps S5 to S9 are executed for all the documents constituting the document set D, and when the execution is completed for all the documents (YES in S10), the document vector generation process ends.

以上説明した文書処理装置１によれば、文書中に出現する単語間の係り受け構造を反映した文書ベクトルを生成することができる。そのため、出現する単語が共通する文書同士であっても単語間の係り受け構造が相違していれば、その相違を文書ベクトルに表現することができる。 According to the document processing device 1 described above, it is possible to generate a document vector that reflects the dependency structure between words appearing in the document. Therefore, even if the documents that appear in the same word are different, if the dependency structure between the words is different, the difference can be expressed in the document vector.

２．変形例
上記の実施形態は以下に記載するように変形してもよい。以下に記載する１以上の変形例は互いに組み合わせてもよい。 2. Modification Example The above embodiment may be modified as described below. One or more modifications described below may be combined with each other.

２−１．変形例１
係り受け関係抽出部１０４は、名詞、形容詞又は動詞以外の自立語からなる係り受け関係を抽出してもよい。例えば、名詞、形容詞又は動詞の単語だけでなく副詞の単語からなる係り受け関係を抽出してもよい。 2-1. Modification 1
The dependency relationship extraction unit 104 may extract a dependency relationship composed of independent words other than nouns, adjectives, or verbs. For example, a dependency relationship consisting of not only noun, adjective or verb words but also adverb words may be extracted.

また、係り受け関係抽出部１０４は、係り元と係り先の２つの単語からなる係り受け関係に代えて、３以上の単語からなる係り受け関係を抽出するようにしてもよい。例えば、３つの単語からなる係り受け関係を抽出する場合には、係り元の第１の単語と、係り先の第２の単語と、第２の単語の係り先である第３の単語からなる係り受け関係を抽出する。３以上の単語からなる係り受け関係が抽出された場合には、重要度行列生成部１０９は、行と列からなる２次元配列である行列に代えて、３次元以上の配列を生成する。例えば、３次元配列を生成する場合には、文書集合から抽出された係り受け関係を構成する各単語が行、列及び奥行（又はページ）に対応付けられた配列であって、処理対象の文書において特定された各係り受け関係について、係り受け関係を構成する係り元の第１の単語の行と、係り先の第２の単語の列と、第２の単語の係り先である第３の単語の奥行（又はページ）に対応する成分として当該係り受け関係の重要度が割り当てられた配列を生成する。重要度が割り当てられない要素には「０」が割り当てられる。 Further, the dependency relationship extraction unit 104 may extract the dependency relationship consisting of three or more words instead of the dependency relationship consisting of two words, the dependency source and the dependency destination. For example, when extracting a dependency relationship consisting of three words, it consists of the first word of the dependency, the second word of the dependency, and the third word of the second word. Extract the dependency relationship. When a dependency relationship consisting of three or more words is extracted, the importance matrix generation unit 109 generates an array of three or more dimensions instead of a matrix which is a two-dimensional array consisting of rows and columns. For example, when a three-dimensional array is generated, each word constituting the dependency relationship extracted from the document set is an array associated with rows, columns, and depths (or pages), and the document to be processed. For each dependency identified in, the row of the first word of the dependency that constitutes the dependency, the column of the second word of the dependency, and the third word that is the dependency of the second word. Generates an array to which the importance of the dependency relationship is assigned as a component corresponding to the depth (or page) of the word. "0" is assigned to the element to which the importance is not assigned.

２−２．変形例２
係り受け関係の重要度を算出するための上記の数１の式では、（Ｎ／ｎ＋１）の常用対数をとっているが、自然対数をとってもよい。または、そもそも対数をとらなくてもよい。 2-2. Modification 2
In the above equation of Equation 1 for calculating the importance of the dependency relationship, the common logarithm of (N / n + 1) is taken, but the natural logarithm may be taken. Alternatively, it is not necessary to take the logarithm in the first place.

２−３．変形例３
重要度算出部１０８が係り受け関係の重要度を算出する際に用いる上記の数１の式において、出現頻度ｍの乗算は省略されてもよい。この変形が採用される場合には、出現頻度算出部１０６は省略されてもよい。 2-3. Modification 3
In the above equation of Equation 1 used by the importance calculation unit 108 to calculate the importance of the dependency relationship, the multiplication of the appearance frequency m may be omitted. When this modification is adopted, the appearance frequency calculation unit 106 may be omitted.

また、係り受け関係の重要度を算出するための式は上記の数１の式に限定されるものではなく、出現頻度算出部１０６により算出される出現頻度と、文書集合を構成する文書数を出現文書数算出部１０７により算出される文書数で除した値とを乗じているものであればよい。例えば、数３〜数６の式のうちのいずれかの式であってもよい。

Further, the formula for calculating the importance of the dependency relationship is not limited to the above formula of equation 1, and the appearance frequency calculated by the appearance frequency calculation unit 106 and the number of documents constituting the document set are used. Any document may be used as long as it is multiplied by a value divided by the number of documents calculated by the number of appearing documents calculation unit 107. For example, it may be any of the equations of Equations 3 to 6.

数３〜数６の式において、Ｗは重要度を表し、ｍは、出現頻度算出部１０６により算出される出現頻度を表し、Ｎは文書集合を構成する文書数を表し、ｎは、出現文書数算出部１０７により算出される文書数を表し、Ｍは、処理対象の文書において最も出現頻度の高い単語の出現頻度を表す。なおここでｌｏｇは常用対数である。 In the equations of Equations 3 to 6, W represents the importance, m represents the appearance frequency calculated by the appearance frequency calculation unit 106, N represents the number of documents constituting the document set, and n represents the appearance document. The number of documents calculated by the number calculation unit 107 is represented, and M represents the frequency of occurrence of the word having the highest frequency of occurrence in the document to be processed. Here, log is a common logarithm.

また、重要度算出部１０８は、係り受け関係の重要度を算出する際に、文書の長さ（総単語数）を考慮してもよい。具体的には、より短い文書において出現頻度が高い係り受け関係の重要度がより高くなるように重要度を算出するようにしてもよい。一例として数７の式を用いて重要度を算出してもよい。

In addition, the importance calculation unit 108 may consider the length of the document (total number of words) when calculating the importance of the dependency relationship. Specifically, the importance may be calculated so that the dependency relationship, which appears frequently in a shorter document, becomes more important. As an example, the importance may be calculated using the formula of Equation 7.

数７の式において、Ｗは重要度を表し、ｍは、出現頻度算出部１０６により算出される出現頻度を表し、ｋ及びｂは、任意に設定される定数を表し、Ｌｅｎ（ｄ）は、処理対象の文書の長さ（総単語数）を表し、ａｖｇｄＬは、文書集合を構成する文書の平均の長さ（総単語数）を表し、Ｎは文書集合を構成する文書数を表し、ｎは、出現文書数算出部１０７により算出される文書数を表す。なおここでｌｏｇは常用対数である。 In the equation of Equation 7, W represents the importance, m represents the appearance frequency calculated by the appearance frequency calculation unit 106, k and b represent a constant set arbitrarily, and Len (d) is The length of the document to be processed (total number of words) is represented, avgdL represents the average length of the documents constituting the document set (total number of words), N represents the number of documents constituting the document set, and n Represents the number of documents calculated by the number of appearing documents calculation unit 107. Here, log is a common logarithm.

２−４．変形例４
重要度行列生成部１０９は、係り元の単語の行と係り先の単語の列に対応する成分として係り受け関係の重要度を割り当てるのに代えて、係り元の単語の列と係り先の単語の行に対応する成分として係り受け関係の重要度を割り当ててもよい。すなわち、係り元を行ではなく列に対応させ、係り先を列ではなく行に対応させてもよい。 2-4. Modification 4
Instead of assigning the importance of the dependency relationship as a component corresponding to the row of the elemental word and the column of the elemental word, the importance matrix generation unit 109 assigns the importance of the dependency relationship to the string of the elemental word and the word to be associated with. The importance of the dependency relationship may be assigned as the component corresponding to the line of. That is, the source may correspond to a column instead of a row, and the destination may correspond to a row instead of a column.

また、重要度行列生成部１０９が生成する重要度行列において、重要度が割り当てられない要素に、「０」以外の一定の値を割り当てるようにしてもよい。 Further, in the importance matrix generated by the importance matrix generation unit 109, a constant value other than "0" may be assigned to an element to which the importance is not assigned.

また、重要度行列生成部１０９は、重要度算出部１０８により算出された重要度に代えて、出現頻度算出部１０６により係り受け関係について算出された出現頻度を、当該係り受け関係を構成する係り元の単語の行と係り先の単語の列に対応する成分として割り当てた行列を生成してもよい。すなわち、重要度算出部１０８により算出された重要度に代えて、出現頻度算出部１０６により算出された出現頻度を係り受け関係の重要度として扱ってもよい。この変形が採用される場合には、出現文書数算出部１０７と重要度算出部１０８は省略されてよい。 Further, the importance matrix generation unit 109 uses the appearance frequency calculated for the dependency relationship by the appearance frequency calculation unit 106 instead of the importance calculated by the importance calculation unit 108 to form the dependency relationship. You may generate a matrix assigned as a component corresponding to the row of the original word and the column of the associated word. That is, instead of the importance calculated by the importance calculation unit 108, the appearance frequency calculated by the appearance frequency calculation unit 106 may be treated as the importance of the dependency relationship. When this modification is adopted, the appearance document number calculation unit 107 and the importance calculation unit 108 may be omitted.

２−５．変形例５
文書ベクトル生成部１１０は、重要度行列の各行ベクトルを連結して文書ベクトルを生成するのに代えて、重要度行列の各列ベクトルを連結して文書ベクトルを生成してもよい。要するに文書ベクトル生成部１１０は、重要度行列の各成分を所定の規則に従って連結して文書ベクトルを生成すればよい。 2-5. Modification 5
The document vector generation unit 110 may generate the document vector by concatenating the column vectors of the importance matrix instead of concatenating the row vectors of the importance matrix to generate the document vector. In short, the document vector generation unit 110 may generate a document vector by connecting each component of the importance matrix according to a predetermined rule.

２−６．変形例６
文書処理装置１は、演算処理装置がプログラムを実行することにより実現される機能として、次元圧縮部をさらに備えてもよい。この次元圧縮部は、文書ベクトル生成部１１０により生成された重要度行列の次元を、所定の次元圧縮方法を用いて圧縮（言い換えると削減）するための機能である。ここで所定の次元圧縮方法とは、例えば、二次元主成分分析（２ＤＰＣＡ）や、特異値分解（ＳＶＤ）等の方法である。次元圧縮部により重要度行列の次元が圧縮されると、文書ベクトル生成部１１０は、当該行列の各行ベクトルを連結して文書ベクトルを生成する。この変形によれば、例えば、文書ベクトルを用いて類似文書の検索をする際の計算量を削減することができる。 2-6. Modification 6
The document processing device 1 may further include a dimension compression unit as a function realized by the arithmetic processing unit executing a program. This dimension compression unit is a function for compressing (in other words, reducing) the dimension of the importance matrix generated by the document vector generation unit 110 by using a predetermined dimension compression method. Here, the predetermined dimensional compression method is, for example, a method such as two-dimensional principal component analysis (2DPCA) or singular value decomposition (SVD). When the dimension of the importance matrix is compressed by the dimension compression unit, the document vector generation unit 110 concatenates each row vector of the matrix to generate a document vector. According to this modification, for example, it is possible to reduce the amount of calculation when searching for similar documents using the document vector.

２−７．変形例７
文書処理装置１は、演算処理装置がプログラムを実行することにより実現される機能として、類似度判定部をさらに備えてもよい。この類似度判定部は、２件以上の文書の各々について文書ベクトル生成部１１０により生成された文書ベクトルに基づいて、所定の類似度計算方法を用いて、それらの文書の類似度を判定するための機能である。ここで所定の類似度計算方法を用いて算出される類似度としては、例えば、コサイン類似度や、ピアソンの相関係数や、偏差パターン類似度等がある。類似度判定部により判定された類似度は、図示せぬ表示装置に表示されてもよいし、通信回線３を介して送信されてもよい。または、文書ベクトル記憶部１１１に登録されている文書を、利用者により入力された質問文書に類似する順に並べて表示装置に表示させる際に参照されてもよい。 2-7. Modification 7
The document processing device 1 may further include a similarity determination unit as a function realized by the arithmetic processing unit executing a program. This similarity determination unit determines the similarity of two or more documents by using a predetermined similarity calculation method based on the document vector generated by the document vector generation unit 110. It is a function of. Here, as the similarity calculated by using the predetermined similarity calculation method, for example, there are a cosine similarity, a Pearson correlation coefficient, a deviation pattern similarity, and the like. The similarity determined by the similarity determination unit may be displayed on a display device (not shown) or may be transmitted via the communication line 3. Alternatively, the documents registered in the document vector storage unit 111 may be referred to when the documents are arranged in an order similar to the question document input by the user and displayed on the display device.

２−８．変形例８
上記の実施形態に係る文書処理装置１は、日本語の文書を処理させることを想定しているが、日本語以外の言語を処理可能としてもよい。例えば、英語の文書を処理させる場合には、形態素解析部１０２は、例えばTreeTagger（http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/）を使用して英語の文書を解析し、単語を抽出するようにしてよい。また、係り受け解析部１０３は、例えばMaltparse（http://www.maltparser.org/）を使用して、英語の文書を構成する各単語の係り受け構造を抽出するようにしてよい。 2-8. Modification 8
The document processing device 1 according to the above embodiment is supposed to process a Japanese document, but may be capable of processing a language other than Japanese. For example, when processing an English document, the morphological analysis unit 102 uses, for example, TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) to process the English document. May be parsed and words extracted. Further, the dependency analysis unit 103 may use, for example, Maltparse (http://www.maltparser.org/) to extract the dependency structure of each word constituting the English document.

２−９．変形例９
文書処理装置１が備える各機能を実現するためのプログラムは、コンピュータ装置が読み取り可能な記録媒体を介して提供されてもよい。ここで記録媒体とは、例えば、磁気テープや磁気ディスクなどの磁気記録媒体や、光ディスクなどの光記録媒体や、光磁気記録媒体や、半導体メモリ等である。また、このプログラムは、インターネット等のネットワークを介して提供されてもよい。 2-9. Modification 9
The program for realizing each function included in the document processing device 1 may be provided via a recording medium readable by the computer device. Here, the recording medium is, for example, a magnetic recording medium such as a magnetic tape or a magnetic disk, an optical recording medium such as an optical disk, an optical magnetic recording medium, a semiconductor memory, or the like. In addition, this program may be provided via a network such as the Internet.

１…文書処理装置、２…文書データベース、３…通信回線、１０１…文書入力部、１０２…形態素解析部、１０３…係り受け解析部、１０４…係り受け関係抽出部、１０５…係り受け関係記憶部、１０６…出現頻度算出部、１０７…出現文書数算出部、１０８…重要度算出部、１０９…重要度行列生成部、１１０…文書ベクトル生成部、１１１…文書ベクトル記憶部 1 ... Document processing device, 2 ... Document database, 3 ... Communication line, 101 ... Document input unit, 102 ... Morphological analysis unit, 103 ... Dependency analysis unit, 104 ... Dependency relationship extraction unit, 105 ... Dependency relationship storage unit , 106 ... Appearance frequency calculation unit, 107 ... Appearance document number calculation unit, 108 ... Importance calculation unit, 109 ... Importance matrix generation unit, 110 ... Document vector generation unit, 111 ... Document vector storage unit

Claims

文書集合を構成する各文書を解析して単語を抽出する解析部と、
前記文書集合を構成する各文書について、前記解析部により抽出された各単語の係り受け関係を抽出する係り受け関係抽出部と、
前記文書集合を構成する一の対象文書において、前記係り受け関係抽出部により抽出された各係り受け関係が出現する出現頻度を算出する出現頻度算出部と、
前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記文書集合において係り受け関係が出現する文書数を算出する出現文書数算出部と、
前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記文書集合を構成する文書数を、前記出現文書数算出部により算出された文書の数で除して逆文書頻度を算出する逆文書頻度算出部と、
前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記出現頻度算出部により算出された頻度に、前記逆文書頻度算出部により算出された逆文書頻度を乗じて重要度を算出する重要度算出部と、
前記解析部により抽出された各単語が行及び列に対応付けられた行列であって、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、係り受け関係を構成する係り元の単語と係り先の単語のうち一方の行と他方の列に対応する成分として当該係り受け関係について前記重要度算出部により算出された重要度が割り当てられた行列を生成する重要度行列生成部と、
前記重要度行列生成部により生成された行列の各成分を連結して前記対象文書の文書ベクトルを生成する文書ベクトル生成部と
を備える文書処理装置。 An analysis unit that analyzes each document that makes up a document set and extracts words,
For each document constituting the document set, a dependency relationship extraction unit that extracts the dependency relationship of each word extracted by the analysis unit, and a dependency relationship extraction unit.
An appearance frequency calculation unit that calculates the appearance frequency of each dependency extracted by the dependency extraction unit in one target document that constitutes the document set, and an appearance frequency calculation unit.
For each dependency relationship extracted by the dependency relationship extraction unit for the target document, an appearance document number calculation unit that calculates the number of documents in which the dependency relationship appears in the document set, and
For each dependency relationship extracted by the dependency relationship extraction unit for the target document, the reverse document frequency is obtained by dividing the number of documents constituting the document set by the number of documents calculated by the appearance document number calculation unit. Inverse document frequency calculation unit that calculates
For each dependency relationship extracted by the dependency relationship extraction unit for the target document, the importance is multiplied by the frequency calculated by the appearance frequency calculation unit and the inverse document frequency calculated by the reverse document frequency calculation unit. Importance calculation unit to calculate
Each word extracted by the analysis unit is a matrix associated with rows and columns, and each dependency relationship extracted by the dependency relationship extraction unit for the target document constitutes a dependency relationship. Severity matrix generation that generates a matrix to which the importance calculated by the importance calculation unit is assigned to the dependency relationship as a component corresponding to one row and the other column of the original word and the related word. Department and
A document processing apparatus including a document vector generation unit that concatenates each component of a matrix generated by the importance matrix generation unit to generate a document vector of the target document.

前記重要度行列生成部により生成された行列の次元を、所定の次元圧縮方法を用いて圧縮する次元圧縮部をさらに備え、
前記文書ベクトル生成部は、前記次元圧縮部により次元を圧縮された行列の各成分を連結して前記対象文書の文書ベクトルを生成する
ことを特徴とする請求項１に記載の文書処理装置。 A dimensional compression unit that compresses the dimensions of the matrix generated by the importance matrix generation unit using a predetermined dimensional compression method is further provided.
The document processing apparatus according to claim 1, wherein the document vector generation unit generates a document vector of the target document by concatenating each component of a matrix whose dimensions are compressed by the dimension compression unit.

前記文書ベクトル生成部により生成された前記対象文書の文書ベクトルと、他の文書について前記文書ベクトル生成部により生成された文書ベクトルとに基づいて、所定の類似度計算方法を用いて、前記対象文書と前記他の文書の類似度を判定する類似度判定部をさらに備えることを特徴とする請求項１又は２に記載の文書処理装置。 Based on the document vector of the target document generated by the document vector generation unit and the document vector generated by the document vector generation unit for other documents, the target document is used by a predetermined similarity calculation method. The document processing apparatus according to claim 1 or 2, further comprising a similarity determination unit for determining the similarity between the document and the other document.

文書処理装置により実行される文書処理方法であって、
文書集合を構成する各文書を解析して単語を抽出する解析ステップと、
前記文書集合を構成する各文書について、前記解析ステップにより抽出された各単語の係り受け関係を抽出する係り受け関係抽出ステップと、
前記文書集合を構成する一の対象文書において、前記係り受け関係抽出ステップにより抽出された各係り受け関係が出現する出現頻度を算出する出現頻度算出ステップと、
前記対象文書について前記係り受け関係抽出ステップにより抽出された各係り受け関係について、前記文書集合において係り受け関係が出現する文書数を算出する出現文書数算出ステップと、
前記対象文書について前記係り受け関係抽出ステップにより抽出された各係り受け関係について、前記文書集合を構成する文書数を、前記出現文書数算出ステップにより算出された文書の数で除して逆文書頻度を算出する逆文書頻度算出ステップと、
前記対象文書について前記係り受け関係抽出ステップにより抽出された各係り受け関係について、前記出現頻度算出ステップにより算出された頻度に、前記逆文書頻度算出ステップにより算出された逆文書頻度を乗じて重要度を算出する重要度算出ステップと、
前記解析ステップにより抽出された各単語が行及び列に対応付けられた行列であって、前記対象文書について前記係り受け関係抽出ステップにより抽出された各係り受け関係について、係り受け関係を構成する係り元の単語と係り先の単語のうち一方の行と他方の列に対応する成分として当該係り受け関係について前記重要度算出ステップにより算出された重要度が割り当てられた行列を生成する重要度行列生成ステップと、
前記重要度行列生成ステップにより生成された行列の各成分を連結して前記対象文書の文書ベクトルを生成する文書ベクトル生成ステップと
を備える文書処理方法。 A document processing method executed by a document processing device.
An analysis step that analyzes each document that makes up a document set and extracts words,
For each document constituting the document set, a dependency relationship extraction step for extracting the dependency relationship of each word extracted by the analysis step, and a dependency relationship extraction step.
An appearance frequency calculation step for calculating the appearance frequency of each dependency extracted by the dependency extraction step in one target document constituting the document set, and an appearance frequency calculation step.
For each dependency relationship extracted by the dependency relationship extraction step for the target document, an appearance document number calculation step for calculating the number of documents in which the dependency relationship appears in the document set, and a step of calculating the number of appearance documents.
For each dependency relationship extracted by the dependency relationship extraction step for the target document, the reverse document frequency is obtained by dividing the number of documents constituting the document set by the number of documents calculated by the appearance document number calculation step. Inverse document frequency calculation step to calculate
For each dependency relationship extracted by the dependency relationship extraction step for the target document, the importance is multiplied by the frequency calculated by the appearance frequency calculation step and the reverse document frequency calculated by the reverse document frequency calculation step. The importance calculation step to calculate
Each word extracted by the analysis step is a matrix associated with rows and columns, and the dependency relationship is formed for each dependency relationship extracted by the dependency relationship extraction step for the target document. Severity matrix generation that generates a matrix to which the importance calculated by the importance calculation step is assigned to the dependency relationship as a component corresponding to one row and the other column of the original word and the associated word. Steps and
A document processing method including a document vector generation step of connecting each component of the matrix generated by the importance matrix generation step to generate a document vector of the target document.

コンピュータを、
文書集合を構成する各文書を解析して単語を抽出する解析部と、
前記文書集合を構成する各文書について、前記解析部により抽出された各単語の係り受け関係を抽出する係り受け関係抽出部と、
前記文書集合を構成する一の対象文書において、前記係り受け関係抽出部により抽出された各係り受け関係が出現する出現頻度を算出する出現頻度算出部と、
前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記文書集合において係り受け関係が出現する文書数を算出する出現文書数算出部と、
前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記文書集合を構成する文書数を、前記出現文書数算出部により算出された文書の数で除して逆文書頻度を算出する逆文書頻度算出部と、
前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、前記出現頻度算出部により算出された頻度に、前記逆文書頻度算出部により算出された逆文書頻度を乗じて重要度を算出する重要度算出部と、
前記解析部により抽出された各単語が行及び列に対応付けられた行列であって、前記対象文書について前記係り受け関係抽出部により抽出された各係り受け関係について、係り受け関係を構成する係り元の単語と係り先の単語のうち一方の行と他方の列に対応する成分として当該係り受け関係について前記重要度算出部により算出された重要度が割り当てられた行列を生成する重要度行列生成部と、
前記重要度行列生成部により生成された行列の各成分を連結して前記対象文書の文書ベクトルを生成する文書ベクトル生成部
として機能させるためのプログラム。 Computer,
An analysis unit that analyzes each document that makes up a document set and extracts words,
For each document constituting the document set, a dependency relationship extraction unit that extracts the dependency relationship of each word extracted by the analysis unit, and a dependency relationship extraction unit.
An appearance frequency calculation unit that calculates the appearance frequency of each dependency extracted by the dependency extraction unit in one target document that constitutes the document set, and an appearance frequency calculation unit.
For each dependency relationship extracted by the dependency relationship extraction unit for the target document, an appearance document number calculation unit that calculates the number of documents in which the dependency relationship appears in the document set, and
For each dependency relationship extracted by the dependency relationship extraction unit for the target document, the reverse document frequency is obtained by dividing the number of documents constituting the document set by the number of documents calculated by the appearance document number calculation unit. Inverse document frequency calculation unit that calculates
For each dependency relationship extracted by the dependency relationship extraction unit for the target document, the importance is multiplied by the frequency calculated by the appearance frequency calculation unit and the inverse document frequency calculated by the reverse document frequency calculation unit. Importance calculation unit to calculate
Each word extracted by the analysis unit is a matrix associated with rows and columns, and each dependency relationship extracted by the dependency relationship extraction unit for the target document constitutes a dependency relationship. Severity matrix generation that generates a matrix to which the importance calculated by the importance calculation unit is assigned to the dependency relationship as a component corresponding to one row and the other column of the original word and the related word. Department and
A program for connecting each component of the matrix generated by the importance matrix generation unit to function as a document vector generation unit for generating a document vector of the target document.