JP2005190284A

JP2005190284A - Information classification device and method

Info

Publication number: JP2005190284A
Application number: JP2003432458A
Authority: JP
Inventors: Yoshimi Takemoto; 義美竹元
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2003-12-26
Filing date: 2003-12-26
Publication date: 2005-07-14

Abstract

<P>PROBLEM TO BE SOLVED: To precisely classify text data according to categories and to efficiently and precisely define each category for executing the classification. <P>SOLUTION: This information classification device for classifying text data according to categories includes a text analysis means 11 that creates word information and modification information from the text data taken in; and a data classification means 12 that classifies the text data according to categories on the basis of the word information and the modification information included in the text data and classification dictionary data stored in a classification dictionary storage means. For the classification dictionary data, the part of all the text data designated from among the word information and the modification information created is registered using a classification dictionary data creation means 30 as the requirements for classification into the categories designated. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、電子化された大量のテキストデータから、所望の有効な情報を得る情報マイニングに利用される情報分類装置および情報分類方法に関する。 The present invention relates to an information classification device and an information classification method used for information mining to obtain desired effective information from a large amount of digitized text data.

近年、企業などにおいては、コンタクトセンタに寄せられる顧客の問い合わせ事項、アンケート結果、営業日報などの情報などが、電子化されたテキストデータとして大量に蓄積されるようになってきている。このようなテキストデータは、製品の改良や開発の指針など企業などにおける経営指針を得る上で有用な、製品に対する要望などの顧客の声を含んでおり、顧客管理（ＣＲＭ）などに有効利用することが望まれている。また、電子化されたテキストデータから有用な所望の情報を効率的に引き出すことができれば、有効なデータ分析コンサルティングサービスの提供を実現できる可能性もある。 In recent years, in companies and the like, a large amount of information such as customer inquiry items, questionnaire results, and daily sales reports sent to contact centers has been accumulated as digitized text data. Such text data includes customer feedback such as requests for products that are useful in obtaining management guidelines for companies such as product improvement and development guidelines, and is used effectively for customer management (CRM). It is hoped that. Further, if useful desired information can be efficiently extracted from digitized text data, there is a possibility that an effective data analysis consulting service can be provided.

このような顧客管理やデータ分析コンサルティングを実行するためには、大量のテキストデータから、例えば、特定の製品に対してどのような苦情が寄せられているか、どのような改善事項が望まれているのかといった有用な事項について、苦情内容毎の発生件数や、要望事項毎の問い合わせ件数などの形で、その傾向を定量的に把握可能な情報を引き出すことが望まれる。近年、大量のテキストデータからこのような定量的な情報を引き出す手法として、テキストマイニングが注目されている。 In order to execute such customer management and data analysis consulting, what kind of complaints are received from a large amount of text data, for example, about a specific product, and what improvement items are desired It is desirable to extract useful information such as the number of complaints for each complaint content and the number of inquiries for each requested item so that the trend can be quantitatively grasped. In recent years, text mining has attracted attention as a technique for extracting such quantitative information from a large amount of text data.

このような手法の１つとして、各データを意味が類似する各グループに分類する文書クラスタリングが知られている。従来の文書クラスタリングでは、通常、各データを、例えば、プリンタ、ディスプレー、…といったユーザが指定した単語が含まれているかどうかに応じて振り分け、各データに、指定した単語によって特徴付けられる、例えば「プリンタ」などといった主題情報を付与する処理が行われている。これによって、各グループに分類されたデータの数を集計して、多い順に表示するなどして、例えば、プリンタ、ディスプレーなどのうち、どれに関する問い合わせが多いかといったことを把握することができる。 As one of such methods, document clustering that classifies each data into groups having similar meanings is known. In conventional document clustering, each data is usually sorted according to whether a user-specified word such as a printer, a display,... Is included, and each data is characterized by the specified word. A process of assigning subject information such as “printer” is performed. As a result, the number of data classified into each group is aggregated and displayed in descending order, for example, so that it is possible to grasp which of the printers, the display, and the like has the most inquiries.

このような文書クラスタリングにおいては、シソーラス辞書を用いて、同義語が使用されているものを同一のグループに振り分けて言葉のゆらぎを考慮する手法も知られている。また、特許文献１には、文書における修飾語とその被修飾語の関係や、主語とその述語の関係などの係り受け情報も考慮して文書クラスタリングを行なう手法が記載されている。
特開２０００−１７２６９１号公報特開２０００−１８１９２６号公報特開２００２−３０４４０１号公報特開２００１−３１２５０１号公報特開２００１−８４２５０号公報特開２００１−２６６０６０号公報特開２００１−１１９９号公報特開平０９−２６５４７６号公報特開平１１−２５１２１号公報「国語辞書の記憶と日本語文の自動分割」（長尾真ほか、情報処理、Ｖｏｌ．１９、Ｎｏ．６、１９７８年）「係り受け解析を用いた複合語の分割方法」（宮崎正弘、情報処理学会論文誌、Ｖｏｌ．２５、Ｎｏ．６、１９８４年） In such document clustering, a method is also known in which synonyms are used in a clustering dictionary and are used in the same group to take into account word fluctuations. Japanese Patent Application Laid-Open No. 2004-228561 describes a technique for performing document clustering in consideration of dependency information such as a relationship between a modifier in a document and a modified word and a relationship between a subject and a predicate.
JP 2000-172691 A JP 2000-181926 A JP 2002-304401 A JP 2001-312501 A JP 2001-84250 A JP 2001-266060 A JP 2001-1199 A JP 09-265476 A Japanese Patent Laid-Open No. 11-25121 “Memory of Japanese dictionary and automatic segmentation of Japanese sentences” (Masao Nagao et al., Information Processing, Vol. 19, No. 6, 1978) “Method of dividing compound words using dependency analysis” (Masahiro Miyazaki, Transactions of Information Processing Society of Japan, Vol. 25, No. 6, 1984)

しかし、上述のように、各データを、ユーザが指定した特定の単語の有無や、特定の係り受け関係を伴って特定の単語が含まれているかどうかによってのみ分類するだけでは、例えば、プリンタに関するテキストデータであっても「プリンタ」という単語を直接含んでいない場合があるなど、ユーザが意図した通りの分類を行なうのには難がある。そこで、通常、分類したデータをユーザが確認して、最終的な分類情報を付与するなどの作業が必要となる。 However, as described above, it is only necessary to classify each data based on the presence / absence of a specific word specified by the user or whether a specific word is included with a specific dependency relationship. Even if it is text data, there are cases where the word “printer” is not directly included, and it is difficult to perform classification as intended by the user. Therefore, it is usually necessary for the user to confirm the classified data and give final classification information.

また、実際に有効な情報マイニングを行なうためには、例えば、データを、苦情に関するものと要望に関するものなどのように、単純に特定の単語などの有無だけで規定するのが困難なカテゴリ別に分類することが望まれる場合が多い。すなわち、単純に電子化されて蓄積されたテキストデータのほとんどは、例えば苦情に関する情報と要望に関する情報などがばらばらに入り交ざったプレーンなデータであり、分析対象とする各データに、例えば、苦情に関する情報なのか要望に関する情報なのかというカテゴリ情報を付与することが、情報マイニングにおいて有効な結果を得る上で重要なポイントの１つである。これに対して、上述のような文書クラスタリング処理は、それだけでは、カテゴリ情報を付与するのには不十分である。 In order to perform effective information mining, for example, data is classified into categories that are difficult to specify simply by the presence or absence of specific words, such as those relating to complaints and requests. It is often desirable to do so. That is, most of the text data that is simply digitized and stored is plain data in which information on complaints and information on requests are intermingled, and each data to be analyzed is related to, for example, complaints. It is one of the important points to obtain a valid result in information mining to give category information as to whether it is information or information about a request. On the other hand, the document clustering process as described above is not sufficient for providing category information.

テキストデータをカテゴリ別に分類する手法として、特許文献２には、苦情というカテゴリに関するデータを抽出する手法が記載されている。同文献に記載された手法では、苦情文に含まれると考えられる単語や文末表現などのパターンを登録した苦情辞書を用い、この苦情辞書に含まれるパターンのいずれかが含まれるかどうかによって、苦情に関するデータかどうかの判定が行なわれている。 As a method for classifying text data into categories, Patent Document 2 describes a method for extracting data related to a category of complaints. The technique described in this document uses a complaint dictionary in which patterns such as words and sentence ending expressions that are considered to be included in the complaint sentence are registered, and the complaint is determined depending on whether one of the patterns included in this complaint dictionary is included. Whether or not the data is related is determined.

このような単純なパターンによって、カテゴリ分類をする手法では、例えば、テキストデータに「故障」という単語が含まれていれば「障害」というカテゴリに、「困る」という文末表現が含まれていれば「苦情」というカテゴリに分類する処理が行われる。しかし、単純なパターンだけでは正確な分類を行なうことができないことがある。例えば、「高い」という単語では、「価格が高い」という場合は「苦情」のカテゴリに分類すればよいが、「性能が高い」という場合は「苦情」と捉えるべきではなく、むしろ「好評」「お褒め」といったカテゴリに分類すべきである。このような分類を行なうには、「価格」や「性能」という単語と「高い」という単語が係り受け関係にあることを判定する必要があり、単純なパターンの有無の判定のみによって分類を実行するのは困難である。 In the method of categorizing with such a simple pattern, for example, if the word “failure” is included in the text data, the sentence “problem” is included in the category “failure”. Processing to classify into the category of “complaints” is performed. However, accurate classification may not be possible with simple patterns alone. For example, in the word “high”, if “price is high”, it should be classified in the category of “complaint”, but if “performance is high”, it should not be regarded as “complaint”, but rather “popular” Should be categorized into categories such as “praise”. In order to perform such classification, it is necessary to determine whether the words "price" and "performance" have a dependency relationship with the word "high", and the classification is performed only by determining whether there is a simple pattern. It is difficult to do.

また、特許文献３には、まず、人が分類処理を実行し、その分類結果から、処理装置に分類規則を学習させ、学習させた分類規則に基づいて分類を実行させる手法が記載されている。同文献では、人があるカテゴリに分類したテキストデータについて、品詞の並びのパターンに基づいて名詞句表現や文末表現などのフレーズを抽出し、このカテゴリに分類された多量のテキストデータについて、各フレーズがどのようなパターンで含まれているかを機能学習させている。 Patent Document 3 describes a technique in which a person first executes classification processing, causes a processing device to learn classification rules from the classification result, and executes classification based on the learned classification rules. . In this document, phrases such as noun phrase expressions and sentence end expressions are extracted from text data classified into a certain category based on part-of-speech arrangement patterns, and each phrase for a large amount of text data classified in this category. Is learning the function of what pattern is included.

しかし、このような自動学習では、多数の人間が自由に書いた文章であり、したがって、表現などがばらばらであるオープンなテキストデータに対して必ずしも精度のよい分類結果は得られない。この精度を上げるためには、多量の学習用データが必要であり、多量の学習用データを用意するのには多大な手間がかかるため、非効率になってしまう場合がある。 However, such automatic learning is a sentence freely written by a large number of humans, and therefore, an accurate classification result cannot always be obtained with respect to open text data in which expressions are scattered. In order to increase this accuracy, a large amount of learning data is required, and it takes a lot of time and effort to prepare a large amount of learning data, which may be inefficient.

また、特許文献４には、単語の係り受け関係の解析を実行し、これによって得られた係り受け情報も考慮して、テキストデータを所定のカテゴリ別に分類する手法が記載されている。このような手法によれば、分類の精度を向上させることができる。 Patent Document 4 describes a method of classifying text data into predetermined categories in consideration of dependency information obtained by executing word dependency relationships. According to such a technique, the accuracy of classification can be improved.

ところで、各カテゴリへの分類を実行するには、各カテゴリを定義し、すなわち、各カテゴリに分類するテキストデータが備えるべき条件を設定してやる必要があるが、係り受け情報まで考慮した分類では、このような各カテゴリの定義は複雑なものになる。これに対して、引用文献４に記載された手法では、人が分類処理を実行して準備した学習用データから自動学習によって各カテゴリの定義を得ている。しかし、この手法では、特許文献３の手法に関連して上述したように、不適切な分類が行なわれてしまう場合があり、また、分類の精度を上げるためには、多大な労力を費やして大量の学習用データを準備する必要があるという問題がある。 By the way, in order to execute classification into each category, it is necessary to define each category, that is, to set conditions that should be included in the text data to be classified into each category. The definition of each category is complicated. On the other hand, in the method described in the cited document 4, the definition of each category is obtained by automatic learning from learning data prepared by a person executing classification processing. However, in this method, as described above in connection with the method of Patent Document 3, inappropriate classification may be performed, and in order to increase the accuracy of classification, a great deal of effort is spent. There is a problem that it is necessary to prepare a large amount of learning data.

そこで、本発明の目的は、テキストデータをカテゴリ別に的確に分類可能であり、さらに、分類を実行するための各カテゴリの定義付けを効率的かつ的確に実行可能な情報分類装置、および情報分類方法を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide an information classification apparatus and an information classification method capable of accurately classifying text data for each category and capable of efficiently and accurately defining each category for performing classification. Is to provide.

上述の目的を達成するため、本発明の情報分類装置は、分類対象とするテキストデータをカテゴリ別に分類する情報分類装置であって、テキストデータを取り込むデータ入力手段と、データ入力手段によって取り込んだテキストデータの文章を形態素解析し、文章を単語に分け品詞情報を付与した単語情報を作成する文章解析手段と、文章解析手段によって作成された単語情報に基づいて、各単語間の主語と述語の関係、または修飾語と被修飾語の関係である係り受け関係を判定し、係り受け情報を作成する係り受け解析手段と、各カテゴリ別に、当該カテゴリに分類する条件となる単語情報と係り受け情報が登録された分類用辞書データを記憶する分類用辞書記憶手段と、各テキストデータを、当該テキストデータに含まれる単語情報および係り受け情報と分類用辞書記憶手段に記憶された分類用辞書データに基づいてカテゴリ別に分類するデータ分類手段とを有し、分類用辞書データを分類用辞書記憶手段に登録する手段として、サンプルテキストデータについての、文章解析手段によって作成された単語情報と、係り受け解析手段によって作成された係り受け情報のうちから指定されたものを、指定されたカテゴリに分類するための条件として分類用辞書データに登録する分類用辞書データ作成手段をさらに有することを特徴とする。 In order to achieve the above-described object, an information classification device according to the present invention is an information classification device for classifying text data to be classified into categories, a data input unit that captures text data, and a text that is captured by the data input unit. Sentence analysis means to analyze the text of the data, divide the text into words and create word information with part-of-speech information, and the relationship between the subject and predicate between each word based on the word information created by the text analysis means Or dependency analysis means for determining dependency relationship that is a relationship between a modifier and a modified word and creating dependency information, and for each category, word information and dependency information as conditions for classification into the category are included. Classification dictionary storage means for storing registered classification dictionary data, each text data, word information and text information included in the text data Sample text as means for registering classification dictionary data in the classification dictionary storage means, having data classification means for classifying based on dependency information and classification dictionary data stored in the classification dictionary storage means Classification dictionary data as a condition for classifying the specified data from the word information created by the sentence analysis unit and the dependency information created by the dependency analysis unit into the specified category. It further has a classifying dictionary data creating means for registering in the above.

この構成によれば、各テキストデータを、その係り受け関係も考慮してカテゴリ別に分類することができる。さらに、ユーザは、分類用辞書データ作成手段を用いることによって、文章解析手段によって作成された単語情報と、係り受け解析手段によって作成された係り受け情報を参照して分類用辞書データを作成することができ、それによって効率的かつ的確に分類用辞書データの作成操作を実行することができる。 According to this configuration, each text data can be classified by category in consideration of the dependency relationship. Further, the user creates classification dictionary data by referring to the word information created by the sentence analysis means and the dependency information created by the dependency analysis means by using the classification dictionary data creation means. Accordingly, the classification dictionary data can be created efficiently and accurately.

さらに、分類用辞書データ作成手段に、分類対象とするテキストデータの全てに含まれる単語情報と係り受け情報に対して統計的な手法によって順位付けを行なう順位付け手段を設けてもよい。ユーザは、この順位付け手段を利用することによって、順位の高い単語情報または係り受け情報から順にカテゴリに振り分ける操作を実行することができ、それによって、分類用辞書データ作成の効率、的確性をより向上させることができる。 Further, the classification dictionary data creation means may be provided with a ranking means for ranking word information and dependency information included in all text data to be classified by a statistical method. By using this ranking means, the user can execute an operation of sorting the categories in order from the word information or the dependency information with the highest ranking, thereby further improving the efficiency and accuracy of creating the classification dictionary data. Can be improved.

また、本発明の情報分類装置には、データ分類手段によって分類された結果に基づいて、指定されたカテゴリに分類されたテキストデータのみを抽出するカテゴリ別データ抽出手段をさらに設け、カテゴリ別データ抽出手段によって抽出されたテキストデータを新たな分類用辞書データの作成に用いる構成としてもよい。それによって、ユーザは、カテゴリ別データ抽出手段を用いて抽出したテキストデータを新たな分類対象として用い、より詳細な分類を実行することができ、これを繰り返して、効率的かつ的確により精緻な分類を実行することができる。 The information classification apparatus of the present invention further includes category-specific data extraction means for extracting only text data classified into a designated category based on the result of classification by the data classification means, and performs data extraction by category. The text data extracted by the means may be used to create new classification dictionary data. As a result, the user can execute more detailed classification by using the text data extracted by the category-specific data extraction means as a new classification target, and repeat this to make more precise and more precise classification. Can be executed.

本発明による情報分類方法は、分類対象とするテキストデータをカテゴリ別に自動的に分類する情報分類方法であって、テキストデータの文章を形態素解析し、文章を単語に分け品詞情報を付与した単語情報を作成するステップと、単語情報に基づいて、各単語間の主語と述語の関係、または修飾語と被修飾語の関係である係り受け関係を判定し、係り受け情報を作成するステップと、各カテゴリ別に、当該カテゴリに分類する条件となる単語情報と係り受け情報を登録した分類用辞書データを作成するステップと、各テキストデータを、当該テキストデータに含まれる単語情報および係り受け情報と分類用辞書記憶手段に記憶された分類用辞書データに基づいてカテゴリ別に分類するステップとを有し、分類用辞書データを作成するステップは、サンプルテキストデータについて作成した単語情報と係り受け情報のうちから指定されたものを、指定されたカテゴリに分類するための条件として分類用辞書データに登録するステップを含むことを特徴とする。 An information classification method according to the present invention is an information classification method for automatically classifying text data to be classified according to categories, word information obtained by morphological analysis of sentences in text data, dividing sentences into words, and giving part-of-speech information Determining the dependency relationship between the subject and predicate between each word, or the relationship between the modifier and the modified word based on the word information, creating dependency information, and For each category, a step of creating dictionary data for classification in which word information and dependency information to be classified into the category are registered, and each text data is classified into word information and dependency information included in the text data, and classification information. Classifying the data into categories based on the classification dictionary data stored in the dictionary storage means, and creating the classification dictionary data. It is the one specified from among the received dependency and word information was developed for sample text data information, characterized in that it comprises a step of registering the classification dictionary data as a condition for classifying the specified category.

本発明の情報分類装置は、所定のプログラムにしたがって動作するコンピュータによって構成することができ、本発明は、上記のような情報分類方法をコンピュータに実行させるためのプログラムを含む。 The information classification apparatus of the present invention can be configured by a computer that operates according to a predetermined program, and the present invention includes a program for causing a computer to execute the information classification method as described above.

本発明によれば、テキストデータをその文章における係り受け関係を考慮して、ユーザの定義したカテゴリ別に分類することができ、それによって、的確な分類が可能である。さらに、分類対象とするテキストデータの文章の単語情報と係り受け情報を参照して、カテゴリを定義する分類用辞書データを作成できるようにすることによって、ユーザが効率的かつ的確に各カテゴリの定義付けを実行することを可能とすることができる。 According to the present invention, the text data can be classified into categories defined by the user in consideration of the dependency relationship in the sentence, thereby enabling accurate classification. Furthermore, the user can efficiently and accurately define each category by making it possible to create classification dictionary data that defines categories by referring to word information and dependency information of sentences of text data to be classified. It may be possible to perform the pasting.

次に、本発明の実施の形態について図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

図１に、本実施形態の情報分類装置の構成を模式的に示すブロック図を示す。同図において、各手段は、磁気ディスク読み取り装置、所定の制御回路を備える制御装置などの個々の装置の他、コンピュータが所定のプログラムに基づく処理を実行することによってその機能を果たすものであってもよい。 FIG. 1 is a block diagram schematically showing the configuration of the information classification device of this embodiment. In the figure, each means fulfills its function by a computer executing a process based on a predetermined program in addition to individual devices such as a magnetic disk reading device and a control device having a predetermined control circuit. Also good.

この情報分類装置によるデータ分類動作について以下に説明する。 The data classification operation by this information classification device will be described below.

まず、データ入力手段１０を介して、分類対象とするテキストデータを取り込む。データは、任意の形式の文書ファイル、文章を含むテキストファイルとすることができるが、本実施形態では、例として、ＣＳＶと呼ばれるコンマ区切りのテキストファイルの形式でデータを取り込むものとする。 First, text data to be classified is taken in via the data input means 10. The data can be a document file of an arbitrary format or a text file containing sentences. In this embodiment, for example, data is taken in the format of a comma-delimited text file called CSV.

図２に、取り込むデータの具体的な内容の例を示す。このデータは、企業内に設置されたコンタクトセンタ（お客様相談室）において蓄積されたデータの例であり、各データは、日時・商品名・顧客名・連絡先・問い合わせ内容の項目からなる、表形式のファイルデータになっている。このファイルデータには、さらに、問い合わせに対する回答内容などの他の項目を含んでもよい。 FIG. 2 shows an example of specific contents of data to be captured. This data is an example of data accumulated in a contact center (customer consultation room) installed in the company. Each data consists of items such as date / time, product name, customer name, contact information, and inquiry contents. Format file data. This file data may further include other items such as the contents of the response to the inquiry.

コンタクトセンタには、その企業の商品に対する質問や苦情、要望などの問い合わせが電話やＦＡＸ、電子メールなどで寄せられる。図２において、「日時」は問い合わせを受けた年月日および時刻の情報、「問い合わせ内容」は電話などで受けた問い合わせの内容であり文章で記述されている。このようなデータは、例えば、コンタクトセンタにおいて応答処理を行なうオペレータ用の端末装置上で動作する任意のアプリケーションを利用して、オペレータが適宜商品名などを選択したり、取い合わせ内容を入力したりすることによって、電子化されたデータとして蓄積される。そして、このように蓄積されたデータを、任意のデータ回線を介して取り込んだり、磁気ディスクなどの記憶媒体に一旦記憶させ、それから読み出したりし、この際、必要に応じて、適当な形式のデータに変換して本実施形態の情報分類装置に取り込む。 Inquiries such as questions, complaints, and requests regarding the company's products are sent to the contact center by telephone, fax, or e-mail. In FIG. 2, “date and time” is information on the date and time when the inquiry is received, and “inquiry content” is the content of the inquiry received by telephone or the like and is described in text. Such data can be obtained by, for example, using an arbitrary application that operates on the terminal device for the operator that performs response processing in the contact center, and the operator appropriately selects a product name or inputs the contents of the arrangement. Is stored as digitized data. Then, the data accumulated in this way is taken in via an arbitrary data line, temporarily stored in a storage medium such as a magnetic disk, and then read out. At this time, data of an appropriate format is used as necessary. Into the information classification apparatus of this embodiment.

取り込むデータは、このようにコンタクトセンタで得られるものの他、アンケートに寄せられる意見や、営業担当者が書いた日報をデータベース化したものなどであってもよい。 In addition to the data obtained at the contact center, the data to be captured may be a database of opinions sent to questionnaires and daily reports written by sales representatives.

次に、データ入力手段１０によって取り込んだデータに含まれる文章を、文章解析手段１１によって解析する。この際、文章解析手段１１で解析するのは、上述の全ての項目のデータであってもよいが、分類する上で必要なデータを含んだ項目を選択して、その項目についてのみ解析を実行できるようにしてもよい。 Next, the sentence included in the data taken in by the data input means 10 is analyzed by the sentence analysis means 11. At this time, the sentence analysis unit 11 may analyze the data of all the above-mentioned items. However, the item including the data necessary for classification is selected and the analysis is performed only on the item. You may be able to do it.

文章解析手段１１では、文章を単語単位に分割し各単語に品詞情報を付与する形態素解析処理が行われる。文章解析手段１１の出力結果である単語と品詞の組み合わせ情報を以下では単語情報と呼ぶことにする。形態素解析は、特に、日本語のように分かち書きのない言語をコンピュータで処理する場合に一般的に適用される技術であり、例えば、非特許文献１などに記載された技術を用いることができる。 The sentence analysis unit 11 performs a morpheme analysis process that divides a sentence into words and gives part-of-speech information to each word. The combination information of words and parts of speech, which is the output result of the sentence analysis means 11, will be referred to as word information below. The morphological analysis is a technique that is generally applied to a case where a non-separated language such as Japanese is processed by a computer. For example, a technique described in Non-Patent Document 1 or the like can be used.

文章解析手段１１によって得られた単語情報は、係り受け解析手段２１に送られる。係り受け解析手段２１では、入力された単語情報に基づいて、構文解析などを行い、文の主語・述語の関係、修飾語・被修飾語の関係などの係り受け関係を判定する。係り受け解析については、特許文献１、特許文献５、非特許文献２などに記載された技術を用いることができる。この係り受け解析においては、処理負荷を軽減するため、厳密な構文解析を行ず、最も近い名詞（句）と動詞（句）または形容詞（句）を主語・述語の関係にあるとみなしたり、最も近い形容詞（句）と名詞（句）を修飾語・被修飾語の関係にあるとみなしたりする手法を用いてもよい。 The word information obtained by the sentence analysis unit 11 is sent to the dependency analysis unit 21. The dependency analysis means 21 performs syntax analysis based on the input word information and determines dependency relationships such as the relationship between the subject and predicate of the sentence and the relationship between the modifier and the modified word. For dependency analysis, techniques described in Patent Document 1, Patent Document 5, Non-Patent Document 2, and the like can be used. In this dependency analysis, in order to reduce the processing load, strict syntax analysis is performed and the closest noun (phrase) and verb (phrase) or adjective (phrase) are considered to be in a subject / predicate relationship, A technique may be used in which the closest adjective (phrase) and noun (phrase) are considered to be in a relationship between a modifier and a modifier.

本実施形態では、図２のデータのうち、「問い合わせ内容」の項目の文章について文章解析を実行するものとし、形態素解析した結果、さらに係り受け解析した結果の一例を模式的に図３に示す。この例では、「店員」と「対応」は名詞同士が助詞「の」で接続されているので修飾語・被修飾語関係と判定され、「対応」と「悪（い）」は最も近い名詞と形容詞なので主語・述語関係と判定されている。このような、係り受け解析手段２１の出力する係り受けの判定結果情報を以下では係り受け情報と呼ぶことにする。 In the present embodiment, sentence analysis is executed for the sentence of the item “inquiry content” in the data of FIG. 2, and an example of the result of morphological analysis and further dependency analysis is schematically shown in FIG. 3. . In this example, “no clerk” and “correspondence” are determined by the noun “no” to be connected with the noun “no”. Because it is an adjective, it is determined to have a subject / predicate relationship. Such dependency determination result information output by the dependency analysis means 21 will be referred to as dependency information below.

次に、文章解析手段１１によって得られた単語情報、および係り受け解析手段２１によって得られた係り受け情報がデータ分類手段１２に入力される。データ分類手段１２は、分類用辞書記憶手段２２に記憶された情報と、入力された単語情報、係り受け情報を比較して、そのデータがどのカテゴリに当たるかを判定し、各データにカテゴリ情報を付与する。 Next, the word information obtained by the sentence analysis unit 11 and the dependency information obtained by the dependency analysis unit 21 are input to the data classification unit 12. The data classification unit 12 compares the information stored in the classification dictionary storage unit 22 with the input word information and dependency information, determines which category the data corresponds to, and sets the category information for each data. Give.

分類用辞書記憶手段２２は、データにどの単語情報や係り受け情報が含まれていたら、どのカテゴリに分類するかに関する情報の集合である分類用辞書データが記憶されている。図４に、分類用辞書記憶手段２２に記憶された分類用辞書データの内容の一例、およびこのデータに基づいて、データ分類手段１２によってどのカテゴリ情報が付与されるかの一例を模式的に示す。 The classification dictionary storage means 22 stores classification dictionary data, which is a set of information related to which category to classify if the word information or dependency information is included in the data. FIG. 4 schematically shows an example of the contents of classification dictionary data stored in the classification dictionary storage means 22 and an example of which category information is given by the data classification means 12 based on this data. .

図４の分類用辞書データの内容例は、分類対象とするデータに、「語句」の列に記述されている単語情報または係り受け情報が含まれていたら、その左側の「カテゴリ情報」の列に記述されている文字列がカテゴリ情報として付与されることを示している。さらに、図４に示す例では、「品詞」、「重み」の情報が付加情報として、分類用辞書データに記憶されている。品詞情報を分類処理に用いることによって、処理精度の向上を図ることができる。重み情報は、各「語句」に対する重要度を示す値であり、特定の語句を重要視したり、逆に軽視したりしたい時に、この値で調整することができる。例えば、カテゴリ情報に、この重みの値、または、各データ内に含まれる各語句についての重みを累積した値を含ませ、苦情の強さの指標としたり、重みの累積値が一定値以上のもののみを「苦情」のカテゴリに分類したりというように利用することができる。さらに他の付加情報を分類用辞書データに含ませてもよいが、これら「品詞」や「重み」などの付加情報は、分類用辞書データに必須のものではなく、以下では、特に必要がない限り、省略して説明する。このような分類用辞書データは、分類処理を実行する前に予め分類用辞書記憶手段２２に登録しておく必要がある。この登録方法については後述する。 The content example of the classification dictionary data shown in FIG. 4 includes the “category information” column on the left side of the data to be classified if the word information or dependency information described in the “phrase” column is included. It is shown that the character string described in is given as category information. Further, in the example shown in FIG. 4, information on “part of speech” and “weight” is stored as additional information in the classification dictionary data. By using the part-of-speech information for the classification process, the processing accuracy can be improved. The weight information is a value indicating the degree of importance for each “word / phrase”, and can be adjusted by this value when a specific word / phrase is emphasized or conspicuously disregarded. For example, the category information includes the value of this weight or a value obtained by accumulating the weight for each word included in each data, and can be used as an index of the strength of the complaint, or the weight accumulated value is a certain value or more. Only things can be used in the category of “complaints”. Further additional information may be included in the classification dictionary data, but such additional information such as “part of speech” and “weight” is not essential for the classification dictionary data, and is not particularly necessary in the following. As long as the description is omitted. Such classification dictionary data needs to be registered in the classification dictionary storage means 22 in advance before executing the classification process. This registration method will be described later.

この分類用辞書データに基づくデータ分類手段１２によるカテゴリ情報付与処理として、図４に示す具体例では、「故障」という単語が含まれているデータに「障害」というカテゴリ情報が付与され、「対応→悪い」「態度→悪い」という係り受け関係が含まれているデータには、「苦情」というカテゴリ情報が付与されている。この際、１つのデータ中に、例えば「故障」と「困る」という語句が含まれ、すなわち、分類用辞書データにおいて、異なる複数のカテゴリに付属する語句が同時に含まれる場合が考えられる。この場合には、１つのデータに対して複数のカテゴリ情報を付与してもよいし、あるいは、出現頻度が高い方のカテゴリ情報を付与したり、前述の重みによるスコア計算を行って高いスコアを持つ方のカテゴリ情報を付与したりしてもよい。 In the specific example shown in FIG. 4, the category information “fault” is added to the data including the word “fault” as the category information addition processing by the data classification unit 12 based on the classification dictionary data. The category information “complaint” is given to data including a dependency relationship of “→ bad” and “attitude → bad”. At this time, for example, the words “failure” and “problem” are included in one data, that is, the classification dictionary data may include words belonging to a plurality of different categories at the same time. In this case, a plurality of category information may be given to one data, or category information having a higher appearance frequency may be given, or score calculation based on the above-mentioned weight may be performed to obtain a high score. You may give the category information of the person who holds it.

このようにして、最終的にカテゴリ情報を付与されたデータは、分類結果出力手段１３に送られる。図５に、分類結果の表示例を示す。図５に示す例では、基の表形式のデータに、カテゴリの項目が付加された形式での表示例を示している。また、分類結果出力手段１３は、図６に示すように、各カテゴリに含まれるデータの数の度数分布のグラフを表示可能としてもよい。さらに、各商品、顧客など毎の、各カテゴリに含まれるデータの数の度数分布のグラフなどをユーザの設定に応じて適宜表示可能としてもよい。 In this way, the data to which category information is finally given is sent to the classification result output means 13. FIG. 5 shows a display example of the classification result. The example shown in FIG. 5 shows a display example in a format in which category items are added to the data in the basic table format. Further, as shown in FIG. 6, the classification result output means 13 may be able to display a graph of the frequency distribution of the number of data included in each category. Furthermore, a graph of the frequency distribution of the number of data included in each category for each product, customer, and the like may be displayed as appropriate according to user settings.

次に、分類用辞書記憶手段２２への分類用辞書データの登録方法について説明する。 Next, a method for registering classification dictionary data in the classification dictionary storage means 22 will be described.

分類用辞書データの登録は、分類用辞書データ編集手段２３を用いて実行することができる。分類用辞書データ編集手段２３は、分類用辞書記憶手段２２に記憶された分類用辞書データに、カテゴリ情報や、あるカテゴリに分類するための条件となる語句を新たに追加したり、逆に削除したりといった編集作業をユーザなどが実行するための手段である。 The registration of classification dictionary data can be executed using the classification dictionary data editing means 23. The classification dictionary data editing unit 23 newly adds or deletes category information or a word / phrase as a condition for classification into a certain category to the classification dictionary data stored in the classification dictionary storage unit 22. This is a means for a user or the like to execute an editing operation such as.

分類用辞書データ編集手段２３は、本実施形態では、コンピュータの表示装置上に所定の画面を表示し、ユーザがこの画面を見ながらキーボードやポインティングデバイスなどの入力手段を用いて入力操作を実行可能な構成を有しているものとする。図７に、分類用辞書データ編集手段２３によって出力される登録画面の例を示す。この画面において、カテゴリ情報の入力欄と語句の入力欄にそれぞれ文字情報を入力し、登録ボタンをクリックすることによって、入力したカテゴリに分類する条件となる語句として、入力した語句が登録される。この際、語句として係り受け情報を登録する場合は、図７に示す例のように、例えば「価格が高い」という文字情報を入力する。分類用辞書データ編集手段２３は、この文字情報のデータを、文書解析手段２４および係り受け解析手段２５に送ることによって、「価格→高い」という係り受け情報に変換し、分類用辞書記憶手段２２に記憶する。これによって、ユーザは、語句として単純に文章を入力することによって、適切な形式で分類用辞書記憶手段２２に分類用辞書データを登録することができる。この際、文書解析手段２４、係り受け解析手段２５の処理内容は、前述の文書解析手段１１、係り受け解析手段２１と基本的に同様であり、図１では便宜上別の手段として記載しているが、文書解析手段２４、係り受け解析手段２５としては、文書解析手段１１、係り受け解析手段２１を流用して用いることができる。 In this embodiment, the classification dictionary data editing means 23 displays a predetermined screen on the display device of the computer, and the user can execute an input operation using an input means such as a keyboard or a pointing device while viewing this screen. It shall have the structure. FIG. 7 shows an example of a registration screen output by the classification dictionary data editing means 23. In this screen, by inputting character information in the category information input field and the word input field and clicking the registration button, the input word is registered as a word that is a condition for classification into the input category. At this time, when dependency information is registered as a phrase, for example, text information “price is high” is input as shown in the example of FIG. The classification dictionary data editing means 23 converts this character information data into dependency information “price → high” by sending it to the document analysis means 24 and dependency analysis means 25, and the classification dictionary storage means 22. To remember. Thus, the user can register the classification dictionary data in the classification dictionary storage means 22 in an appropriate format by simply inputting a sentence as a phrase. At this time, the processing contents of the document analysis unit 24 and the dependency analysis unit 25 are basically the same as those of the document analysis unit 11 and the dependency analysis unit 21 described above, and are illustrated as separate units in FIG. However, as the document analysis unit 24 and the dependency analysis unit 25, the document analysis unit 11 and the dependency analysis unit 21 can be used.

また、本実施形態の情報分類装置は、より効率的に分類用辞書データを作成可能とするために、分類用辞書データ作成手段３０をさらに有している。 In addition, the information classification apparatus according to the present embodiment further includes classification dictionary data creation means 30 in order to enable creation of classification dictionary data more efficiently.

分類用辞書データ作成手段３０を用いて分類用辞書データを作成する際には、まず、データ入力手段３０１を介して、分類対象とするデータに相応するサンプルデータを装置に取り込む。取り込むサンプルデータとしては、分類を実行する場合と同様に、図４に模式的に示すような内容のＣＳＶファイルのデータを取り込むことができる。このサンプルデータは、分類対象とするデータから抜粋したものを用いてもよいし、その時点で蓄積された、分類対象とする全データであってもよい。 When creating the classification dictionary data using the classification dictionary data creation means 30, first, sample data corresponding to the data to be classified is taken into the apparatus via the data input means 301. As sample data to be imported, data of a CSV file having a content as schematically shown in FIG. 4 can be acquired as in the case of performing classification. The sample data may be extracted from the data to be classified, or may be all the data to be classified and accumulated at that time.

次に、分析条件設定手段３０２を用いて分析条件を設定する。分析条件としては、例えば図４に示すようなデータに関して、「商品Ａ」という条件と「問い合わせ内容」という条件を設定する。これによって、設定した条件に合致するデータのみが抽出されて、以後の処理に用いられる。このように分析条件の設定を可能とすることによって、この例では、商品Ａに対する意見の内容を深く分析というように、注目したいデータだけを重点的に分析することが可能となる。また、扱うデータを絞り込むことによって、以降の分類用辞書データの作成にかかる労力を低減することができる。もちろん、分析条件を設定せずに、全てのデータについて分析を行なうことも可能である。 Next, analysis conditions are set using the analysis condition setting means 302. As analysis conditions, for example, a condition “product A” and a condition “inquiry content” are set for data as shown in FIG. As a result, only data that matches the set conditions is extracted and used for the subsequent processing. By enabling the setting of analysis conditions in this way, in this example, it is possible to focus on only the data that is desired to be focused, such as deep analysis of the content of opinion on the product A. Further, by narrowing down the data to be handled, it is possible to reduce the labor required for the subsequent creation of classification dictionary data. Of course, it is possible to analyze all data without setting analysis conditions.

設定条件に従って抽出されたデータは、次に、文書解析手段３０３および係り受け解析手段３０４に送られる。文書解析手段３０３、係り受け解析手段３０４の処理は、基本的に、前述の文書解析手段１１、係り受け解析手段２１と同様であり、各データについて単語情報と係り受け情報を作成する。 The data extracted in accordance with the setting conditions is then sent to the document analysis unit 303 and the dependency analysis unit 304. The processing of the document analysis unit 303 and the dependency analysis unit 304 is basically the same as that of the document analysis unit 11 and the dependency analysis unit 21 described above, and word information and dependency information are created for each data.

作成された単語情報と係り受け情報は、次に、順位付け手段３０５に渡される。順位付け手段３０５は、全てのデータについて得られた単語情報および係り受け情報を、統計情報に基づき、例えば頻度の高い順に順位付けする。他の統計的な処理の手法として、特許文献１に記載されているように確率的コンプレキシティを用いる手法や、に記載されているようにカイ二乗を用いる手法などがあり、順位付け手段３０５は、このような統計的な処理手法を用いて順位付けを実行するようにしてもよい。 The created word information and dependency information are then passed to the ranking unit 305. Ranking means 305 ranks word information and dependency information obtained for all data based on statistical information, for example, in descending order of frequency. As other statistical processing methods, there are a method using probabilistic complexity as described in Patent Document 1 and a method using chi-square as described in The ranking may be performed using such a statistical processing method.

次に、順位付け手段３０５によって順位付けした結果は、順位付け結果出力手段３０６によって、例えば、図８に示すように、リストとグラフといった形式で表示される。 Next, the ranking result output unit 306 displays the ranking result by the ranking unit 305, for example, in the form of a list and a graph as shown in FIG.

分類用辞書入力手段３０７は、ユーザが順位付け結果出力手段３０６によって表示された結果を見ながら、分類用辞書記憶手段２２に分類用辞書データを登録する処理をするための手段であり、例えば、表示画面において、順位付けの順に表示された語句を、ポインティングデバイスを用いてクリックするなどの簡単な操作による登録処理を可能とする。 The classification dictionary input unit 307 is a unit for performing a process of registering the classification dictionary data in the classification dictionary storage unit 22 while viewing the result displayed by the ranking result output unit 306, for example, Registration processing can be performed by a simple operation such as clicking on the words displayed in the order of ranking on the display screen using a pointing device.

図８には、このような処理の例も併せてしめされている。この例では、順位付けの表示画面において、「回答→ない」という語句をクリックすると、「原文参照」、「分類用辞書へ登録」といった選択項目を含むポップアップメニューが表示される。そこで、「原文参照」を選択すると、その語句を含む元の文章が表示され、ユーザは、これを見て、どのようなカテゴリに分類すべきかを検討することができる。そして、次に、ポップアップメニューの「分類用辞書へ登録」を選択すると、分類用辞書の登録画面に移り、語句の入力欄には、順位付け結果出力手段３０６による結果表示画面におういて選択した語句が表示される。この例では、「回答→ない」という語句に対して、ユーザは、この語句を含むデータを、既にカテゴリ情報として登録されている「苦情」というカテゴリに分類すべきだと判断したとする。この場合、ユーザが、登録画面に表示されているカテゴリ情報と語句の対応付けの表の中から「苦情」の欄をクリックすることによって、カテゴリ情報の入力欄に「苦情」という文字情報が入る。そこで、登録ボタンをクリックすることによって、「苦情」のカテゴリに分類する条件となる語句として「回答→ない」が登録される。この際、カテゴリ情報の入力欄に「苦情」という文字情報を直接入力できるようにしてもよいし、また、新たなカテゴリ名を入力して、カテゴリを追加登録できるようにしてもよい。 FIG. 8 also shows an example of such processing. In this example, when the phrase “answer → no” is clicked on the ranking display screen, a pop-up menu including selection items such as “reference to original text” and “register in dictionary for classification” is displayed. Therefore, when “reference to the original text” is selected, the original sentence including the word is displayed, and the user can see what category it should be classified by looking at this. Then, when “Register to classification dictionary” is selected from the pop-up menu, the screen moves to the classification dictionary registration screen, and the phrase input field is selected on the result display screen by the ranking result output means 306. The phrase is displayed. In this example, it is assumed that the user determines that the data including the phrase should be classified into the category “complaint” that is already registered as category information for the phrase “answer → no”. In this case, when the user clicks on the “complaint” field from the category information / phrase correspondence table displayed on the registration screen, the text information “complaint” is entered in the category information input field. . Therefore, by clicking the registration button, “answer → no” is registered as a word that is a condition for classification into the category of “complaint”. At this time, the character information “complaint” may be directly input in the category information input field, or the category may be additionally registered by inputting a new category name.

以上のようにして、分類用辞書登録画面において登録ボタンをクリックすると、画面は、順位付けの表示画面に戻る。この際、登録した語句はリストから削除される。あるいは、リスト上で、表示色を変えたり、登録されたカテゴリ情報をリストに表示したりして、その語句が、既にカテゴリ分類用に登録済みであることを確認できるようにしてもよい。 When the registration button is clicked on the classification dictionary registration screen as described above, the screen returns to the ranking display screen. At this time, the registered word / phrase is deleted from the list. Alternatively, the display color may be changed on the list, or the registered category information may be displayed on the list so that it can be confirmed that the word has already been registered for category classification.

このようにして、ユーザは、分類対象のデータにおいて、高い頻度で使用されている語句から順に次々に、カテゴリとの対応付けを実行し、効率的に分類用辞書データを作成することができる。また、分類用辞書データ作成手段３０を用いることによって、ユーザは、どのようなデータが含まれているかを見ながら分類用辞書データを作成できるので、適切な分類を想起するのも容易であり、的確な分類処理が可能となる。 In this manner, the user can efficiently create the classification dictionary data by sequentially associating with the categories in order from the frequently used words and phrases in the data to be classified. Further, by using the classification dictionary data creation means 30, the user can create classification dictionary data while looking at what kind of data is included, so it is easy to recall an appropriate classification, Accurate classification processing is possible.

なお、分類用辞書データ作成手段３０において、データ入力手段３０１、文書解析手段３０３、係り受け解析手段３０４としては、前述のデータ入力手段１０、文書解析手段１１、係り受け解析手段１２を適宜流用して用いることができる。 In the classification dictionary data creation means 30, the data input means 301, document analysis means 11, and dependency analysis means 12 described above are appropriately used as the data input means 301, document analysis means 303, and dependency analysis means 304. Can be used.

本実施形態の変形例として、情報分類装置は、図９に示すように、カテゴリ別データ抽出手段４０をさらに含んでいてもよい。カテゴリ別データ抽出手段４０は、ユーザが、分類結果出力手段１３によって表示された結果を見て、所望のカテゴリを指定することによって、そのカテゴリに分類されたデータのみを抽出する働きをする。ユーザは、抽出したデータをデータ分類手段１２の入力データとして用い、それによって、前にカテゴリ別に分けたデータのうちの、所望のカテゴリに含まるデータをより詳細に分類して、より深く分析を実施することができる。すなわち、例えば、最初の分類に用いた「苦情」というカテゴリに含まれるデータについてさらに詳細な分類を実行することによって、例えば、苦情の内容の傾向（苦情としてどんな言葉が用いられているかなど）を分析することができる。 As a modification of the present embodiment, the information classification device may further include category-specific data extraction means 40 as shown in FIG. The category-specific data extraction unit 40 functions to extract only data classified into the category by designating a desired category by viewing the result displayed by the classification result output unit 13. The user uses the extracted data as input data for the data classification means 12, thereby classifying the data included in the desired category out of the data previously classified by category into a more detailed analysis. Can be implemented. That is, for example, by executing more detailed classification on the data included in the category of “complaints” used for the first classification, for example, the tendency of the content of complaints (what words are used as complaints, etc.) Can be analyzed.

この際、分類用辞書データとしては、前に分類に用いたカテゴリに含まれるデータをさらに詳細に分類するための新たなカテゴリを定義したものを用いる必要がある。このような分類用辞書データの登録は、分類用辞書データ編集手段２３を用いて実行してもよいが、カテゴリ別データ抽出手段４０によって抽出したデータを分類用辞書データ作成手段３０に入力し、分類用辞書データ作成手段３０を用いて分類用辞書データを作成することによって、より的確かつ効率的に新たな分類用辞書データを作成することができる。 At this time, as the classification dictionary data, it is necessary to use data defining a new category for further classifying data included in the category previously used for classification. Such registration of classification dictionary data may be executed by using the classification dictionary data editing means 23, but the data extracted by the category-specific data extraction means 40 is input to the classification dictionary data creation means 30, By creating the classification dictionary data using the classification dictionary data creation means 30, new classification dictionary data can be created more accurately and efficiently.

このように新たに定義したカテゴリ別に分類したデータは、これに対して、再び、カテゴリ別データ抽出手段４０を用いて特定のカテゴリに分類されたデータを抽出し、抽出したデータをさらに詳細なカテゴリに分類するのに用いることもできる。このように、前に定義したカテゴリ内をより詳細なカテゴリに分類する作業を繰り返し実行することによって、分類を段階的に的確に精緻化していくことができる。 In this way, the data classified by the newly defined category is extracted again by using the category-specific data extraction means 40 to extract data classified into a specific category, and the extracted data is further classified into categories. It can also be used for classification. In this way, the classification can be accurately refined step by step by repeatedly performing the work of classifying the previously defined category into a more detailed category.

本発明の実施形態の情報分類装置の構成を模式的に示すブロック図である。It is a block diagram which shows typically the structure of the information classification device of embodiment of this invention. 図１のデータ入力手段によって取り込むデータの一例を示す図である。It is a figure which shows an example of the data taken in by the data input means of FIG. 図１の文書解析手段、係り受け解析手段による解析結果の一例を示す図である。It is a figure which shows an example of the analysis result by the document analysis means of FIG. 1, and a dependency analysis means. 図１の分類用辞書記憶手段に記憶された分類用辞書データの内容例と、この分類用辞書データに応じた、データ分類手段による分類処理例を示す図である。It is a figure which shows the example of the content of the classification dictionary data memorize | stored in the classification dictionary memory | storage means of FIG. 1, and the classification processing example by a data classification means according to this dictionary data for classification. 図１の分類結果出力手段による出力例を示す図である。It is a figure which shows the example output by the classification result output means of FIG. 図１の分類結果出力手段による他の出力例を示す図である。It is a figure which shows the other output example by the classification result output means of FIG. 図１の分類用辞書データ編集手段による登録画面の一例を示す図である。It is a figure which shows an example of the registration screen by the dictionary data editing means for classification | category of FIG. 図１の順位付け結果出力手段による出力例と、これに応じた分類用辞書入力手段による登録画面の一例を示す図である。It is a figure which shows an example of the output by the ranking result output means of FIG. 1, and an example of the registration screen by the classification dictionary input means according to this. 図１の変形例の情報分類装置の構成を模式的に示すブロック図である。It is a block diagram which shows typically the structure of the information classification device of the modification of FIG.

符号の説明Explanation of symbols

１０，３０１データ入力手段
１１，２４，３０３文書解析手段
１２データ分類手段
１３分類結果出力手段
２１，２５，３０４係り受け解析手段
２２分類用辞書記憶手段
２３分類用辞書データ編集手段
３０分類用辞書データ作成手段
４０カテゴリ別データ抽出手段
３０２分析条件設定手段
３０５順位付け手段
３０６順位付け結果出力手段
３０７分類用辞書入力手段

DESCRIPTION OF SYMBOLS 10,301 Data input means 11, 24,303 Document analysis means 12 Data classification means 13 Classification result output means 21, 25,304 Dependency analysis means 22 Classification dictionary storage means 23 Classification dictionary data editing means 30 Classification dictionary data Creation means 40 Category-specific data extraction means 302 Analysis condition setting means 305 Ranking means 306 Ranking result output means 307 Classification dictionary input means

Claims

分類対象とするテキストデータをカテゴリ別に分類する情報分類装置であって、
前記テキストデータを取り込むデータ入力手段と、
前記データ入力手段によって取り込んだ前記テキストデータの文章を形態素解析し、該文章を単語に分け品詞情報を付与した単語情報を作成する文章解析手段と、
前記文章解析手段によって作成された前記単語情報に基づいて、各単語間の主語と述語の関係、または修飾語と被修飾語の関係である係り受け関係を判定し、係り受け情報を作成する係り受け解析手段と、
前記各カテゴリ別に、当該カテゴリに分類する条件となる、前記テキストデータが含むべき前記単語情報と前記係り受け情報が登録された分類用辞書データを記憶する分類用辞書記憶手段と、
前記各テキストデータを、当該テキストデータに含まれる前記単語情報および前記係り受け情報と前記分類用辞書記憶手段に記憶された前記分類用辞書データに基づいて前記カテゴリ別に分類するデータ分類手段とを有し、
前記分類用辞書データを前記分類用辞書記憶手段に登録する手段として、サンプルテキストデータについての、前記文章解析手段によって作成された前記単語情報と、前記係り受け解析手段によって作成された前記係り受け情報のうちから指定されたものを、指定された前記カテゴリに分類するための条件として前記分類用辞書データに登録する分類用辞書データ作成手段をさらに有する情報分類装置。 An information classification device that classifies text data to be classified into categories,
Data input means for capturing the text data;
A sentence analyzing means for performing morphological analysis on the sentence of the text data taken in by the data input means, dividing the sentence into words and creating word information with part-of-speech information;
Based on the word information created by the sentence analysis unit, a relationship between a subject and a predicate between words or a dependency relationship between a modifier and a modified word is determined, and dependency information is created. Receiving analysis means,
Classification dictionary storage means for storing classification word data in which the word information to be included in the text data and the dependency information are registered, which is a condition for classification into the category for each category,
Data classification means for classifying each text data by category based on the word information and the dependency information included in the text data and the classification dictionary data stored in the classification dictionary storage means. And
As means for registering the classification dictionary data in the classification dictionary storage means, for the sample text data, the word information created by the sentence analysis means and the dependency information created by the dependency analysis means An information classification apparatus further comprising classification dictionary data creating means for registering a specified one of the above in the classification dictionary data as a condition for classifying the specified one into the specified category.

前記分類用辞書データ作成手段は、分類対象とする前記テキストデータの全てに含まれる前記単語情報と前記係り受け情報に対して統計的な手法によって順位付けを行なう順位付け手段を含む、請求項１に記載の情報分類装置。 2. The classification dictionary data creating unit includes a ranking unit that ranks the word information and the dependency information included in all the text data to be classified by a statistical method. The information classification device described in 1.

前記データ分類手段によって分類された結果に基づいて、指定された前記カテゴリに分類された前記テキストデータのみを抽出するカテゴリ別データ抽出手段をさらに有し、該カテゴリ別データ抽出手段によって抽出された該テキストデータを新たな前記分類用辞書データの作成に用いる、請求項１または２に記載の情報分類装置。 Based on the result classified by the data classifying means, the data further comprises categorical data extracting means for extracting only the text data classified into the specified category, and the data extracted by the categorical data extracting means The information classification apparatus according to claim 1, wherein text data is used to create new classification dictionary data.

分類対象とするテキストデータをカテゴリ別に自動的に分類する情報分類方法であって、
前記テキストデータの文章を形態素解析し、該文章を単語に分け品詞情報を付与した単語情報を作成するステップと、
前記単語情報に基づいて、各単語間の主語と述語の関係、または修飾語と被修飾語の関係である係り受け関係を判定し、係り受け情報を作成するステップと、
各カテゴリ別に、当該カテゴリに分類する条件となる、前記テキストデータが含むべき前記単語情報と前記係り受け情報を登録した分類用辞書データを作成するステップと、
前記各テキストデータを、当該テキストデータに含まれる前記単語情報および前記係り受け情報と前記分類用辞書記憶手段に記憶された前記分類用辞書データに基づいて前記カテゴリ別に分類するステップとを有し、
前記分類用辞書データを作成するステップは、サンプルテキストデータについて作成した前記単語情報と前記係り受け情報のうちから指定されたものを、指定された前記カテゴリに分類するための条件として前記分類用辞書データに登録するステップを含む情報分類方法。 An information classification method that automatically classifies text data to be classified into categories,
Morphological analysis of a sentence of the text data, dividing the sentence into words and creating word information with part-of-speech information; and
Based on the word information, determining a dependency relationship between a subject and a predicate between words or a relationship between a modifier and a modified word, and creating dependency information;
Creating classification dictionary data in which the word information to be included in the text data and the dependency information are registered for each category, which is a condition for classification into the category;
Classifying each text data by category based on the word information and the dependency information included in the text data and the classification dictionary data stored in the classification dictionary storage means;
The step of creating the classification dictionary data includes the classification dictionary as a condition for classifying a designated one of the word information and the dependency information created for the sample text data into the designated category. An information classification method including a step of registering data.

前記分類用辞書データを作成するステップは、分類対象とする前記テキストデータの全てに含まれる前記単語情報と前記係り受け情報に対して統計的な手法によって順位付けを行なうステップをさらに含む、請求項４に記載の情報分類方法。 The step of creating the classification dictionary data further includes the step of ranking the word information and the dependency information included in all of the text data to be classified by a statistical method. 4. The information classification method according to 4.

指定された前記カテゴリに分類された前記テキストデータのみを抽出し、抽出した該テキストデータに対して、新たに前記分類用辞書データを作成するステップをさらに有する、請求項４に記載の情報分類方法。 5. The information classification method according to claim 4, further comprising: extracting only the text data classified into the specified category, and newly creating the classification dictionary data for the extracted text data. .

請求項４から６のいずれか１項に記載の情報分類方法をコンピュータに実行させるためのプログラム。 The program for making a computer perform the information classification method of any one of Claim 4 to 6.