JP2006323517A

JP2006323517A - Text classification device and program

Info

Publication number: JP2006323517A
Application number: JP2005144358A
Authority: JP
Inventors: Takeyuki Aikawa; 勇之相川; Makoto Imamura; 誠今村; Akito Nagai; 明人永井; Yasuhiro Takayama; 泰博高山
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2005-05-17
Filing date: 2005-05-17
Publication date: 2006-11-30

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a text classification device for easily performing a classification condition setting operation. <P>SOLUTION: A keyword candidate for classification structuring means 102 structures keyword candidates for classification extracted by a keyword candidate for classification extracting means 101 based on a prescribed relation between keyword candidates for classification. A keyword relevant information for classification extracting means 103 extracts information relating to the structured keyword candidates for classification. A classification condition setting means 104 sets classification conditions from information presented by at least either the keyword candidate for classification structuring means 102 or the keyword relevant information for classification extracting means 103, and a classification performing means 105 classifies classification object texts according to the classification conditions set by the classification condition setting means 104. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、例えば、製品企画や品質管理などの業務で必要とされる重要な情報を蓄積した大量のテキストに対して、所望する様々な観点で分類し、業務改善のために活用する、といったことを可能とするテキスト分類装置に関するものである。 For example, the present invention classifies a large amount of text that stores important information necessary for business such as product planning and quality control from various desired viewpoints and uses it for business improvement. The present invention relates to a text classification device that enables the above.

インターネットの普及に伴い、顧客の声が電子メールやＷｅｂといった手段により容易に企業に寄せられるようになった。また、企業内では文書の電子化が進み、設計仕様書や故障調査報告書などの品質管理業務などで必要とされる文書の蓄積が進んでいる。このように、企業等ではテキスト文書が大量に蓄積されるようになっている。 With the spread of the Internet, customers' voices can be easily sent to companies by means of e-mail and Web. In addition, the digitization of documents is progressing in the enterprise, and the accumulation of documents required for quality control work such as design specifications and failure investigation reports is progressing. In this way, a large amount of text documents are accumulated in companies and the like.

このような点から、これらの大量に蓄積された文書のテキストから業務改善のために活用可能な有用な情報を抽出するため、文書内に含まれる情報の種類によって大量のテキストを分類して整理するためのテキスト分類装置の重要性が増している。従来、このようなテキスト分類装置のうち、文書を分類するための分類条件指定手段、および、指定された分類条件により指定された複数のカテゴリに文書を分類する文書分類手段を備える装置があった（例えば、特許文献１または特許文献２参照）。 From this point, in order to extract useful information that can be utilized for business improvement from the text of these large amounts of documents, a large amount of text is classified and organized according to the type of information contained in the document. The importance of text classifiers for doing this is increasing. 2. Description of the Related Art Conventionally, among such text classification devices, there has been a device provided with classification condition specifying means for classifying documents, and document classification means for classifying documents into a plurality of categories specified by the specified classification conditions. (For example, refer to Patent Document 1 or Patent Document 2).

特開２００３−１４１１２９号公報（第１４頁、図２−図４）Japanese Patent Laying-Open No. 2003-141129 (page 14, FIG. 2 to FIG. 4) 特開２００３−２７１６１６号公報（第１０頁、図１）JP 2003-271616 A (page 10, FIG. 1)

しかしながら、上記のような従来技術のうち、決定木などの統計量を用いた全自動のクラスタリング手法には、生成された各クラスタの意味が利用者には理解しにくいという課題があった。また、全自動であるため、生成された各クラスタが利用者の所望する形にならなかった場合、所望する形とするためのカスタマイズや制御ができないという課題もあった。 However, among the conventional techniques as described above, the fully automatic clustering method using statistics such as a decision tree has a problem that it is difficult for the user to understand the meaning of each generated cluster. In addition, since it is fully automatic, there is a problem in that customization and control to obtain a desired shape cannot be performed when each generated cluster does not have a desired shape.

一方、従来技術のうち、カテゴリ別に検索条件を人手で指定する自動分類方式には、分類条件をどう設定すべきかがわからず、条件設定と分類実行の試行錯誤を繰り返してしまうといった問題があった。 On the other hand, among the conventional techniques, the automatic classification method for manually specifying the search conditions for each category has a problem in that it does not know how to set the classification conditions and repeats the trial and error of condition setting and classification execution. .

この発明は上記のような課題を解決するためになされたもので、分類条件設定作業を容易に行うことのできるテキスト分類装置を得ることを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a text classification device that can easily perform classification condition setting work.

この発明に係るテキスト分類装置は、分類用キーワード候補抽出手段で抽出された複数の分類用キーワード候補を、分類用キーワード候補間の所定の関係に基づいて構造化する分類用キーワード候補構造化手段と、分類用キーワード候補構造化手段で構造化された分類用キーワード候補に関連する情報を抽出する分類用キーワード候補関連情報抽出手段とを備えたものである。 The text classification apparatus according to the present invention includes a classification keyword candidate structuring unit that structures a plurality of classification keyword candidates extracted by the classification keyword candidate extraction unit based on a predetermined relationship between the classification keyword candidates; And classification keyword candidate related information extracting means for extracting information related to the classification keyword candidate structured by the classification keyword candidate structuring means.

この発明のテキスト分類装置は、分類用キーワード候補を、これらの分類用キーワード候補間の所定の関係に基づいて構造化し、かつ、これらの分類用キーワード候補の関連情報を抽出するようにしたので、テキスト分類装置としての分類条件設定作業を容易に行うことができる。 In the text classification device of the present invention, the classification keyword candidates are structured based on a predetermined relationship between these classification keyword candidates, and the related information of these classification keyword candidates is extracted. The classification condition setting operation as the text classification device can be easily performed.

実施の形態１．
図１は、この発明の実施の形態１によるテキスト分類装置を示す構成図である。
図において、テキスト分類装置は、分類用キーワード候補抽出手段１０１、分類用キーワード候補構造化手段１０２、分類用キーワード関連情報抽出手段１０３、分類条件設定手段１０４、分類実行手段１０５を備えている。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a text classification apparatus according to Embodiment 1 of the present invention.
In the figure, the text classification apparatus includes classification keyword candidate extraction means 101, classification keyword candidate structuring means 102, classification keyword related information extraction means 103, classification condition setting means 104, and classification execution means 105.

分類用キーワード候補抽出手段１０１は、分類対象テキスト１１０から分類用のキーワード候補を抽出する機能を有している。分類用キーワード候補構造化手段１０２は、分類用キーワード候補抽出手段１０１で抽出された複数の分類用キーワード候補を、これらの分類用キーワード候補間の所定の関係に基づいて構造化して提示する機能部である。分類用キーワード関連情報抽出手段１０３は、分類用キーワード候補構造化手段１０２で構造化されて提示された分類用キーワード候補に対して、関連語情報、関連文書情報、カテゴリ関連情報といった関連情報を抽出して提示する機能部である。分類条件設定手段１０４は、分類用キーワード候補構造化手段１０２や分類用キーワード関連情報抽出手段１０３により提示される情報を元に、使用者から分類条件が選択された場合、これを分類条件として設定するための機能部である。分類実行手段１０５は、分類条件設定手段１０４により設定された分類条件に従って、分類対象テキストを分類し、分類結果１１１として出力する機能部である。 The classification keyword candidate extraction unit 101 has a function of extracting classification keyword candidates from the classification target text 110. The classification keyword candidate structuring unit 102 is a functional unit for structuring and presenting a plurality of classification keyword candidates extracted by the classification keyword candidate extraction unit 101 based on a predetermined relationship between these classification keyword candidates. It is. The classification keyword related information extraction unit 103 extracts related information such as related word information, related document information, and category related information from the classification keyword candidates structured and presented by the classification keyword candidate structuring unit 102. It is a functional part to present. The classification condition setting unit 104 sets, when a classification condition is selected by the user based on information presented by the classification keyword candidate structuring unit 102 or the classification keyword related information extraction unit 103, as a classification condition. It is a functional part for doing. The classification execution unit 105 is a functional unit that classifies the text to be classified in accordance with the classification condition set by the classification condition setting unit 104 and outputs it as a classification result 111.

尚、テキスト分類装置はコンピュータで実現され、上記の分類用キーワード候補抽出手段１０１〜分類実行手段１０５は、それぞれの機能に対応するソフトウェアと、これを実行するためのＣＰＵやメモリ等のハードウェアから構成されている。 The text classification device is realized by a computer, and the above-described classification keyword candidate extraction unit 101 to classification execution unit 105 are composed of software corresponding to each function and hardware such as a CPU and a memory for executing the software. It is configured.

次に、実施の形態１の動作について説明する。
図２は、分類処理の概要を示すフローチャートである。
先ず、図１から図１４までを適宜参照しつつ分類処理の概要について説明する。
本実施の形態ではパソコン周辺機器に関する問い合わせ記録を分析する場合を例として説明する。先ず、図２のステップＳＴ２０１において、分類用キーワード候補抽出手段１０２により分類対象テキストから分類用キーワード候補を抽出する。 Next, the operation of the first embodiment will be described.
FIG. 2 is a flowchart showing an overview of the classification process.
First, an overview of the classification process will be described with reference to FIGS. 1 to 14 as appropriate.
In this embodiment, a case where an inquiry record related to a PC peripheral device is analyzed will be described as an example. First, in step ST201 of FIG. 2, the classification keyword candidate extraction unit 102 extracts classification keyword candidates from the classification target text.

分類用キーワード候補抽出ステップＳＴ２０１では、分類対象テキストを単語に分割し、各テキストに出現する単語の頻度統計により重要度を算出し、重要度の高い単語を抽出する。単純な頻度を重要度として採用し、頻度の大きい語を重要な語であるとして抽出してもよいし、すべてのテキストに出現する一般語は分類上重要ではないとするＩＤＦ（Inverted Document Frequency）重みを用いて重要語を抽出してもよい。また、特定の属性に着目し、その属性値と各単語との共起確率を求め、属性値の頻度分布に対して偏りの大きい語を分類上重要であるとする特徴度を用いてもよい。 In the classification keyword candidate extraction step ST201, the classification target text is divided into words, the importance is calculated based on the frequency statistics of the words appearing in each text, and the words with high importance are extracted. A simple frequency may be adopted as an importance level, and a word with a high frequency may be extracted as an important word, or a general word appearing in all texts is not important for classification IDF (Inverted Document Frequency) You may extract an important word using a weight. In addition, paying attention to a specific attribute, the co-occurrence probability between the attribute value and each word may be obtained, and a feature degree indicating that a word having a large bias to the attribute value frequency distribution is important for classification may be used. .

次に、ステップＳＴ２０２において、分類用キーワード候補構造化手段１０２が上記ステップＳＴ２０１において抽出された分類用キーワード候補を構造化して使用者に提示する。分類用キーワード候補構造化処理の詳細については後述する。 Next, in step ST202, the classification keyword candidate structuring means 102 structures the classification keyword candidates extracted in step ST201 and presents them to the user. Details of the classification keyword candidate structuring process will be described later.

次に、ステップＳＴ２０３において、分類用キーワード関連情報抽出手段１０３が上記ステップＳＴ２０２において構造化された分類用キーワード候補に関連する情報を抽出して使用者に提示する。分類用キーワード関連情報抽出処理の詳細については後述する。 Next, in step ST203, the classification keyword related information extraction unit 103 extracts information related to the classification keyword candidates structured in step ST202 and presents it to the user. Details of the classification keyword related information extraction processing will be described later.

更に、ステップＳＴ２０４において、分類条件設定手段１０４により分類条件を設定し、ステップＳＴ２０５において分類実行手段１０５により分類対象テキストを分類して出力する。 In step ST204, the classification condition setting unit 104 sets the classification condition, and in step ST205, the classification execution unit 105 classifies and outputs the classification target text.

図３は、本実施の形態におけるテキスト分類装置の操作画面例である。
図中、３０１は、ステップＳＴ２０２の分類用キーワード候補構造化結果を表示する画面であり、３０２はステップＳＴ２０３で抽出したキーワード関連情報を表示する画面である。使用者は、分類用キーワード候補構造化結果表示画面３０１およびキーワード関連情報表示画面３０２を参照しながら、後述する分類条件を設定して分類を実行する。設定された分類条件は、分類一覧表示部３０３に表示される。また、分類一覧表示部３０３で選択したカテゴリに対して内容表示指示をすると分類結果が表示される。尚、これらの画面に表示される内容の詳細については後述する。 FIG. 3 is an example of an operation screen of the text classification apparatus according to this embodiment.
In the figure, 301 is a screen that displays the classification keyword candidate structuring result of step ST202, and 302 is a screen that displays the keyword related information extracted in step ST203. The user sets a classification condition to be described later and executes classification while referring to the classification keyword candidate structured result display screen 301 and the keyword related information display screen 302. The set classification condition is displayed on the classification list display unit 303. In addition, when a content display instruction is issued for the category selected by the classification list display unit 303, a classification result is displayed. Details of the contents displayed on these screens will be described later.

以下では、図２のステップＳＴ２０１における分類用キーワード候補抽出処理について説明する。
図４は、分類対象とするテキストの例である。ここでは、コールセンターの問い合わせ記録を例として示している。図中、４０１はレコードに一意に付与されるレコードＩＤである。ここで、レコードとは分類対象の１単位であり、たとえばＣＳＶ形式のデータを入力する場合のデータ１行に相当する。４０２は属性情報であり、４０３はテキスト情報である。 Hereinafter, the classification keyword candidate extraction process in step ST201 of FIG. 2 will be described.
FIG. 4 is an example of text to be classified. Here, a call center inquiry record is shown as an example. In the figure, 401 is a record ID uniquely assigned to a record. Here, a record is one unit to be classified, and corresponds to, for example, one line of data when CSV format data is input. Reference numeral 402 denotes attribute information, and reference numeral 403 denotes text information.

また、図５は、図４に示す分類対象に対して問い合わせ分類を行った場合の説明図であり、５０１は問い合わせ内容を示し、５０２は、この問い合わせ内容５０１に対応した問い合わせ分類を示している。 FIG. 5 is an explanatory diagram when the inquiry classification is performed on the classification target shown in FIG. 4, 501 indicates the inquiry content, and 502 indicates the inquiry classification corresponding to the inquiry content 501. .

本実施の形態では、図４の問い合わせ内容を原因別に分類し、図５の５０２のような問い合わせ内容分類コードを付与することを目的とする。図４および図５では表形式で入出力する例を示しているが、対象がテキストであればこれ以外の形式としてもよい。たとえばファイル名のリストを入力とし、これを内容別に分類した分類コードを各ファイルに付与するようにしても良い。 The purpose of this embodiment is to classify the inquiry contents shown in FIG. 4 by cause, and assign an inquiry contents classification code such as 502 in FIG. 4 and 5 show examples of inputting and outputting in a tabular format, but other formats may be used if the target is text. For example, a list of file names may be input, and a classification code that is classified according to contents may be assigned to each file.

図６は、分類対象から抽出した分類用キーワード候補の例である。
図中、６０１はキーワードに一意に付与されるキーワードＩＤである。６０２は、分類用キーワードの見出しで、６０３はその品詞である。６０４は分類用キーワードが分類対象テキスト中に出現する頻度である。６０５は分類用キーワードの出現頻度から計算されるｉｄｆ重みである。６０６は、特定の属性に着目し、その属性値と各単語との共起確率を求め、属性値の頻度分布に対して偏りの大きい語を分類上重要であるとする特徴度である。６０７は、係り受け情報である。ここでは見出しに対する受け側の単語を＋に続く見出しで示し、係り側の単語を−に続く見出しで示すこととする。例えば、図６では、「○○装置」という単語が「交換」に係ることを示しており、逆に「交換」という単語が受ける単語として「○○装置」があることを示している。尚、図６に示した分類用キーワード候補抽出例はあくまでも一例であり、分類目的に応じて必要な項目を適宜設定するものとする。以上で、図２のステップＳＴ２０１における分類用キーワード候補抽出処理についての説明を終わる。 FIG. 6 is an example of classification keyword candidates extracted from the classification target.
In the figure, reference numeral 601 denotes a keyword ID uniquely assigned to a keyword. Reference numeral 602 denotes a classification keyword heading, and reference numeral 603 denotes the part of speech. Reference numeral 604 denotes the frequency at which the classification keyword appears in the classification target text. Reference numeral 605 denotes an idf weight calculated from the appearance frequency of the classification keyword. Reference numeral 606 denotes a feature that focuses on a specific attribute, obtains a co-occurrence probability between the attribute value and each word, and determines that a word having a large bias with respect to the frequency distribution of the attribute value is important for classification. Reference numeral 607 denotes dependency information. Here, a receiving word for a heading is indicated by a heading following +, and a related word is indicated by a heading following-. For example, FIG. 6 shows that the word “XX device” relates to “exchange”, and conversely shows that there is “XX device” as a word received by the word “exchange”. Note that the classification keyword candidate extraction example shown in FIG. 6 is merely an example, and necessary items are appropriately set according to the classification purpose. This is the end of the description of the classification keyword candidate extraction process in step ST201 of FIG.

次に、図２のステップＳＴ２０２における分類用キーワード候補構造化処理について説明する。
図７は、分類用キーワードを意味分類に基づいて構造化した例である。図中、７０１は構造化種別選択メニューであり、使用者は所望する構造化表示形態をメニューにより選択する。図示例では、キーワード候補構造化種別として意味分類が選択された状態を示している。７０２は意味分類の選択メニューであり、使用者は表示したい意味分類をその位置関係とともに指定する。図示例では、「機器」「現象」「処置」が選択された状態を示している。７０３は構造化表示された分類用キーワード候補である。図７の７０３において、利用者が分類用キーワードとしてふさわしいと判断したキーワードにはチェックを入れる。この状態で図３のカテゴリ追加ボタンを押すと、チェックされているキーワード群が分類条件として設定される。 Next, the classification keyword candidate structuring process in step ST202 of FIG. 2 will be described.
FIG. 7 shows an example in which classification keywords are structured based on semantic classification. In the figure, reference numeral 701 denotes a structured type selection menu, and the user selects a desired structured display form from the menu. In the illustrated example, a semantic classification is selected as the keyword candidate structured type. Reference numeral 702 denotes a semantic category selection menu, in which the user designates a semantic category to be displayed together with its positional relationship. In the illustrated example, “device”, “phenomenon”, and “treatment” are selected. Reference numeral 703 denotes a classification keyword candidate displayed in a structured manner. In 703 of FIG. 7, the keyword that the user determines to be suitable as the classification keyword is checked. When the add category button in FIG. 3 is pressed in this state, the checked keyword group is set as the classification condition.

意味分類で構造化する場合、分類用キーワード候補構造化手段１０２は、先ず、ステップＳＴ２０１で抽出された分類用キーワード候補の各単語に対して、図８に示す意味分類辞書を参照して分類用キーワード候補の見出しから意味分類を取得し、意味分類毎にキーワード候補を分別する。続けて、分別された意味分類に属する各キーワード候補について、右側に位置する意味分類に属するキーワード候補との共起回数を調べ、その回数が多い順に左側の語と線で連結して表示する。その際、所定頻度以下のキーワード候補は図７中に示すような「追加表示」などのボタンを押した場合のみ得られるよう構成とし、不要な語が大量に表示されることのないようにする。共起回数については、同一レコードに出現するレコード内共起回数、同一文内に出現する文内共起回数、図６の６０７に示したような係り受け関係にある語のみを集計する係り受け共起回数など、いずれを用いても良い。また、必要に応じてこれらを切り替えて使用するよう構成してもよい。 In the case of structuring by semantic classification, the classification keyword candidate structuring means 102 first classifies each word of the classification keyword candidate extracted in step ST201 with reference to the semantic classification dictionary shown in FIG. The semantic classification is acquired from the keyword candidate headings, and the keyword candidates are classified for each semantic classification. Subsequently, for each keyword candidate belonging to the classified semantic category, the number of co-occurrence with the keyword candidate belonging to the semantic category located on the right side is examined, and the words are connected to the left word in a descending order and displayed. At that time, keyword candidates having a predetermined frequency or less are obtained only when a button such as “additional display” as shown in FIG. 7 is pressed, so that a large number of unnecessary words are not displayed. . As for the number of co-occurrence, the number of co-occurrence in a record that appears in the same record, the number of co-occurrence in a sentence that appears in the same sentence, and a dependency that totals only words having a dependency relationship as shown by 607 in FIG. Any of the number of co-occurrence and the like may be used. Moreover, you may comprise so that these may be switched and used as needed.

図９は、分類用キーワードを品詞および係り受け情報に基づいて構造化した例である。
図中、９０１は構造化種別選択メニューであり、使用者は所望する構造化表示形態をメニューにより選択する。９０２は品詞選択メニューであり、使用者は表示したい係り受け情報を品詞間の位置関係により指定する、９０３は構造化表示された分類用キーワード候補である。図９の９０３において、利用者が分類用キーワードとしてふさわしいと判断したキーワードにはチェックを入れる。この状態で図３のカテゴリ追加ボタンを押すと、チェックされているキーワード群が分類条件として設定される。 FIG. 9 shows an example in which classification keywords are structured based on part of speech and dependency information.
In the figure, reference numeral 901 denotes a structured type selection menu, and the user selects a desired structured display form from the menu. 902 is a part-of-speech selection menu, in which the user designates dependency information to be displayed based on the positional relationship between parts of speech, and 903 is a structured keyword candidate for classification. In 903 of FIG. 9, a check is made for a keyword that the user has determined is appropriate as a classification keyword. When the add category button in FIG. 3 is pressed in this state, the checked keyword group is set as the classification condition.

係り受け情報で構造化する場合には、先ず、ステップＳＴ２０１で抽出された分類用キーワード候補の各単語に対して、図６の６０３に示す品詞情報を参照して品詞毎にキーワード候補を分別する。続けて、分別された品詞毎の各キーワード候補について、中央に位置する品詞に属するキーワード候補との係り受け情報を図６の６０７に示す係り受け情報を参照し、その回数が多い順に左右の語と線で連結して表示する。その際、所定頻度以下のキーワード候補は図９中に示すような「追加表示」などのボタンを押した場合のみ得られるよう構成とし、不要な語が大量に表示されることのないようにする。 In the case of structuring with dependency information, first, for each word of classification keyword candidates extracted in step ST201, keyword candidates are classified for each part of speech by referring to the part of speech information indicated by reference numeral 603 in FIG. . Subsequently, with regard to each keyword candidate for each classified part of speech, the dependency information with respect to the keyword candidate belonging to the part of speech located in the center is referred to the dependency information indicated by reference numeral 607 in FIG. And connected with a line. At that time, keyword candidates having a predetermined frequency or less are obtained only when a button such as “additional display” as shown in FIG. 9 is pressed, so that a large number of unnecessary words are not displayed. .

意味カテゴリ辞書を用いた分類用キーワード候補の構造化により機器などの分類条件設定が容易になり、係り受け情報を用いた分類用キーワード候補の構造化により予め意味カテゴリでは定義しにくいような故障現象別の分類条件設定が容易になる。尚、本実施の形態では、キーワードの構造化種別として図７の意味分類と図９の係り受けと品詞によるものを示したが、これらのみに限定されるものではなく、例えば、機種別の構造化や、部品の階層構造による構造化といったように種々の構造化が適用可能である。 Structuring classification keyword candidates using a semantic category dictionary makes it easy to set classification conditions for devices, etc., and failure phenomena that are difficult to define in a semantic category in advance by structuring classification keyword candidates using dependency information Another classification condition can be easily set. In the present embodiment, as the keyword structured type, the semantic classification of FIG. 7 and the dependency and part of speech of FIG. 9 are shown, but the present invention is not limited to these. Various structuring methods such as structuring and structuring by a hierarchical structure of parts can be applied.

このように、使用者が分類用キーワード候補の構造化種別を適宜選択できるので、分類の目的に応じた適切な分類条件設定が容易になる。以上で、図２のステップＳＴ２０２における分類用キーワード候補構造化処理についての説明を終わる。 As described above, since the user can appropriately select the structured type of the classification keyword candidate, it is easy to set an appropriate classification condition according to the purpose of classification. This is the end of the description of the classification keyword candidate structuring process in step ST202 of FIG.

次に、図２のステップＳＴ２０３における分類用キーワード関連情報抽出処理について説明する。
図１０は、分類用キーワード関連情報抽出手段の詳細構成図である。図中、図１と同様の構成については、同一の番号で示している。分類用キーワード関連情報抽出手段１０３は、キーワード関連情報抽出手段１００１、関連文書情報抽出手段１００２、カテゴリ重複情報抽出手段１００３を備えている。 Next, the classification keyword related information extraction process in step ST203 of FIG. 2 will be described.
FIG. 10 is a detailed block diagram of the classification keyword related information extracting means. In the figure, the same components as those in FIG. 1 are denoted by the same numbers. The classification keyword related information extraction unit 103 includes a keyword related information extraction unit 1001, a related document information extraction unit 1002, and a category overlap information extraction unit 1003.

キーワード関連情報抽出手段１００１は、分類用キーワード候補構造化手段１０２で構造化されたキーワード候補の共起単語、類似語、関連語などのキーワード関連情報を抽出する機能部であり、共起単語情報抽出手段１００４、類似単語情報抽出手段１００５、概念関連語情報抽出手段１００６を備えている。共起単語情報抽出手段１００４は、指定されたキーワード候補の共起単語情報を抽出する機能部である。類似単語情報抽出手段１００５は、指定されたキーワード候補の類似単語情報を抽出する機能部である。概念関連語情報抽出手段１００６は、指定されたキーワード候補の概念関連語を抽出する機能部である。 The keyword-related information extraction unit 1001 is a functional unit that extracts keyword-related information such as co-occurrence words, similar words, and related words of keyword candidates structured by the classification keyword candidate structuring unit 102, and includes co-occurrence word information. Extraction means 1004, similar word information extraction means 1005, and concept related word information extraction means 1006 are provided. The co-occurrence word information extraction unit 1004 is a functional unit that extracts co-occurrence word information of a designated keyword candidate. The similar word information extraction unit 1005 is a functional unit that extracts similar word information of a designated keyword candidate. The concept related word information extraction unit 1006 is a functional unit that extracts concept related words of a designated keyword candidate.

関連文書情報抽出手段１００２は、分類用キーワード候補構造化手段１０２で構造化されたキーワード候補に関連する文書情報を抽出する機能部である。カテゴリ重複情報抽出手段１００３は、分類条件設定手段１０４において設定された各カテゴリの分類条件設定に従ってカテゴリ重複情報を抽出する機能部である。 The related document information extraction unit 1002 is a functional unit that extracts document information related to the keyword candidates structured by the classification keyword candidate structuring unit 102. The category duplication information extraction unit 1003 is a functional unit that extracts category duplication information in accordance with the classification condition setting of each category set in the classification condition setting unit 104.

図１１は、図２のステップＳＴ２０３における分類用キーワード関連情報抽出処理の詳細を示すフローチャートである。
ステップＳＴ１１０１において、図１０のキーワード関連情報抽出手段１００１が、分類用キーワード候補構造化手段で構造化されたキーワード候補のうち、使用者により選択された語に関するキーワード関連情報を抽出する。具体的には、図７および図９に示した分類用キーワード候補構造化表示画面において、チェックボックスにチェックされた語に関連するキーワードの情報を抽出する。キーワード関連情報としては、共起単語情報、類似単語情報、概念関連語情報を抽出する。先ず、図１０の共起単語情報抽出手段１００４により、選択単語と係り受け関係にある単語を、図６の６０７に示す係り受け情報を参照して抽出する。続けて図１０の類似単語情報抽出手段１００５により、たとえば最長共通部分列（Longest Common Subsequence）長を類似度尺度として指定キーワードの類似単語を分類対象テキストから抽出する。更に、図１０の概念関連語情報抽出手段１００６により、指定キーワードの概念関連語情報を分類対象テキストから抽出する。尚、概念関連語は、例えば、特開２００４−７０６３６号公報等に記載されている概念ベクトルを関連性の尺度として用いて抽出する。 FIG. 11 is a flowchart showing details of the classification keyword related information extraction processing in step ST203 of FIG.
In step ST1101, the keyword-related information extraction unit 1001 in FIG. 10 extracts keyword-related information related to the word selected by the user from the keyword candidates structured by the classification keyword candidate structuring unit. Specifically, the keyword information related to the word checked in the check box is extracted on the classification keyword candidate structured display screen shown in FIGS. As keyword related information, co-occurrence word information, similar word information, and concept related word information are extracted. First, the co-occurrence word information extraction unit 1004 in FIG. 10 extracts a word having a dependency relationship with the selected word with reference to the dependency information indicated by reference numeral 607 in FIG. Subsequently, the similar word information extracting unit 1005 in FIG. 10 extracts the similar word of the designated keyword from the classification target text using, for example, the longest common subsequence length as a similarity measure. Further, the concept related word information extracting unit 1006 in FIG. 10 extracts the concept related word information of the designated keyword from the classification target text. The concept related words are extracted using, for example, a concept vector described in Japanese Patent Application Laid-Open No. 2004-70636 as a measure of relevance.

上記で抽出した指定キーワードに関連するキーワード群は、分類条件設定に適宜追加できる。このようにして、分類用キーワード候補構造化手段１０２において指定されたキーワードと関連するキーワードを多様な観点で抽出できるので、もれなく適切な分類条件設定が可能となる。 The keyword group related to the specified keyword extracted above can be added as appropriate to the classification condition setting. In this way, keywords related to the keyword specified by the keyword candidate structuring means 102 for classification can be extracted from various viewpoints, so that appropriate classification conditions can be set.

次に、ステップＳＴ１１０２において、図１０の関連文書情報抽出手段１００２が、分類用キーワード候補構造化手段１０２で構造化されたキーワード候補のうち、使用者により選択された語に関連する文書の情報を抽出する。例えば、図７および図９に示した分類用キーワード候補構造化表示画面において、チェックボックスにチェックされた語に関連する文書の情報を抽出する。具体的には、指定されたキーワードを含む文書を、その件数とともに分類対象テキストから抽出する。使用者は抽出された分類対象テキストを参照することにより、指定したキーワードが所望するテキスト分類実行に適切であるかどうかを分類実行前に知ることができる。 Next, in step ST1102, the related document information extraction unit 1002 in FIG. 10 obtains document information related to the word selected by the user from among the keyword candidates structured by the classification keyword candidate structuring unit 102. Extract. For example, in the classified keyword candidate structured display screen shown in FIGS. 7 and 9, information on a document related to a word whose check box is checked is extracted. Specifically, a document including a designated keyword is extracted from the classification target text together with the number of the documents. By referring to the extracted classification target text, the user can know whether or not the specified keyword is appropriate for the desired text classification execution before the classification execution.

次に、ステップＳＴ１１０３において、図１０のカテゴリ重複情報抽出手段１００３が後述する分類条件設定により設定された各カテゴリの分類条件設定に従って、分類用キーワード候補構造化手段で構造化されたキーワード候補のうち、使用者により選択された語で特定されるカテゴリとの重複情報を抽出する。具体的には、後述する分類条件設定手段１０４により、既に分類条件を設定されたカテゴリに分類されるテキストと、図７および図９に示した分類用キーワード候補構造化表示画面において、チェックボックスにチェックされた語を分類条件として設定した場合に分類されるテキストとで重複する結果を抽出する。尚、カテゴリとは、ある分類条件に基づくキーワードや文書の集合を意味している。 Next, in step ST1103, among the keyword candidates structured by the classification keyword candidate structuring unit according to the classification condition setting of each category set by the category duplication information extraction unit 1003 in FIG. And duplicating information with the category specified by the word selected by the user. Specifically, the text is classified into a category for which classification conditions have already been set by the classification condition setting unit 104 described later, and the check boxes in the classification keyword candidate structured display screens shown in FIGS. When the checked word is set as a classification condition, a result that overlaps with the classified text is extracted. A category means a set of keywords or documents based on a certain classification condition.

使用者は、分類条件設定前に、他カテゴリと重複するテキストの情報を知ることができるので、必要に応じて適宜分類条件設定を変更し、重複の少ない適切な分類条件設定を行うことが可能となる。 The user can know the text information that overlaps with other categories before setting the classification conditions, so it is possible to change the classification condition settings as necessary and set appropriate classification conditions with few duplicates. It becomes.

図１２は、分類用キーワード関連情報表示画面例である。１２０１はキーワード関連情報であり、図１１のステップＳＴ１１０１において分類用キーワード候補構造化手段で構造化されたキーワード候補のうち、使用者により選択された語に関するキーワード関連情報を抽出して表示する。選択されたキーワード候補が複数存在する場合は、図１２の１２０４に示すように関連情報を抽出する対象とするキーワード候補を選択して表示する。１２０２は文書情報表示部であり、図１１のステップＳＴ１１０２において分類用キーワード候補構造化手段で構造化されたキーワード候補のうち、使用者により選択された語に関連する文書の情報を抽出して表示する。１２０３はカテゴリ重複情報表示部であり、図１１のステップＳＴ１１０３において、後述する分類条件設定により設定された各カテゴリの分類条件設定に従って、分類用キーワード候補構造化手段１０２で構造化されたキーワード候補のうち、使用者により選択された語で特定されるカテゴリとの重複情報を抽出して表示する。 FIG. 12 is an example of a classification keyword related information display screen. Reference numeral 1201 denotes keyword-related information, which extracts and displays keyword-related information related to the word selected by the user from the keyword candidates structured by the classification keyword candidate structuring means in step ST1101 of FIG. When there are a plurality of selected keyword candidates, as shown by 1204 in FIG. 12, the keyword candidates to be extracted from the related information are selected and displayed. A document information display unit 1202 extracts and displays document information related to the word selected by the user from the keyword candidates structured by the classification keyword candidate structuring unit in step ST1102 of FIG. To do. Reference numeral 1203 denotes a category duplication information display unit. In step ST1103 of FIG. 11, the keyword candidate structured by the classification keyword candidate structuring unit 102 is set according to the classification condition setting of each category set by the classification condition setting described later. Of these, duplicate information with the category specified by the word selected by the user is extracted and displayed.

上記のように、分類用キーワード候補構造化手段１０２で構造化されたキーワード候補のうち、使用者により選択された語に関連するキーワードの情報を抽出することにより、構造化されたキーワードに関連する別のキーワードを分類条件設定に利用できる。また、関連する文書情報を抽出することにより、分類条件として指定しようとしているキーワードが所望するテキスト分類結果を得るために適切かどうかを事前に知ることができる。更に、カテゴリ重複情報を抽出することにより、指定するキーワードが他カテゴリとの干渉の少ない適切な分類条件となりえるかどうかを事前に知ることができる。 As described above, the keyword information related to the word selected by the user is extracted from the keyword candidates structured by the classification keyword candidate structuring unit 102, so that the keyword is related to the structured keyword. Another keyword can be used to set classification conditions. Further, by extracting related document information, it is possible to know in advance whether or not the keyword to be designated as the classification condition is appropriate for obtaining the desired text classification result. Further, by extracting the category overlap information, it is possible to know in advance whether the specified keyword can be an appropriate classification condition with little interference with other categories.

以上のように、分類用キーワード候補構造化手段１０２および分類用キーワード関連情報抽出手段１０３を備えることにより、分類条件の設定が容易になり、使用者が所望するテキスト分類作業が容易となる。以上で、図２のステップＳＴ２０２における分類用キーワード関連情報抽出処理についての説明を終わる。 As described above, by including the classification keyword candidate structuring unit 102 and the classification keyword related information extraction unit 103, the classification condition can be easily set, and the text classification operation desired by the user can be facilitated. This is the end of the description of the classification keyword related information extraction process in step ST202 of FIG.

続けて、図２のステップＳＴ２０４における分類条件設定処理、および、ステップＳＴ２０５における分類実行処理について説明する。使用者の指示、例えば図３の「カテゴリ追加」ボタンを押すことにより、図１の分類条件設定手段１０４が、分類条件設定画面を表示して使用者に入力を促す。 Next, the classification condition setting process in step ST204 of FIG. 2 and the classification execution process in step ST205 will be described. When the user's instruction, for example, the “add category” button in FIG. 3 is pressed, the classification condition setting unit 104 in FIG. 1 displays the classification condition setting screen and prompts the user to input.

図１３は、分類条件設定画面の例である。１３０１はカテゴリ名であり、使用者が適宜設定する。１３０２は分類用キーワード設定部であり、図７および図９に示した分類用キーワード候補構造化表示画面において、チェックボックスにチェックされた語については予め設定されており、追加条件として図１２に示した各種関連情報を元にキーワードを追加設定できる。また、重要語指定、キーワード重みを随時指定できる。１３０３は文書指定部であり、キーワード指定を補足する形で、図７および図９に示した関連文書情報を参照して文書ＩＤを指定する。使用者が分類条件を設定して分類実行を指示すると、図１の分類実行手段１０５が分類処理を実行する。 FIG. 13 is an example of a classification condition setting screen. Reference numeral 1301 denotes a category name, which is set as appropriate by the user. Reference numeral 1302 denotes a classification keyword setting unit. In the classification keyword candidate structured display screen shown in FIGS. 7 and 9, the words checked in the check boxes are set in advance, and additional conditions are shown in FIG. Additional keywords can be set based on various related information. Also, important word designation and keyword weight can be designated at any time. Reference numeral 1303 denotes a document designating unit that designates a document ID with reference to the related document information shown in FIGS. When the user sets the classification condition and instructs the execution of the classification, the classification execution unit 105 in FIG. 1 executes the classification process.

テキスト分類処理は、全自動型ではなく、ステップＳＴ２０４において設定された分類条件によって行われる。具体的には、ステップＳＴ２０４で設定したキーワードを含むテキストを該当カテゴリに属するものとして抽出する。複数のカテゴリに属するテキストについては、合致するキーワードの数、重みなどによりスコアを計算し、スコアの良いほうのカテゴリに分類する。また文書指定がされている場合は、指定された文書に含まれる単語がキーワードとして指定されたものとして分類処理を実行する。 The text classification process is not a fully automatic type, but is performed according to the classification conditions set in step ST204. Specifically, the text including the keyword set in step ST204 is extracted as belonging to the corresponding category. For text belonging to a plurality of categories, scores are calculated based on the number of matching keywords, weights, and the like, and classified into categories with higher scores. If a document is designated, the classification process is executed assuming that a word included in the designated document is designated as a keyword.

図１４に分類結果の例を示す。分類実行後は、該当カテゴリが図３の分類一覧表示３０３に追加される。分類済みのカテゴリを選択して一覧表示指示をすると図１４に示す分類結果画面が表示されるものとする。 FIG. 14 shows an example of the classification result. After the classification is executed, the corresponding category is added to the classification list display 303 in FIG. It is assumed that when a classified category is selected and a list display instruction is given, a classification result screen shown in FIG. 14 is displayed.

以上説明したように、実施の形態１のテキスト分類装置によれば、分類対象テキストから分類条件設定に使用する分類用キーワード候補を抽出する分類用キーワード候補抽出手段と、分類用キーワード候補抽出手段で抽出された複数の分類用キーワード候補を、分類用キーワード候補間の所定の関係に基づいて構造化する分類用キーワード候補構造化手段と、分類用キーワード候補構造化手段で構造化された分類用キーワード候補に関連する情報を抽出する分類用キーワード候補関連情報抽出手段と、分類用キーワード候補構造化手段および分類用キーワード候補関連情報抽出手段のうち、少なくとも一方の手段により提示された情報から分類対象テキストを分類するための分類条件を設定する分類条件設定手段とを備えたので、テキスト分類装置としての分類条件設定作業を容易に行うことができる。 As described above, according to the text classification apparatus of the first embodiment, the classification keyword candidate extraction unit that extracts the classification keyword candidates used for the classification condition setting from the classification target text, and the classification keyword candidate extraction unit. Classification keyword candidate structuring means for structuring a plurality of extracted classification keyword candidates based on a predetermined relationship between the classification keyword candidates, and a classification keyword structured by the classification keyword candidate structuring means The classification target text from the information presented by at least one of the classification keyword candidate related information extraction means for extracting information related to the candidate, the classification keyword candidate structuring means, and the classification keyword candidate related information extraction means Classification condition setting means for setting classification conditions for classifying The classification condition setting operation of the apparatus can be easily performed.

また、実施の形態１のテキスト分類装置によれば、分類用キーワード候補関連情報抽出手段は、分類用キーワード候補に関連するキーワード群を抽出するキーワード関連情報抽出手段を有するようにしたので、構造化されたキーワードに関連する別のキーワードを分類条件設定に利用でき、従って、もれなく適切な分類条件設定が可能となる。 Further, according to the text classification apparatus of the first embodiment, the classification keyword candidate related information extraction unit includes the keyword related information extraction unit that extracts the keyword group related to the classification keyword candidate. Another keyword related to the selected keyword can be used for setting the classification condition, and accordingly, an appropriate classification condition can be set.

また、実施の形態１のテキスト分類装置によれば、分類用キーワード候補関連情報抽出手段は、分類用キーワード候補に関連する文書を抽出する関連文書情報抽出手段を有するようにしたので、分類条件として指定しようとしているキーワードが所望するテキスト分類結果を得るために適切かどうかを事前に知ることができる。 In addition, according to the text classification apparatus of the first embodiment, the classification keyword candidate related information extraction unit includes the related document information extraction unit that extracts a document related to the classification keyword candidate. It is possible to know in advance whether the keyword to be specified is appropriate for obtaining the desired text classification result.

また、実施の形態１のテキスト分類装置によれば、分類用キーワード候補関連情報抽出手段は、分類用キーワード候補により分類される文書と、既に設定済のカテゴリに含まれる文書との重複情報を抽出するカテゴリ重複情報抽出手段を備えたので、指定するキーワードが他カテゴリとの干渉の少ない適切な分類条件となりえるかどうかを事前に知ることができる。 Further, according to the text classification apparatus of the first embodiment, the classification keyword candidate related information extraction unit extracts duplicate information between the document classified by the classification keyword candidate and the document included in the already set category. Since the category duplication information extracting means is provided, it is possible to know in advance whether the designated keyword can be an appropriate classification condition with little interference with other categories.

また、実施の形態１のテキスト分類プログラムによれば、分類対象テキストから所定の分類条件に基づいて分類を行うコンピュータを、分類対象テキストから分類条件設定に使用する分類用キーワード候補を抽出する分類用キーワード候補抽出手段と、分類用キーワード候補抽出手段で抽出された複数の分類用キーワード候補を、分類用キーワード候補間の所定の関係に基づいて構造化する分類用キーワード候補構造化手段と、分類用キーワード候補構造化手段で構造化された分類用キーワード候補に関連する情報を抽出する分類用キーワード候補関連情報抽出手段と、分類用キーワード候補構造化手段および分類用キーワード候補関連情報抽出手段のうち、少なくとも一方から提示される情報に基づいて分類対象テキストを分類するための分類条件を設定する分類条件設定手段として機能させるためのテキスト分類プログラムとしたので、分類条件設定作業を容易に行うことができるテキスト分類装置をコンピュータで実現させることができる。 In addition, according to the text classification program of the first embodiment, a computer that performs classification based on a predetermined classification condition from a classification target text, and a classification keyword candidate that is used for classification condition setting from the classification target text. A keyword candidate extraction unit, a classification keyword candidate structuring unit for structuring a plurality of classification keyword candidates extracted by the classification keyword candidate extraction unit based on a predetermined relationship between the classification keyword candidates, and a classification Among the classification keyword candidate related information extraction means for extracting information related to the classification keyword candidate structured by the keyword candidate structuring means, the classification keyword candidate structuring means, and the classification keyword candidate related information extraction means, For classifying text to be classified based on information presented from at least one Since the text classification program for functioning as the classification condition setting means for setting a kind condition, it is possible to realize a text classifier that can be easily carried out the classification condition setting work on the computer.

この発明の実施の形態１によるテキスト分類装置を示す構成図である。It is a block diagram which shows the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類処理の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the classification | category process of the text classification apparatus by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の操作画面例を示す説明図である。It is explanatory drawing which shows the example of an operation screen of the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類対象とするテキストの例を示す説明図である。It is explanatory drawing which shows the example of the text made into the classification | category object of the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置のテキストに対して問い合わせ内容の分類を行った場合の例を示す説明図である。It is explanatory drawing which shows the example at the time of classifying the inquiry content with respect to the text of the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類対象から抽出した分類用キーワード候補の例を示す説明図である。It is explanatory drawing which shows the example of the keyword candidate for a classification extracted from the classification | category object of the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類用キーワード候補を意味分類で構造化した例を示す説明図である。It is explanatory drawing which shows the example which structured the keyword candidate for classification of the text classification device by Embodiment 1 of this invention by semantic classification. この発明の実施の形態１によるテキスト分類装置の意味分類辞書を示す説明図である。It is explanatory drawing which shows the meaning classification dictionary of the text classification apparatus by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類用キーワード候補を品詞および係り受け情報により構造化した例を示す説明図である。It is explanatory drawing which shows the example which structured the keyword candidate for classification of the text classification device by Embodiment 1 of this invention by the part of speech and dependency information. この発明の実施の形態１によるテキスト分類装置の分類用キーワード関連情報抽出手段の詳細構成図である。It is a detailed block diagram of the keyword related information extraction means for classification of the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類用キーワード関連情報抽出処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the keyword related information extraction process for a classification | category of the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類用キーワード関連情報の表示画面例を示す説明図である。It is explanatory drawing which shows the example of a display screen of the keyword related information for a classification | category of the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類条件設定画面を示す説明図である。It is explanatory drawing which shows the classification condition setting screen of the text classification device by Embodiment 1 of this invention. この発明の実施の形態１によるテキスト分類装置の分類結果の例を示す説明図である。It is explanatory drawing which shows the example of the classification result of the text classification device by Embodiment 1 of this invention.

符号の説明Explanation of symbols

１０１分類用キーワード候補抽出手段、１０２分類用キーワード候補構造化手段、１０３分類用キーワード関連情報抽出手段、１０４分類条件設定手段、１０５分類実行手段、１１０分類対象テキスト、１１１分類結果。 101 classification keyword candidate extraction means, 102 classification keyword candidate structuring means, 103 classification keyword related information extraction means, 104 classification condition setting means, 105 classification execution means, 110 classification target text, 111 classification result

Claims

分類対象テキストから分類条件設定に使用する分類用キーワード候補を抽出する分類用キーワード候補抽出手段と、
前記分類用キーワード候補抽出手段で抽出された複数の分類用キーワード候補を、当該分類用キーワード候補間の所定の関係に基づいて構造化する分類用キーワード候補構造化手段と、
前記分類用キーワード候補構造化手段で構造化された分類用キーワード候補に関連する情報を抽出する分類用キーワード候補関連情報抽出手段と、
前記分類用キーワード候補構造化手段および前記分類用キーワード候補関連情報抽出手段のうち、少なくとも一方の手段により提示された情報から前記分類対象テキストを分類するための分類条件を設定する分類条件設定手段とを備えたテキスト分類装置。 Classification keyword candidate extraction means for extracting classification keyword candidates used for classification condition setting from the classification target text;
Classification keyword candidate structuring means for structuring a plurality of classification keyword candidates extracted by the classification keyword candidate extraction means based on a predetermined relationship between the classification keyword candidates;
Classification keyword candidate related information extracting means for extracting information related to the classification keyword candidates structured by the classification keyword candidate structuring means;
Classification condition setting means for setting a classification condition for classifying the classification target text from information presented by at least one of the classification keyword candidate structuring means and the classification keyword candidate related information extraction means; Text classification device with

分類用キーワード候補関連情報抽出手段は、分類用キーワード候補に関連するキーワード群を抽出するキーワード関連情報抽出手段を有することを特徴とする請求項１記載のテキスト分類装置。 2. The text classification apparatus according to claim 1, wherein the classification keyword candidate related information extracting means includes keyword related information extracting means for extracting a keyword group related to the classification keyword candidate.

分類用キーワード候補関連情報抽出手段は、分類用キーワード候補に関連する文書を抽出する関連文書情報抽出手段を有することを特徴とする請求項１記載のテキスト分類装置。 2. The text classification apparatus according to claim 1, wherein the classification keyword candidate related information extraction means includes related document information extraction means for extracting a document related to the classification keyword candidate.

分類用キーワード候補関連情報抽出手段は、分類用キーワード候補により分類される文書と、既に設定済のカテゴリに含まれる文書との重複情報を抽出するカテゴリ重複情報抽出手段を備えたことを特徴とする請求項１記載のテキスト分類装置。 The classification keyword candidate related information extraction unit includes a category duplication information extraction unit that extracts duplication information between a document classified by the classification keyword candidate and a document included in the already set category. The text classification apparatus according to claim 1.

分類対象テキストから所定の分類条件に基づいて分類を行うコンピュータを、
分類対象テキストから分類条件設定に使用する分類用キーワード候補を抽出する分類用キーワード候補抽出手段と、
前記分類用キーワード候補抽出手段で抽出された複数の分類用キーワード候補を、当該分類用キーワード候補間の所定の関係に基づいて構造化する分類用キーワード候補構造化手段と、
前記分類用キーワード候補構造化手段で構造化された分類用キーワード候補に関連する情報を抽出する分類用キーワード候補関連情報抽出手段と、
前記分類用キーワード候補構造化手段および前記分類用キーワード候補関連情報抽出手段のうち、少なくとも一方から提示される情報に基づいて前記分類対象テキストを分類するための分類条件を設定する分類条件設定手段として機能させるためのテキスト分類プログラム。 A computer that performs classification based on a predetermined classification condition from the classification target text,
Classification keyword candidate extraction means for extracting classification keyword candidates used for classification condition setting from the classification target text;
Classification keyword candidate structuring means for structuring a plurality of classification keyword candidates extracted by the classification keyword candidate extraction means based on a predetermined relationship between the classification keyword candidates;
Classification keyword candidate related information extracting means for extracting information related to the classification keyword candidates structured by the classification keyword candidate structuring means;
As classification condition setting means for setting a classification condition for classifying the classification target text based on information presented from at least one of the classification keyword candidate structuring means and the classification keyword candidate related information extraction means. A text classification program to make it work.