JP2005181928A

JP2005181928A - System and method for machine learning, and computer program

Info

Publication number: JP2005181928A
Application number: JP2003426329A
Authority: JP
Inventors: Hiroki Yoshimura; 宏樹吉村; Hiroshi Masuichi; 博増市; Tomoko Okuma; 智子大熊; Daigo Sugihara; 大悟杉原
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-12-24
Filing date: 2003-12-24
Publication date: 2005-07-07

Abstract

<P>PROBLEM TO BE SOLVED: To improve precision by removing unnecessary data from learning data. <P>SOLUTION: An antecedent acquiring part 2 extracts antecedent information used for machine learning from data held in a learning data holding part 1 and evaluation data inputted to a system. A machine learning part 3 learns the correspondence relation between the antecedent and evaluation, based on the evaluation of each of learning data held in the learning data holding part 1 and antecedent information, on each piece of data obtained from the antecedent acquiring part 2. A data selection part 4 deletes learning data which are inadequate to the machine learning from candidates for learning data held in the learning data holding part 1, based on the learnt result obtained from the machine learning part 3. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、学習データを入力とし、統計処理手法を用いることによって、データの特徴を説明するための規則を出力する機械学習システム及び機械学習方法、並びにコンピュータ・プログラムに係り、特に、学習データ中の各データに、データの評価があらかじめ（人手によって）付与されている教師ありの機械学習を行なう機械学習システム及び機械学習方法、並びにコンピュータ・プログラムに関する。 The present invention relates to a machine learning system, a machine learning method, and a computer program for outputting rules for explaining the characteristics of data by using learning data as an input and using a statistical processing technique. The present invention relates to a machine learning system, a machine learning method, and a computer program for performing supervised machine learning in which data evaluation is given to each data in advance (by hand).

さらに詳しくは、本発明は、教師あり機械学習において、学習データから不要なデータを取り除くことによって精度を向上させる機械学習システム及び機械学習方法、並びにコンピュータ・プログラムに係り、特に、人手を介さずに、評価の付与された学習データから学習に不適切なデータを削除する機械学習システム及び機械学習方法、並びにコンピュータ・プログラムに関する。 More particularly, the present invention relates to a machine learning system and a machine learning method that improve accuracy by removing unnecessary data from learning data in supervised machine learning, and a computer program. The present invention relates to a machine learning system, a machine learning method, and a computer program for deleting data inappropriate for learning from learning data given an evaluation.

昨今の情報処理技術の発展と普及とも相俟って、産業活動や日常生活におけるさまざまな処理や作業の自動化が進められている。ここで、機械を自動化するには、さまざまなパラメータを決定する必要がある。このようなパラメータを機械自身で決定するために、いわゆる「機械学習」が導入されている。 Coupled with the recent development and popularization of information processing technology, various processes and automation in industrial activities and daily life are being automated. Here, in order to automate the machine, it is necessary to determine various parameters. In order to determine such parameters by the machine itself, so-called “machine learning” has been introduced.

機械学習では、学習データを入力とし、統計処理手法を用いることによって、データの特徴を説明するための規則を出力する。例えば、機械自身がある動作を行なったときに得られた結果を学習データとして入力してこれを統計的に評価し、その評価を自分自身の行動決定パラメータに反映させる。 In machine learning, learning data is input, and a rule for explaining the characteristics of the data is output by using a statistical processing method. For example, a result obtained when the machine itself performs a certain operation is input as learning data, statistically evaluated, and the evaluation is reflected in its own action determination parameter.

機械自身が自分で評価できない場合には、「教師あり学習」と呼ばれる方法により、人間が期待する解を機械に与え、その解に至るように学習のパラメータを調整する。これに対し、自分で評価することを「教師なし学習」と呼ぶ。教師あり学習として、ニューラル・ネットワークなどを利用した処理方法が挙げられる。また、教師なし学習として、ＥＭ（ｅｘｐｅｃｔａｔｉｏｎｍａｘｉｍｉｚａｔｉｏｎ）アルゴリズムなどを利用した処理方法が挙げられる。 If the machine itself cannot be evaluated by itself, a method called “supervised learning” is used to give the machine a solution expected by a human and the learning parameters are adjusted so as to reach the solution. On the other hand, self-assessment is called “unsupervised learning”. An example of supervised learning is a processing method using a neural network. Further, as the unsupervised learning, a processing method using an EM (expectation optimization) algorithm or the like can be cited.

前者の教師あり機械学習では、学習データ中の各データに、データの評価があらかじめ（人手によって）付与されている。学習データ中の各データの特徴（以下、「素性」とも呼ぶ）とその評価の間の対応関係（対応規則）を学習することによって、非学習データ（テスト・データ）が与えられたときにその評価を予測することが可能となる。 In the former supervised machine learning, data evaluation is given in advance (by hand) to each data in the learning data. When non-learning data (test data) is given by learning the correspondence (corresponding rule) between the characteristics of each data in the learning data (hereinafter also referred to as “feature”) and its evaluation Evaluation can be predicted.

現在、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ（ＳＶＭ）や、ＭａｘｉｍｕｍＥｎｔｏｒｐｙ（ＭＥ）などさまざまな教師あり機械学習手法が提案され、自然言語処理や生物情報学などのさまざまな分野で実用的に使用されている（例えば、非特許文献１を参照のこと）。 Currently, various supervised machine learning methods such as Support Vector Machine (SVM) and Maximum Entry (ME) have been proposed and are practically used in various fields such as natural language processing and bioinformatics (for example, (See Non-Patent Document 1).

ここで、教師あり機械学習において、学習データから不要なデータを取り除くことによって精度を向上させることが知られている。例えば、異常（不正）であるか否かがわかっていないデータ（教師なしデータ）を基に、異常データを特徴付けるルールを生成し、さらに得られたルールを用いて効率よく異常なデータを検出することができる（例えば、特許文献１を参照のこと）。すなわち、データ集合内にある異常なデータを特徴付けるルールを生成する外れ値検出するために、異常であることの度合いを示す外れ値度を算出し、かつ外れ値度に基づいてサンプリングすることにより、異常なデータであるか否かを示すラベルを付与した各データの集合に基づく教師あり学習により、異常なデータを特徴付けるルールを生成する教師あり学習部を備え、効率よく異常なデータを検出することができる。 Here, in supervised machine learning, it is known to improve accuracy by removing unnecessary data from learning data. For example, a rule that characterizes abnormal data is generated based on data that is not known whether it is abnormal (unauthorized) (unsupervised data), and abnormal data is efficiently detected using the obtained rule. (For example, see Patent Document 1). That is, in order to detect outliers that generate rules characterizing abnormal data in a data set, by calculating an outlier degree indicating the degree of abnormality, and sampling based on the outlier degree, Equipped with a supervised learning unit that generates rules that characterize abnormal data by supervised learning based on a set of each data with a label indicating whether or not it is abnormal data, and efficiently detects abnormal data Can do.

特開２００３−５９７０号公報JP 2003-5970 A ＦａｂｒｉｚｉｏＳｅｂａｓｔｉａｎｉ著“ＭａｃｈｉｎｅＬｅａｒｎｉｎｇｉｎＡｕｔｏｍａｔｅｄＴｅｘｔＣａｔｅｇｏｒｉｚａｔｉｏｎ”（ＡＣＭＣｏｍｐｕｒｔｉｎｇＳｕｒｖｅｙｓＶｏｌ．３４，Ｎｏ．１，ｐｐ．１−４７，２００２）"Machine Learning in Automated Text Categorization" by Fabrizio Sebastani (ACM Computing Surveys Vol. 34, No. 1, pp. 1-47, 2002)

従来の教師あり機械学習では、学習データを絶対の存在として捉えていた。すなわち、学習データ中の各データが持つ評価は、人手によって与えられたものであるため間違いがなく、完全に信頼できるものであると仮定されている。 In conventional supervised machine learning, learning data is regarded as an absolute existence. In other words, the evaluation of each data in the learning data is assumed to be completely reliable because there is no mistake because it is given manually.

しかしながら、実際には、（１）人手による評価の付与の際に人為的なミスが生じる可能性は十分にある。また、（２）評価の判断が難しく、人（評価者）によって異なる評価を学習データに付与してしまうことも多い。さらに、評価の判断が容易な場合であっても、（３）その学習データの特徴（素性）がたまたま評価と食い違う性質を持っている場合には、データの特徴とその評価の間の対応関係を学習するためのデータとして不適切なものが学習データに紛れ込んでしまうことになる。 However, in practice, (1) there is a sufficient possibility that a human error will occur when assigning an evaluation manually. Also, (2) it is difficult to judge the evaluation, and different evaluations are often given to the learning data depending on the person (evaluator). Furthermore, even if the evaluation is easy to judge, (3) if the learning data features (features) happen to be inconsistent with the evaluation, the correspondence between the data features and the evaluation Inappropriate data for learning is mixed into the learning data.

上記（１）並びに（２）の問題点に関しては、通常、複数の評価者が同じ学習データにそれぞれ評価を付与しそれらを比較することによって、ミスをチェックし評価の判断の揺らぎを最小化するという方法をとる。しかしながら、複数の評価者による評価付与は、学習データが大きくなればなるほど多大な工数を必要とする作業となり、また、上記（３）の問題点を解決するものでもない。 Regarding the problems (1) and (2) above, usually, a plurality of evaluators assigns evaluations to the same learning data and compares them to check for mistakes and minimize evaluation judgment fluctuations. Take the method. However, giving an evaluation by a plurality of evaluators is an operation that requires a large number of man-hours as the learning data becomes larger, and does not solve the problem (3).

また、特許文献１は、教師あり機械学習を用いて外れ値データを取り除く技術であるが、素性と評価の間の齟齬に関する上記（２）及び（３）の問題を解決するものではない。 Moreover, although patent document 1 is a technique which removes outlier data using supervised machine learning, it does not solve the above problems (2) and (3) relating to defects between features and evaluation.

本発明は、上述したような技術的課題を勘案したものであり、その主な目的は、学習データ中の各データに、データの評価があらかじめ（人手によって）付与されている教師ありの機械学習を好適に行なうことができる、優れた機械学習システム及び機械学習方法、並びにコンピュータ・プログラムを提供することにある。 The present invention takes into account the technical problems as described above. The main purpose of the present invention is supervised machine learning in which data evaluation is given in advance (by hand) to each data in the learning data. It is an object to provide an excellent machine learning system, machine learning method, and computer program capable of suitably performing the above.

本発明のさらなる目的は、教師あり機械学習において、学習データから不要なデータを取り除くことによって精度を向上させることができる、優れた機械学習システム及び機械学習方法、並びにコンピュータ・プログラムを提供することにある。 A further object of the present invention is to provide an excellent machine learning system, machine learning method, and computer program capable of improving accuracy by removing unnecessary data from learning data in supervised machine learning. is there.

本発明のさらなる目的は、人手を介さずに、評価の付与された学習データから学習に不適切なデータを削除することができる、優れた機械学習システム及び機械学習方法、並びにコンピュータ・プログラムを提供することにある。 A further object of the present invention is to provide an excellent machine learning system, machine learning method, and computer program capable of deleting data inappropriate for learning from learning data given an evaluation without human intervention. There is to do.

本発明は、上記課題を参酌してなされたものであり、その第１の側面は、データの評価があらかじめ付与されている教師ありの機械学習を行なう機械学習システムであって、機械学習を行なうための学習データの候補を評価とともに保持する学習データ保持部と、前記学習データ保持部に保持されるデータから機械学習を行なう際に用いる素性情報を抽出する素性取得部と、前記学習データ保持部に保持されている各学習データの評価と前記素性取得部から得られた各データの素性情報を基に素性とその評価の間の対応関係を学習する機械学習部と、前記機械学習部から得られる学習結果に基づいて前記学習データ保持部に保持されている学習データの候補の中から機械学習に不適切な学習データを削除する学習データ選択部とを具備することを特徴とする機械学習システムである。 The present invention has been made in consideration of the above-mentioned problems, and a first aspect thereof is a machine learning system that performs supervised machine learning in which data evaluation is given in advance, and performs machine learning. A learning data holding unit for holding learning data candidates for evaluation together with evaluation, a feature acquisition unit for extracting feature information used when performing machine learning from data held in the learning data holding unit, and the learning data holding unit Obtained from the machine learning unit, and a machine learning unit that learns the correspondence between the feature and the evaluation based on the evaluation of each learning data held in the feature and the feature information of each data obtained from the feature acquisition unit A learning data selection unit that deletes learning data inappropriate for machine learning from the learning data candidates held in the learning data holding unit based on the learning result obtained. Which is a machine learning system characterized.

本発明に係る機械学習システムでは、一旦すべての学習データを用いて機械学習を実施し、得られた学習結果と矛盾するデータを学習データ中から削除する。これによって、人手を介さずに学習データを精査することが可能となり、上記の（１）、（２）、並びに（３）の各技術的課題を解決することができる。 In the machine learning system according to the present invention, machine learning is once performed using all learning data, and data inconsistent with the obtained learning result is deleted from the learning data. As a result, it becomes possible to scrutinize the learning data without human intervention, and the above technical problems (1), (2), and (3) can be solved.

図１には、本発明に係る機械学習システムの機能構成を模式的に示している。図示の機械学習システムは、学習データ保持部１と、素性取得部２と、機械学習部３と、データ選択部４を備えている。 FIG. 1 schematically shows a functional configuration of a machine learning system according to the present invention. The illustrated machine learning system includes a learning data holding unit 1, a feature acquisition unit 2, a machine learning unit 3, and a data selection unit 4.

学習データ保持部１は、機械学習を行なうための学習データの候補を評価とともに保持している。 The learning data holding unit 1 holds learning data candidates for machine learning as well as evaluation.

素性取得部２は、学習データ保持部１に保持されるデータや、システムに投入される評価データから、機械学習を行なう際に用いる素性情報を抽出する。 The feature acquisition unit 2 extracts feature information used when performing machine learning from data held in the learning data holding unit 1 and evaluation data input to the system.

ここで、学習データや評価データは、例えば、自然言語文からなるテキスト・データで構成される。この場合、素性取得部２は、形態素解析処理又は構文解析処理により学習データから素性情報を取得する。 Here, the learning data and the evaluation data are composed of text data composed of natural language sentences, for example. In this case, the feature acquisition unit 2 acquires feature information from the learning data by morpheme analysis processing or syntax analysis processing.

また、機械学習部３は、学習データ保持部１に保持されている各学習データの評価と、素性取得部２から得られた各データの素性情報を基に、素性とその評価の間の対応関係を学習する。 In addition, the machine learning unit 3 determines the correspondence between the feature and the evaluation based on the evaluation of each learning data held in the learning data holding unit 1 and the feature information of each data obtained from the feature acquisition unit 2. Learn relationships.

データ選択部４は、機械学習部３から得られる学習結果に基づいて、学習データ保持部１に保持されている学習データの候補の中から機械学習に不適切な学習データを削除する。 Based on the learning result obtained from the machine learning unit 3, the data selection unit 4 deletes learning data inappropriate for machine learning from the learning data candidates held in the learning data holding unit 1.

ここで、機械学習部３は、ベクトル空間法に基づいてテキスト・データの素性と評価の間の対応規則を計算し、学習データ選択部１は、ベクトル間の内積の値に基づいて不適切なデータを削除するようにしてもよい。 Here, the machine learning unit 3 calculates a correspondence rule between the feature and evaluation of the text data based on the vector space method, and the learning data selection unit 1 is inappropriate based on the value of the inner product between the vectors. Data may be deleted.

ここで言うベクトル空間法とは、全テキスト・データに含まれる全単語のうち出現頻度の多い所定数のものを「特徴表現語」として抽出し、各単語と特徴表現語が共起（同じテキスト・データで出現）する回数を共起行列として表した単語ベクトルを生成し、次いで、対象とするテキスト・データに含まれる全単語の単語ベクトルの総和を正規化した文書ベクトルを生成し、評価対象となるテキスト・データについても同様の評価文書ベクトルを生成し、各分類の文書ベクトルと評価文書ベクトルとの内積により、評価対象のテキスト・データを分類することができる。 The vector space method here refers to extracting a predetermined number of frequently occurring words from all words included in all text data as “characteristic expression words”, and each word and the characteristic expression word co-occurs (the same text Generate a word vector that represents the number of occurrences in the data as a co-occurrence matrix, then generate a document vector that normalizes the sum of the word vectors of all words in the target text data, and evaluates The same evaluation document vector is generated for the text data to be obtained, and the text data to be evaluated can be classified by the inner product of the document vector of each classification and the evaluation document vector.

あるいは、機械学習部３は、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅに基づいてテキスト・データの素性と評価の間の対応規則を計算し、学習データ選択部４は、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅの出力するベクトル評価値に基づいて不適切なデータを削除するようにしてもよい。ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅは、ノンパラメトリックなパターン分類器の１つであり、学習の最適解として求められた分離超平面による線形識別を行ない、学習資料を線形分離することが不適切な場合には学習資料を元のパターン空間からより高次のパターン空間に非線形写像し高次元空間で分離超平面を構築し線形識別を行なうことができる。 Alternatively, the machine learning unit 3 calculates a correspondence rule between the feature of the text data and the evaluation based on the Support Vector Machine, and the learning data selection unit 4 determines whether or not based on the vector evaluation value output by the Support Vector Machine. Appropriate data may be deleted. The Support Vector Machine is one of the non-parametric pattern classifiers, which performs linear discrimination using the separation hyperplane obtained as the optimal solution for learning, and learning materials when it is inappropriate to linearly separate the learning materials Can be nonlinearly mapped from the original pattern space to a higher-order pattern space, and a separation hyperplane can be constructed in a high-dimensional space to perform linear discrimination.

このようにして、すべての学習データを用いて機械学習を実施し、得られた学習結果と矛盾するデータを学習データ中から削除した後、評価データがシステムに投入されると、素性取得部２は、形態素解析や構文解析などの処理を施し、評価データから機械学習を行なう際に用いる素性情報を抽出する。そして、機械学習部３は、評価データから抽出された素性とその評価の間の対応関係に基づいて、評価データの評価を行なう。 In this way, machine learning is performed using all the learning data, and after deleting the data inconsistent with the obtained learning result from the learning data, when the evaluation data is input to the system, the feature acquisition unit 2 Performs processing such as morphological analysis and syntax analysis, and extracts feature information used when performing machine learning from the evaluation data. Then, the machine learning unit 3 evaluates the evaluation data based on the correspondence extracted between the feature extracted from the evaluation data and the evaluation.

また、本発明の第２の側面は、あらかじめ評価とともに保持されている学習データの候補を用いて教師ありの機械学習を行なうための処理をコンピュータ・システム上で実行するようにコンピュータ可読形式で記述されたコンピュータ・プログラムであって、
学習データから機械学習を行なう際に用いる素性情報を抽出する素性取得ステップと、
各学習データの評価と、前記素性取得ステップにおいて得られた各データの素性情報を基に、素性とその評価の間の対応関係を学習する機械学習ステップと、
前記機械学習ステップにおいて得られる学習結果に基づいて、学習データの候補の中から機械学習に不適切な学習データを削除する学習データ選択ステップと、
選択された学習データを用いて評価データの評価を行なう評価ステップと、
を具備することを特徴とするコンピュータ・プログラムである。 Further, the second aspect of the present invention is described in a computer-readable format so that a process for performing supervised machine learning is performed on a computer system by using learning data candidates held together with evaluation in advance. Computer program,
A feature acquisition step for extracting feature information used when machine learning is performed from learning data;
A machine learning step for learning a correspondence between a feature and its evaluation based on the evaluation of each learning data and the feature information of each data obtained in the feature acquisition step;
A learning data selection step for deleting learning data inappropriate for machine learning from learning data candidates based on the learning result obtained in the machine learning step;
An evaluation step for evaluating the evaluation data using the selected learning data;
A computer program characterized by comprising:

本発明の第２の側面に係るコンピュータ・プログラムは、コンピュータ・システム上で所定の処理を実現するようにコンピュータ可読形式で記述されたコンピュータ・プログラムを定義したものである。換言すれば、本発明の第３の側面に係るコンピュータ・プログラムをコンピュータ・システムにインストールすることによって、コンピュータ・システム上では協働的作用が発揮され、本発明の第１の側面に係る機械学習システムと同様の作用効果を得ることができる。 The computer program according to the second aspect of the present invention defines a computer program described in a computer-readable format so as to realize predetermined processing on a computer system. In other words, by installing the computer program according to the third aspect of the present invention in the computer system, a cooperative action is exhibited on the computer system, and the machine learning according to the first aspect of the present invention. The same effect as the system can be obtained.

本発明によれば、学習データ中の各データに、データの評価があらかじめ（人手によって）付与されている教師ありの機械学習を好適に行なうことができる、優れた機械学習システム及び機械学習方法、並びにコンピュータ・プログラムを提供することができる。 According to the present invention, an excellent machine learning system and machine learning method capable of suitably performing supervised machine learning in which data evaluation is given in advance (by hand) to each data in learning data, In addition, a computer program can be provided.

また、本発明によれば、教師あり機械学習において、学習データから不要なデータを取り除くことによって精度を向上させることができる、優れた機械学習システム及び機械学習方法、並びにコンピュータ・プログラムを提供することができる。 The present invention also provides an excellent machine learning system, machine learning method, and computer program capable of improving accuracy by removing unnecessary data from learning data in supervised machine learning. Can do.

また、本発明によれば、人手を介さずに、評価の付与された学習データから学習に不適切なデータを削除することができる、優れた機械学習システム及び機械学習方法、並びにコンピュータ・プログラムを提供することができる。 Further, according to the present invention, there is provided an excellent machine learning system and machine learning method, and a computer program capable of deleting data inappropriate for learning from learning data given evaluation without human intervention. Can be provided.

本発明によれば、学習データ中の、人為的評価ミスによる悪性データや、素性と評価の間の対応規則を得るために不適切なデータを削除することができる。人手、あるいは、辞書知識などのいかなる補助も必要せずに、機械学習で用いる学習データの質を向上させることが可能であり、より精度の高い機械学習を実現することができる。 According to the present invention, inadequate data can be deleted in order to obtain malignant data due to human evaluation mistakes and correspondence rules between features and evaluations in learning data. It is possible to improve the quality of learning data used in machine learning without requiring any assistance such as manual or dictionary knowledge, and more accurate machine learning can be realized.

本発明のさらに他の目的、特徴や利点は、後述する本発明の実施形態や添付する図面に基づくより詳細な説明によって明らかになるであろう。 Other objects, features, and advantages of the present invention will become apparent from more detailed description based on embodiments of the present invention described later and the accompanying drawings.

以下、図面を参照しながら本発明の実施形態について詳解する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２には、本発明の一実施形態に係る機械学習システムの機能構成を模式的に示している。図示の機械学習システムは、学習コーパス保持部１１と、形態素解析部１２と、単語ベクトル生成部１３と、文書ベクトル生成部１４と、データ選択部１５で構成され、機械学習手段としてベクトル空間法を採用する。この機械学習システムは、実際には、パーソナル・コンピュータのような一般的な計算機システムに所定の機械学習アプリケーションを実行するという形態で実現される。 FIG. 2 schematically shows a functional configuration of a machine learning system according to an embodiment of the present invention. The illustrated machine learning system includes a learning corpus holding unit 11, a morpheme analysis unit 12, a word vector generation unit 13, a document vector generation unit 14, and a data selection unit 15, and uses a vector space method as a machine learning means. adopt. This machine learning system is actually realized in the form of executing a predetermined machine learning application in a general computer system such as a personal computer.

なお、以下に説明する本発明の実施形態は、機械学習手法を新聞記事の分類（「政治経済」分野の記事であるか「スポーツ」分野の記事であるか）などの文書分類システムに応用する場合を例に挙げているが、統計処理に基づく教師あり機械学習手法を用いるものであれば、アンケート分類及び質問応答など分類を要するあらゆる分野への応用であっても同様の効果を得ることが可能である。その他、テキスト分類のみならず数値データを含む分類や画像の分類など、いかなる機械学習手法を用いるものであっても、同様の効果を得ることが可能である。 The embodiment of the present invention described below applies the machine learning method to a document classification system such as newspaper article classification (whether it is an article in the “political economy” field or an article in the “sports” field). As an example, if a supervised machine learning method based on statistical processing is used, the same effect can be obtained even if it is applied to all fields that require classification such as questionnaire classification and question answering Is possible. In addition, the same effect can be obtained by using any machine learning method such as text classification as well as classification including numerical data and image classification.

学習コーパス保持部１１は、学習コーパスとしての複数の新聞記事を、記事毎に「政治経済」分野に属するか「スポーツ」分野に属するかを人手で判断した評価結果とともに、コンピュータ内部に保持する。 The learning corpus holding unit 11 holds a plurality of newspaper articles as a learning corpus together with an evaluation result obtained by manually determining whether each article belongs to the “political economy” field or the “sports” field.

形態素解析部１２は、学習コーパス保持部１１に保持されているすべての新聞記事テキスト、並びにシステムに評価データとして入力される単一の新聞記事テキストに対してそれぞれ形態素解析処理を施し、これらの新聞記事テキストを単語へと分割し、素性情報としての形態素解析結果を取得する。 The morpheme analysis unit 12 performs morpheme analysis on each newspaper article text held in the learning corpus holding unit 11 and a single newspaper article text input as evaluation data to the system. The article text is divided into words, and morphological analysis results as feature information are acquired.

ここで、形態素とは、言語学において、単語や接辞など、文法上、最小の単位となる要素のことである。したがって、形態素解析では、形態素の文法的属性（品詞や活用など）を同定するために、単語を分割して品詞付けを行なう。本実施形態では、形態素解析により得られる入力テキストの品詞情報を、機械学習における素性情報に用いる。 Here, a morpheme is an element that is the smallest unit in grammar, such as a word or affix, in linguistics. Therefore, in morphological analysis, in order to identify the grammatical attributes (part of speech, utilization, etc.) of the morpheme, the word is divided and part of speech is added. In this embodiment, the part-of-speech information of the input text obtained by morphological analysis is used as feature information in machine learning.

単語ベクトル生成部１３及び文書ベクトル生成部１４は、機械学習手段に相当し、ベクトル空間法に基づいてテキスト・データの素性と評価の間の対応規則を計算する。単語ベクトル生成部１３における単語ベクトルの生成手順について以下に説明する。 The word vector generation unit 13 and the document vector generation unit 14 correspond to machine learning means, and calculate correspondence rules between features of text data and evaluation based on a vector space method. The word vector generation procedure in the word vector generation unit 13 will be described below.

ステップ１：
形態素解析部１２で得られた全単語のうち、出現頻度の多いものから順にｎ個の単語を選択する。ここで得られたｎ個の単語のことを、以下では「特徴表現語」と呼ぶことにする。ｎの値は、数千のオーダーとすることが望ましい。 Step 1:
Of all the words obtained by the morpheme analyzer 12, n words are selected in descending order of appearance frequency. The n words obtained here are hereinafter referred to as “feature expression words”. The value of n is preferably on the order of several thousand.

但し、通常、新聞記事のキーワードとなりにくく且つ文に含まれる単語数の多い「は」又は「が」などの助詞については、ストップ・ワードとして、特徴表現語としてカウントしない場合もある。 However, normally, particles such as “ha” or “ga”, which are difficult to become keywords in newspaper articles and contain a large number of words, may not be counted as feature words.

ステップ２：
形態素解析部１２から得られた全単語を行とし、ステップ１で得られた特徴表現語を列として構成される行列を作成する。例えば、形態素解析部１２から得られた全単語の総異なり語数が１０万であれば、ｎの値は２万となり、１０万行×２万列の行列ができることになる。 Step 2:
A matrix composed of all the words obtained from the morphological analysis unit 12 as rows and the feature expression words obtained in step 1 as columns is created. For example, if the total number of different words of all the words obtained from the morphological analysis unit 12 is 100,000, the value of n is 20,000, and a matrix of 100,000 rows × 20,000 columns can be formed.

この行列の各要素には、その要素の行に対応する単語と列に対応する特徴表現語が、新聞記事中で何度共起（すなわち、同じ新聞記事中に同時に出現している)しているかを記録する。こうして得られた行列のことを「共起行列」と呼ぶことにする。このようにして、新聞記事中の全単語をそれぞれｎ次元（２万次元）のベクトルで表現する共起行列を作成することができる。このベクトルは、各単語が学習コーパス選択部１５によって選択された新聞記事中で、どのようなコンテキストで出現し易い傾向にあるかを示すベクトルであると言える。 For each element of this matrix, the word corresponding to the row of the element and the feature expression word corresponding to the column co-occur in the newspaper article (that is, appearing simultaneously in the same newspaper article). Record whether or not The matrix thus obtained will be referred to as a “co-occurrence matrix”. In this way, it is possible to create a co-occurrence matrix that expresses all words in a newspaper article by an n-dimensional (20,000-dimensional) vector. This vector can be said to be a vector indicating in what context each word tends to appear in the newspaper article selected by the learning corpus selection unit 15.

ステップ３：
ステップ２で得られたｎ次元のベクトルは次元数が大きいため、後に必要となる処理で計算時間が膨大なものになってしまう。そこで、計算処理を実時間の範囲に抑えるために、元のｎ次元のベクトルを行列の次元圧縮手法によって、ｎ'次元（数百次元）のベクトルへと圧縮する（ｎ'＜ｎ）。次元圧縮手法にはさまざまなものが存在するが、例えばＢｅｒｒｙ，Ｍ．、Ｄｏ，Ｔ．、Ｏ’Ｂｒｉｅｎ，Ｇ．、Ｋｒｉｓｈｎａ，Ｖ．及びＶａｒａｄｈａｎ，Ｓ．共著“ＳＶＤＰＡＣＫＣＵＳＥＲ’ＳＧＵＩＤＥ”（Ｔｅｃｈ．Ｒｅｐ．ＣＳ−９３−１９４．ＵｎｉｖｅｒｓｉｔｙｏｆＴｅｎｎｅｓｓｅｅ，Ｋｎｏｘｖｉｌｌｅ，ＴＮ（１９９３））で詳細な説明がなされているＳｉｎｇｕｌａｒＶａｌｕｅＤｅｃｏｍｐｏｓｉｔｉｏｎ（特異値分解）を利用する手法がその代表例である。このようにして新聞記事中のすべての単語に対して得られたｎ'次元のベクトルのことを「単語ベクトル」と呼ぶことにする。 Step 3:
Since the n-dimensional vector obtained in step 2 has a large number of dimensions, the calculation time becomes enormous in the processing required later. Therefore, in order to limit the calculation process to the real time range, the original n-dimensional vector is compressed into an n′-dimensional (several hundred dimensions) vector by a matrix dimension compression method (n ′ <n). There are various dimensional compression methods. For example, Berry, M. et al. Do, T .; O'Brien, G .; Krishna, V .; And Varadhan, S .; Co-authored “SVDPACKC USER'S GUIDE” (Tech. Rep. CS-93-194. University of Tennessee, Knoxville, TN (1993)) is used to explain the Singular Value Decomposition (singular value decomposition method). Is a typical example. The n′-dimensional vector obtained for all the words in the newspaper article in this way is called a “word vector”.

文書ベクトル生成部１４は、単語ベクトル生成部１３で得られた単語ベクトルを用いて、学習コーパス保持部１１中の各新聞記事についての文書ベクトルを計算する。ここで言う文書ベクトルとは、対象とする新聞記事に含まれる全単語に対応する単語ベクトルの総和を正規化した（ベクトルの長さを１とした）ベクトルのことである。このようにして得られた文書ベクトルは、学習コーパス保持部１１中の新聞記事集合を学習データとし、新聞記事に含まれる特徴表現語を各記事の素性とした場合に得られる機械学習の結果であると言える。 The document vector generation unit 14 calculates a document vector for each newspaper article in the learning corpus holding unit 11 using the word vector obtained by the word vector generation unit 13. The document vector referred to here is a vector obtained by normalizing the sum of word vectors corresponding to all the words included in the target newspaper article (with the vector length set to 1). The document vector thus obtained is a machine learning result obtained when the newspaper article set in the learning corpus holding unit 11 is used as learning data, and the feature expression word included in the newspaper article is used as the feature of each article. It can be said that there is.

次に、「政治経済」分野に属するすべての文書ベクトルに対して、「政治経済文書ベクトル」との間の類似度すなわち内積の値（Ｓｐ）、及び、「スポーツ文書ベクトル」との間の類似度（Ｓｓ）をそれぞれ計算する。 Next, for all document vectors belonging to the “political economy” field, the similarity between the “political economic document vector”, that is, the inner product value (Sp), and the similarity between the “sports document vector” Each degree (Ss) is calculated.

そして、データ選択部１５は、Ｓｐの値及びＳｓの値が以下の式を満たす文書ベクトルに対応する新聞記事を、学習データ保持部１１から削除する。Ｔ１及びＴ２は、あらかじめ設定された閾値（０＜Ｔ１＜１, ０＜Ｔ２＜１）である。 Then, the data selection unit 15 deletes the newspaper article corresponding to the document vector in which the value of Sp and the value of Ss satisfy the following expression from the learning data holding unit 11. T1 and T2 are preset threshold values (0 <T1 <1, 0 <T2 <1).

Ｓｐ＜Ｔ１
Ｓｓ＞Ｔ２ Sp <T1
Ss> T2

同様に、「スポーツ」分野に属するすべての文書ベクトルに対して、「政治経済文書ベクトル」との間の類似度すなわち内積の値（Ｓｐ）、及び、「スポーツ文書ベクトル」との間の類似度（Ｓｓ）をそれぞれ計算する。そして、データ選択部１５は、Ｓｐの値及びＳｓの値が以下の式を満たす文書ベクトルに対応する新聞記事を学習データ保持部１１から削除する。 Similarly, for all document vectors belonging to the “sports” field, the similarity between the “political economic document vector”, that is, the inner product value (Sp), and the similarity between the “sports document vector” (Ss) is calculated respectively. Then, the data selection unit 15 deletes the newspaper article corresponding to the document vector in which the value of Sp and the value of Ss satisfy the following formula from the learning data holding unit 11.

Ｓｐ＞Ｔ１
Ｓｓ＜Ｔ２ Sp> T1
Ss <T2

このようにして、人手では「スポーツ」分野に分類されていても実際には「政治経済」の分野に当てはまる可能性の高い（「政治経済文書ベクトル」と類似度の高いベクトルを持つ）記事（あるいはその逆の場合の記事）を、学習コーパス保持部１１から削除することが可能となる。このような記事は、単に人手による分類ミスである可能性があり、また、人手による分類が正しい場合であっても記事の特徴(素性)からみれば機械学習にとって適切な学習データでない（不適切な対応規則を学習してしまう）ものであると言える。 In this way, articles that are classified manually in the “sports” field but are actually likely to fall into the “political economy” field (having a vector similar in similarity to the “political economic document vector”) ( Or the article in the opposite case) can be deleted from the learning corpus holding unit 11. There is a possibility that such an article is simply a manual classification mistake, and even if the manual classification is correct, it is not appropriate learning data for machine learning in terms of the feature (feature) of the article (inappropriate) It can be said that it learns the corresponding correspondence rules).

上述したように、新規な新聞記事に対して形態素解析処理を施すことによって単語へと分割し、各単語に対応する単語ベクトルの総和を計算し正規化すれば、その新聞記事に対応する文書ベクトルを得ることができる。（但し、対応する単語ベクトルが存在しない単語は無視する。）次に、この文書ベクトルと「政治経済文書ベクトル」との間の類似度（内積の値）を計算し、同様に「スポーツ文書ベクトル」との間の内積を計算する。該新聞記事は、内積の値が大きい方の分野と内容的に近いと判断することが可能であり、与えられた新聞記事を「政治経済」か「スポーツ」のいずれかに分類することが可能である。 As described above, if a new newspaper article is divided into words by performing morphological analysis processing, and the sum of word vectors corresponding to each word is calculated and normalized, the document vector corresponding to the newspaper article is calculated. Can be obtained. (However, words that do not have a corresponding word vector are ignored.) Next, the similarity (value of the inner product) between this document vector and the “political economic document vector” is calculated, and similarly “sport document vector” ”Is calculated. The newspaper article can be judged to be close in content to the field with the larger inner product value, and the given newspaper article can be classified as either “political economy” or “sports” It is.

但し、出力された機械学習の結果である文書ベクトルの計算結果が同値並びに近似（０＜ａ＜１と閾値を設ける）であった場合には、複数の機械学習結果を人手で判定することも可能である。つまり、内積の値が同値並びに近似によって判別できない場合、当該結果をすべて提示し、人手によって当該結果を判定させ、当該結果を機械学習結果とし、学習データ選択手段に還元し、文書ベクトルを再計算することによって、再分類することも可能である。 However, when the calculation result of the document vector, which is the result of the machine learning that is output, is the same value and approximate (0 <a <1 and a threshold is provided), a plurality of machine learning results may be determined manually. Is possible. In other words, if the inner product value cannot be discriminated by the same value and approximation, all the results are presented, the result is judged manually, the result is used as a machine learning result, is returned to the learning data selection means, and the document vector is recalculated. By doing so, it is also possible to reclassify.

また、上述した実施形態では、ベクトル空間法により、素性として形態素解析結果から得られる単語の出現頻度を用いているが、本発明の要旨はこれに限定されるものではない。機械学習においては、テキストの特徴を表現し得るものであれば、いかなるものであっても素性となる得る。 In the above-described embodiment, the appearance frequency of the word obtained from the morphological analysis result is used as the feature by the vector space method, but the gist of the present invention is not limited to this. In machine learning, anything that can express the characteristics of a text can be a feature.

例えば、形態素解析の代わりに構文解析を施し、新聞記事中で、係り受け関係を有する単語のペアの出現頻度を素性とすることも可能である。構文解析では、文法規則などを基に句構造などの文の構造を解析する。文法規則が木構造であることから、構文解析結果は一般に個々の形態素が係り受け関係などを基にして接合された木構造となる。ＬｅｘｉｃａｌＦｕｎｃｔｉｏｎａｌＧｒａｍｍａｒ（ＬＦＧ）文法理論に基づく構文解析を利用することができる。ＬＦＧでは、ネイティブ・スピーカの言語知識すなわち文法ルールを、コンピュータ処理や、コンピュータの処理動作に影響を及ぼすその他の非文法的な処理パラメータとは切り離したコンポーネントとして構成しており、単語や形態素などからなる文章の句構造を木構造として表した“ｃ−ｓｔｒｕｃｔｕｒｅ（ｃｏｎｓｔｉｔｕｅｎｔｓｔｒｕｃｔｕｒｅ）”と、主語、目的語などの格構造に基づいて入力文を疑問文、過去形、丁寧文など意味的・機能的に解析した結果として“ｆ−ｓｔｒｕｃｔｕｒｅ（ｆｕｎｃｔｉｏｎａｌｓｔｒｕｃｔｕｒｅ）”を出力する。なお、ＬＦＧの詳細に関しては、例えばＲ．Ｍ．Ｋａｐｌａｎ及びＪ．Ｂｒｅｓｎａｎ共著の論文“Ｌｅｘｉｃａｌ−ＦｕｎｃｔｉｏｎａｌＧｒａｍｍａｒ：ＡＦｏｒｍａｌＳｙｓｔｅｍｆｏｒＧｒａｍｍａｔｉｃａｌＲｅｐｒｅｓｅｎｔａｔｉｏｎ”（ＴｈｅＭＩＴＰｒｅｓｓ，Ｃａｍｂｒｉｄｇｅ（１９８２）．ＲｅｐｒｉｎｔｅｄｉｎＦｏｒｍａｌＩｓｓｕｅｓｉｎＬｅｘｉｃａｌ−ＦｕｎｃｔｉｏｎａｌＧｒａｍｍａｒ，ｐｐ．２９−１３０．ＣＳＬＩｐｕｂｌｉｃａｔｉｏｎｓ，ＳｔａｎｆｏｒｄＵｎｉｖｅｒｓｉｔｙ（１９９５）．）、Ｄａｌｒｙｍｐｌｅ，Ｍ．著“ＳｙｎｔａｘａｎｄＳｅｍａｎｔｉｃｓ −ＬｅｘｉｃａｌＦｕｎｃｔｉｏｎａｌＧｒａｍｍａｒ”（ＡｃａｄｅｍｉｃＰｒｅｓｓ，２００１）及び当該論文中の引用文献などに記述されている。また、日本語ＬＦＧによる意味解析処理については、増市博、大熊智子共著「ＬｅｘｉｃａｌＦｕｎｃｔｉｏｎａｌＧｒａｍｍａｒに基づく実用的な日本語解析システムの構築」（自然言語処理，Ｖｏｌ．１０，Ｎｏ．２，ｐｐ．７９−１０９，言語処理学会，２００３）などに記載されている。 For example, syntactic analysis may be performed instead of morphological analysis, and the appearance frequency of word pairs having a dependency relationship may be used as a feature in a newspaper article. In syntax analysis, sentence structure such as phrase structure is analyzed based on grammatical rules. Since the grammatical rule is a tree structure, the parsing result generally has a tree structure in which individual morphemes are joined based on a dependency relationship. Parsing based on Lexical Functional Grammar (LFG) grammar theory can be used. In LFG, linguistic knowledge, that is, grammatical rules of native speakers, is configured as a component separated from computer processing and other non-grammatical processing parameters that affect computer processing operations. “C-structure (constituent structure)” representing the phrase structure of the sentence as a tree structure, and the input sentence based on the case structure of the subject, object, etc. as semantic, functional such as question sentence, past tense, polite sentence As a result of the analysis, “f-structure (functional structure)” is output. For details of LFG, see, for example, R.A. M.M. Kaplan and J.H. Bresnan co-author of the paper. "Lexical-Functional Grammar: A Formal System for Grammatical Representation" (The MIT Press, Cambridge (1982) Reprinted in Formal Issues in Lexical-Functional Grammar, pp.29-130.CSLI publications, Stanford University (1995 ).), Dalymplle, M .; It is described in “Syntax and Semantics-Lexical Functional Grammar” (Academic Press, 2001) and references cited therein. Regarding semantic analysis processing using Japanese LFG, Hiroshi Masuichi and Tomoko Okuma “Construction of a practical Japanese analysis system based on Lexical Functional Grammar” (Natural Language Processing, Vol. 79-109, Language Processing Society of Japan, 2003).

また、上述した実施形態では、ベクトル空間法に基づく機械学習手法を用いたが、これをＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅのような他の手法で置き換えることも可能である。ここで、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅは、ノンパラメトリックなパターン分類器の１つであり、学習の最適解として求められた分離超平面による線形識別を行ない、学習資料を線形分離することが不適切な場合には学習資料を元のパターン空間からより高次のパターン空間に非線形写像し高次元空間で分離超平面を構築し線形識別を行なう。ＳＶＭは、テキスト分類などの分類予測精度が高いとされている機械学習手法であるため、本実施形態の機械学習手段に用いることが可能である。ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅの学習結果に基づく分類処理の詳細については、例えば、ＦａｂｒｉｚｉｏＳｅｂａｓｔｉａｎｉ著“ ＭａｃｈｉｎｅＬｅａｒｎｉｎｇｉｎＡｕｔｏｍａｔｅｄＴｅｘｔＣａｔｅｇｏｒｉｚａｔｉｏｎ”（ＡＣＭＣｏｍｐｕｒｔｉｎｇＳｕｒｖｅｙｓＶｏｌ．３４，Ｎｏ．１，ｐｐ．１−４７，２００２）などに記載されている。 In the above-described embodiment, the machine learning method based on the vector space method is used. However, it is possible to replace this with another method such as Support Vector Machine. Here, Support Vector Machine is one of the non-parametric pattern classifiers, and performs linear discrimination using the separation hyperplane obtained as an optimal solution for learning, and it is inappropriate to linearly separate learning materials. Performs non-linear mapping of the learning material from the original pattern space to a higher-order pattern space, constructs a separation hyperplane in a high-dimensional space, and performs linear discrimination. Since SVM is a machine learning method that is considered to have high classification prediction accuracy such as text classification, it can be used for the machine learning means of this embodiment. For details of the classification process based on the learning result of Support Vector Machine, see, for example, “Machine Learning in Automated Text Categorization”, ACM Computation Survey. Has been described.

ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅを用いた機械学習では、素性情報は、図３で示すようなデータ集合となる。同図に示す例では、単語Ｗ１という素性が入力されており、文Ｓ１内の単語Ｗ１の個数（１個）がカウントされていることを示している。 In machine learning using Support Vector Machine, the feature information is a data set as shown in FIG. In the example shown in the figure, the feature of the word W1 is input, and the number (1) of the word W1 in the sentence S1 is counted.

図４には、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅを機械学習に適用した場合の機械学習システムの機能構成を模式的に示している。図示の機械学習システムは、学習コーパス保持部１１と、形態素解析部１２と、単語素性生成部１６と、文書素性生成部１７と、データ選択部１８で構成される。この機械学習システムは、実際には、パーソナル・コンピュータのような一般的な計算機システムに所定の機械学習アプリケーションを実行するという形態で実現される。以下では、機械学習手法を新聞記事の分類（「政治経済」分野の記事であるか「スポーツ」分野の記事であるか）などの文書分類システムに応用する場合を例に説明する（同上）。 FIG. 4 schematically shows a functional configuration of the machine learning system when the Support Vector Machine is applied to machine learning. The illustrated machine learning system includes a learning corpus holding unit 11, a morpheme analysis unit 12, a word feature generation unit 16, a document feature generation unit 17, and a data selection unit 18. This machine learning system is actually realized in the form of executing a predetermined machine learning application in a general computer system such as a personal computer. In the following, an example in which the machine learning method is applied to a document classification system such as classification of newspaper articles (whether it is an article in the “political economy” field or an article in the “sports” field) will be described (same as above).

学習コーパス保持部１１は、学習コーパスとしての複数の新聞記事を、記事毎に分野を人手で判断した評価結果とともに、コンピュータ内部に保持する。 The learning corpus holding unit 11 holds a plurality of newspaper articles as a learning corpus together with an evaluation result obtained by manually determining a field for each article in the computer.

単語素性生成部１６は、形態素解析部１２から得られるすべての単語に対して、対応する素性情報（集合）を生成する。以下、素性情報を生成するアルゴリズムについて説明する。 The word feature generation unit 16 generates corresponding feature information (set) for all words obtained from the morpheme analysis unit 12. Hereinafter, an algorithm for generating feature information will be described.

ステップ１：
形態素解析部１２から得られた全単語に対する表を作成する。但し、通常、新聞記事のキーワードとなりにくく且つ文に含まれる単語数の多い「は」又は「が」などの助詞については、ストップ・ワードとして、表に入力しないことにする。 Step 1:
A table for all words obtained from the morpheme analyzer 12 is created. However, normally, particles such as “ha” or “ga”, which are difficult to become keywords of newspaper articles and have a large number of words contained in a sentence, are not input to the table as stop words.

ステップ２：
形態素解析部１２から得られた単語をカウントし、ステップ１で得られた表に対して単語の個数を入力する。 Step 2:
The number of words obtained from the morphological analysis unit 12 is counted, and the number of words is input to the table obtained in step 1.

文書素性生成部１７は、単語素性生成部１６で得られた素性情報を用いて、学習コーパス保持部１１中に保持されているすべての新聞記事に対応する素性情報を生成する。学習コーパス保持部１１は、複数の新聞記事を、記事毎に「政治経済」分野に属するか「スポーツ」分野に属するかを人手で判断した評価結果が入力されているが、データ形式は、上述した素性情報を生成するアルゴリズムと同等の方法で作成されている。ステップ２で得られた表を基に、学習コーパス保持部１１が保持する評価結果と比較して、文書素性情報を生成する。 The document feature generation unit 17 uses the feature information obtained by the word feature generation unit 16 to generate feature information corresponding to all newspaper articles held in the learning corpus holding unit 11. The learning corpus holding unit 11 is input with an evaluation result obtained by manually determining whether a plurality of newspaper articles belong to the “political economy” field or the “sports” field for each article. It is created by the same method as the algorithm that generates the feature information. Based on the table obtained in step 2, the document feature information is generated by comparing with the evaluation result held by the learning corpus holding unit 11.

例えば、学習コーパス保持部１１が保持する「政治経済」分野と「スポーツ」分野からすべての単語を抽出し、「政治経済」分野と「スポーツ」分野毎に得られた単語の表をそれぞれ作成する。「政治経済」分野と「スポーツ」分野から抽出された単語と一致するステップ２で得られた表に入力された単語のみ、これに対応する単語数を各分野の表に入力していく。これによって、「政治経済」分野と「スポーツ」分野の文書素性情報が生成される。カウントされなかった単語は、削除せず単語数０とする。 For example, all the words are extracted from the “political economy” field and the “sports” field held by the learning corpus holding unit 11, and tables of words obtained for the “political economy” field and the “sports” field are created. . Only the words input in the table obtained in step 2 that matches the words extracted from the “political economy” field and the “sports” field, the corresponding number of words are input to the table of each field. As a result, document feature information in the “political economy” field and the “sports” field is generated. Words that are not counted are not deleted and the number of words is 0.

このようにして得られた文書素性は、学習コーパス保持部１１中の新聞記事集合を学習データとし、新聞記事に含まれる特徴表現語（ここでは「政治経済」分野と「スポーツ」分野の単語）を各記事の素性とした場合に得られる素性情報であると言える。 The document feature obtained in this way is a feature expression word (herein, “political economy” field and “sports” field) included in the newspaper article, with the newspaper article set in the learning corpus holding unit 11 as learning data. It can be said that it is the feature information obtained when the feature of each article.

次に、「政治経済」分野と「スポーツ」分野の文書素性情報を、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅを用いた機械学習法を用いて計算させ、「政治経済」分野と「スポーツ」分野のいずれかに分類することが可能である。学習データ選択部１５は、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅの出力するベクトル評価値に基づいて不適切なデータを学習データコーパス保持部１１から削除する。 Next, the document feature information in the “political economy” field and the “sports” field is calculated using a machine learning method using a support vector machine, and is classified into either the “political economy” field or the “sports” field. It is possible. The learning data selection unit 15 deletes inappropriate data from the learning data corpus holding unit 11 based on the vector evaluation value output by the Support Vector Machine.

このようにＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅを用いた機械学習手法においても、前述した空間ベクトル法と同等の効果を得ることが可能である。 As described above, even in the machine learning method using the Support Vector Machine, it is possible to obtain the same effect as the space vector method described above.

以上、特定の実施形態を参照しながら、本発明について詳解してきた。しかしながら、本発明の要旨を逸脱しない範囲で当業者が該実施形態の修正や代用を成し得ることは自明である。 The present invention has been described in detail above with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiment without departing from the gist of the present invention.

本明細書では、本発明に係る機械学習手法を新聞記事の分類（「政治経済」分野の記事であるか「スポーツ」分野の記事であるかの）などの文書分類システムに応用する場合を例にとって本発明について説明しているが、本発明の要旨はこれに限定されるものではない。すなわち、統計処理に基づく教師あり機械学習手法を用いるものであれば、アンケート分類及び質問応答など分類を要するあらゆる分野への応用であっても、同様に本発明を適用することが可能である。その他、テキスト分類のみならず数値データを含む分類や画像の分類など、いかなる機械学習手法を用いるものであっても、同様に本発明の効果を得ることが可能である。 In the present specification, an example of applying the machine learning method according to the present invention to a document classification system such as classification of newspaper articles (whether it is an article in the “political economy” field or an article in the “sports” field) is taken as an example. However, the gist of the present invention is not limited to this. That is, as long as a supervised machine learning method based on statistical processing is used, the present invention can be similarly applied to any field requiring classification such as questionnaire classification and question answering. In addition, the effect of the present invention can be obtained in the same manner regardless of any machine learning method such as text classification as well as classification including numerical data and image classification.

要するに、例示という形態で本発明を開示してきたのであり、本明細書の記載内容を限定的に解釈するべきではない。本発明の要旨を判断するためには、冒頭に記載した特許請求の範囲の欄を参酌すべきである。 In short, the present invention has been disclosed in the form of exemplification, and the description of the present specification should not be interpreted in a limited manner. In order to determine the gist of the present invention, the claims section described at the beginning should be considered.

図１は、本発明に係る機械学習システムの機能構成を模式的に示した図である。FIG. 1 is a diagram schematically showing a functional configuration of a machine learning system according to the present invention. 図２は、本発明の一実施形態に係る機械学習システムの機能構成を模式的に示した図である。FIG. 2 is a diagram schematically illustrating a functional configuration of the machine learning system according to the embodiment of the present invention. 図３は、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅを用いた機械学習における素性情報を示した図である。FIG. 3 is a diagram showing feature information in machine learning using the Support Vector Machine. 図４は、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅを機械学習に適用した場合の機械学習システムの機能構成を模式的に示した図である。FIG. 4 is a diagram schematically illustrating a functional configuration of the machine learning system when the Support Vector Machine is applied to machine learning.

符号の説明Explanation of symbols

１…学習データ保持部
２…素性取得部
３…機械学習部
４…データ選択部
１１…学習コーパス保持部
１２…形態素解析部
１３…単語ベクトル生成部
１４…文書ベクトル生成部
１５…データ選択部
１６…単語素性生成部
１７…文書素性生成部 DESCRIPTION OF SYMBOLS 1 ... Learning data holding part 2 ... Feature acquisition part 3 ... Machine learning part 4 ... Data selection part 11 ... Learning corpus holding part 12 ... Morphological analysis part 13 ... Word vector generation part 14 ... Document vector generation part 15 ... Data selection part 16 ... Word feature generation unit 17 ... Document feature generation unit

Claims

データの評価があらかじめ付与されている教師ありの機械学習を行なう機械学習システムであって、
機械学習を行なうための学習データの候補を評価とともに保持する学習データ保持部と、
前記学習データ保持部に保持されるデータから、機械学習を行なう際に用いる素性情報を抽出する素性取得部と、
前記学習データ保持部に保持されている各学習データの評価と、前記素性取得部から得られた各データの素性情報を基に、素性とその評価の間の対応関係を学習する機械学習部と、
前記機械学習部から得られる学習結果に基づいて、前記学習データ保持部に保持されている学習データの候補の中から機械学習に不適切な学習データを削除する学習データ選択部と、
を具備することを特徴とする機械学習システム。 A machine learning system that performs supervised machine learning with pre-evaluated data,
A learning data holding unit for holding learning data candidates for machine learning together with evaluation;
A feature acquisition unit that extracts feature information used when performing machine learning from data held in the learning data holding unit;
Based on the evaluation of each learning data held in the learning data holding unit and the feature information of each data obtained from the feature acquisition unit, a machine learning unit that learns the correspondence between the feature and the evaluation; ,
Based on the learning result obtained from the machine learning unit, a learning data selection unit that deletes learning data inappropriate for machine learning from learning data candidates held in the learning data holding unit;
A machine learning system comprising:

前記学習データ保持部は、自然言語文からなるテキスト・データを保持し、
前記素性取得部は、形態素解析処理又は構文解析処理により学習データから素性情報を取得する、
ことを特徴とする請求項１に記載の機械学習システム。 The learning data holding unit holds text data composed of natural language sentences,
The feature acquisition unit acquires feature information from learning data by morphological analysis processing or syntax analysis processing,
The machine learning system according to claim 1.

前記機械学習部は、ベクトル空間法に基づいてテキスト・データの素性と評価の間の対応規則を計算し、
前記学習データ選択部は、ベクトル間の内積の値に基づいて不適切なデータを削除する、
ことを特徴とする請求項２に記載の機械学習システム。 The machine learning unit calculates a correspondence rule between features and evaluation of text data based on a vector space method,
The learning data selection unit deletes inappropriate data based on the inner product value between vectors,
The machine learning system according to claim 2.

前記機械学習部は、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅに基づいてテキスト・データの素性と評価の間の対応規則を計算し、
前記学習データ選択部は、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅの出力するベクトル評価値に基づいて不適切なデータを削除する、
ことを特徴とする請求項２に記載の機械学習システム。 The machine learning unit calculates a correspondence rule between the feature of the text data and the evaluation based on the Support Vector Machine,
The learning data selection unit deletes inappropriate data based on a vector evaluation value output by the Support Vector Machine.
The machine learning system according to claim 2.

前記機械学習部が出力する複数の結果に対して、複数の結果を提示及び選択し、機械学習結果として用いる、
ことを特徴とする請求項１に記載の機械学習システム。 For a plurality of results output by the machine learning unit, present and select a plurality of results, and use them as machine learning results.
The machine learning system according to claim 1.

あらかじめ評価とともに保持されている学習データの候補を用いて教師ありの機械学習を行なう機械学習方法であって、
学習データから機械学習を行なう際に用いる素性情報を抽出する素性取得ステップと、
各学習データの評価と、前記素性取得ステップにおいて得られた各データの素性情報を基に、素性とその評価の間の対応関係を学習する機械学習ステップと、
前記機械学習ステップにおいて得られる学習結果に基づいて、学習データの候補の中から機械学習に不適切な学習データを削除する学習データ選択ステップと、
選択された学習データを用いて評価データの評価を行なう評価ステップと、
を具備することを特徴とする機械学習方法。 A machine learning method for performing supervised machine learning using learning data candidates stored in advance with evaluation,
A feature acquisition step for extracting feature information used when machine learning is performed from learning data;
A machine learning step for learning a correspondence between a feature and its evaluation based on the evaluation of each learning data and the feature information of each data obtained in the feature acquisition step;
A learning data selection step for deleting learning data inappropriate for machine learning from learning data candidates based on the learning result obtained in the machine learning step;
An evaluation step for evaluating the evaluation data using the selected learning data;
A machine learning method comprising:

前記学習データは自然言語文からなるテキスト・データであり、
前記素性取得ステップでは、形態素解析処理又は構文解析処理により学習データから素性情報を取得する、
ことを特徴とする請求項６に記載の機械学習方法。 The learning data is text data composed of natural language sentences,
In the feature acquisition step, feature information is acquired from learning data by morphological analysis processing or syntax analysis processing.
The machine learning method according to claim 6.

前記機械学習ステップでは、ベクトル空間法に基づいてテキスト・データの素性と評価の間の対応規則を計算し、
前記学習データ選択ステップでは、ベクトル間の内積の値に基づいて不適切なデータを削除する、
ことを特徴とする請求項７に記載の機械学習方法。 In the machine learning step, a correspondence rule between the feature and evaluation of the text data is calculated based on a vector space method,
In the learning data selection step, inappropriate data is deleted based on the inner product value between vectors,
The machine learning method according to claim 7.

前記機械学習ステップでは、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅに基づいてテキスト・データの素性と評価の間の対応規則を計算し、
前記学習データ選択ステップでは、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅの出力するベクトル評価値に基づいて不適切なデータを削除する、
ことを特徴とする請求項７に記載の機械学習方法。 In the machine learning step, a correspondence rule between the feature and evaluation of the text data is calculated based on the Support Vector Machine,
In the learning data selection step, inappropriate data is deleted based on the vector evaluation value output by the Support Vector Machine.
The machine learning method according to claim 7.

前記機械学習ステップにおいて複数の結果に対して、複数の結果を提示及び選択し、機械学習結果として用いる、
ことを特徴とする請求項６に記載の機械学習方法。 Presenting and selecting a plurality of results for a plurality of results in the machine learning step, and using them as machine learning results,
The machine learning method according to claim 6.

あらかじめ評価とともに保持されている学習データの候補を用いて教師ありの機械学習を行なうための処理をコンピュータ・システム上で実行するようにコンピュータ可読形式で記述されたコンピュータ・プログラムであって、
学習データから機械学習を行なう際に用いる素性情報を抽出する素性取得ステップと、
各学習データの評価と、前記素性取得ステップにおいて得られた各データの素性情報を基に、素性とその評価の間の対応関係を学習する機械学習ステップと、
前記機械学習ステップにおいて得られる学習結果に基づいて、学習データの候補の中から機械学習に不適切な学習データを削除する学習データ選択ステップと、
選択された学習データを用いて評価データの評価を行なう評価ステップと、
を具備することを特徴とするコンピュータ・プログラム。
A computer program written in a computer-readable format to execute a process for performing supervised machine learning on a computer system using learning data candidates stored in advance with evaluation,
A feature acquisition step for extracting feature information used when machine learning is performed from learning data;
A machine learning step for learning a correspondence between a feature and its evaluation based on the evaluation of each learning data and the feature information of each data obtained in the feature acquisition step;
A learning data selection step for deleting learning data inappropriate for machine learning from learning data candidates based on the learning result obtained in the machine learning step;
An evaluation step for evaluating the evaluation data using the selected learning data;
A computer program comprising: