JP2003058860A

JP2003058860A - Method and device for detecting error of text corpus

Info

Publication number: JP2003058860A
Application number: JP2001246643A
Authority: JP
Inventors: Sei Ba; 青馬
Original assignee: Communications Research Laboratory
Current assignee: Communications Research Laboratory
Priority date: 2001-08-15
Filing date: 2001-08-15
Publication date: 2003-02-28
Anticipated expiration: 2021-08-15
Also published as: JP3726125B2

Abstract

PROBLEM TO BE SOLVED: To provide a method and a device for embedding/extracting information, with which the creator or distributor of a language column can easily embed desired information, simultaneously, a reader or user hardly notices the existence of that information and further, the information can be surely extracted on the basis of a prescribed system, and to provide a recording medium. SOLUTION: Significant information is embedded by changing the position relation of a line end character group and a line shift position comprising a character unit close to the terminal of each of rows in the language column. A language column capable of substantial print/display or electromagnetically recorded language column can be defined as an object for the language column as well. The character unit can be a division by a morpheme as well. A method is provided for extracting information from the language column prepared/ outputted by such an information embedding method. Then, a device for embedding/extracting information and a recording medium for recording the language column are provided.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、言語処理に用いら
れるテキストコーパスの誤りを検出する方法に関し、よ
り詳しくは、該誤り検出の高速化、高効率化に関する技
術である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for detecting an error in a text corpus used in language processing, and more particularly to a technique for speeding up error detection and improving efficiency.

【０００２】[0002]

【従来の技術】近年、さまざまなテキストコーパスが作
られ、教師有り機械学習の研究をはじめとして、言語処
理技術の研究が盛んに行われている。しかし、学習に用
いられるテキストコーパスは人手によって作成されるた
め、多くの誤りを含み、この誤りが各研究の進捗を妨げ
たり、言語処理精度の低下を招く場合も多い。このた
め、テキストコーパス中の誤りを検出・修正することは
非常に重要な課題となっている。2. Description of the Related Art In recent years, various text corpora have been created, and language processing technology has been actively researched, including research on machine learning with teachers. However, since the text corpus used for learning is created by hand, it contains many errors, and these errors often hinder the progress of each research and reduce the accuracy of language processing. Therefore, detecting and correcting errors in the text corpus has become a very important issue.

【０００３】従来から知られているテキストコーパス中
の誤りを検出する試みとしては、形態素コーパス中での
過分割の誤りを検出する方法（内山将夫、「形態素解析
結果から過分割を検出する統計的尺度」、言語処理学会
会誌、Ｖｏｌ．６、Ｎｏ．７、（１９９９））などがあ
る。しかし、従来の手法の多くは、誤りの種類を特化
し、汎用性の見えにくい手法である。As a conventional attempt to detect an error in a text corpus, a method of detecting an over-segmentation error in a morpheme corpus (Masao Uchiyama, “Statistics for detecting over-segmentation from a result of morpheme analysis. Scale, ”Journal of the Language Processing Society, Vol. 6, No. 7, (1999)). However, most of the conventional methods are methods that specialize in the type of error and are difficult to see in general.

【０００４】そこで、本件出願人らによって、一般的に
どの問題に対しても用いることができると考えられてい
る用例ベース手法や、決定リスト手法を利用した、対象
とするコーパスのみから間違っている確率を算出し、誤
りを検出する手法が考え出された。（村田真樹、内山将
夫、内元清貴、馬青、井佐原均、「決定リスト、用例ベ
ース手法を用いたコーパス誤り検出・誤り訂正」、情報
処理学会自然言語処理研究会、２０００−ＮＬ−１３
６、ｐｐ．４９−５６（２０００））しかし、これら従
来の手法でも、学習の前に誤りの検出を行わなければな
らず、いわばオフラインでの検出手法である上に、検出
処理も、誤りがありそうな部分への絞り込み作業なし
に、全コーパスを対象に１語ずつ調べていくため、好適
な検出効率を得ることは難しかった。このため、大規模
なテキストコーパスの場合には、検出が難しく、費用コ
スト的にも大きくなる問題があった。Therefore, it is wrong only from the corpus of interest using the example-based method and the decision list method which are generally considered to be applicable to any problem by the present applicants. A method for calculating the probability and detecting an error has been devised. (Maki Murata, Masao Uchiyama, Kiyotaka Uchimoto, Mao, Hitoshi Isahara, "Corpus error detection / correction using decision list, example-based method", IPSJ Natural Language Processing Research Group, 2000-NL-13
6, pp. 49-56 (2000)) However, even with these conventional methods, it is necessary to detect an error before learning, which is, so to speak, an off-line detection method, and the detection process is also likely to have an error. It is difficult to obtain a suitable detection efficiency, because the whole corpus is searched word by word without narrowing down. Therefore, in the case of a large-scale text corpus, there is a problem that detection is difficult and the cost is large.

【０００５】[0005]

【発明が解決しようとする課題】本発明は、上記従来技
術の有する問題点に鑑みて創出されたものであり、その
目的は、高速かつ高効率でテキストコーパス中の誤りを
検出する方法を創出することである。ひいては、言語処
理技術の向上に寄与することを目的とする。SUMMARY OF THE INVENTION The present invention was created in view of the above problems of the prior art, and an object thereof is to create a method for detecting an error in a text corpus at high speed and high efficiency. It is to be. As a result, it aims to contribute to the improvement of language processing technology.

【０００６】[0006]

【課題を解決するための手段】本発明は、上記の課題を
解決するために、次のようなテキストコーパスの誤り検
出方法を創出する。すなわち、単語情報を含む予め作成
されたテキストコーパスにおける該単語情報の誤りを検
出する方法であって、該各単語情報の分類をニューラル
ネットワークにおけるクラスとして捉え、それらを小規
模な２クラス問題に分割して、複数のモジュールを構成
する。In order to solve the above problems, the present invention creates the following method for detecting an error in a text corpus. That is, it is a method of detecting an error in the word information in a pre-created text corpus including word information, wherein the classification of each word information is regarded as a class in a neural network, and these are divided into two small class problems. Then, a plurality of modules are configured.

【０００７】そして、各モジュールがニューラルネット
ワークにおける学習過程において収束するか否かの演算
を行い、収束しない場合に、該モジュールに該単語情報
の誤りがあると判定し、該モジュールを抽出する。この
ように本発明では、一般にニューラルネットワークでは
問題とされていた収束しない現象を逆手にとって、収束
しないときに誤りであると判定する誤り検出方法を創出
した。Then, each module calculates whether or not it converges in the learning process in the neural network, and if it does not converge, it is judged that the word information has an error in the module, and the module is extracted. As described above, in the present invention, an error detection method for determining an error when it does not converge is created by taking advantage of a phenomenon that does not converge, which is generally a problem in a neural network.

【０００８】ここで、前記単語情報が、品詞に係る情報
であって、該情報をタグ形式でテキスト中に埋め込み、
テキストコーパスを構成し、該タグの誤りを検出する場
合にも適用することができる。Here, the word information is information relating to a part of speech, and the information is embedded in the text in a tag format,
It can also be applied to the case of forming a text corpus and detecting an error in the tag.

【０００９】又、本発明は以上の処理を行うことによっ
て、テキストコーパスの誤りを検出可能な装置として提
供することもできる。この時、誤りを検出することに特
化した機能を有する装置として提供してもよいし、他の
装置の一部として構成することもできる。例えば、テキ
ストコーパスを手動、半自動、自動的に作成する装置に
組み込んで、作成後のコーパスの誤りを連続的に検出す
ることもできるし、その他、任意の言語処理を行う装置
に付加することもできる。The present invention can also be provided as an apparatus capable of detecting an error in a text corpus by performing the above processing. At this time, it may be provided as a device having a function specialized for detecting an error, or may be configured as a part of another device. For example, it is possible to incorporate a text corpus into a device that creates manually, semi-automatically or automatically to continuously detect errors in the corpus after creation, or to add it to a device that performs arbitrary language processing. it can.

【００１０】[0010]

【発明の実施の形態】以下、本発明の実施方法を図面に
示した実施例に基づいて説明する。なお、本発明の実施
形態は以下に限定されず、適宜変更可能である。以下に
おいては、テキストコーパスの一例として、日本語によ
るコーパスを挙げて説述していくが、本発明の実施方法
は、性質上実現出来ない場合を除き、英語、中国語、韓
国語等のいかなる言語に対しても適用可能である。ま
た、本発明が対象とするテキストコーパスは、品詞や形
態素区切り等、任意の単語情報を含むテキストコーパス
であってよく、本発明はそれらの単語情報に係る誤りを
効果的に検出できる方法である。BEST MODE FOR CARRYING OUT THE INVENTION The method for carrying out the present invention will be described below with reference to the embodiments shown in the drawings. The embodiment of the present invention is not limited to the following and can be modified as appropriate. In the following, as an example of a text corpus, a corpus in Japanese will be described, but the implementation method of the present invention is not limited to English, Chinese, Korean, etc., unless it can be realized in nature. It can also be applied to languages. Further, the text corpus targeted by the present invention may be a text corpus including arbitrary word information such as a part-of-speech or a morpheme delimiter, and the present invention is a method capable of effectively detecting an error related to the word information. .

【００１１】種々の学習システムはそれぞれの用途にあ
わせ、大規模データベースを学習することによって必要
な知識を獲得する。大規模データベースは通常、人手で
作成されるため、細心な注意を払ってもエラーは存在す
る。データベースの品質を高めるため、そして、それを
用いる学習するシステムの性能を高めるため、エラーを
自動検出する技術が必要である。しかし、これまで開発
されたエラー自動検出はすべてそのデータを利用する前
にあらかじめ（つまり、オフライン的に）行うものであ
り、常に全データを対象にデータを一つずつチェックし
て行わなければならない。そのため、例えばテキストコ
ーパスのような大規模データベースの場合、その計算コ
ストは非常に高い。Various learning systems acquire necessary knowledge by learning a large-scale database according to their respective uses. Large databases are usually created manually, so even with the utmost care, errors exist. In order to improve the quality of the database and the performance of the learning system that uses it, a technique for automatically detecting errors is required. However, all the error detection methods developed so far are performed in advance (that is, offline) before using the data, and it is necessary to always check all the data one by one. . Therefore, in the case of a large-scale database such as a text corpus, the calculation cost is very high.

【００１２】そこで、本発明において、テキストコーパ
スの誤りを検出する際、次の特徴を有する誤り検出方法
及び装置を提供する。（１）エラーは、そのデータを学習する最中に検出さ
れる。（２）検出は全データをスキャンするのではなく、エ
ラーのあるごく小さなデータエリアに直接飛びつき、そ
れらのエリアに絞って行う。（３）データが修正された後の再学習は、全データを
対象に行う必要がなく、修正がかかったエリアのデータ
のみに対し行えばよい。（４）学習機械としては、大規模で複雑な学習問題
(データ)を多数の小規模で簡単な問題(データセット)に
分割して学習が行えるモジュール型ニューラルネットが
用いられる。（５）上記ニューラルネットを構成する個々のモジュ
ールは、学習データにエラーがなければ必ず収束する、
という前提に基づき処理を行う。（６）上記（１）ないし（３）を可能にしたのは上記
（５）の特性を逆に利用した結果である。つまり、収束
しないモジュールにエラーデータが存在すると考え、エ
ラー検出を行う。Therefore, in the present invention, there is provided an error detecting method and device having the following features when detecting an error in a text corpus. (1) The error is detected while learning the data. (2) Detection does not scan all data, but jumps directly to a very small data area with an error and narrows down to those areas. (3) The re-learning after the data is corrected does not have to be performed on all the data, and may be performed only on the data in the corrected area. (4) As a learning machine, large-scale and complicated learning problems
A modular neural network is used that can perform learning by dividing (data) into a large number of small problems (data sets). (5) The individual modules that make up the neural network will always converge if there is no error in the learning data.
Processing is performed based on the premise. (6) What made the above (1) to (3) possible is the result of utilizing the characteristics of the above (5) in reverse. That is, it is considered that error data exists in a module that does not converge, and error detection is performed.

【００１３】図１には上記の本発明の特徴を、テキスト
コーパスの誤り検出に用いた場合の処理概念図を示す。
まず、大規模テキストコーパス（１０）を、学習対象と
なる学習データセット（１１）として用いる。該学習デ
ータセットは大規模で複雑な学習問題であるから、これ
を多数の小規模で簡単なデータサブセット（１２ａ）
（１２ｂ）（１２ｃ）（１２ｄ）（１２ｅ）・・に分割
する。この分割によってモジュール型ニューラルネット
が形成される。FIG. 1 is a conceptual diagram of processing when the above-mentioned feature of the present invention is used for error detection of a text corpus.
First, the large-scale text corpus (10) is used as a learning data set (11) to be learned. Since the training data set is a large-scale and complicated learning problem, it can be divided into a large number of small-scale and simple data subsets (12a).
(12b) (12c) (12d) (12e) ... A modular neural network is formed by this division.

【００１４】そして、各データサブセット（１２ａ・・
・）をそれぞれ学習に用いられ、その際、収束（１３）
すれば当該データサブセットに誤りはない（１４）と判
断される。一方、例えばあるデータサブセット（１２
ｃ）において学習が収束しなかった（１３’）場合、デ
ータサブセット（１２ｃ）を抽出（１５）し、例えば人
手によって修正（１６）を施す。修正方法は自動的な手
法によってもよいし、任意である。Then, each data subset (12a ...
・) Are used for learning, and then convergence (13)
Then, it is determined that there is no error in the data subset (14). On the other hand, for example, a certain data subset (12
If the learning does not converge in (c) in (c), the data subset (12c) is extracted (15) and corrected (16) manually, for example. The correction method may be an automatic method or is arbitrary.

【００１５】修正（１６）を施されたデータサブセット
（１２ｃ）は再び、再学習（１７）に用いられ、学習が
収束（１８）し、データに誤りはない（１９）として処
理される。本発明は、以上の流れにしたがってテキスト
コーパスの誤り検出を行うものであり、以下に、これら
の本発明が特徴とする革新的アイデアに基づいて創出さ
れた誤り検出方法を詳細に説明する。The corrected data subset (12c) is used again for re-learning (17), the learning converges (18), and the data is treated as error-free (19). The present invention performs error detection of a text corpus according to the above flow, and the error detection method created based on these innovative ideas that characterize the present invention will be described in detail below.

【００１６】人手で作成されたコーパスには、単純ミス
型（例えば，「動詞」を「同士」と表記してしまう場合
など。）、不正確な知識による知識型（例えば，「国
立」という地名もあるのに、「国立」をすべて普通名詞
にしてしまう場合など。）、矛盾型（例えば、格助詞で
あるべき「の」をときには接続助詞にしてしまう場合な
ど。）という三種類の誤りが考えられる。In a corpus created by hand, a simple miss type (for example, when "verb" is written as "mutual"), a knowledge type due to incorrect knowledge (for example, a place name "national") However, there are three types of errors: "national" is a common noun, etc.), and contradiction type (for example, "no", which should be a case particle, is sometimes a connection particle). Conceivable.

【００１７】単純ミス型誤りは、電子辞書や品詞体系リ
ストなどを参照すれば容易に検出可能である。一方、知
識型誤りの検出は自動的な方法では困難である。品詞タ
グ付けを一種の入出力マッピング問題として捉えるな
ら、矛盾型誤りを同じ入力を持ちながら出力が異なるデ
ータの集合として考えることができる。このような誤り
を検出する手法は従来から幾つか提案されている。しか
し、それらの手法はいずれも学習の前に行わなければな
らないものであり、いわばオフラインで行う手法であっ
た。そして、検出処理は、誤りがありそうな部分への絞
り作業なしに、全コーパスを対象に一語ずつ調べて行く
ため、計算処理に時間と費用がかかり、非効率的であ
る。とくに、コーパスが高精度である場合には、全コー
パスを検査する手法は、非常に無駄な作業が多い。The simple mistake type error can be easily detected by referring to an electronic dictionary or a part-of-speech system list. On the other hand, detection of knowledge type errors is difficult by an automatic method. If we consider part-of-speech tagging as a kind of input / output mapping problem, we can think of a contradiction type error as a set of data with the same input but different outputs. Several methods for detecting such an error have been conventionally proposed. However, all of these methods had to be performed before learning, and were, so to speak, offline methods. Then, the detection process is inefficient because the calculation process takes time and cost because the whole corpus is searched word by word without narrowing down the parts that are likely to have errors. In particular, when the corpus has high accuracy, the method of inspecting the entire corpus is very wasteful work.

【００１８】そこで、本発明においては、次のような誤
り検出方法を創出した。まず、学習問題とされる品詞タ
グ付けは、任意の文Ｗ＝ｗ₁ｗ₂・・・ｗ_s が与えられたとき、マッピング処理或いはクラス分け問
題Ｗ^p→τ_p によって品詞列Ｔ＝τ₁τ₂・・・τ_s を見つけることである。ただし、ｐは品詞を定めようと
する目標言語のコーパスにおける位置を表し、Ｗ^pは目
標単語ｗ_pを中心とした左右それぞれ（ｌ，ｒ）個の単
語で構成される単語列である。すなわち、Ｗ^p＝ｗ_p-l・・・ｗ_p・・・ｗ_p+r となる。ただし、ｐ−ｌ≧ｓ_s、ｐ＋ｒ≦ｓ_s＋ｓ、ｓ_s
は文頭単語の位置である。Therefore, in the present invention, the following error detection method was created. First, part-of-speech tagging, which is a learning problem, when given an arbitrary sentence W = w ₁ w ₂ ... w _s , the part-of-speech sequence T = τ ₁ by the mapping process or the classification problem W ^p → τ _p τ ₂ is to find τ _s . However, p represents the position in the corpus of the target language whose part of speech is to be determined, and W ^p is a word string composed of (l, r) left and right words centering on the target word w _p . That is, the ^{_{_{W p = w pl ··· w p}}} ··· w p + r. However, p-1 ≧ s _s , p + r ≦ s _s + s, s _s
Is the position of the first word.

【００１９】誤りの検出の一例として、例えば京大テキ
ストコーパスからすくなくとも一箇所の誤りを持つ２１
７文を用いる。それらの文はのべ６８１６個の単語（う
ち、異なりが２４１０個）を持ち、９７種類の品詞を
有する。品詞をクラスとして捉えるなら，この品詞タグ
付け問題は９７classのクラス分け問題となる。As an example of error detection, for example, there is at least one error from the Kyoto University text corpus.
Use 7 sentences. The sentences have a total of 6816 words (of which 2410 are different), and have 97 kinds of parts of speech. If part-of-speech is regarded as a class, this part-of-speech tagging problem becomes a 97class classification problem.

【００２０】この９７class問題をまず―意的に４６５
６個のtwo-class問題に分割する。そして、まだ学習デ
ータの多いtwo-class問題はその数が８０以下になるよ
うに更に無作為に分割する。このようにして、この97cl
ass問題が23,231個の小規模で簡単なtwo-class問題に分
割される。First of all, this 97 class problem is-465
Divide into 6 two-class problems. Then, the two-class problem with much learning data is further randomly divided so that the number becomes 80 or less. In this way, this 97cl
The ass problem is divided into 23,231 small, simple two-class problems.

【００２１】本発明で用いるＭ³ネットワークでは、こ
れらの問題はそれぞれ独立のモジュールで学習される。
そして、学習したモジュールはMINやMAXなどの簡単な演
算で統合され、品詞タグ付けが行われる。各モジュール
ヘの入力ベクトルＸは単語列Ｗ^pから以下のように構成
される。In the M ³ network used in the present invention, these problems are learned by independent modules.
Then, the learned modules are integrated by simple operations such as MIN and MAX, and part-of-speech tagging is performed. The input vector X to each module is constructed from the word string W ^{p as} follows.

【式１】Ｘ＝（ｘ_p-l，・・・，ｘ_p，・・・，ｘ_p+r）要素ｘ_pは目標単語を符号化するω次元のbinary-coded
ベクトルである。一方、文脈にある単語に対応する要素
ｘ_t（ｔ≠ｐ）はその単語に付与された品詞を符号化す
るτ次元のbinary-codedベクトルである。目標出力Ｙは
目標単語に付与すべき品詞を符号化するτ次元の binar
y-codedベクトルであり、式２によって示される。[Formula 1] X = (x _pl , ..., x _p , ..., x _{p + r} ) Element x _p is a ω-dimensional binary-coded that encodes the target word
Is a vector. On the other hand, the element x _t (t ≠ p) corresponding to the word in the context is a τ-dimensional binary-coded vector that encodes the part of speech assigned to the word. The target output Y is a τ-dimensional binar that encodes the part of speech to be given to the target word.
It is a y-coded vector and is shown by Equation 2.

【式２】Ｙ＝（ｙ₁，ｙ₂，・・・，ｙ_r）[Formula 2] Y = (y ₁ , y ₂ , ..., Y _r )

【００２２】Ｍ³ネットワークは簡単で小規模な問題を
扱う多数のモジュールから構成されるため、収束性の問
題が基本的に生じない。言い替えれば、あるモジュール
が学習において収束しなければ、このモジュールの学習
データに矛盾型誤りがあると考えられる。即ち、学習デ
ータセットに、式３の条件を満足するデータのぺア
(Ｘ_i，Ｙ_i)と (Ｘ_j，Ｙ_j)が存在する。Since the M ³ network is composed of a large number of modules that deal with simple and small problems, the problem of convergence does not occur basically. In other words, if a module does not converge in learning, it is considered that the learning data of this module has a contradiction type error. That is, a pair of data that satisfies the condition of Equation 3 is added to the learning data set.
There are (X _i , Y _i ) and (X _j , Y _j ).

【式３】Ｘ_i＝Ｘ_j，Ｙ_i≠Ｙ_j （ｉ≠ｊ）ただし、Ｘ_iとＸ_jは入力であり、Ｙ_iとＹ_jはそれぞれ対
応する目標出力である。従って、このタイプの誤りは学
習の時に、収束しないモジュールだけを選び出すことに
よって検出できる。## EQU3 ## X _i = X _j , Y _i ≠ Y _j (i ≠ j) where X _i and X _j are inputs and Y _i and Y _j are corresponding target outputs. Therefore, this type of error can be detected at the time of learning by picking only those modules that do not converge.

【００２３】本発明によるこの検出方法は全データをス
キャンするのではなく、誤りのあるごく小さなデータエ
リアにスキップし、それらのエリアに絞って行う、と見
ることができる。このような検出法は、従来のように全
コーパスを調べていくよりもはるかに効率的であって、
特に高精度のコーパスを用いている場合には、誤りデー
タが極少数のモジュールに限られているため、非常に有
効である。It can be seen that this detection method according to the invention does not scan the entire data, but skips to erroneous tiny data areas and focuses on those areas. Such a detection method is much more efficient than the conventional method of examining the entire corpus,
Especially when a high-precision corpus is used, the error data is limited to a very small number of modules, which is very effective.

【００２４】そして、誤りが訂正された後の再学習は、
全コーパスを対象に行う必要がなく、また全モジュール
に対し行う必要もなく、修正がかけられたエリアのデー
タのみを用い、収束しないモジュールだけに対して行え
ばよい。いうまでもなく、このような低コストの再学習
は学習システムの性能向上にも寄与することができる。Then, the relearning after the error is corrected is
It is not necessary to perform it on all corpora, and it is not necessary to perform it on all modules. Only the data of the corrected area is used, and only the modules that do not converge may be used. Needless to say, such low-cost relearning can also contribute to improving the performance of the learning system.

【００２５】本発明では、以上の手法を用いたテキスト
コーパスの誤り検出装置を提供することができる。該手
法は、高速な処理を行うことが可能なため、テキストコ
ーパスによる言語処理装置に本発明の装置を付加し、学
習を行うと同時に、誤りを検出し、より高精度な言語処
理装置を提供することも可能となる。また、テキストコ
ーパスを人手によって作成する装置、手動及び自動的に
作成する装置、言語固有情報を自動獲得して、自動的に
作成する装置等に、本発明の装置を付加し、テキストコ
ーパスの作成と同時に誤りを検出し、高精度なテキスト
コーパスを完成させることもできる。The present invention can provide a text corpus error detecting apparatus using the above method. Since this method can perform high-speed processing, the apparatus of the present invention is added to the language processing apparatus using a text corpus to perform learning and at the same time detect an error to provide a more accurate language processing apparatus. It is also possible to do. Further, the device of the present invention is added to a device for manually creating a text corpus, a device for manually and automatically creating, a device for automatically acquiring language-specific information and automatically creating it, and creating a text corpus. At the same time, it is possible to detect an error and complete a highly accurate text corpus.

【００２６】以下、本発明による検出方法の検出精度を
示す。単語と品詞を表すべクトルの次元ω，τはそれぞ
れ１６と８に設定した。単語列の長さ（ｌ，ｒ）は
（２，２）に設定した。従って、モジュールの入力層の
ユニット数は [(l+r)×τ]+[1×ω]＝４８であった。The detection accuracy of the detection method according to the present invention will be described below. The dimensions ω and τ of the vector representing the word and the part of speech are set to 16 and 8, respectively. The word string length (l, r) was set to (2, 2). Therefore, the number of units in the input layer of the module was [(l + r) × τ] + [1 × ω] = 48.

【００２７】初期値としてすべてのモジュールは48-2-1
の３層パーセブトロンで構成された。一回の学習は目標
誤差0.05に達した時点あるいは5000ステップまで行われ
る。目標誤差まで到達できないモジュールについては、
中間層ユニット数を２個ずつ増やし、再度学習を最大５
回まで行った。As an initial value, all modules are 48-2-1
It consisted of a three-layer persebutron. One learning is performed when the target error reaches 0.05 or up to 5000 steps. For modules that cannot reach the target error,
Increase the number of units in the middle tier by 2 and re-learn up to 5
Went up to times.

【００２８】図２は実験結果を示している。23231個の
モジュールのうち、僅か８２個が収束しなかった。この
うち、８１個の中に矛盾なデータがあった。矛盾なデー
タは計９７ペアであった。この９７ペアの中、９４ペア
にそれぞれ１個ずつの誤りデータがあった。図３は正し
く検出された例を示す。左列の数字はそれぞれ検出され
た単語が所在する文番号と文の中の位置を示す。下線の
ある単語はチェックを受けている単語で、×が付いてい
るほうがその単語に付与されている品詞が誤っているこ
とを示す。FIG. 2 shows the experimental results. Of the 23231 modules, only 82 did not converge. Of these, 81 contained inconsistent data. There were 97 contradictory data in total. Of these 97 pairs, 94 pairs each had one error data. FIG. 3 shows an example of correct detection. The numbers in the left column indicate the sentence number where each detected word is located and the position in the sentence. The underlined word is the word that has been checked, and the one with a cross indicates that the part of speech given to that word is incorrect.

【００２９】一方、正しく検出されなかった残りの３ペ
アはすべて助詞／判定詞の「で」についてのものであっ
た。しかし実際の文を調べた結果、これらの「で」を判
定するためには文全体、すなわち構文情報を用いる必要
があることが分かった。従って、本実験結果は、本発明
による検出方法の精度は、構文情報を必要としない誤り
の検出において実質的に１００％に達したことを意味し
ている。On the other hand, the remaining three pairs which were not detected correctly were all about the particle / determinant "de". However, as a result of examining the actual sentence, it was found that it is necessary to use the whole sentence, that is, the syntactic information in order to determine these "at". Therefore, the result of this experiment means that the accuracy of the detection method according to the present invention reaches substantially 100% in the detection of an error that does not require syntax information.

【００３０】このように、ニューラルネットワークにお
ける収束しない現象を用いたテキストコーパスの誤り検
出方法は、極めて高精度、高信頼性を有し、しかも従来
の手法に比して飛躍的に高速な方法を実現している。上
記実験では小規模なテキストコーパスであったが、より
大規模な場合には、さらにその効果が顕著になると考え
られ、本発明の有用性が証明された。As described above, the method of detecting an error in a text corpus using the phenomenon of non-convergence in a neural network has extremely high accuracy and high reliability, and is dramatically faster than the conventional method. Has been realized. In the above experiment, a small-scale text corpus was used, but in the case of a larger scale, the effect is considered to be more remarkable, and the usefulness of the present invention was proved.

【００３１】[0031]

【発明の効果】本発明は、以上の構成を備えるので、次
の効果を奏する。請求項１に記載のテキストコーパスの
誤り検出方法によると、ニューラルネットを用いるとき
によく悩まされる収束しない問題を逆手に取り、人手で
作成したコーパスを学習しながらその中に含まれる誤り
を収束しないモジュールを調べることによって高効率に
検出する手法を実現することができる。これによって、
高速かつ高精度、低コストな検出方法に寄与する。Since the present invention has the above construction, it has the following effects. According to the error detecting method of the text corpus according to claim 1, the problem that does not converge, which is often troubled when using a neural network, is taken in reverse, and the error contained therein is not converged while learning the corpus created by hand. By examining the module, a highly efficient detection method can be realized. by this,
It contributes to high-speed, high-accuracy, low-cost detection method.

【００３２】請求項２に記載のテキストコーパスの誤り
検出方法によると、テキストコーパスに広く用いられて
いるタグを利用したテキストコーパスに本発明の方法を
用いることができるので、該手法を有効に活用すること
ができる。According to the error detecting method of the text corpus according to claim 2, since the method of the present invention can be used for the text corpus using the tag widely used for the text corpus, the method can be effectively utilized. can do.

【００３３】請求項３に記載のテキストコーパスの誤り
検出装置によると、オンラインで高効率に誤りを検出す
ることで、正確なテキストコーパスでの学習に寄与し、
さらに誤りの訂正、訂正後のコーパスによる再学習が効
率よく行える。これによって、学習システムの性能向上
を図ることができ、ひいては言語処理技術の向上に寄与
する。According to the error detecting device for a text corpus of claim 3, by detecting an error online with high efficiency, it contributes to learning in an accurate text corpus,
Furthermore, error correction and re-learning using the corrected corpus can be performed efficiently. This can improve the performance of the learning system, which in turn contributes to the improvement of language processing technology.

【００３４】請求項４に記載のテキストコーパスの誤り
検出装置によると、広く普及したタグ形式のテキストコ
ーパスを用いることができるので、有用性が高い。According to the text corpus error detecting device of the fourth aspect, since the text corpus of the widely used tag format can be used, it is highly useful.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の処理概要図である。FIG. 1 is a schematic diagram of processing of the present invention.

【図２】誤り検出の実験結果である。FIG. 2 is an experimental result of error detection.

【図３】誤りの検出例である。FIG. 3 is an example of error detection.

【符号の説明】[Explanation of symbols]

１０大規模テキストコーパス１１学習データセット１２ａないしｅデータサブセット１３学習の収束結果１３’ 学習の非収束結果１４データに誤りなし１５データに誤りがあった場合のサブセット抽出
過程１６修正過程１７再学習過程１８再学習後の収束結果１９データに誤りなし10 large-scale text corpus 11 learning data set 12a to e data subset 13 learning convergence result 13 'learning non-convergence result 14 no error in data 15 subset extraction process when data is incorrect 16 correction process 17 re-learning process 18 Convergence result after re-learning 19 No error in data

Claims

【特許請求の範囲】[Claims]

【請求項１】単語情報を含む予め作成されたテキストコ
ーパスにおける該単語情報の誤りを検出する方法であっ
て、該各単語情報の分類をニューラルネットワークにおける
クラスとして捉え、それらを小規模な２クラス問題に分割して、複数のモジ
ュールを構成し、各モジュールがニューラルネットワークにおける学習過
程において収束するか否かの演算を行い、収束しない場合に、該モジュールに該単語情報の誤りが
あると判定し、該モジュールを抽出することを特徴とす
るテキストコーパスの誤り検出方法。1. A method for detecting an error in the word information in a previously created text corpus including word information, wherein the classification of each word information is regarded as a class in a neural network, and these are classified into two small classes. Divide into problems, configure multiple modules, calculate whether each module converges in the learning process in the neural network, and if it does not, determine that there is an error in the word information in the module. , A method of detecting an error in a text corpus, characterized by extracting the module.

【請求項２】前記単語情報が、品詞に係る情報であっ
て、該情報をタグ形式でテキスト中に埋め込み、テキストコ
ーパスを構成し、該タグの誤りを検出する請求項１に記載のテキストコー
パスの誤り検出方法。2. The text corpus according to claim 1, wherein the word information is information relating to a part of speech, the information is embedded in a text in a tag format to form a text corpus, and an error in the tag is detected. Error detection method.

【請求項３】単語情報を含む予め作成されたテキストコ
ーパスにおける該単語情報の誤りを検出する検出装置で
あって、該各単語情報の分類をニューラルネットワークにおける
クラスとして捉え、それらを小規模な２クラス問題に分割して、複数のモジ
ュールを構成し、各モジュールがニューラルネットワークにおける学習過
程において収束するか否かの演算を行い、収束しない場合に、該モジュールに該単語情報の誤りが
あると判定し、該モジュールを抽出する一連の処理を行
うことによって誤りを検出可能なことを特徴とする検出
装置。3. A detection device for detecting an error in the word information in a pre-created text corpus containing word information, wherein the classification of each word information is regarded as a class in a neural network, and these are classified into a small scale 2. Divide into class problems, configure multiple modules, calculate whether each module converges in the learning process in the neural network, and if not, determine that the word information is incorrect in the module Then, an error can be detected by performing a series of processes for extracting the module.

【請求項４】前記単語情報が、品詞に係る情報であっ
て、該情報をタグ形式でテキスト中に埋め込み、テキストコ
ーパスを構成し、該タグの誤りを検出する請求項３に記載のテキストコー
パスの誤り検出装置。4. The text corpus according to claim 3, wherein the word information is information relating to a part of speech, the information is embedded in a text in a tag format to form a text corpus, and an error of the tag is detected. Error detector.