JP2004005641A

JP2004005641A - Method and device for correcting or improving use of word

Info

Publication number: JP2004005641A
Application number: JP2003132395A
Authority: JP
Inventors: Peter John Whitelock; ピーター　ジョン　ワイトロック; Philip Glenny Edmonds; フィリップ　グレニー　エドモンズ
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-05-22
Filing date: 2003-05-09
Publication date: 2004-01-08
Anticipated expiration: 2023-05-09
Also published as: GB0211727D0; CN1273915C; GB2388940A; JP4278090B2; CN1460948A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method capable of detecting mistakes and unnatural expressions in an article written by a user, and improving use of a language. <P>SOLUTION: A database including likelihood values related to junction, and junction between words is provided, and a scale of likelihood concerning if the junction is correct, or in compliance with common use of words or not is provided. Values of likelihood are based on frequency of occurrence of junction obtained by analysis to most parts of text generated by native speakers of the language, for example. In an embodiment to correct mistakes, words easy to be confused are tried to words of likelihood lower than a threshold value, and words easy to be confused are reported to a user to improve likelihood. In an embodiment of a thesaurus with higher sensitivity to contexts, words easy to be confused are tried to all the words, and words of which values of likelihood exceed a second threshold value are reported. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、自然な言語テキストにおいて、単語の選択および使用を、訂正し、改善させる方法および装置に関する。また、本発明は、このような方法を行うようにコンピュータをプログラムするコンピュータプログラム、このようなプログラムを含む格納媒体、およびこのようなプログラムによってプログラムされるコンピュータに関する。
【０００２】
【従来の技術】
ある言語で書くことまたは話すことの中心には、どの単語を用いるかを選択することがある。この選択に役立てるため、母国語で書いている人は、類語辞典を用い、言語の学習者は、典型的には、２カ国語の辞書を用いる。しかし、母国語で書いている人は、類語辞典には、類義語が適切である文脈についての詳細な情報がないことに気付き、学習者は、２カ国語の辞書から誤った翻訳を選択することがあり、両者は、集中力または知識が欠けている場合には他の単語に綴り間違いをすることがある。
【０００３】
学習者の英語の注釈付きのコーパス（非特許文献１参照）によると、誤った動詞または前置詞の使用が、最も一般的なタイプの誤りであり、綴りおよび句読点の誤りがその後に続く。例えば、書き手は、「ａｓｓｏｃｉａｔｅ　ｗｉｔｈ」ではなく「ａｓｓｏｃｉａｔｅ　ｔｏ」、「ｌｏｓｅ　ｏｎｅ’ｓ　ｔｅｍｐｅｒ」ではなく「ｌｏｏｓｅ　ｏｎｅ’ｓ　ｔｅｍｅｐｅｒ」、「ｂｅａｔｓ　ｍｅ　ａｔ　ｔｅｎｎｉｓ」ではなく「ｗｉｎｓ　ｍｅ　ａｔ　ｔｅｎｎｉｓ」と書くことがある。
【０００４】
従来、このようなタイプの誤りおよび他のタイプの誤りを検出し、これらに対する訂正を示唆することが出来なかった。
【０００５】
特許文献１、２、３は、構文解析および翻訳における、共起の情報の作成および使用を開示する。
【０００６】
特許文献４、５、６、７、８、９、１０の各々が開示する技術は、一般的に混乱しやすい単語のセット、例えば、「ｈｅａｒ」と「ｈｅｒｅ」、または、「ｔｏ」と「ｔｏｏ」などのリストを用いる。テキストにおける、このような単語の存在は、潜在的な誤りを示す。これらの特許は、誤りの訂正に対して異なる方法を記載する。
【０００７】
特許文献１１は、混乱しやすい単語の使用を区別する、異なる文脈を記載する規則のシステムを用いる技術を開示する。
【０００８】
特許文献１２、１３、１４は、確率を品詞の連続に割り当てるシステムを開示する。混乱しやすい単語を含む品詞の連続である確率は、その単語と混乱される単語を含む品詞の連続である確率と比較され得る。後者の方が前者よりも高い場合、起こり得る誤りが報告される。
【０００９】
特許文献１５は、単語の連続に確率を割り当て、ある単語を他の単語と誤って綴ることに確率を割り当て、これらの確率を組み合わせて、単語が他の単語と誤って綴られているか否かを判定するシステムを開示する。
【００１０】
特許文献１６、１７は、単語を、その文脈を表す特徴と関連付け、機械学習アルゴリズムを用いて、混乱しやすい単語のセットの特定の要素に対して、特徴の値から、関数を計算するシステムを開示する。混乱しやすい単語のセットの要素がテキストに現れる場合、この関数が用いられて、正確であるか、または誤っているかが、分類される。
【００１１】
非特許文献２は、連続的な単語のｎグラムモデルを用いて、誤りを検出するシステムを開示する。このシステムは、以前には見られなかった、カテゴリー変更およびカテゴリー保存の誤りを検出し得るが、連続的なモデルに起因して、非常に限定された長さにわたってのみ検出し得る。誤りの訂正は、記載されていない。
【００１２】
特許文献１８に開示されるシステムは、パーサーの失敗による、単語の使用における潜在的な誤りを識別し、これらの誤りを、続く構文解析の成功につながるように、混乱しやすい単語を見つけることによって解決する。
【００１３】
連結に関する強度または尤度の多くの尺度は、例えば、非特許文献３、４に開示され、非特許文献３、４は特定のタスクにおいていくつかの尺度の比較評価を提供する。
【００１４】
任意の適切なパーサーを用いたテキストの解析の一例が、非特許文献５に開示されている。
【００１５】
統計学的尺度による尤度の値の計算に用いられるパラメータの公式は、非特許文献６に開示されている。
【００１６】
【特許文献１】
米国特許第４，９１６，６１４号
【特許文献２】
米国特許第４，９４２，５２６号
【特許文献３】
米国特許第５，４０６，４８０号
【特許文献４】
米国特許第４，６７４，０６５号
【特許文献５】
米国特許第４，８６８，７５０号
【特許文献６】
米国特許第５，２５８，９０９号
【特許文献７】
米国特許第５，５３７，３１７号
【特許文献８】
米国特許第５，６５９，７７１号
【特許文献９】
米国特許第５，７９９，２６９号
【特許文献１０】
米国特許第５，９０７，８３９号
【特許文献１１】
米国特許第４，６７４，０６５号
【特許文献１２】
米国特許第４，８６８，７５０号
【特許文献１３】
米国特許第５，５３７，３１７号
【特許文献１４】
米国特許第５，７９９，２６９号
【特許文献１５】
米国特許第５，２５８，９０９号
【特許文献１６】
米国特許第５，６５９，７７１号
【特許文献１７】
米国特許第５，９０７，８３９号
【特許文献１８】
米国特許第５，９９９，８９６号
【非特許文献１】
Ｎｉｃｈｏｌｌｓ、１９９９「Ｔｈｅ　Ｃａｍｂｒｉｄｇｅ　Ｌｅａｒｎｅｒ　Ｃｏｒｐｕｓ−Ｅｒｒｏｒ　Ｃｏｄｉｎｇ　ａｎｄ　Ａｎａｌｙｓｉｓ　ｆｏｒ　Ｗｒｉｔｉｎｇ　Ｄｉｃｔｉｏｎａｒｉｅｓ　ａｎｄ　ｏｔｈｅｒｂｏｏｋｓ　ｆｏｒ　Ｅｎｇｌｉｓｈ　Ｌｅａｒｎｅｒｓ」、Ｓｕｍｍｅｒ　Ｗｏｒｋｓｈｏｐ　ｏｎ　Ｌｅａｒｎｅｒ　Ｃｏｒｐｏｒａ、Ｃａｍｂｒｉｄｇｅ　Ｕｎｉｖｅｒｓｉｔｙ　Ｐｒｅｓｓ
【非特許文献２】
ＣｈｏｄｏｒｏｗおよびＬｅａｃｏｃｋのＡｎ　ｕｎｓｕｐｅｒｖｉｓｅｄ　ｍｅｔｈｏｄ　ｆｏｒ　ｄｅｔｅｃｔｉｎｇ　ｇｒａｍｍａｔｉｃａｌｅｒｒｏｒｓ」（Ｐｒｏｃｅｅｄｉｎｇｓ　ｏｆ　ｔｈｅ　１^ｓｔＡｎｎｕａｌ　Ｍｅｅｔｉｎｇ　ｏｆ　ｔｈｅ　Ｎｏｒｔｈ　Ａｍｅｒｉｃａｎ　Ｃｈａｐｔｅｒ　ｏｆ　ｔｈｅ　Ａｓｓｏｃｉａｔｉｏｎ　ｆｏｒ　Ｃｏｍｐｕｔａｔｉｏｎａｌ　Ｌｉｎｇｕｉｓｔｉｃｓ、１４０〜１４７ページ、２０００年
【非特許文献３】
Ｋ．Ｋａｇｅｕｒａ、１９９９、「Ｂｉｇｒａｍ　Ｓｔａｔｉｓｔｉｃｓ　Ｒｅｖｉｓｉｔｅｄ：　ａ　Ｃｏｍｐａｒａｔｉｖｅ　Ｅｘａｍｉｎａｔｉｏｎ　ｏｆ　ｓｏｍｅ　Ｓｔａｔｉｓｔｉｃａｌ　Ｍｅａｓｕｒｅｓ　ｉｎ　Ｍｏｒｐｈｏｌｏｇｉｃａｌ　Ａｎａｌｙｓｉｓ　ｏｆ　Ｊａｐａｎｅｓｅ　Ｋａｎｊｉ　Ｓｅｑｕｅｎｃｅｓ」、Ｊｏｕｒｎａｌ　ｏｆ　Ｑｕａｎｔｉｔａｔｉｖｅ　Ｌｉｎｇｕｉｓｔｉｃｓ、１９９９、ｖｏｌ　６、ｎｏ．２、１４４〜１６６ページ
【非特許文献４】
Ｅｖｅｒｔら、「Ｍｅｔｈｏｄｓ　ｆｏｒ　ｔｈｅ　Ｑｕａｌｉｔａｔｉｖｅ　Ｅｖａｌｕａｔｉｏｎ　ｏｆ　Ｌｅｘｉｃａｌ　Ａｓｓｏｃｉａｔｉｏｎ　Ｍｅａｓｕｒｅｓ」、Ｐｒｏｃｅｅｄｉｎｇ　ｏｆ　ｔｈｅ　３０^ｔｈＡｎｎｕａｌ　Ｍｅｅｔｉｎｇ　ｏｆ　ｔｈｅ　Ａｓｓｏｃｉａｔｉｏｎ　ｆｏｒ　Ｃｏｍｐｕｔａｔｉｏｎａｌ　Ｌｉｎｇｕｉｓｔｉｃｓ，Ｔｏｕｌｏｕｓｅ，２００１、１８８〜１９５ページ
【非特許文献５】
Ｍ．Ｃｏｌｌｉｎｓの「Ｔｈｒｅｅ　Ｇｅｎｅｒａｔｉｖｅ　Ｌｅｘｉｃａｌｉｓｅｄ　Ｍｏｄｅｌｓ　ｆｏｒ　Ｓｔａｔｉｓｔｉｃａｌ　Ｐａｒｓｉｎｇ」（Ｐｒｏｃｅｅｄｉｎｇｓ　ｏｆ　ｔｈｅ　３５ｔｈ　ａｎｎｕａｌ　ｍｅｅｔｉｎｇ　ｏｆ　ｔｈｅ　ＡＣＬ／８^ｔｈｃｏｎｆｅｒｅｎｃｅ　ｏｆｔｈｅ　ＥＡＣＬ、Ｍａｄｒｉｄ、１９９７）、ＳｌｅａｔｏｒおよびＴｅｍｐｅｒｌｅｙの「Ｐａｒｓｉｎｇ　Ｅｎｇｌｉｓｈ　ｗｉｔｈ　ａ　Ｌｉｎｋ　Ｇｒａｍｍａｒ」（ＣＭＵ−ＣＳ−９１−１９６、Ｃａｒｎｅｇｉｅ−Ｍｅｌｌｏｎ　Ｕｎｉｖｅｒｓｉｔｙ　Ｄｅｐｔ．　ｏｆ　Ｃｏｍｐｕｔｅｒ　Ｓｃｉｅｎｃｅ、１９９１）
【非特許文献６】
Ｄ．Ｌｉｎの「Ａｕｔｏｍａｔｉｃ　Ｒｅｔｒｉｅｖａｌ　ａｎｄ　Ｃｌｕｓｔｅｒｉｎｇ　ｏｆ　Ｓｉｍｉｌａｒ　Ｗｏｒｄｓ」（ＣＯＬＩＮＧ−ＡＣＬ　９８、Ｍｏｎｔｒｅａｌ、Ｃａｎａｄａ、１９９８年８月）
【００１７】
【発明が解決しようとする課題】
本発明は、ユーザが書いたものにおける誤りおよび不自然な表現を検出し、言語の使用を改善し得る方法を示唆する方法および装置を提供することを目的とする。
【００１８】
【課題を解決するための手段】
本発明は、上記のようなタイプの誤りおよび他のタイプの誤りを検出し、これらに対する訂正を示唆することが可能である。本発明は、事実上の単語の綴りの誤り（例えば、ｌｏｓｅ／ｌｏｏｓｅ）、および様々な他のタイプの誤りを処理することができる。
【００１９】
例えば、「ｍａｋｅ」のような単語を類語辞典で引くと、書き手は多数の類義語を見出す。これらは、中心的な意味を共有するグループに分類され得る。あるグループには、「ｃｒｅａｔｅ」、「ｃｏｎｓｔｒｕｃｔ」、および「ｅｓｔａｂｌｉｓｈ」などの類義語が含まれ得るが、書き手が、「ｃｒｅａｔｅｓ　ａ　ｄｉｖｅｒｓｉｏｎ」、「ｃｏｎｓｔｒｕｃｔｓ　ａ　ｍｏｄｅｌ」、または「ｅｓｔａｂｌｉｓｈｅｓ　ａ　ｒｅｌａｔｉｏｎｓｈｉｐ」を見出すことはない。
【００２０】
本発明は、これらを、「ｍａｋｅ　ａ　ｄｉｖｅｒｓｉｏｎ」、「ｍａｋｅ　ａ　ｍｏｄｅｌ」、または「ｍａｋｅ　ａ　ｒｅｌａｔｉｏｎｓｈｉｐ」などの入力に応答して提供することを可能にする。
【００２１】
本発明は、書き言葉であるか話し言葉であるかに関わらず、以下ではテキストと呼ぶ、一続きの言語において共起し得る（必ずしも、隣接しない）、２つの単語または句の間の関係を含む、依存性または連結性を利用する。連結性は、テキストの大部分において現れる頻度に基づいて、強度または尤度の尺度と関連付けられ得る。テキストにおける単語は、それが現れている連結における尤度の値に基づいて、もっともらしさの値と関連付けられ得る。テキスト内においてもっともらしくない単語は、文脈において、誤っているか、または、不自然であり得る。
【００２２】
本発明の第１の局面によると、第１の言語の複数の単語を含む書かれたテキストまたは話されたテキストのセクションにおける第１の単語または句の選択を訂正または改善させる方法であって、（ａ）該第１の言語の単語または句の間の連結に関する第１のデータベースを提供する工程であって、各連結は、該第１の言語のテキストの本文において該連結が現れる頻度（ｆｒｅｑｕｅｎｃｙ）に基づいて、少なくとも１つの関連付けられた尤度の値を有する、工程と、（ｂ）テキストのセクションの該第１の単語または句と、第２の単語または句との間に第１の連結を確立するように、テキストのセクションを解析する工程であって、該連結の少なくとも１つの第１の尤度の値、および該第１の単語または句の第１のもっともらしさの値は、該少なくとも１つの尤度の値に基づく、工程と、（ｃ）少なくとも１つの単語または句の各々が、混乱されることがある単語または句のセットと関連付けられている、第２のデータベースを提供する工程と、（ｄ）該第２のデータベースから、混乱しやすい単語または句を、該テキストのセクションにおける該第１の単語または句との置換候補として選択または計算する工程と、（ｅ）該第１のデータベースにおける第２の連結の尤度の値に基づいて、該混乱しやすい単語または句の第２のもっともらしさの値を導出する工程であって、該第２の連結は、該混乱しやすい単語または句と、該テキストのセクションにおける他の単語または句とを含む、工程と、（ｆ）該計算されたもっともらしさの値（ｐｌａｕｓｉｂｉｌｉｔｙ　ｖａｌｕｅｓ）に基づいて、該混乱しやすい単語または句の表示を選択的に提供する工程とを包含する、方法が提供される。
【００２３】
前記第１のデータベースにおける前記連結の各々の尤度の値が、同じ依存性関係を有する単語または句のうちの１つを含む他のリンクの各々が現れる頻度にも基づいてもよい。
【００２４】
前記第１のデータベースにおける前記連結の各々の尤度の値が、同じ依存性関係を有する他の連結の全てが現れる頻度にも基づいてもよい。
【００２５】
前記第１のデータベースにおける前記連結の各々の尤度の値が、相互情報（Ｍｕｔｕａｌ　Ｉｎｆｏｒｍａｔｉｏｎ）、Ｔ得点（Ｔ−ｓｃｏｒｅ）、ＹｕｌｅのＱ係数（Ｙｕｌｅ’ｓ　Ｑ　ｃｏｅｆｆｉｃｉｅｎｔ）、および対数尤度（ｌｏｇ−ｌｉｋｅｌｉｈｏｏｄ）のうちの少なくとも１つを含んでもよい。
【００２６】
前記工程（ｅ）において、前記他の単語または句が、前記第２の単語または句であってもよく、前記第２の連結の前記依存性関係は、前記第１の連結の依存性関係と同じであってもよい。
【００２７】
前記工程（ｂ）は、前記テキストのセクションにおいて、複数の第１の単語または句の複数の第１の連結を確立する工程を含んでもよく、前記工程（ｄ）、（ｅ）および（ｆ）は、該第１の単語または句の各々について行われてもよい。
【００２８】
前記工程（ｂ）が、前記テキストのセクションにおいて隣接していない単語または句の間に連結を確立する工程を含んでもよい。
【００２９】
前記工程（ｄ）が、第１の単語または句とこんらんしやすい単語または句のセットの混乱しやすい単語または句の各々を選択する工程を含んでもよく、前記工程（ｅ）および（ｆ）が、該混乱しやすい単語または句の各々について行われてもよい。
【００３０】
前記工程（ｆ）が、値の降順で、第２のもっともらしさの値を示す工程を含んでもよい。
【００３１】
前記第１のもっともらしさの値が第１の閾値よりも低い場合、前記工程（ｄ）、（ｅ）、および（ｆ）が行われてもよい。
【００３２】
前記工程（ｆ）が、第２のもっともらしさの値の各々または前記第２のもっともらしさの値が、第２の閾値を越える場合に、表示を提供する工程を含んでもよい。
【００３３】
前記工程（ｆ）が、前記第２のもっともらしさの値が前記第１のもっともらしさの値よりも大きい場合、表示を提供する工程を含んでもよい。
【００３４】
前記工程（ｂ）が前記第１のもっともらしさの値を、注釈付きの学習者の誤りのコーパスおよび関連付けられた尤度の値から機械学習技術によって学習した関数によって計算する工程を含んでもよい。
【００３５】
この方法は、前記テキストのセクションにおける第１の単語を、前記混乱しやすい単語と置換する工程をさらに含んでもよい。
【００３６】
この方法は、第２の言語から、翻訳によってテキストのセクションを生成する工程をさらに含んでもよい。
【００３７】
この方法は、印刷された文献から、光学文字認識によって、テキストのセクションを生成する工程をさらに含んでもよい。
【００３８】
本発明の第２の局面によると、本発明の第１の局面による方法をコンピュータに実行させるための、コンピュータプログラムが提供される。
【００３９】
本発明の第３の局面によると、本発明の第２の局面によるプログラムを含む、格納媒体が提供される。
【００４０】
この媒体は、コンピュータ読取り可能媒体を含んでもよい。
【００４１】
本発明の第４の局面によると、本発明の第３の局面によるプログラムを含む、コンピュータが提供される。
【００４２】
本発明の第５の局面によると、第１の言語の複数の単語を含む書かれたテキストまたは話されたテキストのセクションにおける単語または句の選択を訂正または改善させる装置であって、該第１の言語の単語または句の間の連結に関する第１のデータベースであって、各連結は、該第１の言語のテキストの本文において該連結が現れる頻度に基づいて、少なくとも１つの関連付けられた尤度の値を有する、第１のデータベースと、テキストのセクションの該第１の単語または句と、第２の単語または句との間に第１の連結を確立するように、テキストのセクションを解析する制御部であって、該連結の少なくとも１つの第１の尤度の値、および該第１の単語または句の第１のもっともらしさの値は、該少なくとも１つの尤度の値に基づく、制御部と、少なくとも１つの単語または句の各々が、混乱されることがある単語または句のセットと関連付けられている、第２のデータベースとを備え、該制御部は、該第２のデータベースから、混乱しやすい単語または句を、該テキストのセクションにおける該第１の単語または句との置換候補として選択または計算し、該制御部は、該第１のデータベースにおける第２の連結の尤度の値に基づいて、該混乱しやすい単語または句の第２のもっともらしさの値を導出し、該第２の連結は、該混乱しやすい単語または句と、該テキストのセクションにおける他の単語または句とを含んでおり、該制御部は、該計算されたもっともらしさの値に基づいて、該混乱しやすい単語または句の表示を選択的に提供する、装置が提供される。
【００４３】
単語間の連結の尤度を用いることによって、品詞の連続の確率を殆ど用いない、公知のシステムよりも改善している技術を提供することが可能である。なぜなら、このような公知のシステムは、非常に一般的であるカテゴリーを維持する誤りを検出して訂正することができないからである。
【００４４】
改善は、依存性文法は、隣接していないが、それでも、互いの選択に直接影響を与える、単語間の依存性を捕らえることができるので、連続的なｎグラム（ワードまたは品詞のいずれか）を用いることによって達成される。ｎグラムは、原則として、このような依存性をも含むように、拡大され得るが、実際には、これは、データが疎であることにおいて深刻な問題につながり得る。連結を用いることによって、統計学的な尤度の値の計算について利用可能なデータが、言語学的に大きな単位に集められる。殆どの場合において、常に、３つの要素の依存性の断片が、有用な統計を得るために充分であるが、４つの要素の連続的なｎグラムでさえ、ありそうな単語の組合せおよびありそうもない単語の組合せの多くの場合について誤りをおかす。
【００４５】
言語学的に意味のあるエンティティに対する、この統計の制限の重要な結果として、確率の値が、誤りを見つけるために必要な様態で解釈することが、より容易になることである。これを理解するため、連続的な単語の二重字モデルにおいて、隣接する単語間の遷移の確率の重要性を考慮する。構成要素内で、例えば、「ａ　ｂｉｇ　ｄｏｇ」における「ｂｉｇ」と「ｄｏｇ」との間で、遷移の確率は、類似の形容詞および名詞の連続と、直接比較され得る。しかし、「ｇｉｖｅｔｈｅ　ｄｏｇ　ａ　ｂｏｎｅ」における「ｄｏｇ」と「ａ」との間の遷移の確率は、「ｄｏｇ」で終わる構成要素に、「ａ」で始まる構成要素が続くので、どちらかというと、対象とならない（ありそうもない）確率である。「ｇｉｖｅ」が先頭である構成要素が、「ｂｏｎｅ」が先頭である第２の目的語を有するという対象になる確率は表されず、可能な代替例、例えば、「ｇｉｖｅ　ｔｈｅ　ｄｏｇ　ａ　ｃｌｏｎｅ」と比較されることはできない。
【００４６】
すなわち、連続的なｎグラムモデルにおいて、低い遷移確率は、言語学的に興味深い尤度の低さと、そうではない尤度の低さとの両方を表し得る。これは、潜在的な誤りの直接的な指示として用いられることはできない。連続的なｎグラムに基づくシステムが、誤りを処理するトリガとして、全ての低い確率を処理する場合、多数の潜在的な「誤り」を検出し、そのうちの多くが実際の「誤り」ではない。これらの処理はコストが高く、また、このような偽の誤りが、本当の誤りとして分類されるという危険を引き起こす。
【００４７】
これが、低い遷移確率を用いる公知の技術のいずれも誤り処理のトリガとして用いられず、むしろ、混乱しやすいことが公知である特定の単語のテキストにおける存在を用いて、元の連続の相対的な尤度および単語を置き換えることによって得られる尤度を考慮する理由である。
【００４８】
対照的に、本発明の技術においては、「低い尤度」が、よりロバストな誤りのインジケータである。任意のありそうもない連結は、誤り処理の開始に寄与し得、ありそうもない連結のみが寄与する。当然、ありそうにもないことが、常に誤りであるという結果にはならないが、本発明の技術においては、これらの偽のトリガは、ずっと少ない。
【００４９】
さらに、いくつかの混乱しやすい単語のセットにおける要素のテキストにおける存在が、多くの公知の技術と同様に、誤り処理のトリガに過ぎない場合、混乱しやすい単語のセットに要素を追加することは、誤り処理がトリガされる回数と、各要素を考慮する計算コストとの両方を増加させる。
【００５０】
連結の尤度、および得られる単語のもっともらしさが、本発明と同様に、誤り処理のトリガである場合、ずっと広い範囲の誤りが、特徴付けられ得る。混乱しやすさの概念は、綴りおよび発音の高い頻度での混乱に限定されない。
【００５１】
学習アルゴリズムを用い、また、誤り処理のトリガとして、混乱しやすいことが公知である単語の存在を用いる公知の技術において、学習アルゴリズムを単語の分類に適用すること以外に、単語を潜在的な誤りとして検出する方法はない。さらに、公知のｎグラムに基づく技術と同様に、学習システムは、データを言語学的に大きな単位に集めることによる利益を完全には得ない。
【００５２】
本発明の技術は、構文解析の失敗に基づく公知の技術の改善を表す。なぜなら、構文解析の失敗は、語彙の誤り、特に、同じ品詞の単語との置換に関わる語彙の誤りの、非常に粗い検出機構であるからである。対照的に、本発明の技術は、非常に短い文の断片の尤度でさえ、非常にきめ細かい定量的な判定を提供し、アタッチメントがないことによって示されるように、特定の、極端に尤度が低い場合として、構文解析の失敗を含む。さらに、構文解析の成功（誤りが訂正されたという粗い状態）は、得られた改善のきめ細かい定量的な判定と置換され得る。
【００５３】
【発明の実施の形態】
本発明は、添付の図面を参照しながら、例示のために、さらに説明される。
【００５４】
本発明においては、ユーザが書いたものにおける誤りおよび不自然な表現を検出し、言語のこのような使用を改善し得る方法を示唆する方法および装置が提供される。これらの技術は、その文脈において、所与の入力表現と意味が類似する表現を示唆する、文脈に対して高感度な類語辞典として用いられてもよい。単語の組合せの統計的に依存性のモデルは、誤り検出および置換のチェックの基礎として用いられる。これによって、連続的なｎグラムモデルまたは解析されていない特徴のセットのいずれかを用いる、公知の方式で、いくつかの問題が解決される。また、これらの技術は、置換の候補の範囲をずっと広くすることが可能である。誤りの検出は、用いることによって誤りが起きやすい特定の単語の検出に依存しないので、以前に出てきたことがない誤りも検出および訂正され得る。
【００５５】
本発明は、２つのタイプの単語間の関係を用いる。一方のタイプの関係は、１つの文において異なる位置にある２つの単語の間で保持される。これらは、「〜の主語」、「〜の目的語」および「〜の修飾語」ような依存の関係であり、その例を図２に示す。図２は、「Ｌｏｖｅ　ｉｓ　ｔｈｅ　ｍｏｓｔ　ｉｍｐｏｒｔａｎｔ　ｃｏｎｄｉｔｉｏｎ　ｆｏｒ　ｍａｒｒｉａｇｅ」という文を解析した結果を示す。単語は、屈折していない形および品詞によって、すなわち、見出し語として表される。従って、「ｉｓ」は、「ｂｅ＿Ｖ」と表される。この動詞の主語は、「ｌｏｖｅ＿Ｎ」であると識別され、その目的語は、「ｃｏｎｄｉｔｉｏｎ＿Ｎ」であると識別される。後者は、「ｔｈｅ＿ＤＥＴ」によって特定され、「ｉｍｐｏｒｔａｎｔ＿ＡＤＪ」によって修飾される。「Ｍｏｓｔ＿ＡＤＶ」は、「ｉｍｐｏｒｔａｎｔ＿ＡＤＪ」を修飾する副詞として識別される。「Ｆｏｒ＿ＰＲＥＰ」は、「ｃｏｎｄｉｔｉｏｎ＿Ｎ」を修飾する前置詞として識別され、「ｍａｒｒａｉａｇｅ＿Ｎ」は、前置詞「ｆｏｒ＿ＰＲＥＰ」の目的語として識別される。２つの見出し語およびこれらを連結する依存性の関係からなる３つの形態は、連結と呼ばれる。
【００５６】
他方のタイプの関係は、「〜の可能な置換」として定義される関係、すなわち、文の所与の位置での代替的な単語の選択肢の間の関係を含む。置換の関係のいくつかの例は、以下の通りである。
【００５７】
・類義語、反意語、下位語、および上位語のような類語関係
・「ｌｏｓｅ」が「ｌｏｏｓｅ」になるように、その言語の他の単語になってしまうような綴りの誤り（特殊な場合として、「ｐａｎｅ」および「ｐａｉｎ」のように、発音が同じであるが綴りが異なる単語に関連する、同音がある）
・１つの語源から異なる様式で形成された単語に関連する、派生語（例えば、「ｉｎｔｅｒｅｓｔｅｄ」および「ｉｎｔｅｒｅｓｔｉｎｇ」、あるいは、「ｓａｆｅ」および「ｓａｆｅｔｙ」）
・他の言語における、１つの単語に対する代替的な翻訳である単語に関連する、複数の言語間での混乱しやすさ（例えば、フランス語には、両方とも、「ｍａｒｑｕｅｒ」と翻訳され得る「ｍａｒｋ」および「ｂｒａｎｄ」）
・ある単語が、同語源の他の言語の単語の翻訳として不適切である、偽のフレンド（例えば、フランス語の「ａｃｔｕａｌ」の、それぞれ、正しい翻訳および誤った翻訳である、「ｐｏｓｓｉｂｌｅ」および「ａｃｔｕａｌ」）
・無意味な単語を置換すること、または、無意味な単語と置換することとしても考えられ得る、挿入および消去の誤り（例えば、「ｈｅ　ｒａｎｇ　（ａｔ）ｔｈｅ　ｄｏｏｒｂｅｌｌ」「ｗｅ　ｐａｉｄ　（ｆｏｒ）　ｏｕｒ　ｍｅａｌｓ」）
文中で単語ｗを用いることが、誤っているか、または、そうでなくても、慣用語法にかなっておらず、ぎこちないと思われる場合、ｗの混乱しやすい単語のセットＣ（ｗ）と呼ばれる、単語のセットの各要素が、可能な置換として考えられる。ｗの混乱しやすい単語のセットは、ｗに関連する単語から得られる。ただし、実際の全要素は、ユーザの母国語、書いている言語における言語能力のレベル、および他の要因によって異なり得る。
【００５８】
依存性の関係は、文の構造を表す、幅広く用いられる手段である。多くの変形例が見出されるが、本発明の技術のコンテキストからは、主として、些細なものである。依存性の関係は、従属部分およびヘッドと呼ばれる、２つの単語を結合する。典型的な公式において、従属部分は、１つより多いヘッドに関連し得ないが、ヘッドは、例えば、任意の数の従属部分と、循環の禁止などの他の制約とを含み得、１つの文における関係が樹形図を形成することを確実にする。本明細書においては、文中の２つの単語の間の連結（連結とも呼ばれる）は、３つの形態によって表される。
＜ｆｉｒｓｔ　ｌｅｍｍａ，ｒｅｌａｔｉｏｎ，ｓｅｃｏｎｄ　ｌｅｍｍａ＞
ただし、ｌｅｍｍａ（見出し）は、動詞「ｔｏ　ｃｈａｓｅ」の全ての形態、すなわち、ｃｈａｓｅ、ｃｈａｓｅｓ、ｃｈａｓｅｄ、ｃｈａｓｉｎｇを表す、「ｃｈａｓｅ＿Ｖ」のような用語である。
【００５９】
連結は、強度または尤度の多くの尺度と関連付けられ得る。連結の頻度、すなわち、構文解析されたコーパスにおいて何回見受けられたかは、強度を評価する粗い方法に過ぎない。より正確な尺度は、連結の頻度が、その成分の部分の頻度から予期され得るものから外れる範囲まで計算する。このような尺度のいくつかは、上記非特許文献３、４から公知である。このような尺度のいくつかは、単語の分割、構文解析、翻訳、情報の取り出し、および辞書編集法における用途を有する。これらの例において、典型的には、予期されるよりも、ずっとありそうな連結のみが、対象となる。しかし、本発明の技術は、予期されるようもずっとありそうもない連結についても関係する。テキストにおいて、このような連結が検出されることは、文法的に正しくないか、または言語の慣用的な用法とは異なっていることを示す。
【００６０】
１つ以上のありそうもない連結において現れる単語は、順に、混乱しやすい単語のセットの各要素によって置換され得、このような置換のそれぞれを行うことによる結果は、もっともらしさについて評価され得る。混乱しやすい単語のセットのうちの１つ以上の要素によって、充分にもっともらしくなる場合、これらの要素は、置換用のものとして示唆され得る。
【００６１】
予備的な工程として、単語の組合せについての尤度の値のデータベースが、依存性文法に従って、ネイティブスピーカーのテキストを大量に解析することによって、構築される。任意の適切なパーサーが用いられ得、適切な例が、上記非特許文献５に開示されている。アナライザーは、一般的に考えられるようにパーサーでなくてもよいが、有限状態、または、依存性を記録する機構で補強された、類似の技術を用い得る。
【００６２】
各タイプの連結の頻度が数えられ、例えば、相互情報、Ｔ得点、対数尤度（ｌｏｇ−ｌｉｋｅｌｉｈｏｏｄ）のような１つ以上の統計学的尺度による、それぞれについての尤度の値が、計算され、結果が表に格納される。図３に、このようなデータベースにおけるいくつかの項目を示す。
【００６３】
図３において、最初の列は、連結自体を示す。「頻度」が上についている列は、この連結が構文解析されたコーパス（ここでは、Ｂｒｉｔｉｓｈ　Ｎａｔｉｏｎａｌ　Ｃｏｒｐｕｓの約８０００万の単語）において現れる回数を示す。残りの列は、それぞれ、相互情報、Ｔ得点、ＹｕｌｅのＱ係数、および対数尤度である。これらの各々は、以下の４つの項目から計算される、異なる測定基準である。
＜ｆｉｒｓｔ　ｌｅｍｍａ，ｒｅｌａｔｉｏｎ，ｓｅｃｏｎｄ　ｌｅｍｍａ＞
＜ｆｉｒｓｔ　ｌｅｍｍａ，ｒｅｌａｔｉｏｎ，＊＞
＜＊，ｒｅｌａｔｉｏｎ，ｆｉｒｓｔ　ｌｅｍｍａ＞
＜＊，ｒｅｌａｔｉｏｎ，＊＞
ただし、「＊」は、任意の見出しを表す。このパラメータの公式は、上記非特許文献６に開示されている。異なる測定基準は、異なる範囲を有し、異なる様式の４つのパラメータの精密な値を感知する。しかし、各々の場合において、値が、関係の尤度と相関する。正の値は、組合せが、偶然よりもありそうな組合せであることを示し、負の値は、ありそうにもない組合せであることを示す。
【００６４】
例えば、＜ａｓｓｏｃｉａｔｅ＿Ｖ　ｐａｄｖ　ｔｏ＿ＰＲＥＰ＞のｔ得点は、以下のように計算される。
【００６５】
【数１】

ただし、ｆ（ａｓｓｏｃｉａｔｅ＿Ｖ　ｐａｄｖ　ｔｏ＿ＰＲＥＰ）＝Ｆ
ネイティブスピーカーのコーパスの構文解析は、高品質な、単語の組合せの尤度の評価を得るためには、正確、かつ、可能な限り広い範囲である必要がある。しかし、正確な構文解析は、高品質な、単語の組合せの尤度の評価へのアクセスを必要とし、これによって、矛盾が生じる。この矛盾は、反復的またはブートストラッピングアプローチによって解決され得る。これは、構文解析アルゴリズムのある特定の性質に基づく。
【００６６】
文中の各個別の連結は、優先度の値と関連付けられている。優先度の値は、このような連結が文中の２つの単語の間に存在しているという信頼度の尺度である。このような優先度の値は、品詞の確率および単語の孤立などの文特有の要因と、これらの単語の間の連局の強度などの言語全体にわたる要因との両方の関数である。
【００６７】
構文解析アルゴリズムは、集合的に依存性構造の公理を満たす（すなわち、連結は交差しない、各単語は１つより多いノードに依存しないなど）、１セットの連結を返す。しかし、このセットは、１つの接続された樹形図を形成するためには必要とされない。
【００６８】
文特有の要因および言語全体にわたる要因の優先度の値に対する相対的な寄与は、適切なパラメータ設定によって変動し得る。
【００６９】
閾値は、優先度の値がその閾値を越える連結のみが返されるように設定され得る。
【００７０】
構文解析アルゴリズムの反復的な性質は、非常に簡略的な句「ｗｏｒｌｄ　ｔｉｔｌｅ　ｆｉｇｈｔ」の構文解析を考慮することによって、説明される。
【００７１】
統語論的には、「ｔｉｔｌｅ」が「ｆｉｇｈｔ」を修飾するはずであるが、「ｗｏｒｌｄ」が、「ｔｉｔｌｅ」を修飾するのか、「ｆｉｇｈｔ」を修飾するのかが不明である。英語の統語論において、名詞の連続では、最後の名詞以外の各名詞が、その右側にあるいずれの名詞を修飾してもよい。この場合、特定の単語の結合の強度の知識から、「ｗｏｒｌｄ」が「ｔｉｔｌｅ」を修飾しているという結論が得られる。他の場合、例えば、「ｐｌａｓｔｉｃ　ｂａｂｙ　ｐａｎｔｓ」の場合、第１の名詞は、直後に続く名詞ではなく、最後の名詞を修飾する。
【００７２】
完全な構文解析から以下の連結が得られる。
１．＜ｔｉｔｌｅ＿Ｎ，ｍｏｄ＿ｏｆ，ｆｉｇｈｔ＿Ｎ＞
２．＜ｗｏｒｌｄ＿Ｎ，ｍｏｄ＿ｏｆ，ｔｉｔｌｅ＿Ｎ＞
ネイティブスピーカーのコーパスの構文解析の第１の反復において、特有の単語の間の連結についての尤度の値は利用可能でないので、言語全体にわたる要因は、優先度の値に何も寄与しない。優先度の閾値は高く設定されるので、例えば、品詞が曖昧な単語、または、広く分類される単語は、連結されず、連結の正確性についての信頼度は、高い。この例においては、連結１のみが返される。連続する名詞中終わりから２番目の名詞は、言語全体の要因に関わらず、最後の名詞を修飾しているはずである。しかし、言語全体にわたる情報がないので、連結２、および不正確な＜ｗｏｒｌｄ＿Ｎ，ｍｏｄ＿ｏｆ，ｆｉｇｈｔ＿Ｎ＞のいずれも、この場合において、返されるような充分に高い優先度を有していない。しかし、コーパスにおける、他の名詞が後に続かない「ｗｏｒｌｄ　ｔｉｔｌｅ」（および「ｗｏｒｌｄ　ｆｉｇｈｔ」）の他の例の連結が返される。
【００７３】
その後、尤度の値は、これらの高い確実な連結を用いて、計算される。後続の反復は、優先度の決定において、これらの言語全体にわたる要因を使用し始め得るので、優先度の閾値は下げられ得る。これによって、返される連結の数（構文解析の範囲）が増大し、尤度のより正確な統計が計算されることが可能になる。この例において、＜ｗｏｒｌｄ，ｍｏｄ＿ｏｆ，ｔｉｔｌｅ＞および＜ｗｏｒｌｄ，ｍｏｄ＿ｏｆ，ｆｉｇｈｔ＞の相対的な頻度および／または尤度は、前者が後者よりも、選ばれることにつながる。その後、さらなる反復は、言語にわたる要因の優先度に対する寄与を増大させ続け、優先度の閾値を低減させる。このようにして、尤度データの範囲および信頼度が徐々に改善され得る。
【００７４】
ネイティブスピーカーのコーパスの構文解析の各反復の後、各タイプの尤度の値が、データベースにおいて決定され、入力される。
【００７５】
充分に正確なデータベースが準備されるか、何らかの手段で入手される場合、そのデータベースは、本発明において用いられ得る。問題についてチェックされるテキストは、このような構文解析手順の１回の反復にさらされる。言語全体にわたる要因の構文解析に対する寄与は、これらの要因、すなわち、連結の尤度の値が、次の段階で考慮されるので、低減され得る。
【００７６】
その後、テキストにおける各リンクの尤度の値は、ネイティブスピーカーのデータベースを調べることによって判定される。元のネイティブスピーカーのコーパスに見受けられない連結は、かなり頻度が低いと仮定することによって、尤度の値を割り当てられ得る。典型的な実施形態において、ネイティブスピーカーのコーパスにおいて、１の頻度で見受けられる連結は全て放棄され、データのサイズが大幅に低減される。データベースにおいて見受けられない連結は、０〜２の範囲内の頻度であると仮定され、最適な値は、実験によって決定され、尤度の値は、それに従って計算される。
【００７７】
尤度の値が低い（すなわち、負である）連結は、起こり得る誤りのインジケータである。単語が現れる連結の尤度の値は、単語のもっともらしさの値に組み合わせられる。もっともらしくない単語は、もっともらしさにおいて、改善が見られるか否かを調べるため、混乱しやすい単語のセットの要素によって置き換えられる。
【００７８】
図４Ａおよび図４Ｂは、誤り検出器および訂正器としての本発明の実施形態の動作を示すフローチャートである。入力テキストは、工程１０で供給され、例えば、構文解析することによって、工程１１で解析される。工程１２において、入力テキストにおける連結の尤度が、解析される。工程１３において、テキスト内の最初の単語が選択され、工程１４において、この単語のもっともらしさが計算される。工程１５において、全ての単語が用いられたか否かを確立するように入力テキストが調べられて、用いられていない場合、工程１６において次の単語が選ばれ、工程１４が繰り返される。
【００７９】
テキスト内の全ての単語のもっともらしさが計算される場合、単語は、工程１７において、もっともらしさを増大させることによって分類される。一番もっともらしくない単語が工程１８において選択され、工程１９において、もっともらしさが第１の閾値よりも小さくない場合、この方法は、工程２０で終了する。そうでない場合、この単語と混乱しやすい単語のセットは、工程２１において入手され、第１の混乱しやすい単語が工程２２において選択される。工程２３において、テキスト内で、対象の単語が、混乱しやすい単語と置換され、文脈における、混乱しやすい単語のもっともらしさは、工程２４において計算される。もっともらしさにおける改善が、工程２５において検出される（もっともらしさにおける変化が第２の閾値よりも大きい）場合、混乱しやすい単語は、工程２６において、ユーザに報告される。
【００８０】
工程２７において、混乱しやすい単語の全てが試されたか否かを調べ、そうでない場合、工程２８において、次に混乱しやすい単語が選択され、操作は、工程２３に戻る。そうでない場合、工程２９において、テキスト内の全ての単語が処理された否かを決定し、処理されていない場合、工程３０において、次の単語を入手し、操作は、工程１９に戻る。そうでない場合、この方法は、工程３１で終了する。
【００８１】
この実施形態において、各単語ｗ_ｉ（１≦ｉ≦ｎ、文の長さ）について、単語ｗ_ｉが現れる連結Ｄ（ｗ_ｉ）のセットを判定する。その後、各Ｄ（ｗ_ｉ）に、単語λ（ｗ_ｉ）の「もっともらしさ」と呼ばれる、その連結のセットの尤度の値を単一の値にマッピングする関数を割り当てる。単語は、もっともらしさに従って並べられる。一番もっともらしくない単語ｗ_λｍｉｎのもっともらしさが閾値より下になる場合、訂正を見出そうと試みる。ｗ_λｍｉｎを、順に、各ワードｃ_ｊ（ｗ_λｍｉｎ）（１≦ｊ≦ｍ、（Ｃｗ_λｍｉｎ）における混乱しやすい単語の数）と置換し、λ（ｃ_ｊ（ｗ_λｍｉｎ））を計算する。置換によって、単語のもっともらしさに改善が見られる、混乱しやすい単語が、ユーザに対して示唆される。混乱しやすい単語は、置換されることによって生み出す改善に従って、降順に提示され得る。
【００８２】
混乱しやすい単語のセットの要素は、混乱の尤度を表す、混乱しやすい値と関連付けられ得る。例えば、注釈付きの学習者のコーパスから、各単語が他の単語と誤って用いられる頻度の回数を入手することができ、実際の単語の綴り間違いが、音および／または綴りにおける、編集の長さに基づいて、値と関連付けられ得る。意味論上の関連性に基づく、混乱しやすい単語は、階層ネットワークにおけるパスの長さに基づいて、値と関連付けられ得る。
【００８３】
このような情報に対するアクセスがある場合、示唆は、混乱しやすさと、もっともらしさにおける改善とを単一得点、すなわち、置換可能性得点σ（ｗ_ｉ→ｃ_ｊ（ｗ_ｉ））に組み合わせることによって、さらに助けになる順序で提示され得る。
【００８４】
ユーザとのインタラクションのセッション中、示唆は、初期的に、ｗ_λｍｉｎを改善させるために、混乱しやすい単語のセットの要素と置換することによって、提供される。ユーザがこれらのうちの１つを受け入れる場合、置換の効果は、その単語に連結されている他の単語にまで伝播し得、ｗ_λｍｉｎの新たな値の計算から手順が繰り返される。伝播の手順は、置換された単語を元の単語とは異なる単語に再び取り付けることを含み得る。
【００８５】
孤立している状態で、ありそうもない連結は、より大きな構造の一部である可能性があり、逆もあり得る。例えば、「ｂｙ　ａｃｃｉｄｅｎｔ」は、非常に強い連語であり、「ｂｙ　ｔｈｅ　ａｃｃｉｄｅｎｔ」は、ありそうになく、潜在的な誤りであると考えられ得る。後者を含む、より多くの、恐らくは正しい構造、例えば、「ｈｏｒｒｉｆｉｅｄ　ｂｙ　ｔｈｅ　ａｃｃｉｄｅｎｔ」がある。
【００８６】
反対に、孤立した「ａ　ｋｎｏｗｌｅｄｇｅ」は、典型的な学習者の誤りであり、「ａ　ｋｎｏｗｌｅｄｇｅ　ｏｆ」は、合理的な表現である。しかし、「ｌｅａｒｎ　ａ　ｋｎｏｗｌｅｄｇｅ　ｏｆ」は、誤りであり得る。
【００８７】
これらの場合は、２つ以上の連結によって結合される、３以上の要素を含む依存性部分グラフの尤度の値を計算することによって処理され得る。実験的な観察は、多くの場合において、３つの要素を越えていくことが不必要であることを示す。上記の場合において、４つの要素の句の尤度は、より小さい単位の尤度まで追跡され得る。例えば、「ｈｏｒｒｉｆｉｅｄ　ｂｙ」は強い連語なので、「ｈｏｒｒｉｆｉｅｄ　ｂｙ　ｔｈｅ　ａｃｃｉｄｅｎｔ」は、ありそうであるが、「ｋｎｏｗｌｅｄｇｅ」は、「ｌｅａｒｎ」の目的語である可能性は低いので、他の要素に関わらず、「ｌｅａｒｎ　ａ　ｋｎｏｗｌｅｄｇｅ　ｏｆ」はありそうにもない。
【００８８】
３つの要素のサブグラフの尤度の値は、各種の方法で計算され得る。１つの方法は、要素のうちの２つと、その間の連結を句の単位として処理し、この句の単位と第３の要素との間の尤度の測定基準を、２つの要素の場合において計算された方法と全く同じ方法で計算することである。
【００８９】
２つまたは３つの要素の連結の尤度の値を、もっともらしさの値へと組合せることは、各種の方式に従って実行され得る。３つの要素の句の寄与を、２つの要素の句の寄与よりも高く重み付けしてもよいし（平滑化方式）、または、２つの要素の句を含む３つの要素の句が頻度におけるある程度の制約および／または尤度を満たさない場合、２つの要素の句のみを考慮してもよい（バックオフ方式）。このような方式に対するパラメータは、経験的に、または、学習手順によって、判定され得るが、学習する特徴は、特定単語が文脈にあるかないかではなく、組合せの強度と頻度である。
【００９０】
基本的な方法が、検出されて訂正され得る誤りの範囲を増大させるため、いくつかの改善させる処理にかけられる。
【００９１】
単語のもっともらしさの計算は、その単語が任意の他の単語に付かないことを示す用語を含み得る。依存性の樹形図の根元になり得る定動詞（または、リストおよびタイトルにおける何らかの他の品詞）の場合を除き、付けられない単語は、常に、誤り（または誤った文法）を示す。従って、非常に低い尤度の値を、無意味な取り付けに割り当てることは、適切であり、これによって、誤り処理がトリガされる。
【００９２】
その後、この方法は、訂正を決定するため、以下に示すように、適用される必要がある。
【００９３】
上述したように、訂正されるテキストの構文解析は、言語全体にわたる優先度要素によって強く影響されない場合、単語は、品詞が適切であれば、概して、結び付けられる。反対に、単語が結び付けられない場合、誤りは、典型的には、同じ品詞の単語の置換によって、訂正可能でない。
【００９４】
誤りは、置換のうちの１つではなく、削除であり得る。例えば、名詞は、自動的な動詞の目的語として結び付けられない。多くの場合において、誤りは、前置詞の挿入によって訂正され得る。名詞が、弱い連結で動詞に結び付けられる場合でも、挿入が適切であり得る。いずれの場合においても、挿入は、誤りが訂正されたか否かをその尤度が判定する、新たな連結の作成を伴う必要がある。
【００９５】
結び付けられることがないことは、カテゴリー変更置換の誤りによっても引き起こされ得る。あるカテゴリーの単語の混乱しやすい単語のセットが、他のカテゴリーの単語を含む場合、置換は、入力の局所的な再構文解析を伴うことを必要とし得る。例えば、学習者が、「ｇｅｔ　ｏｕｔ　ｏｆ　ｔｈｅ　ｂｕｉｌｄｉｎｇ　ｓａｆｅｔｙ」と書く場合、「ｂｕｉｌｄｉｎｇ　ｓａｆｅｔｙ」というつながりが、（ありそうにもない）名詞句として構文解析され得る。名詞「ｓａｆｅｔｙ」についての混乱しやすい単語のセットが、副詞「ｓａｆｅｌｙ」を含む場合、再構文解析は、後者が、動詞「ｇｅｔ　ｏｕｔ」の修飾語句であり、その目的語が、「ｓａｆｅｔｙ」ではなく、「ｂｕｉｌｄｉｎｇ」であることを確立する必要がある。
【００９６】
本発明の方法は、例えば、各単語のもっともらしさの値について、閾値を設定しないことによって、文脈に対して高感度な類義語辞典としても用いられ得る。この場合においては、全ての単語が、もっともらしさに関わらず、置換の候補である。また、置換が、もっともらしさを改善する必要はない。例えば、もっともらしさの値が閾値を越える場合、潜在的な置換が示唆され得る。
【００９７】
本発明の方法は、任意の適切な装置によって行われ得るが、実際には、この方法を行うようにコンピュータを制御するプログラムによってプログラムされたコンピュータによって行われる可能性が高い。図１に、制御部として中央演算処理装置（ＣＰＵ）１を用いる、適切なコンピュータシステム１００を示す。ＣＰＵ１には、例えば、ディスクドライブの形のプログラムメモリ２が接続され、プログラムメモリ２は、磁気ディスクまたは光ディスクの形の格納媒体を含み、また、格納媒体は、ＣＰＵ１を制御するプログラムを含む。プログラムメモリ２が、第１のデータベース３および第２のデータベース４を含んでもよい。
【００９８】
例えば磁気ディスクに格納される第１のデータベース３は、連結および関連付けられる尤度の値を含む。例えば、他の磁気ディスク、または、同じ磁気ディスクに、同様に格納される、第２のデータベース４は、混乱しやすい単語のセットを含む。ランダムアクセスメモリ（ＲＡＭ）５の読み出し／書き込みは、パラメータの一時的な値を保持する、通常の方法で提供される。
【００９９】
ＣＰＵ１には、誤り、不自然な表現などについて調べられるテキストの入力を可能にする入力インターフェース６が接続される。例えば、テキストは、キーボードを介して手動で入力されてもよいし、（例えば、磁気ディスクまたは光ディスクで）既に機械読取り可能な形であってもよい。ＣＰＵ１には、出力インターフェース７も接続され、ユーザがこの方法の出力をモニタすることが可能になる。また、この方法を用いてインタラクトすることを可能にするため、インターフェース６および７が、ユーザに、データ、コマンドなどを入力し、この方法の動作をモニタする設備を提供する。例えば、もっともらしさが改善した混乱しやすい単語の選択が提供される場合、これらは、出力インターフェース７の一部または全てを形成するディスプレイ上に表示され、ユーザは、入力インターフェース６の全てまたは一部を形成する、キーボードおよび／またはマウスを適切に操作することによって、混乱しやすい単語のうちの１つを選択し得る。
【０１００】
本発明は、連結と関連付けられた尤度値とともに、単語間の連結を含むデータベースを提供し、このような連結が正確であるか、または、慣用語法にかなっているかについての尤度の尺度を提供する。尤度の値は、例えば、その言語のネイティブスピーカーによって生成されたテキストの大部分を解析することによって得られる、連結が現れる頻度に基づく。テキストのセクションを、セクション内の１つ以上の単語の起こり得る誤りまたは不自然な使用について調べるため、テキストが、まず解析されて単語間の連結が確立される。解析されたテキストにおける連結の尤度は、データベースから判定される。もっともらしさの値は、その単語が現れる連結の尤度の値を組み合わせることによって、解析されたテキスト内の各単語について計算される。単語は、見出しの単語と混乱しやすい単語のセットを含む他のデータベースに見出しを付けるために用いられる。混乱しやすい単語の各々は、順に選択され、見出しの単語の連結において置換される。これらの新たな連結についての尤度の値が判定され、混乱しやすい単語についてのもっともらしさの値が計算される。誤りを訂正する実施形態において、もっともらしさが閾値より低くなる単語について、混乱しやすい単語が試され、もっともらしさを改善する混乱しやすい単語がユーザに報告される。コンテキストに対して高感度な類語辞典の実施形態において、混乱しやすい単語が、全ての単語について試され、もっともらしさの値が第２の閾値を超える混乱しやすい単語が報告され得る。
【０１０１】
本発明を英文に適用した実施形態を説明してきたが、本発明は英語に限定されず、その他の原語にも適用される。
【０１０２】
なお、英語以外の言語（例えば日本語）から、翻訳によって英語テキストのセクションを生成してもよい。
【０１０３】
また、印刷された文献に記載されるテキストを光学文字認識システムを用いて読取って、テキストのセクションを生成してもよい。
【０１０４】
【発明の効果】
本発明によれば、ユーザが書いたものにおける誤りおよび不自然な表現を検出し、言語の使用を改善し得る方法を示唆する方法および装置が提供される。
【０１０５】
本発明によれば、ユーザが書いたものにおける誤りおよび不自然な表現を検出し、これらに対する訂正を示唆することが可能である。本発明は、事実上の単語の綴りの誤りおよび様々な他のタイプの誤りを処理することができる。
【図面の簡単な説明】
【図１】図１は、本発明の実施形態における装置の模式図である。
【図２】図２は、「Ｌｏｖｅ　ｉｓ　ｔｈｅ　ｍｏｓｔ　ｉｍｐｏｒｔａｎｔ　ｃｏｎｄｉｔｉｏｎ　ｆｏｒ　ｍａｒｒｉａｇｅ」という文の依存性構造を示す図である。
【図３】図３は、尤度の値を連結と関連付ける、第１のデータベースの一部分を示す図である。
【図４Ａ】図４Ａは、誤り検出器および訂正器としての本発明の実施形態の動作を示すフローチャートである。
【図４Ｂ】図４Ｂは、誤り検出器および訂正器としての本発明の実施形態の動作を示すフローチャートである。
【符号の説明】
１　ＣＰＵ
２　プログラムメモリ
３　第１のデータベース
４　第２のデータベース
５　ＲＡＭ
６　入力インターフェース
７　出力インターフェース[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method and apparatus for correcting and improving word selection and use in natural language texts. The invention also relates to a computer program for programming a computer to perform such a method, a storage medium containing such a program, and a computer programmed by such a program.
[0002]
[Prior art]
At the heart of writing or speaking in a language is choosing which words to use. To help with this choice, people writing in their native language use a thesaurus, while language learners typically use a bilingual dictionary. However, those who write in their native languages find that the thesaurus does not provide detailed information about contexts where synonyms are appropriate, and the learner may select the wrong translation from the bilingual dictionary. Both may misspell to another word if they lack concentration or knowledge.
[0003]
According to the learner's annotated corpus of English [1], the use of incorrect verbs or prepositions is the most common type of error, followed by spelling and punctuation errors. For example, the writer may write "associate to" instead of "associate with", "loose one's template" instead of "loss one's temper", and "wins me at tennis" instead of "beats me at tennis". There is.
[0004]
Heretofore, it was not possible to detect such and other types of errors and suggest corrections for them.
[0005]
U.S. Pat. Nos. 6,059,026, 5,838,037, disclose the creation and use of co-occurrence information in parsing and translation.
[0006]
The technology disclosed in each of

Patent Documents

4, 5, 6, 7, 8, 9, and 10 generally uses a set of confusing words, for example, "hear" and "here" or "to" and "to". A list such as "too" is used. The presence of such words in the text indicates a potential error. These patents describe different approaches to error correction.
[0007]
U.S. Patent Application Publication No. US 2004 / 0138,697 discloses a technique that uses a system of rules describing different contexts to distinguish between the use of confusing words.
[0008]

Patent Literatures

12, 13, and 14 disclose systems that assign probabilities to consecutive parts of speech. The probability of being a continuation of a part of speech containing a word that is easily confused may be compared to the probability of being a continuation of a part of speech containing the word that is confused. If the latter is higher than the former, a possible error is reported.
[0009]
Patent Document 15 assigns probabilities to a succession of words, assigns probabilities to spelling a word erroneously as another word, and combines these probabilities to determine whether a word is spelled incorrectly as another word. A system for determining is described.
[0010]
U.S. Patent Nos. 6,064,086 and 6,057,064 relate a system that associates words with features that represent their context, and uses a machine learning algorithm to calculate a function from the values of the features for specific elements of the set of confusing words. Disclose. If an element of the set of confusing words appears in the text, this function is used to classify whether it is correct or incorrect.
[0011]
Non-Patent Document 2 discloses a system for detecting an error using an n-gram model of a continuous word. The system may detect category change and category preservation errors not previously seen, but only over a very limited length due to the continuous model. Error correction is not described.
[0012]
The system disclosed in U.S. Pat. No. 6,037,086 identifies potential errors in the use of words due to parser failure, and identifies these errors by finding confusing words so as to lead to subsequent successful parsing. Resolve.
[0013]
Many measures of strength or likelihood for connection are disclosed, for example, in [3] and [4], which provide comparative evaluation of several measures in a particular task.
[0014]
An example of parsing text using any suitable parser is disclosed in Non-Patent Document 5.
[0015]
Non-Patent Document 6 discloses a parameter formula used for calculating a likelihood value using a statistical measure.
[0016]
[Patent Document 1]
U.S. Pat. No. 4,916,614
[Patent Document 2]
U.S. Pat. No. 4,942,526
[Patent Document 3]
U.S. Pat. No. 5,406,480
[Patent Document 4]
U.S. Pat. No. 4,674,065
[Patent Document 5]
U.S. Pat. No. 4,868,750
[Patent Document 6]
U.S. Pat. No. 5,258,909
[Patent Document 7]
U.S. Pat. No. 5,537,317
[Patent Document 8]
US Patent No. 5,659,771
[Patent Document 9]
U.S. Pat. No. 5,799,269
[Patent Document 10]
U.S. Pat. No. 5,907,839
[Patent Document 11]
U.S. Pat. No. 4,674,065
[Patent Document 12]
U.S. Pat. No. 4,868,750
[Patent Document 13]
U.S. Pat. No. 5,537,317
[Patent Document 14]
U.S. Pat. No. 5,799,269
[Patent Document 15]
U.S. Pat. No. 5,258,909
[Patent Document 16]
US Patent No. 5,659,771
[Patent Document 17]
U.S. Pat. No. 5,907,839
[Patent Document 18]
US Patent No. 5,999,896
[Non-patent document 1]
Nichols, 1999, "The Cambridge Learner Corpus-Error Coding and Analysis for Writing Dictionaries and other books in the company of the University of England.
[Non-patent document 2]
Chodrow and Leacock's An unsupervised method for detecting grammatical errors "(Proceedings of the 1). ^st Annual Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 140-147, 2000
[Non-Patent Document 3]
K. Kageura, 1999, "Bigram Statistics Revisited: a Comparative Examination of some Statistical Measures in Morphological Analysis of Japanese Kanji Sequences", Journal of Quantitative Linguistics, 1999, vol 6, no. 2, 144-166 pages
[Non-patent document 4]
Evert et al., "Methods for the Qualitative Evaluation of Lexical Association Measures", Proceeding of the 30. ^th Annual Meeting of the Association for Computational Linguistics, Toulouse, 2001, pp. 188-195.
[Non-Patent Document 5]
M. Collins, "Three Generalisable Models for Statistical Parsing" (Proceedings of the 35th annual meeting of the ACL / 8). ^th Conference of the EACL, Madrid, 1997), "Parsing English with a Link Grammar" by Sleator and Temperley (CMU-CS-91-196, Carnegie-Mellington University, California, USA).
[Non-Patent Document 6]
D. Lin's "Automatic Retrieval and Clustering of Similar Words" (COLING-ACL 98, Montreal, Canada, August 1998).
[0017]
[Problems to be solved by the invention]
It is an object of the present invention to provide a method and apparatus for detecting errors and unnatural expressions in user writing and suggesting ways in which language usage can be improved.
[0018]
[Means for Solving the Problems]
The present invention can detect these types of errors and other types of errors and suggest corrections for them. The present invention can handle spelling errors in words (e.g., lose / loose), as well as various other types of errors.
[0019]
For example, if a word such as "make" is looked up in a thesaurus, the writer will find many synonyms. These can be categorized into groups that share a central meaning. Some groups may include synonyms such as "create", "construct", and "establish", but the writer may find that "creates a division", "constructs a model", or "establishes a relation" There is no.
[0020]
The present invention allows these to be provided in response to inputs such as "make a division", "make a model", or "make a relationship".
[0021]
The present invention includes a relationship between two words or phrases, whether written or spoken, that may co-occur in a stretch of language (hereinafter not necessarily adjacent), hereinafter referred to as text, Take advantage of dependencies or connectivity. Connectivity can be associated with a measure of intensity or likelihood based on the frequency of occurrence in most of the text. A word in the text can be associated with a likelihood value based on the likelihood value in the connection in which it appears. Words that are not plausible in the text can be incorrect or unnatural in context.
[0022]
According to a first aspect of the present invention, there is provided a method of correcting or improving the selection of a first word or phrase in a written text or a section of a spoken text comprising a plurality of words in a first language, (A) providing a first database of connections between words or phrases of the first language, wherein each connection is a frequency of occurrence of the connection in the body of the text of the first language; B.) Having at least one associated likelihood value based on the first and second words or phrases in a section of the text; Analyzing a section of text to establish a connection, wherein at least one first likelihood value of the connection and a first likelihood value of the first word or phrase are: Providing a step based on at least one likelihood value and (c) a second database wherein each of the at least one word or phrase is associated with a set of words or phrases that may be confused. (D) selecting or calculating a confusing word or phrase from the second database as a replacement candidate for the first word or phrase in a section of the text; Deriving a second plausibility value for the confusing word or phrase based on a likelihood value of a second concatenation in the first database, wherein the second concatenation comprises: (F) the calculated plausibility value including the easy words or phrases and other words or phrases in the section of the text; Based on, including a step of selectively providing an indication of the confusing word or phrase, a method is provided.
[0023]
The likelihood value of each of the connections in the first database may also be based on the frequency of occurrence of each of the other links containing one of the words or phrases having the same dependency.
[0024]
The likelihood value of each of the connections in the first database may also be based on the frequency with which all other connections having the same dependency appear.
[0025]
The likelihood value of each of the connections in the first database is mutual information (Mutual Information), T-score (T-score), Yule's Q coefficient (Yule's Q coefficient), and log likelihood (log -Likelihood).
[0026]
In the step (e), the other word or phrase may be the second word or phrase, and the dependency relation of the second connection is different from the dependency relation of the first connection. It may be the same.
[0027]
The step (b) may include establishing a plurality of first concatenations of a plurality of first words or phrases in the section of text, wherein the steps (d), (e) and (f) May be performed for each of the first words or phrases.
[0028]
The step (b) may include establishing a connection between non-adjacent words or phrases in the section of the text.
[0029]
The step (d) may include selecting each of the confusing words or phrases in the set of words or phrases that are easy to confront with the first word or phrase, wherein the steps (e) and (f) are performed. May be performed for each of the confusing words or phrases.
[0030]
The step (f) may include a step of indicating a second plausibility value in descending order of the value.
[0031]
If the first likelihood value is lower than a first threshold, the steps (d), (e), and (f) may be performed.
[0032]
Step (f) may include providing an indication if each of the second likelihood values or the second likelihood value exceeds a second threshold.
[0033]
Step (f) may include providing an indication if the second plausibility value is greater than the first plausibility value.
[0034]
The step (b) may include calculating the first plausibility value from the annotated learner's error corpus and the associated likelihood value by a function learned by a machine learning technique.
[0035]
The method may further include replacing a first word in the section of the text with the confusing word.
[0036]
The method may further include generating a section of the text by translation from the second language.
[0037]
The method may further include generating a section of the text from the printed document by optical character recognition.
[0038]
According to a second aspect of the present invention, there is provided a computer program for causing a computer to execute the method according to the first aspect of the present invention.
[0039]
According to a third aspect of the present invention, there is provided a storage medium including a program according to the second aspect of the present invention.
[0040]
This medium may include computer readable media.
[0041]
According to a fourth aspect of the present invention, there is provided a computer including a program according to the third aspect of the present invention.
[0042]
According to a fifth aspect of the present invention, there is provided an apparatus for correcting or improving the selection of a word or phrase in a written text or a section of a spoken text comprising a plurality of words of a first language, wherein A first database of connections between words or phrases of the first language, wherein each connection includes at least one associated likelihood based on a frequency of occurrence of the connection in the body of the text of the first language. Parsing a section of text to establish a first connection between a first database and a first word or phrase of the section of text and a second word or phrase having a value of The control unit, wherein the at least one first likelihood value of the concatenation and the first likelihood value of the first word or phrase are based on the at least one likelihood value. And a second database, wherein each of the at least one word or phrase is associated with a set of words or phrases that may be confused, wherein the control comprises: Selecting or calculating a confusing word or phrase as a candidate for replacement with the first word or phrase in the section of text, the control unit calculates a likelihood value of a second connection in the first database; Derive a second plausibility value for the confusing word or phrase based on the second concatenation word and the confusing word or phrase and other words or phrases in the section of text. Wherein the control unit selectively provides an indication of the confusing word or phrase based on the calculated plausibility value.
[0043]
By using the likelihood of the connection between words, it is possible to provide a technique that uses little likelihood of continuation of parts of speech and is improved over known systems. This is because such known systems cannot detect and correct errors that maintain categories that are very common.
[0044]
An improvement is that the dependency grammar is not adjacent, but can still capture the dependencies between words that directly affect each other's choice, so a continuous n-gram (either word or part of speech) This is achieved by using The n-gram can in principle be expanded to include such dependencies, but in practice this can lead to serious problems in sparse data. By using concatenation, the data available for the calculation of statistical likelihood values is collected in linguistically large units. In most cases, always a fragment of a three-element dependency is sufficient to obtain useful statistics, but even a continuous n-gram of four elements is likely a word combination and a likely Make mistakes for many cases of word combinations that have no words.
[0045]
An important consequence of this statistical limitation on linguistically significant entities is that the value of the probability is easier to interpret in the manner required to find the error. To understand this, consider the importance of the probability of transition between adjacent words in a continuous word double letter model. Within a component, for example, between "big" and "dog" in "a big dog", the probability of a transition can be directly compared to a succession of similar adjectives and nouns. However, the probability of a transition between “dog” and “a” in “givethe dog a bone” is that a component ending with “dog” is followed by a component that starts with “a”. Probability of not (impossible). The probability that a component headed by "give" is the subject of having a second object headed by "bone" is not represented, and possible alternatives, for example, "give the dog a clone" and Cannot be compared.
[0046]
That is, in a continuous n-gram model, low transition probabilities may represent both low and likely linguistically interesting likelihoods. It cannot be used as a direct indication of potential errors. If a continuous n-gram based system handles all low probabilities as a trigger to handle errors, it will detect a large number of potential "errors", many of which are not actual "errors". These processes are expensive and pose the risk that such spurious errors are classified as real errors.
[0047]
This is because none of the known techniques that use low transition probabilities are used as a trigger for error handling, but rather, using the presence in the text of certain words that are known to be confusing, to use the relative presence of the original sequence. This is the reason for considering the likelihood and the likelihood obtained by replacing words.
[0048]
In contrast, in the present technique, "low likelihood" is a more robust error indicator. Any improbable concatenation can contribute to the start of error handling, only improbable concatenations contribute. Of course, what is unlikely does not always result in an error, but in the present technique these false triggers are much less.
[0049]
Furthermore, if the presence in the text of an element in a set of some confusing words, like many known techniques, is only a trigger for error handling, adding an element to the set of confusing words may not be possible. , Increasing both the number of times error handling is triggered and the computational cost of considering each element.
[0050]
If the likelihood of concatenation, and the likelihood of the resulting word, is a trigger for error handling, as in the present invention, a much wider range of errors can be characterized. The concept of confusion is not limited to frequent spelling and pronunciation confusion.
[0051]
Known techniques that use a learning algorithm and use the presence of a word that is known to be confusing as a trigger for error handling include, in addition to applying the learning algorithm to word classification, the ability to identify words with potential errors. There is no way to detect it. Further, like known n-gram based techniques, learning systems do not fully benefit from gathering data into linguistically large units.
[0052]
The technique of the present invention represents an improvement over known techniques based on parsing failures. This is because parsing failure is a very coarse detection mechanism for vocabulary errors, particularly vocabulary errors related to replacement with words of the same part of speech. In contrast, the technique of the present invention provides a very fine-grained quantitative decision, even for very short sentence fragments, as indicated by the absence of attachments, with a specific, extreme likelihood Include low parsing failures. Furthermore, a successful parsing (coarse state that the error has been corrected) can be replaced by a fine-grained quantitative determination of the improvement obtained.
[0053]
BEST MODE FOR CARRYING OUT THE INVENTION
The present invention will be further described, by way of example, with reference to the accompanying drawings.
[0054]
In the present invention, methods and apparatus are provided for detecting errors and unnatural expressions in user writing and suggesting ways in which such use of the language may be improved. These techniques may be used as a context-sensitive thesaurus that suggests expressions in the context that are similar in meaning to a given input expression. A statistically dependent model of word combinations is used as a basis for error detection and replacement checking. This solves some problems in a known manner, using either a continuous n-gram model or a set of unanalyzed features. Also, these techniques can provide a much wider range of replacement candidates. Since the detection of errors does not rely on the detection of certain words that are prone to errors by use, errors that have not been previously encountered can also be detected and corrected.
[0055]
The present invention uses a relationship between two types of words. One type of relationship is maintained between two words at different positions in a sentence. These are dependency relationships such as “subject of”, “object of” and “modifier of”, examples of which are shown in FIG. FIG. 2 shows a result of analyzing a sentence “Love is the most important condition for marriage”. Words are represented by their inflexible shape and part of speech, ie, as headwords. Therefore, “is” is represented as “be_V”. The subject of this verb is identified as "love_N" and its object is identified as "condition_N". The latter is specified by "the_DET" and qualified by "important_ADJ". “Most_ADV” is identified as an adverb modifying “important_ADJ”. “For_PREP” is identified as a preposition that modifies “condition_N”, and “marageage_N” is identified as an object of the preposition “for_PREP”. The three forms consisting of two headwords and the dependency relationship connecting them are called connection.
[0056]
The other type of relationship includes relationships defined as "possible replacements for", ie, relationships between alternative word choices at a given position in a sentence. Some examples of substitution relationships are as follows.
[0057]
・ Synonyms such as synonyms, antonyms, narrow terms, and broad terms
A spelling error such that “lose” becomes another word in the language so that it becomes “loose” (in special cases, the pronunciation is the same, such as “pane” and “pain”) Is related to words with different spellings, and there is a homophone)
Derivatives (eg, “interested” and “interesting” or “safe” and “safety”) that relate to words formed in different ways from one etymology
• Ease of confusion between multiple languages, related to words that are alternative translations for one word in other languages (eg, in French, both “markers” can be translated as “marker” "And" brand ")
A fake friend (eg, “possible” and “possible” translations of words that are inappropriate as translations of words in other languages of the same origin, eg, “actual” in French, which are the correct and incorrect translations, respectively) actual ")
-Insertion and deletion errors (e.g., "he lang (at) the doorbell", "we paid (for) our", which can be considered as replacing a meaningless word or as a meaningless word. meals ")
If the use of the word w in a sentence is incorrect or otherwise deemed unconventional and awkward, it is called w's confusing word set C (w), Each element of the set of words is considered as a possible replacement. The set of w confusing words is derived from the words associated with w. However, all actual factors may vary depending on the user's native language, the level of linguistic competence in the language in which they are writing, and other factors.
[0058]
Dependency relationships are a widely used means of expressing sentence structure. Many variations are found, but are largely insignificant in the context of the present technology. A dependency relationship combines two words, called the dependent part and the head. In a typical formula, a subordinate may not be associated with more than one head, but a head may include any number of subordinates and other constraints, such as prohibition of circulation, for example. Ensure that relationships in sentences form a dendrogram. As used herein, a connection (also called a connection) between two words in a sentence is represented by three forms.
<First lemma, relation, second lemma>
However, lemma (heading) is a term such as "chase_V" that represents all forms of the verb "to case", that is, "chase, cases, chased, and chasing".
[0059]
Connections can be associated with many measures of intensity or likelihood. The frequency of concatenation, ie how many times it is found in the parsed corpus, is only a coarse way of assessing strength. A more accurate measure is to calculate the frequency of concatenation to the extent that it deviates from what can be expected from the frequency of that component part. Some of such measures are known from Non-Patent Documents 3 and 4 described above. Some of these measures have applications in word segmentation, parsing, translation, information retrieval, and lexicography. In these examples, typically only those connections that are much more likely than expected are of interest. However, the techniques of the present invention are also concerned with connections that are expected and much less likely. The detection of such a connection in the text indicates that it is grammatically incorrect or differs from the conventional usage of the language.
[0060]
Words that appear in one or more improbable concatenations can in turn be replaced by each element of the set of confusing words, and the result of making each such substitution can be evaluated for plausibility. If one or more elements of the set of confusing words are sufficiently plausible, these elements can be suggested as replacements.
[0061]
As a preliminary step, a database of likelihood values for word combinations is built by mass analyzing native speaker texts according to a dependency grammar. Any suitable parser can be used, and suitable examples are disclosed in [5] above. The analyzer may not be a parser as generally considered, but may use finite state or similar techniques augmented with a mechanism to record dependencies.
[0062]
The frequency of each type of concatenation is counted, and a likelihood value for each is calculated according to one or more statistical measures such as, for example, mutual information, T-score, log-likelihood. , The results are stored in a table. FIG. 3 shows some items in such a database.
[0063]
In FIG. 3, the first column shows the connection itself. The column with “frequency” above indicates the number of times this concatenation appears in the parsed corpus (here, about 80 million words in the British National Corpus). The remaining columns are the mutual information, T score, Yule's Q coefficient, and log likelihood, respectively. Each of these is a different metric, calculated from the following four items:
<First lemma, relation, second lemma>
<First lemma, relation, *>
<*, Relation, first lemma>
<*, Relation, *>
However, "*" represents an arbitrary heading. The formula for this parameter is disclosed in Non-Patent Document 6 above. Different metrics have different ranges and sense the exact values of the four parameters in different ways. However, in each case, the value correlates with the likelihood of the relationship. Positive values indicate that the combination is more likely than accidental, and negative values indicate that it is unlikely.
[0064]
For example, the t score of <associate_V padv to_PREP> is calculated as follows.
[0065]
(Equation 1)

Here, f (associate_V padv to_PREP) = F
Parsing a native speaker's corpus must be accurate and as broad as possible in order to obtain a high quality, likelihood assessment of word combinations. However, accurate parsing requires access to a high quality, likelihood assessment of word combinations, which creates inconsistencies. This discrepancy can be resolved by an iterative or bootstrapping approach. This is based on certain properties of the parsing algorithm.
[0066]
Each individual link in the sentence is associated with a priority value. The priority value is a measure of the confidence that such a connection exists between two words in a sentence. Such priority values are a function of both sentence-specific factors, such as part-of-speech probability and word isolation, and language-wide factors, such as the strength of associativity between these words.
[0067]
The parsing algorithm returns a set of connections that collectively satisfy the axiom of the dependency structure (ie, the connections do not intersect, each word does not depend on more than one node, etc.). However, this set is not required to form one connected dendrogram.
[0068]
The relative contribution of sentence-specific factors and language-wide factors to priority values can vary with appropriate parameter settings.
[0069]
The threshold may be set so that only those connections whose priority value exceeds the threshold are returned.
[0070]
The iterative nature of the parsing algorithm is explained by considering the parsing of the very simple phrase "world title figure".
[0071]
Syntactically, “title” should modify “fight”, but it is unclear whether “world” modifies “title” or “fight”. In English syntactic theory, in a series of nouns, each noun other than the last noun may modify any noun to its right. In this case, from the knowledge of the strength of the combination of particular words, one can conclude that "world" modifies "title". In other cases, for example, in the case of "plastic baby pats", the first noun modifies the last noun, not the noun that immediately follows.
[0072]
The following concatenation results from a complete parsing:
1. <Title_N, mod_of, figure_N>
2. <World_N, mod_of, title_N>
In the first iteration of parsing the native speaker's corpus, language-wide factors do not contribute to the priority value, as no likelihood value for the connection between the specific words is available. Since the priority threshold is set high, for example, words with ambiguous parts of speech or words that are widely classified are not connected, and the reliability of the connection accuracy is high. In this example, only connection 1 is returned. The penultimate noun in a series of nouns should have modified the last noun, regardless of the overall language factor. However, because there is no language-wide information, neither concatenation 2 and the incorrect <world_N, mod_of, figure_N> have sufficiently high priority in this case to be returned. However, the concatenation of other examples of "world title" (and "world figure") in the corpus that are not followed by other nouns is returned.
[0073]
Thereafter, a likelihood value is calculated using these highly reliable connections. Subsequent iterations may begin using these language-wide factors in determining priority, so the priority threshold may be lowered. This increases the number of connections returned (the extent of parsing) and allows more accurate statistics of likelihood to be calculated. In this example, the relative frequency and / or likelihood of <world, mod_of, title> and <world, mod_of, figure> leads to the former being chosen over the latter. Thereafter, further iterations continue to increase the contribution of the factors across languages to priority and reduce the priority threshold. In this way, the range and reliability of the likelihood data can be gradually improved.
[0074]
After each iteration of parsing the native speaker corpus, each type of likelihood value is determined and entered in a database.
[0075]
If a sufficiently accurate database is prepared or obtained by some means, that database can be used in the present invention. The text checked for a problem is subjected to one iteration of such a parsing procedure. The contribution of the factors across the language to the parsing can be reduced because these factors, namely the values of the likelihood of concatenation, are considered in the next stage.
[0076]
Thereafter, the likelihood value of each link in the text is determined by consulting a database of native speakers. Connections not found in the original native speaker's corpus can be assigned likelihood values by assuming that they are fairly infrequent. In an exemplary embodiment, all connections seen at a frequency of 1 in the native speaker corpus are discarded, and the size of the data is significantly reduced. Connections not found in the database are assumed to have a frequency in the range of 0-2, the optimal value is determined empirically, and the likelihood value is calculated accordingly.
[0077]
Concatenations with low likelihood values (ie, negative) are indicators of possible errors. The likelihood value of the connection in which the word appears is combined with the likelihood value of the word. Non-plausible words are replaced by elements of a confusing set of words to see if there is any improvement in plausibility.
[0078]
4A and 4B are flowcharts illustrating the operation of the embodiment of the present invention as an error detector and a corrector. The input text is provided at step 10 and parsed at step 11 by, for example, parsing. In step 12, the likelihood of concatenation in the input text is analyzed. In step 13, the first word in the text is selected, and in step 14, the likelihood of this word is calculated. In step 15, the input text is examined to establish whether all words have been used, and if not, the next word is selected in step 16 and step 14 is repeated.
[0079]
If the plausibility of all words in the text is calculated, the words are classified in step 17 by increasing plausibility. If the least likely word is selected in step 18 and in step 19 the plausibility is not less than the first threshold, the method ends in step 20. If not, the set of confusing words is obtained at step 21 and the first confusing word is selected at step 22. In step 23, in the text, the word of interest is replaced with a confusing word, and the likelihood of the confusing word in context is calculated in step 24. If an improvement in plausibility is detected in step 25 (the change in plausibility is greater than a second threshold), the confusing word is reported to the user in step 26.
[0080]
In step 27, it is checked whether all of the confusing words have been tried, otherwise, in step 28, the next confusing word is selected and the operation returns to step 23. If not, step 29 determines if all words in the text have been processed, and if not, obtains the next word in step 30 and operation returns to step 19. If not, the method ends at step 31.
[0081]
In this embodiment, each word w _i (1 ≦ i ≦ n, sentence length), word w _i Appears in the connection D (w _i ) Is determined. Then, each D (w _i ) Contains the word λ (w _i ), Which assigns a function that maps the likelihood value of the set of concatenations to a single value, called "plausibility". Words are ordered according to plausibility. The worst word w _λmin If the plausibility of falls below a threshold, try to find a correction. w _λmin , In turn, each word c _j (W _λmin ) (1 ≦ j ≦ m, (Cw _λmin )) And replace λ (c _j (W _λmin )). The replacement suggests to the user a confusing word that improves the plausibility of the word. Confusing words can be presented in descending order, according to the improvements that result from being replaced.
[0082]
Elements of the set of confusing words may be associated with confusing values that represent the likelihood of confusion. For example, from the annotated learner's corpus, the number of times each word is used incorrectly with other words can be obtained, and the misspelling of the actual word indicates the length of the edit in sound and / or spelling. Based on this, it can be associated with a value. Confusing words based on semantic relevance may be associated with values based on the length of the path in the hierarchical network.
[0083]
If there is access to such information, the suggestion is a single score for the confusion and the improvement in plausibility, namely the replaceability score σ (w _i → c _j (W _i )) Can be presented in a further helpful order.
[0084]
During a session of interaction with the user, the suggestion is initially w _λmin Is provided by replacing elements of the set of confusing words to improve If the user accepts one of these, the effect of the replacement may be propagated to other words linked to the word, w _λmin The procedure is repeated from the calculation of the new value of. The propagation procedure may include re-attaching the replaced word to a different word than the original word.
[0085]
Isolated, unlikely connections may be part of a larger structure, and vice versa. For example, "by accident" is a very strong collocation and "by the accident" is unlikely and could be considered a potential error. There are more and possibly correct structures, including the latter, for example, "horrified by the accident."
[0086]
Conversely, an isolated "aknowledge" is a typical learner's error, and an "aknowledge of" is a reasonable expression. However, "learn a knowledge of" may be incorrect.
[0087]
These cases can be handled by calculating the likelihood value of a dependency subgraph containing three or more elements, joined by two or more connections. Experimental observations show that in many cases it is unnecessary to go beyond the three factors. In the above case, the likelihood of the four-element phrase can be tracked to a smaller unit likelihood. For example, since "horrified by" is a strong collocation, "horrified by the accident" is likely, but "knowledge" is unlikely to be the object of "learn". , "Learn a knowledge of" is unlikely.
[0088]
The likelihood value of the three-element subgraph can be calculated in various ways. One method treats two of the elements and the concatenation between them as units of a phrase, and computes the likelihood metric between the units of this phrase and a third element in the case of two elements It is to calculate in exactly the same way as the one done.
[0089]
Combining the likelihood value of the concatenation of two or three elements into a plausibility value may be performed according to various schemes. The contribution of the three-element phrase may be weighted higher than the contribution of the two-element phrase (smoothing scheme), or the three-element phrase, including the two-element phrase, may have some degree in frequency. If the constraints and / or likelihoods are not met, only two element phrases may be considered (back-off scheme). The parameters for such a scheme can be determined empirically or by a learning procedure, but the features to be learned are the strength and frequency of the combination, not whether or not a particular word is in context.
[0090]
The basic method is subject to several remedial actions to increase the range of errors that can be detected and corrected.
[0091]
Computing the likelihood of a word may include terms indicating that the word does not stick to any other words. Except in the case of definite verbs (or some other part of speech in lists and titles) that can be the root of a dependency dendrogram, unattached words always indicate an error (or incorrect grammar). It is therefore appropriate to assign very low likelihood values to meaningless attachments, which will trigger error handling.
[0092]
This method then needs to be applied to determine the correction, as described below.
[0093]
As mentioned above, if the parsing of the text to be corrected is not strongly affected by language-wide priority factors, the words are generally tied together if the part of speech is appropriate. Conversely, if the words are not tied, the error is not correctable, typically by substitution of words of the same part of speech.
[0094]
The error may be a deletion rather than one of the substitutions. For example, nouns are not tied as automatic verb objects. In many cases, errors can be corrected by the insertion of a preposition. Even if the noun is tied to the verb with a weak connection, insertion may be appropriate. In either case, the insertion must involve the creation of a new concatenation whose likelihood determines whether the error has been corrected.
[0095]
The lack of binding can also be caused by an incorrect category change substitution. If the confusing set of words in one category includes words in another category, replacement may need to involve local re-parsing of the input. For example, if the learner writes "get out of the building safety", the connection "building safety" can be parsed as a (unlikely) noun phrase. If the set of confusing words for the noun "safety" includes the adverb "safely", the re-parser will find that the latter is a modifier of the verb "get out" and its object is "safety". And it is necessary to establish that it is "building".
[0096]
The method of the present invention can also be used as a context-sensitive synonym dictionary, for example, by setting no threshold for the likelihood value of each word. In this case, all words are candidates for replacement, regardless of plausibility. Also, the replacement need not improve plausibility. For example, if the plausibility value exceeds a threshold, a potential replacement may be indicated.
[0097]
The method of the present invention may be performed by any suitable device, but in practice is likely to be performed by a computer programmed with a program that controls the computer to perform the method. FIG. 1 shows a suitable computer system 100 using a central processing unit (CPU) 1 as a control unit. For example, a program memory 2 in the form of a disk drive is connected to the CPU 1, and the program memory 2 includes a storage medium in the form of a magnetic disk or an optical disk, and the storage medium includes a program that controls the CPU 1. The program memory 2 may include a first database 3 and a second database 4.
[0098]
For example, a first database 3 stored on a magnetic disk contains likelihood values to be linked and associated. For example, a second database 4, also stored on another magnetic disk or on the same magnetic disk, contains a set of confusing words. Reading / writing of the random access memory (RAM) 5 is provided in the usual way, retaining the temporary values of the parameters.
[0099]
The CPU 1 is connected to an input interface 6 that enables input of text to be checked for errors, unnatural expressions, and the like. For example, the text may be entered manually via a keyboard, or may already be in machine readable form (eg, on a magnetic or optical disk). An output interface 7 is also connected to the CPU 1 so that a user can monitor the output of this method. Also, to enable interaction with the method, interfaces 6 and 7 provide facilities for the user to enter data, commands, etc. and monitor the operation of the method. For example, if a selection of confusing words with improved plausibility is provided, these are displayed on a display forming part or all of the output interface 7 and the user is prompted to select all or part of the input interface 6 By manipulating the keyboard and / or mouse appropriately, one of the confusing words may be selected.
[0100]
The present invention provides a database containing links between words, along with likelihood values associated with the links, and provides a measure of the likelihood that such links are accurate or idiom-compliant. provide. The likelihood value is based on, for example, the frequency of occurrence of concatenation, obtained by analyzing the majority of the text generated by native speakers of the language. To examine a section of text for possible errors or unnatural use of one or more words in the section, the text is first parsed to establish connections between words. The likelihood of concatenation in the analyzed text is determined from a database. A plausibility value is calculated for each word in the parsed text by combining the likelihood values of the concatenation in which the word appears. The words are used to place headings in other databases that contain sets of words that are confusing with the heading words. Each of the confusing words is selected in turn and replaced in the headword word concatenation. The likelihood values for these new connections are determined and the likelihood values for words that are likely to be confused are calculated. In embodiments that correct errors, for words whose plausibility is below the threshold, confusing words are tried and confusing words that improve plausibility are reported to the user. In context-sensitive thesaurus embodiments, confusing words may be tried for all words and confusing words whose plausibility value exceeds a second threshold may be reported.
[0101]
Although the embodiment in which the present invention is applied to English text has been described, the present invention is not limited to English, but can be applied to other original languages.
[0102]
Note that an English text section may be generated from a language other than English (for example, Japanese) by translation.
[0103]
Also, text described in a printed document may be read using an optical character recognition system to generate sections of text.
[0104]
【The invention's effect】
In accordance with the present invention, there is provided a method and apparatus for detecting errors and unnatural expressions in user writing and suggesting ways in which language usage may be improved.
[0105]
According to the present invention, it is possible to detect errors and unnatural expressions in what the user has written, and suggest corrections to these. The present invention can handle spelling errors in fact and various other types of errors.
[Brief description of the drawings]
FIG. 1 is a schematic view of an apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a dependency structure of a sentence “Love is the most important condition for marriage”;
FIG. 3 is a diagram illustrating a portion of a first database associating likelihood values with concatenations.
FIG. 4A is a flowchart illustrating the operation of an embodiment of the present invention as an error detector and a corrector.
FIG. 4B is a flowchart showing the operation of the embodiment of the present invention as an error detector and a corrector.
[Explanation of symbols]
1 CPU
2 Program memory
3 First database
4 Second database
5 RAM
6 Input interface
7 Output interface

Claims

第１の言語の複数の単語を含む書かれたテキストまたは話されたテキストのセクションにおける第１の単語または句の選択を訂正または改善させる方法であって、
（ａ）該第１の言語の単語または句の間の連結に関する第１のデータベースを提供する工程であって、各連結は、該第１の言語のテキストの本文において該連結が現れる頻度に基づいて、少なくとも１つの関連付けられた尤度の値を有する、工程と、
（ｂ）テキストのセクションの該第１の単語または句と、第２の単語または句との間に第１の連結を確立するように、テキストのセクションを解析する工程であって、該連結の少なくとも１つの第１の尤度の値、および該第１の単語または句の第１のもっともらしさの値は、該少なくとも１つの尤度の値に基づく、工程と、
（ｃ）少なくとも１つの単語または句の各々が、混乱されることがある単語または句のセットと関連付けられている、第２のデータベースを提供する工程と、（ｄ）該第２のデータベースから、混乱しやすい単語または句を、該テキストのセクションにおける該第１の単語または句との置換候補として選択または計算する工程と、
（ｅ）該第１のデータベースにおける第２の連結の尤度の値に基づいて、該混乱しやすい単語または句の第２のもっともらしさの値を導出する工程であって、該第２の連結は、該混乱しやすい単語または句と、該テキストのセクションにおける他の単語または句とを含む、工程と、
（ｆ）該計算されたもっともらしさの値に基づいて、該混乱しやすい単語または句の表示を選択的に提供する工程と
を包含する、方法。A method of correcting or improving the selection of a first word or phrase in a written text or section of spoken text comprising a plurality of words in a first language,
(A) providing a first database of connections between words or phrases in the first language, wherein each connection is based on a frequency of occurrence of the connection in a body of text of the first language; And having at least one associated likelihood value,
(B) analyzing the section of text so as to establish a first connection between the first word or phrase of the section of text and a second word or phrase, At least one first likelihood value and the first likelihood value of the first word or phrase are based on the at least one likelihood value;
(C) providing a second database, wherein each of the at least one word or phrase is associated with a set of words or phrases that may be confused; and (d) from the second database: Selecting or calculating a confusing word or phrase as a possible replacement for the first word or phrase in a section of the text;
(E) deriving a second likelihood value of the confusing word or phrase based on a likelihood value of a second connection in the first database, the second connection Comprises the confusing word or phrase and other words or phrases in the section of text.
(F) selectively providing an indication of the confusing word or phrase based on the calculated likelihood value.

前記第１のデータベースにおける前記連結の各々の尤度の値が、同じ依存性関係を有する単語または句のうちの１つを含む他のリンクの各々が現れる頻度にも基づく、請求項１に記載の方法。The likelihood value of each of the connections in the first database is also based on the frequency of occurrence of each of the other links containing one of the words or phrases having the same dependency. the method of.

前記第１のデータベースにおける前記連結の各々の尤度の値が、同じ依存性関係を有する他の連結の全てが現れる頻度にも基づく、請求項１に記載の方法。The method of claim 1, wherein the likelihood value of each of the connections in the first database is also based on a frequency of appearance of all other connections having the same dependency.

前記第１のデータベースにおける前記連結の各々の尤度の値が、相互情報、Ｔ得点、ＹｕｌｅのＱ係数、および対数尤度のうちの少なくとも１つを含む、請求項１に記載の方法。The method of claim 1, wherein the likelihood value of each of the connections in the first database comprises at least one of mutual information, T-score, Yule's Q factor, and log likelihood.

前記工程（ｅ）において、前記他の単語または句が、前記第２の単語または句であり、前記第２の連結の依存性関係は、前記第１の連結の依存性関係と同じである、請求項１に記載の方法。In the step (e), the other word or phrase is the second word or phrase, and the dependency relationship of the second connection is the same as the dependency relationship of the first connection. The method of claim 1.

前記工程（ｂ）は、前記テキストのセクションにおいて、複数の第１の単語または句の複数の第１の連結を確立する工程を含み、前記工程（ｄ）、（ｅ）および（ｆ）は、該第１の連結の各々について行われる、請求項１に記載の方法。Said step (b) comprises establishing a plurality of first concatenations of a plurality of first words or phrases in said section of text, wherein said steps (d), (e) and (f) comprise: The method of claim 1, wherein said method is performed for each of said first connections.

前記工程（ｂ）が、前記テキストのセクションにおいて隣接していない単語または句の間に連結を確立する工程を含む、請求項１に記載の方法。The method of claim 1, wherein step (b) comprises establishing a connection between non-adjacent words or phrases in the section of text.

前記工程（ｄ）が、単語または句のセットの混乱しやすい単語または句の各々を選択する工程を含み、前記工程（ｅ）および（ｆ）が、該混乱しやすい単語または句の各々について行われる、請求項１に記載の方法。The step (d) includes selecting each of the confusing words or phrases in the set of words or phrases, and the steps (e) and (f) comprise selecting a row for each of the confusing words or phrases. 2. The method of claim 1, wherein the method is performed.

前記工程（ｆ）が、値の降順で、第２のもっともらしさの値を示す工程を含む、請求項８に記載の方法。9. The method of claim 8, wherein step (f) comprises, in descending order of value, indicating a second likelihood value.

前記第１のもっともらしさの値が第１の閾値よりも低い場合、前記工程（ｄ）、（ｅ）、および（ｆ）が行われる、請求項１に記載の方法。The method of claim 1, wherein steps (d), (e), and (f) are performed if the first likelihood value is less than a first threshold.

前記工程（ｆ）が、前記第２のもっともらしさの値の各々または該第２のもっともらしさの値が、第２の閾値を越える場合に、表示を提供する工程を含む、請求項１に記載の方法。The method of claim 1, wherein step (f) comprises providing an indication if each of the second likelihood values or the second likelihood value exceeds a second threshold. the method of.

前記工程（ｆ）が、前記第２のもっともらしさの値が前記第１のもっともらしさの値よりも大きい場合、表示を提供する工程を含む、請求項１に記載の方法。The method of claim 1, wherein step (f) includes providing an indication if the second plausibility value is greater than the first plausibility value.

前記工程（ｂ）が前記第１のもっともらしさの値を、注釈付きの学習者の誤りのコーパスおよび関連付けられた尤度の値から機械学習技術によって学習した関数によって計算する工程を含む、請求項１に記載の方法。The step (b) comprises calculating the first likelihood value from a corpus of annotated learner errors and associated likelihood values by a function learned by machine learning techniques. 2. The method according to 1.

前記テキストのセクションにおける第１の単語を、前記混乱しやすい単語と置換する工程をさらに含む、請求項１に記載の方法。The method of claim 1, further comprising replacing a first word in the section of text with the confusing word.

第２の言語から、翻訳によってテキストのセクションを生成する工程をさらに含む、請求項１に記載の方法。The method of claim 1, further comprising: generating a section of text from a second language by translation.

印刷された文献から、光学文字認識によって、テキストのセクションを生成する工程をさらに含む、請求項１に記載の方法。The method of claim 1, further comprising generating a section of text from the printed document by optical character recognition.

請求項１に記載の方法をコンピュータに実行させるための、コンピュータプログラム。A computer program for causing a computer to execute the method according to claim 1.

請求項１７に記載のプログラムを含む、格納媒体。A storage medium containing the program according to claim 17.

コンピュータ読取り可能媒体を含む、請求項１８に記載の媒体。19. The medium of claim 18, including a computer readable medium.

請求項１７に記載のプログラムを含む、コンピュータ。A computer comprising the program according to claim 17.

第１の言語の複数の単語を含む書かれたテキストまたは話されたテキストのセクションにおける第１の単語または句の選択を訂正または改善させる装置であって、
該第１の言語の単語または句の間の連結に関する第１のデータベースであって、各連結は、該第１の言語のテキストの本文において該連結が現れる頻度に基づいて、少なくとも１つの関連付けられた尤度の値を有する、第１のデータベースと、
テキストのセクションの該第１の単語または句と、第２の単語または句との間に第１の連結を確立するように、テキストのセクションを解析する制御部であって、該連結の少なくとも１つの第１の尤度の値、および該第１の単語または句の第１のもっともらしさの値は、該少なくとも１つの尤度の値に基づく、制御部と、
少なくとも１つの単語または句の各々が、混乱されることがある単語または句のセットと関連付けられている、第２のデータベースと、
を備え、
該制御部は、該第２のデータベースから、混乱しやすい単語または句を、該テキストのセクションにおける該第１の単語または句との置換候補として選択または計算し、
該制御部は、該第１のデータベースにおける第２の連結の尤度の値に基づいて、該混乱しやすい単語または句の第２のもっともらしさの値を導出し、該第２の連結は、該混乱しやすい単語または句と、該テキストのセクションにおける他の単語または句とを含んでおり、
該制御部は、該計算されたもっともらしさの値に基づいて、該混乱しやすい単語または句の表示を選択的に提供する、装置。An apparatus for correcting or improving the selection of a first word or phrase in a written text or section of spoken text that includes a plurality of words in a first language,
A first database of connections between words or phrases of the first language, wherein each connection has at least one associated value based on a frequency of occurrence of the connection in a body of text of the first language. A first database having a likelihood value;
A control for analyzing a section of text to establish a first connection between the first word or phrase of the section of text and a second word or phrase, the control comprising: A first likelihood value and a first likelihood value of the first word or phrase based on the at least one likelihood value;
A second database, wherein each of the at least one word or phrase is associated with a set of words or phrases that may be confused;
With
The controller selects or calculates from the second database a confusing word or phrase as a replacement candidate for the first word or phrase in a section of the text;
The control unit derives a second likelihood value of the confusing word or phrase based on a likelihood value of a second connection in the first database, wherein the second connection Including the confusing word or phrase and other words or phrases in a section of the text,
The apparatus wherein the control unit selectively provides an indication of the confusing word or phrase based on the calculated likelihood value.