JP3895797B2

JP3895797B2 - Conversion candidate generation method

Info

Publication number: JP3895797B2
Application number: JP03486696A
Authority: JP
Inventors: 龍也上原; 和広木村; 佳美齋藤; 達也出羽; 由美水谷
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1996-02-22
Filing date: 1996-02-22
Publication date: 2007-03-22
Anticipated expiration: 2016-02-22
Also published as: JPH09231212A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば、日本語ワードプロセッサにおいて、未登録語を含む文字列を仮名漢字変換する際の自立語の判別方法に関する。
【０００２】
【従来の技術】
近年、日本語における仮名漢字混じり表記を入力する手段として日本語ワードプロセッサが広く普及している。このような日本語ワードプロセッサにおいては、
予め単語の表記と読み等を記憶した辞書を用い、利用者により入力された仮名文字列を漢字変換するようになっているが、利用者が所望する文書についての漢字仮名混じり表記を効率よく作成するために、入力された仮名文字列を、利用者が意図する漢字仮名漢字混じり表記に正確に変換できることが必要とされる。もし変換できない場合には、変換を誤った部分についての修正を利用者自ら行わなければならず、その修正には多くの労力が必要になる。特に、変換文字列に辞書の未登録語がある場合、未登録語の部分を単漢字変換等で別個に指定しなければならないだけでなく、未登録語の前後にも悪影響を及ぼすことが多い。
【０００３】
この問題を解決するために、従来、２つの段階によって未登録語の影響を減少させる方法が用いられてきた。
１）既登録語から、どの部分が未登録語を含む区間であるか推定する。
【０００４】
２）上記手段によって推定された未登録語がどのような表記を持つか推定する。
１）の方法としては、一般に変換不能になる地点から、解析可能になる地点まで入力をとばし、その部分を未登録語区間として認定する方法が取られている。
【０００５】
基本的には、仮名漢字変換では文節を単位として変換をおこなっているため、１）の方法によって推定された未登録区間には自立語以外にも付属語も含まれる。
【０００６】
一方、２）の段階は、非常に困難な問題であり、現在のところ、カタカナ語の推定にとどまっている。上述したように推定した区間には、付属語が含まれているので付属語の部分はカタカナに変換する必要はない。この問題に対処するために、推定区間の末尾が付属語の可能性がある場合、末尾を付属語として扱ったものも候補とするが行われている。
【０００７】
【発明が解決しようとする課題】
このように、従来の仮名漢字変換方法では、入力された仮名文字列中の未登録文字列について、変換不要な付属語と変換対象の自立語を判別する方法, すなわち、どこまでを自立語とするかを判断することは考慮されておらず、変換候補の生成が困難であった。従って、第１番目に正解が出ないことが多く、利用者が次候補操作をおこなう回数が多くなり、未登録語による利用者の修正の負担が多いという問題点があった。
【０００８】
また、未登録文字列中の自立語を判別する方法は、仮名漢字変換のみならず、機械翻訳を行う上でも有用である。
そこで、本発明は、このような問題点に鑑みてなされたものであり、入力文字列から予め表記等が登録された辞書に登録されていない未登録文字列中の自立語を判別することが可能な自立語判別方法を提供することを目的とする。
【０００９】
【課題を解決するための手段】
本発明の自立語判別方法は、自立語についての表記と文法情報を記憶した自立語辞書を用いて、入力された文字列から前記自立語辞書に登録されていない未登録文字列を含む文節を検出し、付属語についての表記と文法情報を記憶した付属語辞書を用いて、前記検出された文節から付属語を分離して１または複数の自立語候補を抽出し、この抽出された自立語候補のそれぞれについて、前記文節の後方あるいは前方にある語の文法情報に基づき自立語の尢度を算出し、その算出された値をもとに前記文節に含まれる自立語を判別することにより、入力文字列から予め表記等が登録された辞書に登録されていない未登録文字列中の自立語を容易に判別することが可能となる。
【００１０】
【発明の実施の形態】
以下、本発明の実施形態について図面を参照して説明する。
図１は、本実施形態に係る自立語判別方法を適用するワードプロセッサ等の情報処理装置の要部の構成を示したもので、未登録文字列検出部１、付属語検出部２、優先度計算部３、候補生成部４、付属語辞書５、自立語辞書６から構成されている。
【００１１】
なお、以下の説明では、仮名漢字変換の場合を例にとり説明する。
未登録文字列検出部１は、例えば、入力された文字列から自立語についての表記と文法情報等を記憶した自立語辞書６を用いて未登録語を含む文字列（以下、未登録文字列と呼ぶ）を検出するようになっている。
【００１２】
付属語検出部２は、付属語についての表記と文法情報を記憶した付属語辞書５を用いて、未登録文字列検出部１で検出された未登録文字列から付属語を分離して１または複数の自立語候補を検出するようになっている。
【００１３】
優先度計算部３は、自立語辞書６を用いて、未登録文字列の前方あるいは後方にある語の文法情報をもとに、付属語検出部２で検出された複数の自立語候補のそれぞれについて、自立語らしさの指標値（自立語の尢度）である優先度を計算するようになっている。
【００１４】
候補生成部４は、優先度計算部３で計算された優先度をもとに、付属語検出部２で検出された自立語候補から優先度の最も大きいものから順に自立語と判別して切り出していくようになっている。すなわち、仮名漢字変換の場合、計算された優先度の値が大きいものから順に並び替えることにより、仮名漢字変換候補の順位を決定して変換候補テーブルを生成する。
【００１５】
付属語辞書５は、付属語の表記とその文法情報として、それに前接可能な付属語を記憶するようになっている。
自立語辞書６は、自立語とその文法情報を記憶するようになっている。
【００１６】
図２は、付属語辞書５の記憶例を示したものである。図２において、この付属語辞書５に登録されている各付属語には、付属語番号が付されていて、さらに、その付属語に前接可能な付属語の付属語番号が記憶されている。
【００１７】
例えば、付属語番号が「１」の付属語は、「に」で、それに前接可能な付属語は、付属語番号「４」、「５」の付属語、すなわち、「と」と「まで」である。
図３は、自立語辞書６の記憶例を示したものである。図３において、この自立語辞書６に登録されている各自立語には、「読み」と「表記」と「文法情報」が記憶されている。
【００１８】
例えば、表記が「製品」の読みは「せいひん」で、その文法情報は、「名詞」となる。
次に、以上のような構成における自立語判別方法の全体の処理動作を図４に示すフローチャートを参照して説明する。
【００１９】
まず、ステップＡ１において、入力された仮名文字列から自立語辞書６を用いて、この自立語辞書６に登録されていない語を含む未登録文字列を検出する。この処理に関しては、例えば特開昭６０−１４７８６７号公報等によって開示されている手法を流用すればので、ここでは詳しく述べない。
【００２０】
次に、ステップＡ２では、付属語検出部２において、付属語辞書５を用いて、検出された未登録文字列から付属語となり得る語を分離することにより、１または複数の自立語候補を検出する。
【００２１】
この検出された自立語候補のそれぞれについて優先度計算部３で優先度を計算する（ステップＡ３〜ステップＡ４）。
ステップＡ４の優先度計算処理は、優先度計算部３において、未登録文字列の前後の語の文法情報を参照し優先度を計算する。
【００２２】
ステップＡ３で、全ての自立語候補について優先度が計算されたことが判断されると、ステップＡ５に進む。
ステップＡ５では、ステップＡ４によって求められた各自立語候補の優先度のうち、最も大きいものから順に自立語と判別して切り出していくようになっている。すなわち、仮名漢字変換の場合、計算された優先度の値が大きいものから順に並び替えることにより、仮名漢字変換候補の順位を決定して変換候補テーブルを生成する。
【００２３】
次に、図５に示すフローチャートを参照して、図４のステップＡ２における付属語検出処理をより詳細に説明する。
まず、ステップＢ１では、未登録文字列に付属語がなく全て自立語である可能性は常に存在するので、候補の１つとして候補のリストに詰む。
【００２４】
ステップＢ２において、付属語の開始位置を未登録文字列の終端からの文字数で表す変数Ｓに「１」を代入する。
ステップＢ３において、変数Ｓが未登録文字列の長さに等しくなれば、すべての可能性を検査したことになるので終了し、そうでなければ変数Ｓがさし示す位置から未登録文字列の終端までの仮名文字列が付属語列として許されるかを付属語辞書５を用いて判断する（ステップＢ４）。判断方法は一般に形態素解析に用いられている手法を用いればよいので、ここでは詳しく述べない。
【００２５】
ステップＢ４の判定が真ならば、ステップＢ５において変数Ｓが指し示す位置で自立語と付属語を分離した候補を候補のリストに詰み、ステップＢ６へ進む。判定が偽ならば、ステップＢ６へ進み、変数Ｓに「１」を加える。
【００２６】
次に、図６に示すフローチャートを参照して、図４のステップＡ４における優先度計算処理をより詳細に説明する。
まず、ステップＣ１では、ステップＡ２で検出された自立語候補の優先度を表す変数Ｐにデフォルトの点数（ここでは「１００」）を代入する。
【００２７】
ステップＣ２で自立語辞書６を検索して、未登録文字列の直後の自立語の文法情報を得る。もし、文法情報が「非複合語名詞」でかつ、未登録文字列から分離された付属語がなければ（ステップＣ３）、未登録文字列は複合名詞になる可能性が高いので、ステップＣ４にて優先度を下げる（ここでは、「５０」引く）。もし、文法情報が「“と”をとる動詞」で、未登録文字列から分離された付属語が「と」であれば（ステップＣ５）、ステップＣ６にて、優先度をあげる（ここでは、「３０」加える）。最後に、各自立語候補に変数Ｐの表す優先度を付与する（ステップＣ７）。
【００２８】
次に、以上説明したような自立語判別方法について、具体的に説明する。
例えば、「あみーがあすはつばいされる」という仮名文字列が処理の対象であるとする。このとき、未登録文字列検出部１によって「あみーが」の部分が未登録文字列として検出される（ステップＡ１）。
【００２９】
次に、付属語検出部２において、まず「あみーが」という全体を自立語とする候補の１つがつまれる（ステップＢ１）。ついで、変数Ｓの値が１のとき、指し示される付属語列は「が」となり、これは付属語列として認められるので、「あみー（自立語）＋が（付属語）」という候補がつまれる（ステップＢ５）。
【００３０】
そして、優先度計算部３に、この２つの候補が送られ、未登録文字列直後が「明日」だとすると、その文法情報は「非複合語名詞」であるので、「あみーが（自立語）」は、図６のステップＣ３の条件に適合し、優先度は「５０」となる。一方、「あみー（自立語）＋が（付属語）」は、ステップＣ３、Ｃ５のいずれの条件にも適用しないので、優先度は「１００」のままである。したがって、候補生成部４において、「あみー（自立語）＋が（付属語）」が優先され、自立語として「あみー」が判別される。
【００３１】
以上、説明したように、上記実施形態によれば、未登録文字列検出部１は、入力された文字列から自立語についての表記と文法情報等を記憶した自立語辞書６を用いて未登録文字列を検出し、付属語検出部２は、付属語についての表記と文法情報を記憶した付属語辞書５を用いて、未登録文字列検出部１で検出された未登録文字列から付属語を分離して１または複数の自立語候補を検出し、優先度計算部３は、自立語辞書６を用いて、未登録文字列の後方にある語の文法情報をもとに、付属語検出部２で検出された複数の自立語候補のそれぞれについて優先度を計算し、候補生成部４は、優先度計算部３で計算された優先度をもとに、付属語検出部２で検出された自立語候補から優先度の最も大きいものから順に自立語と判別して切り出すことにより、入力文字列から予め表記等が登録された辞書に登録されていない未登録文字列中の自立語を容易に判別することが可能となり、未登録文字列中のもっとも可能性の高い自立語を持つ候補を優先することにより、例えば、仮名漢字変換に用いた場合、入力文字列に未登録語が存在するときでも使用者の候補選択および修正の手間を軽減できる。
【００３２】
なお、本発明は、上記実施形態にのみ限定されず、要旨を変更しない範囲で適宜変形して実施可能である。本実施形態では、優先度計算部３において用いられている文法情報として、名詞の非複合語属性と動詞の付属語「と」に関する接続性を用いているが、これは一例である。さらに、他の品詞の付属語の接続性などを用いることが可能である。また、文法情報のみでなく、共起情報などの意味情報もしくは語用論的情報も用いることが可能である。
【００３３】
また、上記実施形態では、未登録文字列の後方にある語の文法情報のみを用いているが、前方にある語の文法情報も使用可能である。例えば、前方に付属語「が」でおわる文節がある場合、そのあとでは、付属語「が」を取りにくいという性質を利用して、未登録文字列内の自立語の候補を優先付けできる。
【００３４】
また、本実施形態では、未登録文字列の直後の語の情報を用いているが、直前および直後ではなく、離れている文節の情報を用いることも可能である。
また、本実施形態では、未登録文字列を「自立語部＋付属語部」に分離することにより自立語候補を検出しているが、「自立語部＋付属語部＋自立語部＋付属語部」のように、必要に応じて３つ以上に分離してもよい。
【００３５】
また、本実施形態では、入力が仮名文字列と仮定しているが、かな漢字混じり列でもよい。この場合は自立語部に仮名が含まれるときに本発明は有効である。例えば、「老松や不動産にて」というかな漢字混じり列が対象で未登録文字列が「老松や」の場合、自立語部は、「老松」もしくは「老松や」の２つの可能があるが、「不動産」は、直前に固有名詞を伴って使われる場合が多いという文法情報があれば、「老松や」が自立語部である可能性が高いと判断することができる。
【００３６】
また、付属語辞書５および自立語辞書６に関しては、辞書のフォーマットの変更や辞書項目の追加を行っても実施可能である。
また、本実施形態は、ワードプロセッサ等のかな漢字変換を行う情報処理装置に組み込んで、自立語部をカタカナ化もしくは漢字等に置き換えることにより、利用者の文書作成の効率を上げることが可能である。例えば、上述した「あみーがあすはつばいされる」という例に対して、検出された自立語候補のうち優先度の最も高い自立語、すなわち、「あみー」をカタカナ化することにより、候補生成部４で変換候補テーブルを図７に示すように生成して、「アミーが明日販売される」のように変換させることが容易に可能である。自立語部の置きかえは、自立語部に長音が含まれるなどの条件下で行うようにしてもよい。
【００３７】
また、判別された自立語部を記号で囲む、もしくは、フォントを変えるなどして、区別できるよう表示し、辞書登録などに利用することが可能である。例えば、図８のように、判別された未登録の自立語部分を記号で表示し、辞書登録を利用者に促すことも可能である。この場合は、かな漢字変換だけでなく、機械翻訳やＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）などにも応用が可能である。
【００３８】
【発明の効果】
以上説明したように本発明によれば、入力文字列から予め表記等が登録された辞書に登録されていない未登録文字列中の自立語を容易に判別することが可能となり、未登録文字列中のもっとも可能性の高い自立語を持つ候補を優先することにより、例えば、仮名漢字変換に用いた場合、未登録語による利用者の修正の手間を軽減させるという効果を生じる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る実施例に係る自立語判別方法を適用するワードプロセッサ等の情報処理装置の要部の構成を示したブロック図。
【図２】付属語辞書の記憶例を示した図。
【図３】自立語辞書の記憶例を示した図。
【図４】自立語判別方法の概略を説明するためのフローチャート。
【図５】付属語検出部の処理動作を説明するためのフローチャート。
【図６】優先度計算部の処理動作を説明するためのフローチャート。
【図７】候補生成部で生成される変換候補テーブルの具体例を示した図。
【図８】未登録文字列から判別された自立語の表示例を示した図。
【符号の説明】
１…未登録文字列検出部、２…付属語検出部、３…優先度計算部、４…候補生成部、５…付属語辞書、６…自立語辞書。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method for discriminating independent words when converting a character string including an unregistered word into a kana-kanji character, for example, in a Japanese word processor.
[0002]
[Prior art]
In recent years, Japanese word processors have been widely used as means for inputting kana-kanji mixed expressions in Japanese. In such a Japanese word processor,
A kana character string input by the user is converted to kanji using a dictionary that stores word notation and readings in advance, but kanji kana mixed notation is efficiently created for the document desired by the user. In order to do this, it is necessary to be able to accurately convert the input kana character string into a kanji-kanji mixed notation intended by the user. If conversion is not possible, the user must make corrections for the wrong part of the conversion, which requires a lot of effort. In particular, if there is an unregistered word in the dictionary in the converted character string, not only the unregistered word part must be specified separately by single-kanji conversion, etc., but it often has an adverse effect on the front and back of the unregistered word. .
[0003]
In order to solve this problem, conventionally, a method of reducing the influence of unregistered words by two steps has been used.
1) Estimate which part is a section including an unregistered word from the registered word.
[0004]
2) Estimate what notation the unregistered word estimated by the above means has.
As a method of 1), a method is generally employed in which input is skipped from a point where conversion is impossible to a point where analysis is possible, and that portion is recognized as an unregistered word section.
[0005]
Basically, in kana-kanji conversion, conversion is performed in units of phrases, and therefore unregistered sections estimated by the method 1) include an adjunct word in addition to independent words.
[0006]
On the other hand, stage 2) is a very difficult problem, and is currently limited to katakana. The section estimated as described above includes an appendix, so the appendage does not need to be converted into katakana. In order to deal with this problem, when there is a possibility that the end of the estimated section is an attached word, a candidate that handles the end as an attached word is also set as a candidate.
[0007]
[Problems to be solved by the invention]
As described above, in the conventional kana-kanji conversion method, for unregistered character strings in the input kana character string, a method for discriminating an attached word that does not require conversion and an independent word to be converted, that is, how far is an independent word. Therefore, it is difficult to generate conversion candidates. Therefore, there is a problem that the correct answer is not often obtained first, the number of times that the user performs the next candidate operation increases, and the burden of correction of the user by an unregistered word is large.
[0008]
Further, the method for discriminating independent words in unregistered character strings is useful not only for kana-kanji conversion but also for machine translation.
Therefore, the present invention has been made in view of such problems, and can determine an independent word in an unregistered character string that is not registered in a dictionary in which notation or the like is registered in advance from the input character string. The object is to provide a possible independent word discrimination method.
[0009]
[Means for Solving the Problems]
The independent word discriminating method of the present invention uses an independent word dictionary that stores notation and grammatical information about an independent word, and includes a phrase including an unregistered character string that is not registered in the independent word dictionary from the input character string. Using the adjunct dictionary that detects and stores the notation and grammatical information about the adjunct word, the adjunct word is separated from the detected phrase and one or more independent word candidates are extracted, and the extracted independent word For each candidate, calculate the degree of independence based on the grammatical information of the word behind or in front of the clause, and determine the independent words included in the clause based on the calculated value, It is possible to easily determine an independent word in an unregistered character string that is not registered in a dictionary in which notation or the like is registered in advance from the input character string.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 shows a configuration of a main part of an information processing apparatus such as a word processor to which an independent word discrimination method according to the present embodiment is applied. An unregistered character string detection unit 1, an attached word detection unit 2, a priority calculation 3, a candidate generation unit 4, an attached word dictionary 5, and an independent word dictionary 6.
[0011]
In the following description, the case of kana-kanji conversion will be described as an example.
The unregistered character string detection unit 1 uses, for example, a character string including an unregistered word (hereinafter referred to as an unregistered character string) using an independent word dictionary 6 storing notation and grammatical information about the independent word from the input character string. Called).
[0012]
The adjunct word detection unit 2 separates the adjunct word from the unregistered character string detected by the unregistered character string detection unit 1 by using the adjunct word dictionary 5 storing the notation and grammatical information about the adjunct word. A plurality of independent word candidates are detected.
[0013]
The priority calculation unit 3 uses the independent word dictionary 6 to each of a plurality of independent word candidates detected by the adjunct word detection unit 2 based on the grammatical information of the word in front of or behind the unregistered character string. , A priority that is an index value of self-supporting wordness (degree of self-supporting words) is calculated.
[0014]
Based on the priority calculated by the priority calculation unit 3, the candidate generation unit 4 discriminates the independent word candidates detected in the attached word detection unit 2 as independent words in descending order of priority. It has come to go. That is, in the case of kana-kanji conversion, the order of kana-kanji conversion candidates is determined by rearranging the calculated priority values in descending order, and a conversion candidate table is generated.
[0015]
The ancillary word dictionary 5 stores adjunct words that can be prepended thereto as notation and grammatical information of the ancillary words.
The independent word dictionary 6 stores independent words and their grammatical information.
[0016]
FIG. 2 shows a storage example of the attached word dictionary 5. In FIG. 2, each attached word registered in the attached word dictionary 5 is assigned an attached word number, and further, an attached word number of an attached word that can be prepended to the attached word is stored. .
[0017]
For example, an adjunct with an adjunct number “1” is “ni”, and an adjunct that can be preceded by it is an adjunct with adjunct numbers “4” and “5”, ie, “to” and “to” Is.
FIG. 3 shows a storage example of the independent word dictionary 6. In FIG. 3, “reading”, “notation”, and “grammar information” are stored in each independent word registered in the independent word dictionary 6.
[0018]
For example, the reading “product” is “seiin”, and the grammatical information is “noun”.
Next, the entire processing operation of the independent word discrimination method having the above configuration will be described with reference to the flowchart shown in FIG.
[0019]
First, in step A1, by using the independent word dictionary 6 from the input kana character string, an unregistered character string including a word not registered in the independent word dictionary 6 is detected. Regarding this processing, for example, a technique disclosed in Japanese Patent Application Laid-Open No. 60-147867 can be used and will not be described in detail here.
[0020]
Next, in step A2, the auxiliary word detection unit 2 detects one or more independent word candidates by using the auxiliary word dictionary 5 to separate words that can become auxiliary words from the detected unregistered character string. To do.
[0021]
The priority calculation unit 3 calculates a priority for each detected independent word candidate (step A3 to step A4).
In the priority calculation process of step A4, the priority calculation unit 3 calculates the priority by referring to the grammar information of the words before and after the unregistered character string.
[0022]
If it is determined in step A3 that the priorities have been calculated for all independent word candidates, the process proceeds to step A5.
In step A5, the priority of each independent word candidate obtained in step A4 is determined as an independent word in order from the highest priority and cut out. That is, in the case of kana-kanji conversion, the order of kana-kanji conversion candidates is determined by rearranging the calculated priority values in descending order, and a conversion candidate table is generated.
[0023]
Next, with reference to the flowchart shown in FIG. 5, the attached word detection process in step A2 of FIG. 4 will be described in more detail.
First, in step B1, since there is always a possibility that there is no attached word in the unregistered character string and all words are independent words, the candidate list is packed as one of the candidates.
[0024]
In step B2, “1” is substituted into a variable S representing the starting position of the attached word by the number of characters from the end of the unregistered character string.
In step B3, if the variable S is equal to the length of the unregistered character string, all the possibilities have been checked, and the process ends. Otherwise, the variable S indicates the position of the unregistered character string from the position indicated by the variable S. It is judged using the attached word dictionary 5 whether the kana character string up to the end is permitted as the attached word string (step B4). Since the determination method may be a method generally used for morphological analysis, it will not be described in detail here.
[0025]
If the determination in step B4 is true, the candidate obtained by separating the independent words and the adjunct words at the position indicated by the variable S in step B5 is packed in the candidate list, and the process proceeds to step B6. If the determination is false, the process proceeds to step B6, and “1” is added to the variable S.
[0026]
Next, the priority calculation process in step A4 in FIG. 4 will be described in more detail with reference to the flowchart shown in FIG.
First, in step C1, a default score (here, “100”) is substituted into a variable P representing the priority of the independent word candidate detected in step A2.
[0027]
In step C2, the independent word dictionary 6 is searched to obtain grammatical information of the independent word immediately after the unregistered character string. If the grammar information is “non-compound word noun” and there is no attached word separated from the unregistered character string (step C3), the unregistered character string is highly likely to be a compound noun. To lower the priority (here, “50” is subtracted). If the grammar information is “a verb that takes“ to ”” and the ancillary word separated from the unregistered character string is “to” (step C5), the priority is increased in step C6 (here, Add “30”). Finally, the priority represented by the variable P is given to each independent word candidate (step C7).
[0028]
Next, the independent word discrimination method as described above will be specifically described.
For example, it is assumed that a kana character string “Ami is soaked” is a processing target. At this time, the unregistered character string detector 1 detects the “Ami-ga” portion as an unregistered character string (step A1).
[0029]
Next, the adjunct word detection unit 2 first catches one of the candidates “Amyuga” as an independent word (step B1). Next, when the value of the variable S is 1, the attached word sequence pointed to is “ga”, which is recognized as an adjunct word sequence, so the candidate “Ami (independent word) + is (an adjunct word)” It is pinched (step B5).
[0030]
Then, if these two candidates are sent to the priority calculation unit 3 and immediately after the unregistered character string is “Tomorrow”, the grammatical information is “non-compound word noun”, so “Ami (independent word)” "Conforms to the condition of step C3 in FIG. 6, and the priority is" 50 ". On the other hand, “Ami (independent word) + ga (adjunct word)” does not apply to any of the conditions of steps C3 and C5, so the priority remains “100”. Therefore, the candidate generation unit 4 gives priority to “Ami (independent word) + is (attached word)” and determines “Ami” as an independent word.
[0031]
As described above, according to the above-described embodiment, the unregistered character string detection unit 1 is not registered using the independent word dictionary 6 that stores notation and grammar information about independent words from the input character string. The adjunct word detection unit 2 detects an adjunct word from the unregistered character string detected by the unregistered character string detection unit 1 using the adjunct dictionary 5 storing notation and grammatical information about the adjunct word. 1 and a plurality of independent word candidates are detected, and the priority calculation unit 3 uses the independent word dictionary 6 to detect an adjunct word based on the grammatical information of the word behind the unregistered character string. The priority is calculated for each of a plurality of independent word candidates detected by the unit 2, and the candidate generation unit 4 is detected by the adjunct word detection unit 2 based on the priority calculated by the priority calculation unit 3. Independent word candidates are identified as independent words in descending order of priority. Makes it possible to easily determine independent words in unregistered character strings that are not registered in a dictionary whose notation is registered in advance from the input character string, and the most probable independent words in the unregistered character string Priority is given to the candidate having, for example, when it is used for kana-kanji conversion, it is possible to reduce the user's trouble of selecting and correcting the candidate even when an unregistered word exists in the input character string.
[0032]
In addition, this invention is not limited only to the said embodiment, In the range which does not change a summary, it can deform | transform suitably and can be implemented. In the present embodiment, as the grammatical information used in the priority calculation unit 3, connectivity relating to the non-compound word attribute of the noun and the adjunct “to” of the verb is used, but this is an example. Furthermore, it is possible to use the connectivity of additional words with other parts of speech. In addition to grammatical information, semantic information such as co-occurrence information or pragmatic information can be used.
[0033]
Moreover, in the said embodiment, although only the grammatical information of the word behind an unregistered character string is used, the grammatical information of the word ahead can also be used. For example, if there is a clause that is preceded by the attached word “ga”, then the independent word candidate in the unregistered character string can be prioritized by using the property that it is difficult to remove the attached word “ga”.
[0034]
In the present embodiment, the information of the word immediately after the unregistered character string is used, but it is also possible to use information of a distant phrase instead of just before and immediately after.
In this embodiment, the independent word candidate is detected by separating the unregistered character string into “independent word part + attached word part”, but “independent word part + attached word part + independent word part + attached”. Like “word part”, it may be separated into three or more as necessary.
[0035]
In this embodiment, it is assumed that the input is a kana character string, but a kana-kanji mixed string may be used. In this case, the present invention is effective when a kana is included in the independent word part. For example, if a kana-kanji mixed string “Ohmatsu or real estate” is the target and the unregistered string is “Oimatsuya”, there are two possible independent words, “Oimatsu” or “Oimatsuya”. If there is grammatical information that “real estate” is often used immediately before with proper nouns, it can be determined that “Oimatsuya” is likely to be an independent word part.
[0036]
In addition, the attached word dictionary 5 and the independent word dictionary 6 can be implemented by changing the dictionary format or adding dictionary items.
Further, this embodiment can be incorporated into an information processing apparatus that performs Kana-Kanji conversion such as a word processor, and the self-supporting word part can be converted into katakana or kanji to improve the efficiency of user document creation. For example, in contrast to the above-mentioned example of “Ami will be swallowed”, the independent word with the highest priority among the detected independent word candidates, that is, “Amy” is converted into katakana. The candidate generation unit 4 can easily generate a conversion candidate table as shown in FIG. 7 and convert it as “Amy will be sold tomorrow”. The replacement of the self-supporting word part may be performed under the condition that a long sound is included in the self-supporting word part.
[0037]
In addition, the determined independent word part can be displayed so as to be distinguished by enclosing it with symbols or changing the font, and can be used for dictionary registration. For example, as shown in FIG. 8, it is also possible to display the determined unregistered free word part as a symbol and prompt the user to register the dictionary. In this case, not only kana-kanji conversion but also machine translation and OCR (Optical Character Reader) can be applied.
[0038]
【The invention's effect】
As described above, according to the present invention, it is possible to easily determine an independent word in an unregistered character string that is not registered in a dictionary in which notation or the like is registered in advance from an input character string, and an unregistered character string By giving priority to the candidate having the most likely independent word among them, for example, when used for kana-kanji conversion, an effect of reducing the user's trouble of correcting the unregistered word is produced.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a main part of an information processing apparatus such as a word processor to which an independent word discrimination method according to an embodiment of the present invention is applied.
FIG. 2 is a diagram showing a storage example of an attached word dictionary.
FIG. 3 is a diagram showing a storage example of an independent word dictionary.
FIG. 4 is a flowchart for explaining an outline of an independent word discrimination method.
FIG. 5 is a flowchart for explaining a processing operation of an attached word detection unit;
FIG. 6 is a flowchart for explaining a processing operation of a priority calculation unit.
FIG. 7 is a diagram showing a specific example of a conversion candidate table generated by a candidate generation unit.
FIG. 8 is a diagram showing a display example of independent words determined from unregistered character strings.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Unregistered character string detection part, 2 ... Attached word detection part, 3 ... Priority calculation part, 4 ... Candidate generation part, 5 ... Attached word dictionary, 6 ... Independent word dictionary

Claims

各自立語の表記、仮名読み、当該自立語が非複合名詞である場合にはその旨、及び当該自立語がその直前に付属語をとる場合には当該付属語が登録されている自立語辞書を記憶する第１の記憶手段と、
各付属語の表記と、当該付属語の前に接続可能な付属語とが登録されている付属語辞書を記憶する第２の記憶手段と、
入力された仮名文字列中の文節のうち、前記自立語辞書に登録されている自立語が含まれていない処理対象の文節の変換候補テーブルを生成する生成手段と、
を含む日本語ワードプロセッサにおける変換候補生成方法であって、
前記生成手段が、前記処理対象の文節から、当該処理対象の文節全体を変換対象の自立語とする候補を生成するとともに、前記付属語辞書を用いて、変換対象の自立語とする文字あるいは文字列と変換不要の付属語とする文字あるいは文字列とに分離した候補を生成するステップと、
前記生成手段が、前記処理対象の文節の直後にある自立語について前記自立語辞書を検索した結果、（ａ）当該処理対象の文節の直後にある自立語が非複合名詞である場合には、生成された複数の候補のうち、当該処理対象の文節全体を自立語とする候補に対する優先度を予め定められた値だけ下げ、（ｂ）当該処理対象の文節の直後にある自立語に対し、当該自立語がその直前にとる付属語が登録されている場合には、前記複数の候補のうち、当該付属語で終わる候補に対する優先度を予め定められた値だけ上げて、前記複数の候補のそれぞれに対する優先度を算出するステップと、
前記生成手段が、前記複数の候補を前記優先度が最も高い候補から順に並び替えて、各候補に含まれる自立語をカタカナあるいは漢字に置き換えた変換候補テーブルを生成するステップと、
を含む変換候補生成方法。Representation of each independent word, kana, freestanding Dictionary the independent words is that the accessory word is registered in the case when a non-compound nouns taking this fact, and the independent word is shipped words immediately before First storage means for storing
And representation of each accessory words, a second storage means and connectable comes word before the accessory word stores the supplied dictionary, which is registered,
Generating means for generating a conversion candidate table for a phrase to be processed that does not include an independent word registered in the independent word dictionary among the phrases in the input kana character string;
A conversion candidate generation method in a Japanese word processor including
The generating means generates a candidate for converting the entire processing target clause from the processing target clause into independent words to be converted, and using the attached word dictionary, the character or character to be converted into independent words. Generating a candidate separated into a string and a character or character string as an attachment-free word ;
As a result of searching the independent word dictionary for the independent word immediately after the processing target clause, the generating means (a) when the independent word immediately after the processing target clause is a non-compound noun, Among the plurality of generated candidates, the priority for a candidate having the entire processing target clause as an independent word is lowered by a predetermined value, and (b) for the independent word immediately after the processing target clause, If an adjunct that the self-supporting word takes immediately before is registered, the priority for the candidate ending with the adjunct among the plurality of candidates is increased by a predetermined value, and the plurality of candidates Calculating a priority for each ;
The generating means rearranging the plurality of candidates in order from the candidate with the highest priority, and generating a conversion candidate table in which independent words included in each candidate are replaced with katakana or kanji ;
A conversion candidate generation method including :