JP6648421B2

JP6648421B2 - Information processing apparatus for processing documents, information processing method, and program

Info

Publication number: JP6648421B2
Application number: JP2015116798A
Authority: JP
Inventors: 功宮下; 片岡　正弘; 正弘片岡; 洋之川村; 大樹向井
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2015-06-09
Filing date: 2015-06-09
Publication date: 2020-02-14
Anticipated expiration: 2035-06-09
Also published as: TWI667579B; JP2017004218A; CN106250354B; TW201643749A; CN106250354A

Description

本発明は、文書を処理する情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing apparatus that processes a document, an information processing method, and a program.

近年、経営判断のスピードアップや的確性向上を目的に、企業が持つデータをテキストマイニングと呼ばれる文書解析技術で分析し、有用な情報を抽出しようとする動きが活発化している。テキストマイニングにおいては、文書中の文を意味のある単語単位で区切る形態素解析が実行されることがある。形態素解析では、事前に準備しておいた単語辞書と処理対象の文から取り出した単語が照合される。しかし、利用者定義文字のような外字が処理対象の文に含まれていると、コンピュータは、文を正しく単語に区切ることができず、有用な情報を抽出することができない。なお、以下では、文書とは一または複数の文を含む情報として例示できる。 2. Description of the Related Art In recent years, there has been an active movement to analyze data possessed by a company using a document analysis technique called text mining to extract useful information for the purpose of speeding up and improving the accuracy of management decisions. In text mining, morphological analysis that separates sentences in a document into meaningful words may be performed. In the morphological analysis, a word extracted from a sentence to be processed is compared with a word dictionary prepared in advance. However, if an external character such as a user-defined character is included in a sentence to be processed, the computer cannot correctly divide the sentence into words, and cannot extract useful information. In the following, a document can be exemplified as information including one or a plurality of sentences.

図１に、コンピュータが、外字を含む文を形態素解析した処理の一例を示す。図１の例では、文中の左から２番目の文字Ｃ２と、７番目の文字Ｃ７が外字であると想定する。図１の例では、左から１番目の文字Ｃ１と２番目の文字Ｃ２は、固有名詞を表している。しかし、コンピュータが図1の文を形態素解析した場合に、左から２番目の文字Ｃ２の形態
素を認識できない。このため、１番目の文字Ｃ１と２番目の文字Ｃ２は、それぞれ「漢字」、「？？（不明、未定義）」のように判断されている。図１の左から7番目の文字を含
む箇所、つまり左から６番目の文字Ｃ６から８番目の文字Ｃ８までの箇所も同様である。 FIG. 1 shows an example of a process in which a computer morphologically analyzes a sentence including an external character. In the example of FIG. 1, it is assumed that the second character C2 from the left and the seventh character C7 in the sentence are external characters. In the example of FIG. 1, the first character C1 and the second character C2 from the left represent proper nouns. However, when the computer morphologically analyzes the sentence of FIG. 1, it cannot recognize the morpheme of the second character C2 from the left. Therefore, the first character C1 and the second character C2 are determined as "Kanji" and "?? (unknown, undefined)", respectively. The same applies to the portion including the seventh character from the left in FIG. 1, that is, the portion from the sixth character C6 to the eighth character C8 from the left.

利用者定義文字のように、特定の文字集合に含まれない文字は外字と呼ばれる。さらに具体的には、例えば、１つのコンピュータに対して、そのコンピュータが扱う特定の文字規格に含まれない文字を外字と呼ぶ。一方、特定の文字規格に含まれる文字は内字と呼ばれる。 Characters that are not included in a specific character set, such as user-defined characters, are called external characters. More specifically, for example, for one computer, a character that is not included in a specific character standard handled by that computer is called an external character. On the other hand, characters included in a specific character standard are called inner characters.

上述のように、コンピュータが外字を含んだ文の形態素解析を行うと、コンピュータは文書中の外字部分を認識できない。その結果、外字を含んだ文の形態素解析の実行結果は、不適切なものとなる。そこで、従来から、文を形態素解析する場合には、外字を内字の異字体に置換後に形態素解析が実行される。ここで、内字の異字体とは、例えば、外字に類似した形状の内字であって、外字の代用として用いられるものをいう。 As described above, when a computer performs a morphological analysis of a sentence including an external character, the computer cannot recognize the external character portion in the document. As a result, the execution result of the morphological analysis of the sentence including the external character becomes inappropriate. Therefore, conventionally, when a sentence is subjected to morphological analysis, morphological analysis is performed after replacing an external character with an allograph of an internal character. Here, the variant of the internal character is, for example, an internal character having a shape similar to the external character, which is used as a substitute for the external character.

図２に、コンピュータが文中の外字を内字の異字体に置換し、形態素解析する処理を例示する。図２の例では、図１に例示した文中の左から２番目の文字Ｃ２と７番目の文字Ｃ７が、それぞれ内字に置換され、形態素解析が実行されている。 FIG. 2 exemplifies a process in which a computer replaces an external character in a sentence with an allograph of an internal character and performs morphological analysis. In the example of FIG. 2, the second character C2 and the seventh character C7 from the left in the sentence illustrated in FIG. 1 are each replaced with an inner character, and morphological analysis is performed.

特開２０００−２９３５２２号公報JP-A-2000-293522 特開２００６−２３５８００号公報JP 2006-235800 A 特開２０１０−１６５３０２号公報JP 2010-165302 A

しかし、外字を内字に置換したとしても、形態素解析の結果が適切なものになるとは限らない。形態素解析は、文を言語で意味を持つ最小の単位に分割し、品詞等を判断する処理ということができる。処理対象の文に外字を含む単語Ｚ１が含まれる場合を想定する。
この文の外字が内字に置き換えられると、単語Ｚ１は、例えば、単語Ｚ２に置き換えられることになる。 However, even if the external characters are replaced with the internal characters, the result of the morphological analysis does not always become appropriate. The morphological analysis can be said to be a process of dividing a sentence into minimum units having meaning in a language and determining a part of speech or the like. It is assumed that a sentence to be processed includes a word Z1 including an external character.
When the external character of this sentence is replaced with the internal character, the word Z1 is replaced with, for example, the word Z2.

しかし、形態素解析を実行するプログラムは、単語Ｚ２を形態素として認識できるとは限らない。より具体的には、外字を含む単語Ｚ１が、名詞、動詞、形容詞、文中の符合等である場合、形態素解析を実行するプログラムは、置換後の単語Ｚ２を同じ品詞のものとして認識できない場合が生じる。置換後の単語Ｚ２が形態素解析の単語辞書に登録されていない場合があるからである。例えば、単語Ｚ１が人名等の固有名詞の場合、形態素解析を実行するプログラムは、置換後の単語Ｚ２を人名と認識できるとは限らないからである。なお、このような問題は、人名のような固有名詞に限られず、文中の外字が内字に置換されて形態素解析される処理において、名詞、動詞、形容詞、副詞、助詞、助動詞、接続詞、接辞詞、符号、記号等、他の形態素解析の要素においても生じ得る。 However, a program that performs morphological analysis cannot always recognize the word Z2 as a morpheme. More specifically, when the word Z1 including the external character is a noun, a verb, an adjective, a code in a sentence, or the like, the program that executes the morphological analysis may not recognize the replaced word Z2 as having the same part of speech. Occurs. This is because the replacement word Z2 may not be registered in the morphological analysis word dictionary. For example, if the word Z1 is a proper noun such as a person's name, a program that executes morphological analysis cannot always recognize the replaced word Z2 as a person's name. In addition, such a problem is not limited to proper nouns such as personal names, and in a process in which an external character in a sentence is replaced with an internal character and morphological analysis is performed, nouns, verbs, adjectives, adverbs, particles, auxiliary verbs, conjunctions, affixes It can also occur in other morphological analysis elements, such as words, signs, and symbols.

そこで、本実施形態の一側面によれば、外字を含む文書の形態素解析精度を向上させることを目的とする。 Thus, according to one aspect of the present embodiment, it is an object to improve the morphological analysis accuracy of a document including an external character.

本実施形態の一側面は、コンピュータに情報処理を実行させるためのプログラムによって例示される。このプログラムは、処理対象の文書中で、情報処理装置が取り扱う文字規格に含まれない外字を判別する判別処理と、形態素解析に用いられる第1の辞書から生成
された置換辞書に基づき外字を文字規格に含まれる内字に置換する置換処理と、外字が内字に置換された文書を、第１の辞書を用いて解析する解析処理と、をコンピュータに実行させる。 One aspect of the present embodiment is exemplified by a program for causing a computer to execute information processing. This program performs a process of determining an external character that is not included in the character standard handled by the information processing device in a document to be processed, and converts the external character to a character based on a substitution dictionary generated from a first dictionary used for morphological analysis. The computer causes the computer to execute a replacement process for replacing the internal characters included in the standard with the internal characters and an analysis process for analyzing the document in which the external characters have been replaced with the internal characters using the first dictionary.

本情報処理装置によれば、外字を含む文書を従来よりも適切に形態素解析できる。 According to the present information processing apparatus, a document including an external character can be more appropriately morphologically analyzed than before.

外字が含まれている文を形態素解析した処理の一例を示す図である。FIG. 9 is a diagram illustrating an example of a process of performing a morphological analysis on a sentence including an external character. コンピュータが文中の外字を内字の異字体に置換し、形態素解析する処理を例示する図である。FIG. 9 is a diagram illustrating a process in which a computer replaces an external character in a sentence with an allograph of an internal character and performs morphological analysis. 情報処理装置が実行する処理に伴うデータフローと各処理に用いられる辞書を例示する図である。FIG. 4 is a diagram illustrating a data flow accompanying processing executed by the information processing apparatus and a dictionary used for each processing. ＯＣＲ辞書を用いた処理を例示する図である。It is a figure which illustrates the process using an OCR dictionary. 大規模文字集を用いた処理を例示する図である。It is a figure which illustrates processing using a large-scale character collection. 属性情報の類似度の算出結果を例示する図である。FIG. 9 is a diagram illustrating a calculation result of a similarity of attribute information. 部品の類似度辞書の例である。It is an example of a part similarity dictionary. 部品の位置の類似度の算出例である。It is a calculation example of the similarity of the position of a part. 部品の位置の類似度の他の算出例である。It is another calculation example of the similarity of the position of a component. 実施形態の処理による効果を例示する図である。FIG. 9 is a diagram illustrating an effect obtained by processing according to the embodiment. 情報処理装置のハードウェア構成図を例示する図である。FIG. 2 is a diagram illustrating a hardware configuration diagram of the information processing apparatus. 全体処理フローを例示する図である。It is a figure which illustrates the whole processing flow. 大規模文字集による検索処理の詳細を例示するフローチャートである。It is a flowchart which illustrates the detail of the search processing by a large-scale character collection. 大規模文字集による検索処理の詳細の他の例である。It is another example of the detail of the search processing by a large-scale character collection. 外字を部品に分解する処理を例示する図である。It is a figure which illustrates the process which decomposes an external character into parts.

以下、図面を参照して一実施形態に係る情報処理装置について説明する。
＜処理例＞
図３は、本情報処理装置が実行する処理に伴うデータフローと各処理に用いられる辞書を
例示する。図３のように、本情報処理装置は、単語辞書１と、ＯＣＲ辞書２と、大規模文字集３と、属性辞書４とを有する。本情報処理装置は、処理対象の文書と図３に例示された辞書とを照合することによって、文書中の外字を形態素解析可能な内字の異字体に置換する。 Hereinafter, an information processing apparatus according to an embodiment will be described with reference to the drawings.
<Processing example>
FIG. 3 illustrates a data flow associated with the processing executed by the information processing apparatus and a dictionary used for each processing. As shown in FIG. 3, the information processing apparatus includes a word dictionary 1, an OCR dictionary 2, a large-scale character collection 3, and an attribute dictionary 4. The information processing apparatus replaces an external character in the document with an allograph of an internal character that can be morphologically analyzed by comparing the document to be processed with the dictionary illustrated in FIG.

（Ａ）単語辞書１
単語辞書１は、形態素解析に用いられる辞書である。単語辞書１が第１の辞書の一例である。単語辞書１には、例えば、文字と文字とを組み合わせた単語と、単語の品詞とが登録されている。単語辞書１において、文字は、文字コードで記述される。文字コードは、１バイト、あるいは２バイト等のバイト列で文字を定義する。したがって、単語辞書１は、文字コードの組み合わせによって単語を定義する。例えば、本情報処理装置は、処理対象の文から文字コードの組み合わせを取得し、取得した文字コードの組み合わせによって単語辞書１を検索する。そして、文字コードの組み合わせが単語辞書１に定義されている場合に、本情報処理装置は、文字コードの組み合わせを単語として認識する。 (A) Word dictionary 1
The word dictionary 1 is a dictionary used for morphological analysis. The word dictionary 1 is an example of a first dictionary. In the word dictionary 1, for example, words that combine letters and letters and the parts of speech of the words are registered. In the word dictionary 1, characters are described by character codes. The character code defines a character by a byte string such as 1 byte or 2 bytes. Therefore, the word dictionary 1 defines words by combinations of character codes. For example, the information processing apparatus acquires a combination of character codes from a sentence to be processed, and searches the word dictionary 1 based on the acquired combination of character codes. Then, when the combination of character codes is defined in the word dictionary 1, the information processing apparatus recognizes the combination of character codes as a word.

また、情報処理装置は、各文字コードに対応する文字形状である文字フォントのライブラリを有している。したがって、情報処理装置は、文字コードと文字フォントの種類の指定を受けて、文書中の文字コードを文字フォントで指定される文字形状でディスプレイ、プリンタ等に出力する。本実施形態では、情報処理装置は、所定のフォントを用いて処理を実行する。所定のフォントは、例えば、ユーザ操作で設定できるようにすればよい。 Further, the information processing apparatus has a library of character fonts that are character shapes corresponding to each character code. Therefore, the information processing apparatus receives the designation of the character code and the type of the character font, and outputs the character code in the document to a display, a printer, or the like in a character shape designated by the character font. In the present embodiment, the information processing device executes a process using a predetermined font. The predetermined font may be set, for example, by a user operation.

（Ｂ）ＯＣＲ辞書２
ＯＣＲ辞書２は、文字コードに対応する文字形状をOptical Character Recognition（
ＯＣＲ）処理に適した形式に変換した辞書である。例えば、ＯＣＲ辞書２は、文字の縦横の比が所定の値に正規化された文字形状の情報を有する。また、ＯＣＲ辞書２は、個々の文字形状そのままのパターンを有してもよいし、文字形状を特徴部分に分解した文字形状のパターンを有してよい。例えば、ＯＣＲ辞書２は、文字を中心から放射状に向かう各方向ベクトルに対して所定の許容角度範囲に含まれる線分として分類される部分ごとに、文字形状の特徴パターンを有してもよい。いずれにしても、ＯＣＲ辞書２は、本情報処理装置で実行されるＯＣＲ処理に適合した形式で文字形状の情報を有する。 (B) OCR dictionary 2
The OCR dictionary 2 converts a character shape corresponding to a character code into an Optical Character Recognition (
This is a dictionary converted into a format suitable for OCR) processing. For example, the OCR dictionary 2 has character shape information in which the aspect ratio of a character is normalized to a predetermined value. In addition, the OCR dictionary 2 may have a pattern of each character as it is, or may have a pattern of a character shape obtained by decomposing the character shape into a characteristic portion. For example, the OCR dictionary 2 may have a character-shaped feature pattern for each part classified as a line segment included in a predetermined allowable angle range with respect to each direction vector that radially extends from the center of the character. In any case, the OCR dictionary 2 has character shape information in a format suitable for the OCR process executed by the information processing apparatus.

本情報処理装置は、単語辞書１に登録された形態素解析用の文字コードの組み合わせをそれぞれの文字コードに分解する。例えば、単語辞書１に、「渡辺」「渡邉」「渡邊」という３つの単語が登録されている場合を想定する。情報処理装置は、これらの単語を「渡」、「辺」、「邉」、「邊」という文字に分解する。そして、情報処理装置は、単語から分解されたそれぞれの文字の文字形状を取得し、ＯＣＲ辞書２に適合する文字形状の情報を生成し、ＯＣＲ辞書２に登録する。したがって、ＯＣＲ辞書２は置換辞書の一例として、第１の辞書（単語辞書１）に含まれている内字の文字形状を定義する文字形状情報を含む。 The information processing apparatus decomposes a combination of character codes for morphological analysis registered in the word dictionary 1 into respective character codes. For example, assume that three words “Watanabe”, “Watanabe”, and “Watanabe” are registered in the word dictionary 1. The information processing device decomposes these words into the characters “Water”, “Side”, “Side”, and “Side”. Then, the information processing device acquires the character shape of each character decomposed from the word, generates information on the character shape conforming to the OCR dictionary 2, and registers the information in the OCR dictionary 2. Therefore, the OCR dictionary 2 includes, as an example of the substitution dictionary, character shape information that defines the character shape of the inner character included in the first dictionary (word dictionary 1).

本情報処理装置は、形態素解析において形態素解析対象の文書中の外字を判別すると、外字の文字形状情報と、ＯＣＲ辞書２の文字形状情報とを照合し、外字の文字形状に整合するＯＣＲ辞書２中の文字形状を検索する。そして、本情報処理装置は、外字の文字形状に整合するＯＣＲ辞書２中の文字形状が取得できた場合には、取得できた文字形状に対応する内字の文字コードに外字の文字コードを置き換える。すなわち、本情報処理装置は、形態素解析の対象である文書中の外字の文字コードをＯＣＲ辞書２から取得した内字の文字コードに置換する。そして、情報処理装置は、外字が内字に置換された文書に対して形態素解析を実行する。 When the information processing apparatus determines the external character in the document to be morphologically analyzed in the morphological analysis, the information processing device compares the character shape information of the external character with the character shape information of the OCR dictionary 2 and matches the character shape information of the OCR dictionary 2 with the character shape of the external character. Search for character shapes inside. Then, when a character shape in the OCR dictionary 2 that matches the character shape of the external character can be acquired, the information processing apparatus replaces the character code of the external character with the character code of the internal character corresponding to the acquired character shape. . That is, the information processing apparatus replaces the character code of the external character in the document to be subjected to the morphological analysis with the character code of the internal character acquired from the OCR dictionary 2. Then, the information processing device performs a morphological analysis on the document in which the external characters have been replaced with the internal characters.

（Ｃ）大規模文字集３
大規模文字集３は、大規模文字集合、大規模文字セット等とも呼ばれる文字コードとともに文字形状情報を登録した文字の辞書であり、一般に通用している規格よりも多くの文字を含むものをいう。したがって、大規模文字集３は、１つの規格で定められる範囲よりも多い数の文字コードと文字形状情報を登録しているということができる。大規模文字集３は、コンピュータメーカ、出版社等の企業、大学、研究機関、研究者の団体等から提供されている。大規模文字集３は、情報処理装置内にインストールされていてもよいし、ＬＡＮ（Local Area Network）、あるいはインターネット等を通じてアクセス可能なサーバ上のデータベースに格納されていてもよい。 (C) Large-scale character collection 3
The large-scale character collection 3 is a dictionary of characters in which character shape information is registered together with a character code also called a large-scale character set, a large-scale character set, and the like, and includes a larger number of characters than a generally accepted standard. . Therefore, it can be said that the large-scale character collection 3 registers more character codes and character shape information than the range defined by one standard. The large-scale character collection 3 is provided by companies such as computer manufacturers and publishers, universities, research institutions, and groups of researchers. The large character collection 3 may be installed in the information processing apparatus, or may be stored in a database on a server accessible through a LAN (Local Area Network), the Internet, or the like.

（Ｄ）属性辞書４
属性辞書４は、単語辞書１に含まれる文字を部品に分解して、文字コードごとの読みと、部品と、部品の位置等の属性情報を定義した辞書である。本実施形態では、属性辞書４の各レコードは、文字コードと、読みと、部品と、部品の位置という要素を含む。各レコード中の要素「読み」は、文字コードで指定される文字（漢字）の読みを定義する。 (D) Attribute dictionary 4
The attribute dictionary 4 is a dictionary in which characters included in the word dictionary 1 are decomposed into parts, and reading for each character code, and parts and attribute information such as the position of the parts are defined. In the present embodiment, each record of the attribute dictionary 4 includes elements such as a character code, a reading, a part, and a position of the part. The element “Yomi” in each record defines the reading of the character (Kanji) specified by the character code.

要素「部品」は、文字コードで指定される文字に含まれる部品を定義する。部品としては、例えば、漢字の部首で特定されるもの、例えば、偏（へん）、旁（つくり）、冠（かんむり）、脚（あし）、構（かまえ）、垂（たれ）、繞（にょう）の７種類の部首を例示できる。７種類の部首のそれぞれに、例えば、にんべん、てへん等の具体的な部首が部品コードで定義される。文字が複数の部品を有する場合、文字コードに対応づけて複数の部品コードが指定される。なお、属性辞書４は、部品コードに対応する部品の形状情報、つまり、部品のパターンを保持してもよい。ただし、属性辞書４は、部品の形状を保持せず、他のフォントファイル等に部品の形状情報が定義されてもよい。 The element “part” defines a part included in the character specified by the character code. As the parts, for example, those specified by the radical of the kanji, for example, unbalanced (negative), side (creation), crown (kanmuri), leg (reed), structure (kamae), hanging (drip), surrounding Seven types of radicals can be exemplified. For each of the seven types of radicals, for example, specific radicals such as ninben and chin are defined by component codes. When a character has a plurality of parts, a plurality of part codes are specified in association with the character codes. The attribute dictionary 4 may hold the shape information of the component corresponding to the component code, that is, the pattern of the component. However, the attribute dictionary 4 does not hold the shape of the component, and the shape information of the component may be defined in another font file or the like.

要素「部品の位置」は、文字の存在範囲で定義される文字領域における、部品の位置を定義する情報である。要素「部品の位置」は、部品の形状内の基準点が位置する文字領域内の部分領域番号で指定できる。例えば、文字領域が正規化された所定の寸法の矩形領域であるとする。図３に例示するように、矩形領域は、縦４横４の合計１６の部分領域に分割されて、各部分領域に番号が１から１６まで付される。また、部品の形状情報の範囲、つまり部品の存在領域の左上点が基準点であるとする。このような場合に、部品の位置は、部品の存在領域の左上点が属する文字中の部分領域の番号として定義される。属性辞書４の各レコードにおいて、要素「部品の位置」には、それぞれの部品ごとの位置が定義される。なお、図３では、「渡」、「辺」、「邉」、「邊」という４つの文字を例にして属性辞書４が例示されているが、属性辞書４がこれら４つの文字に限定される訳ではない。 The element “part position” is information that defines the position of the part in the character area defined by the character existence range. The element “position of the component” can be specified by a partial region number in the character region where the reference point in the shape of the component is located. For example, it is assumed that the character area is a normalized rectangular area having a predetermined size. As illustrated in FIG. 3, the rectangular area is divided into a total of 16 partial areas, 4 in length and 4 in width, and each partial area is numbered from 1 to 16. It is also assumed that the range of the component shape information, that is, the upper left point of the region where the component exists is the reference point. In such a case, the position of the component is defined as the number of the partial region in the character to which the upper left point of the region where the component exists. In each record of the attribute dictionary 4, a position for each component is defined in the element "position of component". In FIG. 3, the attribute dictionary 4 is exemplified by using four characters “Water”, “side”, “side”, and “side”, but the attribute dictionary 4 is limited to these four characters. Not necessarily.

（Ｅ）大規模文字集３と属性辞書４の利用目的
本実施形態では、情報処理装置は、ＯＣＲ辞書２によって、外字に対応する内字の異字体を決定できなかったときに、大規模文字集３と属性辞書４を利用する。すなわち、本情報処理装置は、外字の文字形状情報と大規模文字集３に含まれる文字コードに対応する文字形状情報とを照合し、両方の文字形状が一定基準値以上のスコアで整合する大規模文字集３の文字を選択する。本情報処理装置は、選択された文字の部品と部品の位置等の属性情報を大規模文字集３から取得し、属性辞書４から部品と部品の位置等の属性情報が類似する内字を選択する。なお、情報処理装置は、外字の文字形状で大規模文字集３を検索する前に、外字を部品に分解しておき、部品を基に大規模文字集３を検索してもよい。 (E) Purpose of Use of Large-Scale Character Collection 3 and Attribute Dictionary 4 In the present embodiment, the information processing apparatus determines a large-scale character when the OCR dictionary 2 cannot determine the allomorph of the internal character corresponding to the external character. The collection 3 and the attribute dictionary 4 are used. That is, the information processing apparatus compares the character shape information of the external character with the character shape information corresponding to the character code included in the large-scale character collection 3, and matches both character shapes with a score equal to or higher than a certain reference value. Select a character from the scale character set 3. The information processing apparatus obtains attribute information such as a part of the selected character and the position of the part from the large-scale character collection 3 and selects an inner character having similar attribute information such as the part and the position of the part from the attribute dictionary 4. I do. In addition, before searching the large-scale character collection 3 with the character shape of the external character, the information processing apparatus may decompose the external character into components and search the large-scale character collection 3 based on the components.

図４は、ＯＣＲ辞書２を用いた処理を例示する図である。図４の処理例では、形態素解析の対象である文中に文字Ｃ１、Ｃ２が組み合わせられた単語Ｚ１を含む文が処理される。文字Ｃ１は、「渡」であり、文字Ｃ２は、「邉」の文字の「自」の部分が「白」になった外字の例である。本情報処理装置は、形態素解析の対象となる文中に外字があることを認識すると、その外字の形状を取得する。外字の形状は、例えば、ユーザ定義辞書、外字
ファイル等に保存されている。本情報処理装置は、外字の文字形状を正規化し、ＯＣＲ辞書２に合致するフォーマットに変換し、ＯＣＲ辞書２に定義された文字形状と照合し、ＯＣＲ処理を実行する。 FIG. 4 is a diagram illustrating a process using the OCR dictionary 2. In the processing example of FIG. 4, a sentence that includes the word Z1 in which characters C1 and C2 are combined in a sentence to be subjected to morphological analysis is processed. The character C1 is "Water", and the character C2 is an example of an external character in which the "self" portion of the character of "Ben" has become "White". When recognizing that a character to be subjected to morphological analysis includes an external character, the information processing device acquires the shape of the external character. The shape of the external character is stored, for example, in a user-defined dictionary, an external character file, or the like. The information processing device normalizes the character shape of the external character, converts the character shape into a format that matches the OCR dictionary 2, compares the character shape with the character shape defined in the OCR dictionary 2, and executes the OCR process.

本情報処理装置は、ＯＣＲ辞書２との照合の結果、外字（文字Ｃ２）と基準値以上のスコアで合致する文字形状の文字が認識された場合、その文字の文字コードをＯＣＲ辞書２から取得する。図４の例では、ＯＣＲ処理の結果、「邉」の文字が認識されている。すると、本情報処理装置は、形態素解析の対象となる文において、外字である文字Ｃ２を内字「邉」に置換する。その結果、形態素解析の対象となる文は、内字「渡」（文字Ｃ１）と外字Ｃ２が組み合わせられた単語Ｚ１に代えて、内字「渡」と内字「邉」とを組み合わせた単語Ｚ２を含む文となる。本情報処理装置は、このように外字を内字に置換した文に対して、単語辞書１を用いて形態素解析を実行する。 When a character having a character shape that matches the external character (character C2) with a score equal to or higher than the reference value is recognized as a result of the comparison with the OCR dictionary 2, the information processing apparatus acquires the character code of the character from the OCR dictionary 2. I do. In the example of FIG. 4, as a result of the OCR processing, the character of “edge” is recognized. Then, in the sentence to be subjected to morphological analysis, the information processing apparatus replaces the character C2, which is an external character, with the internal character "en". As a result, the sentence to be subjected to morphological analysis is a word obtained by combining the internal character “Watari” with the internal character “Ben” instead of the word Z1 obtained by combining the internal character “Water” (character C1) and the external character C2. The sentence contains Z2. The information processing apparatus performs morphological analysis using the word dictionary 1 on the sentence in which the external character is replaced with the internal character in this way.

ＯＣＲ辞書２は、元々単語辞書１に定義された単語に含まれる文字形状を基にＯＣＲ処理の実行に適した形式で作成されている。したがって、ＯＣＲ辞書２に含まれる内字は、単語辞書１で定義された単語に含まれている。その結果、内字「邉」を含む単語Ｚ２、つまり、内字「渡」と内字「邉」とを組み合わせた単語Ｚ２は、単語辞書１に定義されている可能性が高い。つまり、図４の処理によって、本情報処理装置は、形態素解析を適切に実行できる可能性を高めることができる。 The OCR dictionary 2 is created in a format suitable for executing the OCR process based on the character shapes included in the words originally defined in the word dictionary 1. Therefore, the internal characters included in the OCR dictionary 2 are included in the words defined in the word dictionary 1. As a result, it is highly likely that the word Z2 including the internal character “beside”, that is, the word Z2 in which the internal character “wata” and the internal character “beside” are combined, is defined in the word dictionary 1. That is, the information processing apparatus can increase the possibility that the morphological analysis can be appropriately performed by the processing in FIG.

図５は、大規模文字集３を用いた処理を例示する図である。図５の処理例でも、図４と同様、形態素解析の対象である文書中に文字Ｃ１、Ｃ２が組み合わせられた単語Ｚ１を含む文が処理される。図５の処理でも、情報処理装置は、外字の文字形状と、大規模文字集３で定義されている文字に対応する文字形状とを照合し、ＯＣＲ処理する。図５の例では、情報処理装置は、大規模文字集３によるＯＣＲ処理の結果、外字（文字Ｃ２）に合致する文字の認識に成功している。すると、情報処理装置は、大規模文字集３から、外字（文字Ｃ２）の属性情報を取得する。属性情報は、例えば、読み、部品、部品位置である。部品は、例えば、部品コードで示され、しろ（白）、うかんむり、はち（八）、くち（口）、しんにょう等である。部品位置は、各部品コードに対応づけて示される。部品位置は、例えば、図３で説明した文字領域を１６分割したときの部分領域の番号である。 FIG. 5 is a diagram illustrating a process using the large-scale character collection 3. In the processing example of FIG. 5, as in FIG. 4, a sentence including the word Z1 in which the characters C1 and C2 are combined in the document to be subjected to morphological analysis is processed. Also in the processing in FIG. 5, the information processing apparatus collates the character shape of the external character with the character shape corresponding to the character defined in the large-scale character collection 3 and performs the OCR process. In the example of FIG. 5, as a result of the OCR process performed by the large-scale character collection 3, the information processing device successfully recognizes a character that matches the external character (character C2). Then, the information processing device acquires the attribute information of the external character (character C2) from the large-scale character collection 3. The attribute information is, for example, a reading, a part, and a part position. The parts are represented by, for example, a part code, such as a margin (white), a fragrance, a bee (eight), a bee (mouth), and a chin. The component position is indicated in association with each component code. The component position is, for example, the number of a partial area when the character area described in FIG. 3 is divided into 16 parts.

次に、情報処理装置は、外字（文字Ｃ２）の属性情報と、属性辞書４とを照合する。図３で説明したように、属性辞書４には、単語辞書１から取得された各文字の属性情報が定義されている。そこで、情報処理装置は、属性辞書４から、部品と部品位置が外字（文字Ｃ２）の属性情報と類似する文字を抽出する。例えば、図５の例では、内字「邉」が抽出されている。外字（文字Ｃ２）は部首しんにょうとしろ（白）を有するのに対して、内字「邉」は、部首えんにょうとみずから（自）を有する点で２つの文字は相違する。しかし、他の部首および部首の位置は整合している。このため、本情報処理装置は、外字（文字Ｃ２）の属性情報と内字「邉」の属性情報は、所定の基準値以上のスコアで整合すると判定し、内字「邉」を取得する。 Next, the information processing apparatus collates the attribute information of the external character (character C2) with the attribute dictionary 4. As described with reference to FIG. 3, the attribute dictionary 4 defines attribute information of each character acquired from the word dictionary 1. Therefore, the information processing apparatus extracts, from the attribute dictionary 4, a character whose component and component position are similar to the attribute information of the external character (character C2). For example, in the example of FIG. 5, the inner character "beige" is extracted. The two characters differ in that the external character (character C2) has a radical (white), while the internal character “begin” has a radical (self). However, the positions of the other radicals and radicals are aligned. For this reason, the information processing apparatus determines that the attribute information of the external character (character C2) and the attribute information of the inner character “en” match with a score equal to or higher than a predetermined reference value, and acquires the inner character “en”.

図６に、属性情報の類似度の算出結果を例示する。図６では、図５で例示した外字（文字Ｃ２）の属性情報と、属性辞書４とを照合する処理での属性情報の類似度の算出結果が例示されている。 FIG. 6 illustrates a calculation result of the similarity of the attribute information. FIG. 6 illustrates a calculation result of the similarity of the attribute information in the process of comparing the attribute information of the external character (character C2) illustrated in FIG.

まず、読みの類似度に関しては、図５に例示のように、外字（文字Ｃ２）と、「渡」とは一致するものがない。したがって、外字（文字Ｃ２）と「渡」の読みの類似度は０点である。一方、外字（文字Ｃ２）と、「辺」、「邉」、「邊」の字とは、４つの読みが一致する。本実施形態では、読みが一致するごとに１００点が付与される。その結果、図６では、外字（文字Ｃ２）と、「辺」、「邉」、「邊」のそれぞれの字との読みの類似度は、
４００点である。 First, regarding the similarity of reading, as shown in FIG. 5, there is no match between the external character (character C2) and “Water”. Therefore, the similarity between the reading of the external character (character C2) and the reading of “Wataru” is zero. On the other hand, the four characters of the external character (character C2) and the characters of “side”, “side”, and “side” match. In this embodiment, 100 points are given each time the readings match. As a result, in FIG. 6, the similarity of reading between the external character (character C2) and each of the characters “side”, “side”, and “side” is:
400 points.

部品の類似度に関しては、図６のように、外字（文字Ｃ２）は、部品として、しろ（白）、うかんむり、はち（八）、くち（口）、しんにょうを有する。一方、「渡」の字は、部品として、また（又）、さんずいへん、まだれ等であり、外字（文字Ｃ２）の部品と一致するものはない。このため、本情報処理装置は、外字（文字Ｃ２）と「渡」の字の類似度を０と算出する。 Regarding the similarity of the parts, as shown in FIG. 6, the external character (character C2) has white (white), fragrant, eight (eight), eight (mouth), and four shins. On the other hand, the character "Watari" is a part, and is a character such as a character, a character, and the like, and there is no character that matches the part of the external character (character C2). For this reason, the information processing apparatus calculates the similarity between the external character (character C2) and the character “Water” as 0.

また、「辺」の字は、かたな（刀）としんにょうを有する。「辺」の字のかたな（刀）は、外字（文字Ｃ２）の部品と一致しないが、しんにょうは外字（文字Ｃ２）の部品と一致する。例えば、本情報処理装置は、一致する部品が存在すると１００点を加点する。その結果、外字（文字Ｃ２）と「辺」の字の類似度は１００点と算出される。 In addition, the character of “side” has Katana (sword) and Shinyo. The character (sword) of the character of the "side" does not match the part of the external character (character C2), but the character of the character corresponds to the part of the external character (character C2). For example, the present information processing apparatus adds 100 points when a matching part exists. As a result, the similarity between the external character (character C2) and the character “side” is calculated to be 100 points.

一方、「邉」の字は、部品として、みずから（自）、わかんむり、はち（八）、くち（口）、およびえんにょうを有する。「邉」の字の「みずから（自）」および「わかんむり」は、それぞれ、外字（文字Ｃ２）の「しろ（白）」および「うかんむり」と類似しているので、情報処理装置は、それぞれ７０点を付与する。また、「邉」の字の「はち（八）」および「くち（口）」は外字（文字Ｃ２）の部品と一致するので、情報処理装置は、それぞれ１００点を付与する。さらに、「邉」の字の「えんにょう」は外字（文字Ｃ２）の「しんにょう」と類似するので、情報処理装置は、８０点を付与する。これらの計算から、外字（文字Ｃ２）と「邉」の字の類似度は４２０点と算出される。外字（文字Ｃ２）と「邊」の字についても、同様の計算により、類似度が３２０点と算出される。 On the other hand, the character of "beige" has, as its parts, itself (self), wakamuri, bee (eight), bee (mouth), and porcelain. Since the characters "Water (self)" and "Wakamuri" of the "edge" are similar to "Shiro (white)" and "Ukanmuri" of the external character (character C2), respectively, the information processing apparatus 70 points are given. In addition, since the characters “hachi (eight)” and “chichi (mouth)” of the character “beside” match the part of the external character (character C2), the information processing device gives 100 points each. Furthermore, since the character "ennyo" of the character "beside" is similar to the external character (character C2) "shinnyo", the information processing device gives 80 points. From these calculations, the degree of similarity between the external character (character C2) and the character of “beige” is calculated to be 420 points. The similarity is calculated to be 320 points for the external character (character C2) and the character of “side” by the same calculation.

図６では、読みの類似度と、部品の類似度の合計が算出されている。しかし、本情報処理装置は、読みの類似度を用いないで、外字を内字に置換してもよい。文字形状が類似する文字同士は読みが一致する場合が多いので、文字形状の類似を判断すれば十分な場合が多いからである。ただし、本情報処理装置は、文字形状の照合による誤判定を低減するため、読みの類似度を含めて属性情報の類似度を判断し、外字を内字に置換する処理を実行してもよい。 In FIG. 6, the sum of the reading similarity and the component similarity is calculated. However, the information processing apparatus may replace the external character with the internal character without using the similarity of the reading. This is because characters having similar character shapes often have the same reading, and it is often sufficient to determine similarity in character shapes. However, the information processing apparatus may execute a process of determining the similarity of the attribute information including the similarity of the reading and replacing the external character with the internal character in order to reduce erroneous determination by collation of the character shape. .

図７は、部品間の類似度を定義した部品の類似度辞書５の例である。部品の類似度辞書５は、部品対部品の関係に対して類似度の値を設定した辞書である。部品間の類似度の値は、部品が一致する場合に１００点として、部品と部品の類似する程度を数値化したものである。例えば、部品「しろ（白）」と部品「みずから（自）」、部品「わかんむり」と部品「うかんむり」、部品「ひとあし」と部品「はち（八）」は、いずれも類似度７０点に定義される。また、部品「えんにょう」と部品「しんにょう」は類似度８０点に定義される。図６の処理は、図７の部品の類似度辞書５にしたがって算出された結果である。 FIG. 7 is an example of the component similarity dictionary 5 defining the similarity between components. The component similarity dictionary 5 is a dictionary in which similarity values are set for the component-to-component relationship. The value of the degree of similarity between parts is a numerical value of the degree of similarity between parts assuming 100 points when the parts match. For example, the parts “white (white)” and the part “water (self)”, the part “wakamuri” and the part “ukanmuri”, the part “hitoashi” and the part “hachi (eight)” all have a similarity score of 70 points. Is defined as In addition, the component “ennin” and the component “shinnyo” are defined with a similarity of 80 points. The process of FIG. 6 is a result calculated according to the component similarity dictionary 5 of FIG.

図８は、部品の位置の類似度の算出例である。本情報処理装置は、図６の処理によって外字に対する部品の類似度が所定の基準値以上の内字を選択し、選択された複数の内字ついて、外字に対する部品の位置の類似度を算出する。 FIG. 8 is an example of calculating the similarity between the positions of the components. The information processing apparatus selects an internal character whose similarity of a component to an external character is equal to or greater than a predetermined reference value by the processing of FIG. 6 and calculates the similarity of the position of the component to the external character for a plurality of selected internal characters. .

本実施形態では、図３で説明したように、部品の位置は、文字を正規化した文字領域が１６分割された部分領域の番号によって指定される。例えば、外字（文字Ｃ２）の部品の位置は、しろ（白）：２、うかんむり：６、はち（八）：８、くち（口）：１０、しんにょう：１の各位置となる。 In the present embodiment, as described with reference to FIG. 3, the position of the component is specified by the number of a partial area obtained by dividing a character area obtained by normalizing a character into 16 parts. For example, the positions of the parts of the external character (character C2) are as follows: white (white): 2, fragrance: 6, bee (eight): 8, kuchi (mouth): 10, and shinyo: 1.

一方、「邉」の文字の部品の位置は、みずから（自）：２、わかんむり：６、はち（八）：６、くち（口）：１０、えんにょう：１の各位置となる。図８のように、外字（文字Ｃ２）と「邉」の文字の部品の位置はすべて一致しているので、本情報処理装置は、それ
ぞれの部品の位置に１００点を付与する。したがって、合計点は、５００点となる。 On the other hand, the positions of the parts of the character "beside" are the following positions: (self): 2, wakamuri: 6, bee (eight): 6, kuchi (mouth): 10, and ennyo: 1. As shown in FIG. 8, since the positions of the parts of the external character (character C2) and the character of “beside” all match, the information processing apparatus assigns 100 points to the positions of the parts. Therefore, the total points are 500 points.

さらに、「邊」の文字の部品の位置は、みずから（自）：２、うかんむり：７、ひとあし：６、ほう（方）：１０、えんにょう：１の各位置となる。「邊」の文字の部品の位置のうち、部品の位置として、２、６、１０、１が外字（文字Ｃ２）の部品の位置に一致するので、情報処理装置は、それぞれ１００点を付与する。一方、「邊」の文字の部品の位置のうち、７に一致する、外字（文字Ｃ２）の部品の位置はない。外字（文字Ｃ２）の部品の位置で残ったもののうち、最も近い部品の位置は８である。そこで、本情報処理装置は部品の位置７と８との関係に対して、９０点を付与する。したがって、合計は、４９０点となる。 Further, the positions of the parts of the character of the "side" are: (self): 2, enemies: 7, toe: 6, ho (10): 10, and ennyo: 1. Among the positions of the parts of the character of “side”, 2, 6, 10, and 1 as the positions of the parts match the positions of the parts of the external character (character C2), so the information processing apparatus gives 100 points each. . On the other hand, there is no external character (character C2) component position that matches 7 among the component positions of the character “side”. Among the remaining parts of the part of the external character (character C2), the position of the nearest part is 8. Therefore, the information processing apparatus assigns 90 points to the relationship between the component positions 7 and 8. Therefore, the total is 490 points.

図９は、部品の位置の類似度の他の算出例である。この例では、２つの文字Ｃ１０、Ｃ１１間で、類似する部品同士を対応づけ、対応付けた部品の位置の類似度を算出する。例えば、文字Ｃ１０と、文字Ｃ１１は、ともに部品「やま（山）」を有するので、情報処理装置は文字Ｃ１０の部品「やま（山）」と、文字Ｃ１１の部品「やま（山）」とを対応付け、これらの部品の位置を判定する。文字Ｃ１０の部品「やま（山）」と、文字Ｃ１１の部品「やま（山）」とは、ともに位置が１であるので、本情報処理装置は、１００点を付与する。 FIG. 9 is another calculation example of the similarity of the position of the component. In this example, similar parts are associated with each other between two characters C10 and C11, and the similarity of the positions of the associated parts is calculated. For example, since both the character C10 and the character C11 have a component “Yama (mountain)”, the information processing apparatus determines the component “Yama (mountain)” of the character C10 and the component “Yama (mountain)” of the character C11. Correspondence and the position of these components are determined. Since the position of the part “yama (mountain)” of the character C10 and the part “yama (mountain)” of the character C11 are both 1, the information processing apparatus gives 100 points.

また、文字Ｃ１０と、文字Ｃ１１は、ともに部品「まがりがわ」を有する。文字Ｃ１０の部品「まがりがわ」の位置は５であり、一方、文字Ｃ１１の部品「まがりがわ」の位置は２である。そこで、本情報処理装置は、文字Ｃ１０の部品「まがりがわ」と、文字Ｃ１１の部品「まがりがわ」とに対して、７０点を付与する。 In addition, both the character C10 and the character C11 have a component “Magarigawa”. The position of the component "Magagawa" of the character C10 is 5, while the position of the component "Magagawa" of the character C11 is 2. Therefore, the present information processing apparatus gives 70 points to the part “Magagawa” of the character C10 and the part “Magagawa” of the character C11.

また、文字Ｃ１０と、文字Ｃ１１は、ともに部品「た（田）」を有する。文字Ｃ１０の部品「た（田）」の位置は９であり、一方、文字Ｃ１１の部品「た（田）」の位置は１０である。そこで、本情報処理装置は、文字Ｃ１０の部品「た（田）」と、文字Ｃ１１の部品「た（田）」とに対して、９０点を付与する。なお、図９に例示したような部品の位置関係ごとに付与する点数は、例えば、位置の類似度の辞書として、例えば、情報処理装置の主記憶上に保持しておけばよい。また、例えば、２つの文字の部品について、互いの部分領域の位置が一致する場合に、１００点とし、隣接する場合には９０点、さらに１つ離れるごとに１０点ずつ減点するように、コンピュータプログラムにしたがって本情報処理装置が評価点を算出してもよい。 Further, both the character C10 and the character C11 have a component “ta (ta)”. The position of the component "ta (ta)" of the character C10 is 9, while the position of the component "ta (ta)" of the character C11 is 10. Therefore, the present information processing apparatus gives 90 points to the component “ta (ta)” of the character C10 and the component “ta (ta)” of the character C11. The score given for each component positional relationship as illustrated in FIG. 9 may be stored in, for example, a main memory of the information processing device as a dictionary of positional similarity. In addition, for example, a computer is designed such that, when the positions of partial regions of two character parts coincide with each other, 100 points are set, 90 points are adjacent to each other, and 10 points are subtracted each time one part is separated. The information processing apparatus may calculate the evaluation points according to a program.

図１０に、本実施形態の処理による効果を例示する。以上図３から図９によって例示した処理により、本情報処理装置は、形態素解析の対象の文中の外字を形態素解析に使用される単語辞書１に定義された内字に置換することが可能となる。その結果、例えば、図１０の外字Ｃ２が内字「邉」に置換される。また、外字Ｃ７が内字「宵」に置換される。これらの内字「邉」および「宵」はいずれも形態素解析で用いられる単語辞書１から取得されたものである。したがって、外字が内字に置き換えられた単語「渡邉」および「阿宵月」は、いずれも、単語辞書１に定義されている可能性が高い。したがって、本情報処理装置によれば、外字を内字に単純に置換する場合と比較して、形態素解析の対象の文中の外字を含む単語を形態素解析可能な適切な単語に変換できる可能性が高い。 FIG. 10 illustrates the effect of the processing of the present embodiment. With the processing illustrated in FIGS. 3 to 9, the information processing apparatus can replace the external character in the sentence to be subjected to morphological analysis with the internal character defined in the word dictionary 1 used for morphological analysis. . As a result, for example, the external character C2 in FIG. 10 is replaced with the internal character “edge”. Also, the external character C7 is replaced with the internal character “Yo”. These internal characters “beside” and “evening” are both obtained from the word dictionary 1 used in the morphological analysis. Therefore, the words “Watanabe” and “Aoitsuki” in which the external characters are replaced with the internal characters are both likely to be defined in the word dictionary 1. Therefore, according to the present information processing apparatus, there is a possibility that a word including an external character in a sentence to be subjected to morphological analysis can be converted into an appropriate word that can be morphologically analyzed, as compared with a case where an external character is simply replaced with an internal character. high.

＜ハードウェア構成＞
図１１に、本情報処理装置のハードウェア構成図を例示する。本情報処理装置は、Central Processing Unit（ＣＰＵ）１１、主記憶装置１２、インターフェース１８を通じて
接続される外部機器を有し、プログラムにより情報処理を実行する。外部機器としては、外部記憶装置１３および通信インターフェース１４を例示できる。ＣＰＵ１１は、主記憶装置１２に実行可能に展開されたコンピュータプログラムを実行し、本情報処理装置の機
能を提供する。ＣＰＵ１１はプロセッサとも呼ばれる。主記憶装置１２は、ＣＰＵ１１が実行するコンピュータプログラム、ＣＰＵ１１が処理するデータ等を記憶する。主記憶装置１２は、Dynamic Random Access Memory（ＤＲＡＭ）、Static Random Access Memory
（ＳＲＡＭ）、Read Only Memory（ＲＯＭ）等である。さらに、外部記憶装置１３は、例えば、主記憶装置１２を補助する記憶領域として使用され、ＣＰＵ１１が実行するコンピュータプログラム、ＣＰＵ１１が処理するデータ等を記憶する。外部記憶装置１３は、ハードディスクドライブ、Solid State Disk（ＳＳＤ）等である。 <Hardware configuration>
FIG. 11 illustrates a hardware configuration diagram of the information processing apparatus. The information processing apparatus has a Central Processing Unit (CPU) 11, a main storage device 12, and external devices connected through an interface 18, and executes information processing by a program. As the external device, an external storage device 13 and a communication interface 14 can be exemplified. The CPU 11 executes a computer program loaded in an executable manner in the main storage device 12 and provides the functions of the information processing apparatus. The CPU 11 is also called a processor. The main storage device 12 stores a computer program executed by the CPU 11, data processed by the CPU 11, and the like. The main storage device 12 includes a dynamic random access memory (DRAM) and a static random access memory.
(SRAM), Read Only Memory (ROM), and the like. Further, the external storage device 13 is used as, for example, a storage area that assists the main storage device 12, and stores a computer program executed by the CPU 11, data processed by the CPU 11, and the like. The external storage device 13 is a hard disk drive, a solid state disk (SSD), or the like.

また、本情報処理装置は、入力装置１５、表示装置１６等によるユーザインターフェースを有するようにしてもよい。入力装置１５は、例えば、キーボード、ポインティングデバイス等である。また、表示装置１６は、例えば、液晶ディスプレイ、エレクトロルミネッセンスパネル等である。さらに、本情報処理装置は、着脱可能記憶媒体駆動装置１７を設けてもよい。着脱可能記憶媒体は、例えば、ブルーレイディスク、Digital Versatile Disk（ＤＶＤ）、Compact Disc（ＣＤ）、フラッシュメモリカード等である。なお、図１１の例では、単一のインターフェース１８が例示されているが、インターフェース１８として複数種類のものが複数設けられてもよい。 The information processing apparatus may have a user interface including the input device 15, the display device 16, and the like. The input device 15 is, for example, a keyboard, a pointing device, or the like. The display device 16 is, for example, a liquid crystal display, an electroluminescence panel, or the like. Further, the information processing apparatus may include a removable storage medium driving device 17. The removable storage medium is, for example, a Blu-ray disc, Digital Versatile Disk (DVD), Compact Disc (CD), flash memory card, or the like. Although the single interface 18 is illustrated in the example of FIG. 11, a plurality of types of interfaces 18 may be provided.

本情報処理装置は、例えば、パーソナルコンピュータ、ネットワーク上でパーソナルコンピュータ、端末等にサービスを提供するサーバ、情報携帯端末（Personal Data Assistance(PDA)）、携帯電話等である。 The information processing apparatus is, for example, a personal computer, a server that provides services to a personal computer, a terminal, and the like over a network, a personal digital assistant (PDA), a mobile phone, and the like.

＜処理フロー＞
図１２から図１５に、本情報処理装置の処理フローを例示する。図１２は、本情報処理装置の全体処理フローを例示する図である。図１２の処理では、入力データ、つまり、形態素解析の対象として、文字コード0x6E21と文字コード0xE001とを含む文が例示されている。このうち、文字コード0xE001の文字は、外字Ｃ２である。本実施形態において、内字は、文字コード0x0000〜0xDFFFの範囲で定義され、外字は文字コード0xE000以降の範囲で定義される。したがって、本情報処理装置は、文字コードの範囲によって内字と外字とを判別可能である。 <Processing flow>
12 to 15 illustrate a processing flow of the information processing device. FIG. 12 is a diagram illustrating an overall processing flow of the information processing apparatus. In the processing of FIG. 12, a sentence including a character code 0x6E21 and a character code 0xE001 is illustrated as an example of input data, that is, a target of morphological analysis. Among them, the character with the character code 0xE001 is the external character C2. In the present embodiment, the internal characters are defined in the range of character codes 0x0000 to 0xDFFF, and the external characters are defined in the range of character codes 0xE000 and thereafter. Therefore, the information processing apparatus can determine the internal character and the external character based on the range of the character code.

まず、本情報処理装置は、入力データに外字があるか否かを判定する（Ｓ１）。Ｓ１の処理は、処理対象の文書中で、情報処理装置が取り扱う文字規格に含まれない外字を判別する判別処理の一例である。また、本情報処理装置は、上述のような文字コード0xE000以降の範囲で外字を判定すればよい。したがって、Ｓ１の処理は、文字を特定する文字コードの範囲に基づいて外字を判別することの一例でもある。 First, the information processing apparatus determines whether there is an external character in the input data (S1). The process of S1 is an example of a determination process of determining an external character that is not included in a character standard handled by the information processing apparatus in a document to be processed. In addition, the information processing apparatus may determine the external character in the range of the character code 0xE000 or later as described above. Therefore, the process of S1 is also an example of determining an external character based on a range of a character code specifying a character.

入力データに外字がない場合、本情報処理装置は、入力データに対してそのまま形態素解析（Ｓ１５）を実行する。一方、入力データに外字がある場合、本情報処理装置は、入力データ中で識別された外字が過去に認識済みの外字か、あるいは初めて認識した外字かを判定する（Ｓ２）。過去に認識済みの外字は、例えば、主記憶装置１２あるいは外部記憶装置１３上の置換テーブルと呼ばれる領域に登録されている。そこで、本情報処理装置は、置換テーブルを参照することで、入力データ中で判別された外字が初めて認識した外字か否かを判定すればよい。 If there is no external character in the input data, the information processing apparatus directly executes the morphological analysis (S15) on the input data. On the other hand, when there is an external character in the input data, the information processing apparatus determines whether the external character identified in the input data is an external character recognized in the past or an external character recognized for the first time (S2). The external characters recognized in the past are registered in, for example, an area called a replacement table on the main storage device 12 or the external storage device 13. Therefore, the information processing apparatus may determine whether the external character determined in the input data is the external character recognized for the first time by referring to the replacement table.

入力データ中で判別された外字が過去に認識済みの外字で有り、置換テーブルに登録されている場合、本情報処理装置は、入力データ中で判別された外字の外字コードを置換テーブルで関係づけされた内字の文字コードに変換し（Ｓ１４）、形態素解析を実行する（Ｓ１５）。 If the external character determined in the input data is a previously recognized external character and is registered in the replacement table, the information processing apparatus associates the external character code of the external character determined in the input data with the replacement table. The character code is converted into the character code of the inner character (S14), and morphological analysis is performed (S15).

入力データ中で識別された外字が初めて認識した外字である場合、本情報処理装置は、
外字の文字形状を取得する（Ｓ３）。外字の文字形状は、例えば、外字の文字コードに対応づけて外字ファイルに登録されている。本実施形態では、外字ファイルは、外字の文字形状のビットマップを保持する。そして、本情報処理装置は、外字のビットマップからＯＣＲ辞書２と同一構成の文字形状情報を作成し、ＯＣＲ辞書２の文字形状と照合する。ＯＣＲ辞書２と同一構成の文字形状情報とは、例えば、文字の縦横比が正規化され、特徴データが文字中心から所定の方向（放射状の各方向）ごとに抽出されたデータである。このような照合処理をＯＣＲ処理と呼ぶことにする。本情報処理装置は、ＯＣＲ処理の結果、所定の基準値以上のスコアで、外字の文字形状と合致する文字形状の内字が認識できたか否かを判定する（Ｓ４）。 If the external character identified in the input data is the external character recognized for the first time, the information processing apparatus
The character shape of the external character is acquired (S3). The character shape of the external character is registered in the external character file in association with the character code of the external character, for example. In the present embodiment, the external character file holds a bitmap of the character shape of the external character. Then, the information processing apparatus creates character shape information having the same configuration as that of the OCR dictionary 2 from the bit map of the external character, and compares the character shape information with the character shape of the OCR dictionary 2. The character shape information having the same configuration as that of the OCR dictionary 2 is, for example, data obtained by normalizing the character aspect ratio and extracting characteristic data in predetermined directions (radial directions) from the character center. Such collation processing is referred to as OCR processing. As a result of the OCR processing, the information processing device determines whether or not the internal character having a character shape matching the character shape of the external character has been recognized with a score equal to or higher than a predetermined reference value (S4).

外字の文字形状と合致する文字形状の内字が認識できた場合、本情報処理装置は、ＯＣＲ処理の結果として類似文字を取得する。さらに、本情報処理装置は、ＯＣＲ辞書２の他の文字形状に対しても同様の処理を繰り返すことで、類似文字一覧を取得する（Ｓ５）。ただし、類似文字が単一の場合もあり得る。 When an inner character having a character shape that matches the character shape of the external character can be recognized, the information processing apparatus acquires a similar character as a result of the OCR process. Further, the information processing apparatus acquires a similar character list by repeating the same processing for other character shapes of the OCR dictionary 2 (S5). However, there may be a case where there is only one similar character.

次に、本情報処理装置は、類似文字一覧として取得した文字を形態素解析用の単語辞書１で検索する（Ｓ６）。次に、本情報処理装置は、Ｓ６の検索の結果、形態素解析用の単語辞書で検索できた文字が複数か否かを判定する（Ｓ７）。形態素解析用の単語辞書で検索できた文字が単数の場合、本情報処理装置は、形態素解析用の単語辞書１で検索できた文字を選択し（Ｓ１２）、外字の文字コードと形態素解析用の単語辞書で検索できた文字の文字コードの組み合わせを置換テーブルに登録する（Ｓ１３）。 Next, the information processing apparatus searches the word acquired for the similar character list in the word dictionary 1 for morphological analysis (S6). Next, as a result of the search in S6, the information processing apparatus determines whether there are a plurality of characters that can be searched in the morphological analysis word dictionary (S7). When a single character can be searched in the morphological analysis word dictionary, the information processing apparatus selects a character that can be searched in the morphological analysis word dictionary 1 (S12), and outputs the character code of the external character and the morphological analysis. The combination of the character codes of the characters that can be searched in the word dictionary is registered in the replacement table (S13).

Ｓ７の判定で、形態素解析用の単語辞書で検索できた文字が複数の場合、本情報処理装置は、検索できた複数の文字の属性情報を属性辞書４から取得する。属性辞書４には、図３に例示のように、読みと、部品と、部品の位置等の属性情報が単語辞書１の文字に対して登録されている。そこで、本情報処理装置は、外字ファイルから外字の属性情報を取得し、外字の属性情報と、検索できた複数の文字の属性情報を比較する。なお、外字ファイルに外字の属性情報が含まれていない場合には、本情報処理装置は、大規模文字集３から外字の属性情報を取得してもよい。 If it is determined in S7 that there are a plurality of characters that can be searched in the morphological analysis word dictionary, the information processing apparatus acquires the attribute information of the plurality of searched characters from the attribute dictionary 4. In the attribute dictionary 4, as shown in FIG. 3, attribute information such as readings, parts, and positions of parts are registered for the characters in the word dictionary 1. Therefore, the information processing apparatus acquires the attribute information of the external character from the external character file, and compares the attribute information of the external character with the attribute information of a plurality of characters that can be searched. When the external character file does not include the attribute information of the external character, the information processing apparatus may acquire the attribute information of the external character from the large-scale character collection 3.

そして、本情報処理装置は、外字の属性情報と最も類似する文字を選択する。このとき、本情報処理装置は、外字の属性情報と最も類似する文字が複数あるか否かを判定する（Ｓ９）。外字の属性情報と最も類似する文字が単数である場合、本情報処理装置は、Ｓ１２以下の処理を実行する。Ｓ９の判定で、外字の属性情報と最も類似する文字が複数である場合、本情報処理装置は、ＪＩＳ領域の文字を選択し（Ｓ１１）、外字と選択した文字（内字）の組み合わせを置換テーブルに登録する（Ｓ１３）。Ｓ５−Ｓ９、Ｓ１１、Ｓ１２の処理は、判別された外字の文字形状を定義する文字形状情報と置換辞書に含まれる文字形状情報とを照合することによって照合された文字形状情報に対応する内字を選択することの一例である。 Then, the information processing apparatus selects a character most similar to the attribute information of the external character. At this time, the information processing apparatus determines whether there are a plurality of characters most similar to the attribute information of the external character (S9). If the character that is most similar to the attribute information of the external character is a single character, the information processing device executes the processing of S12 and thereafter. If it is determined in S9 that there are a plurality of characters most similar to the attribute information of the external character, the information processing apparatus selects a character in the JIS area (S11) and replaces the combination of the external character and the selected character (inner character). Register in the table (S13). The processing of S5-S9, S11, and S12 is performed by comparing the character shape information defining the character shape of the determined external character with the character shape information included in the replacement dictionary, and the internal character corresponding to the collated character shape information. Is an example of selecting.

Ｓ４の判定で、ＯＣＲ処理によって外字の文字形状と合致する文字形状の内字が認識できなかった場合、本情報処理装置は、大規模文字集３による検索処理を実行する（Ｓ１０）。そして、本情報処理装置は、外字と大規模文字集を基に検索した文字（内字）の組み合わせを置換テーブルに登録する（Ｓ１３）。Ｓ１３の処理の後、本情報処理装置は、入力データの外字コードをＳ１０からＳ３の処理で取得した文字の文字コードに置換する（Ｓ１４）。そして、本情報処理装置は、形態素解析を実行する（Ｓ１５）。Ｓ１４の処理は、形態素解析に用いられる第1の辞書から生成された置換辞書に基づき外字を文字規格
に含まれる内字に置換する置換処理の一例である。Ｓ１４の処理は、外字を選択された内字に置換することの一例でもある。また、Ｓ１５の処理は、外字が内字に置換された文書を、第１の辞書を用いて解析する解析処理の一例である。 If it is determined in S4 that the internal character having the character shape matching the character shape of the external character cannot be recognized by the OCR process, the information processing apparatus executes a search process using the large-scale character collection 3 (S10). Then, the information processing apparatus registers the combination of the character (inner character) searched based on the external character and the large-scale character collection in the replacement table (S13). After the processing of S13, the information processing apparatus replaces the external character code of the input data with the character code of the character obtained in the processing of S10 to S3 (S14). Then, the information processing apparatus performs a morphological analysis (S15). The process of S14 is an example of a replacement process of replacing an external character with an internal character included in the character standard based on a substitution dictionary generated from a first dictionary used for morphological analysis. The process of S14 is also an example of replacing the external character with the selected internal character. The process of S15 is an example of an analysis process of analyzing a document in which an external character has been replaced with an internal character by using the first dictionary.

図１３は、大規模文字集３による検索処理（図１２のＳ１０）の詳細を例示するフローチャートである。この処理では、本情報処理装置は、外字の文字形状を基に大規模文字集３を検索する（Ｓ１０１）。より具体的には、本情報処理装置は、外字の文字形状と大規模文字集に登録されている文字形状とを照合する。Ｓ１０１の処理は、図１２で説明したＯＣＲ処理と同様である。 FIG. 13 is a flowchart illustrating details of the search process (S10 in FIG. 12) using the large-scale character collection 3. In this process, the information processing apparatus searches the large-scale character collection 3 based on the character shape of the external character (S101). More specifically, the information processing apparatus checks the character shape of the external character against the character shape registered in the large-scale character collection. The processing in S101 is the same as the OCR processing described with reference to FIG.

そして、本情報処理装置は、外字の文字形状と所定の基準値以上のスコアで整合する大規模文字集３の文字が認識できたか否かを判定する（Ｓ１０２）。Ｓ１０１、Ｓ１０２の処理は、置換辞書（ＯＣＲ辞書２）との照合によって外字を内字に置換できなかった場合に、外字の文字形状情報を文字規格の範囲に含まれない文字の文字形状情報を含む第２の辞書（大規模文字集３）と照合することによって外字に対応する第２の辞書中の文字を決定することの一例である。 Then, the information processing apparatus determines whether a character of the large-scale character collection 3 that matches the character shape of the external character with a score equal to or greater than a predetermined reference value has been recognized (S102). In the processing of S101 and S102, when the external character cannot be replaced with the internal character by collation with the replacement dictionary (OCR dictionary 2), the character shape information of the character which is not included in the range of the character standard is replaced with the character shape information of the external character. This is an example of determining a character in the second dictionary corresponding to the external character by comparing with a second dictionary (large-scale character collection 3).

本情報処理装置は、所定の基準値以上で整合する大規模文字集３の文字が認識できた場合、認識できた文字の属性情報を取得する（Ｓ１０３）。ここでは、大規模文字集には、文字の属性情報が登録されているとする。さらに、本情報処理装置は、大規模文字集から取得された文字の属性情報が類似する文字を属性辞書４から取得する。図５に例示したように、属性辞書４には、形態素解析で用いられる単語辞書１に登録された文字ごとの読み、部品、部品の位置等の属性情報が登録されている。そこで、本情報処理装置は、図６から図１０で例示した手順と同様の手順により、例えば、部品と部品の位置が類似する文字を属性辞書４から取得する（Ｓ１０４）。Ｓ１０４の処理は、決定された文字との類似度を基に第１の辞書（単語辞書１）から外字を置換するための内字を取得することの一例である。Ｓ１０４の処理、および図６から図１０の処理は、それぞれの文字形状に含まれる部品の形状と部品が前記文字形状内で配置される位置とにより類似度を算出することの一例である。 When a character of the large-scale character collection 3 that matches with a predetermined reference value or more is recognized, the information processing apparatus acquires attribute information of the recognized character (S103). Here, it is assumed that character attribute information is registered in the large-scale character collection. Further, the information processing apparatus acquires, from the attribute dictionary 4, characters having similar attribute information of the characters acquired from the large-scale character collection. As illustrated in FIG. 5, in the attribute dictionary 4, attribute information such as reading of each character registered in the word dictionary 1 used in the morphological analysis, components, positions of components, and the like are registered. Therefore, the information processing apparatus obtains, for example, characters whose positions are similar from each other from the attribute dictionary 4 by the same procedure as that illustrated in FIGS. 6 to 10 (S104). The process of S104 is an example of acquiring an internal character for replacing an external character from the first dictionary (word dictionary 1) based on the degree of similarity with the determined character. The processing of S104 and the processing of FIGS. 6 to 10 are examples of calculating the similarity based on the shape of the component included in each character shape and the position where the component is arranged in the character shape.

図１４は、大規模文字集３による検索処理（図１２のＳ１０）の詳細の他の例である。図１３の処理では、本情報処理装置は、外字の文字形状を基にＯＣＲ処理により大規模文字集３を検索した。図１４の処理では、本情報処理装置は、外字を一旦部品に分解し（Ｓ１０１Ａ）、部品を基に大規模文字集３を検索する点（Ｓ１０１Ｂ）が相違する。図１４のＳ１０２以下の手順は、図１３と同様であるので、その説明を省略する。 FIG. 14 is another example of the details of the search process (S10 in FIG. 12) using the large-scale character collection 3. In the processing of FIG. 13, the information processing apparatus has searched the large-scale character collection 3 by OCR processing based on the character shape of the external character. The processing in FIG. 14 is different from the information processing apparatus in that the external character is once decomposed into parts (S101A), and the large character collection 3 is searched based on the parts (S101B). The procedure after S102 in FIG. 14 is the same as that in FIG. 13, and the description thereof will be omitted.

なお、外字を定義した外字ファイル（図１２参照）には、外字の部品形状と部品コードが登録されている場合には、Ｓ１０１Ａの処理では、本情報処理装置は、外字ファイルを参照して、外字を部品に分解すればよい。また、外字ファイルには、外字の文字形状情報が登録されているが、外字の部品形状と部品コードが登録されていない場合には、本情報処理装置は、図１５にしたがって、外字を部品に分解する。 If the external character file defining the external character (see FIG. 12) has registered the external character component shape and component code, in the process of S101A, the information processing apparatus refers to the external character file, The external character may be disassembled into parts. Further, in the external character file, the character shape information of the external character is registered, but when the component shape and the component code of the external character are not registered, the information processing apparatus according to FIG. Decompose.

図１５は、外字を部品に分解する処理を例示する図である。以下の処理では、部品の形状を定義した部品フォントファイルが主記憶装置１２、または外部記憶装置１３に保存されていると想定する。この処理では、本情報処理装置は、外字の文字形状情報を外字ファイルから取得する（Ａ１）。次に、本情報処理装置は、部品フォントファイルから次の部品の形状を取得する（Ａ２）。そして、部品の形状と外字の文字形状を照合する（Ａ３）。そして、所定の基準値以上のスコアで部品の形状と外字の部分が整合するか否かを判定する（Ａ４）。 FIG. 15 is a diagram illustrating a process of decomposing an external character into components. In the following processing, it is assumed that a component font file defining the shape of the component is stored in the main storage device 12 or the external storage device 13. In this process, the information processing apparatus acquires the character shape information of the external character from the external character file (A1). Next, the information processing apparatus acquires the shape of the next component from the component font file (A2). Then, the shape of the part and the character shape of the external character are collated (A3). Then, it is determined whether or not the shape of the part matches the external character portion with a score equal to or higher than a predetermined reference value (A4).

Ａ４の判定で、部品の形状と外字の部分が整合しない場合、本情報処理装置は制御をＡ２に戻し、次の部品に対して同様の処理を繰り返す。なお、Ａ２からＡ４の繰り返しは、部品フォントファイルで次の部品がなくなると終了する。一方、Ａ４の判定で、部品の形
状と外字の文字形状の部分が所定の基準値以上のスコアで整合する場合、本情報処理装置は整合した部品の形状を示す部品コードと、部品の位置を記録し、外字の該当箇所をマスクする（Ａ５）。そして、本情報処理装置は、マスクされた箇所以外の残り部分が存在するか否かを判定する（Ａ６）。残り部分がある場合、本情報処理装置は、制御をＡ２に戻し、処理を続行する。一方、残り部分がない場合、本情報処理装置は、処理を終了する。 If the shape of the part and the external character do not match in the determination of A4, the information processing apparatus returns the control to A2 and repeats the same processing for the next part. Note that the repetition of A2 to A4 ends when there is no next component in the component font file. On the other hand, in the determination of A4, when the part shape and the character shape of the external character match with a score equal to or greater than a predetermined reference value, the information processing apparatus determines the part code indicating the matched part shape and the position of the part. It is recorded and the corresponding portion of the external character is masked (A5). Then, the information processing apparatus determines whether there is a remaining portion other than the masked portion (A6). If there is a remaining portion, the information processing device returns the control to A2 and continues the process. On the other hand, when there is no remaining portion, the information processing device ends the process.

以上述べたように、本情報処理装置には、形態素解析に用いられる単語辞書１を基に作成したＯＣＲ辞書２が保存されている。そして、形態素解析の対象の文、すなわち、入力データ中に外字に認識された場合、本情報処理装置はＯＣＲ辞書２を基に、外字をＯＣＲ処理し、類似文字一覧を取得する。そして、取得された類似文字一覧が単数の場合には、本情報処理装置は入力データ中で、ＯＣＲ処理の結果取得された類似文字（内字）の文字コードで、外字の文字コードを置換し、形態素解析を実行する。このような処理の結果、外字の文字コードが置換された文字コードは、形態素解析で用いられる単語辞書１に登録されたものであり、外字の文字コードを内字に置換した結果として得られる入力データ中の単語は、形態素解析で用いられる単語辞書１に登録されたものである可能性が高い。したがって、本情報処理装置によれば、従来よりも適切に外字を含む入力データに対して形態素解析を実施できる可能性を高めることができる。 As described above, the information processing device stores the OCR dictionary 2 created based on the word dictionary 1 used for morphological analysis. When the sentence to be subjected to the morphological analysis, that is, the external character is recognized in the input data, the information processing apparatus performs the OCR process on the external character based on the OCR dictionary 2 and acquires a similar character list. If the acquired list of similar characters is singular, the information processing apparatus replaces the character code of the external character with the character code of the similar character (inner character) acquired as a result of the OCR process in the input data. , Perform morphological analysis. As a result of such processing, the character code in which the character code of the external character has been replaced is registered in the word dictionary 1 used in the morphological analysis, and the input obtained as a result of replacing the character code of the external character with the internal character is obtained. The words in the data are likely to have been registered in the word dictionary 1 used in the morphological analysis. Therefore, according to the present information processing apparatus, it is possible to increase the possibility that morphological analysis can be performed on input data including external characters more appropriately than before.

また、ＯＣＲ辞書２には、内字の文字形状がＯＣＲ処理に適した形式で登録されている。また、本情報処理装置は、外字の文字コードを基に外字ファイルから外字の文字形状を取得し、外字文字形状とＯＣＲファイル２に登録された文字形状を照合することによってＯＣＲ辞書に定義されている内字を選択する。したがって、本情報処理装置は、入力データ中に外字の文字コードを認識した場合に、形状が類似する内字を適切に選択できる。 In the OCR dictionary 2, the character shape of the inner character is registered in a format suitable for OCR processing. Further, the information processing apparatus obtains the character shape of the external character from the external character file based on the character code of the external character, and compares the character shape registered in the OCR file 2 with the character shape registered in the OCR file 2 to define the character shape in the OCR dictionary. Select the internal character that you have. Therefore, when the information processing apparatus recognizes the character code of the external character in the input data, it can appropriately select the internal character having a similar shape.

また、本情報処理装置は、ＯＣＲ辞書２によって所定の基準値のスコア以上で外字の文字形状と整合する内字を選択できかった場合、外字の文字形状を基に、大規模文字集３の文字形状を検索する。大規模文字集には、様々な団体、企業、機関等が収集した文字コード、形状、属性等が登録されている。したがって、本情報処理装置は、外字の文字形状を基に、大規模文字集３において外字を特定できる可能性が高い。本情報処理装置が大規模文字集３において外字を特定できると、外字の読み、部品、部品の位置等の属性情報を大規模文字集３から取得し、属性辞書４を用いて、外字と属性が類似する内字を検索できる。したがって、大規模文字集３を用いた処理により、本情報処理装置はさらに外字を特定できる可能性を高めることができる。したがって、本情報処理装置がＯＣＲ辞書２を用いて、外字に相当する内字を特定できない場合も、大規模文字集３と属性辞書４とによって、外字を内字に置換可能となる。 Further, when the OCR dictionary 2 fails to select an internal character matching the character shape of the external character at a score equal to or higher than the score of the predetermined reference value, the OCR dictionary 2 generates the large character collection 3 based on the character shape of the external character. Search for character shapes. In the large-scale character collection, character codes, shapes, attributes, and the like collected by various groups, companies, institutions, and the like are registered. Therefore, the information processing apparatus has a high possibility of specifying an external character in the large-scale character collection 3 based on the character shape of the external character. When the information processing apparatus can identify an external character in the large-scale character collection 3, it acquires attribute information such as reading of the external character, parts, and the position of the part from the large-scale character collection 3, and uses the attribute dictionary 4 to specify the external character and the attribute. Can search for similar characters. Therefore, by the processing using the large-scale character collection 3, the information processing apparatus can further increase the possibility of specifying an external character. Therefore, even when the information processing apparatus cannot specify the internal character corresponding to the external character using the OCR dictionary 2, the external character can be replaced with the internal character by the large-scale character collection 3 and the attribute dictionary 4.

本情報処理装置は、属性情報として、例えば、部品と部品の位置とを用いて、属性辞書４から外字に類似する内字を選択する。したがって、本情報処理装置は、外字と内字の間での部品対部品の細かな対比を基に外字に類似する内字を検索できる。また、漢字の部品である部首はそれぞれ意味を有している。したがって、本情報処理装置は、単に形状だけではなく、漢字を形成する意味も含めて、外字と置換する内字を選択できる。 The information processing apparatus selects an internal character similar to an external character from the attribute dictionary 4 using, for example, a component and a position of the component as attribute information. Therefore, the present information processing apparatus can search for an internal character similar to the external character based on a detailed comparison of the component to the component between the external character and the internal character. The radicals, which are kanji parts, each have a meaning. Therefore, the information processing apparatus can select not only the shape but also the internal character to be replaced with the external character, including the meaning of forming a kanji.

本実施形態において、本情報処理装置は、外字か否かを文字コードの範囲を基に判定する。したがって、本情報処理装置は、簡易、確実に入力データ中の外字を識別できる。 In the present embodiment, the information processing apparatus determines whether a character is an external character based on the range of the character code. Therefore, the information processing apparatus can easily and surely identify the external character in the input data.

＜記録媒体＞
コンピュータその他の機械、装置（以下、コンピュータ等）に上記いずれかの機能を実現させるプログラムをコンピュータ等が読み取り可能な記録媒体に記録することができる。そして、コンピュータ等に、この記録媒体のプログラムを読み込ませて実行させることにより、その機能を提供させることができる。 <Recording medium>
A program that causes a computer or other machine or device (hereinafter, a computer or the like) to realize any of the above functions can be recorded on a recording medium readable by a computer or the like. Then, the function can be provided by causing a computer or the like to read and execute the program on the recording medium.

ここで、コンピュータ等が読み取り可能な記録媒体とは、データやプログラム等の情報を電気的、磁気的、光学的、機械的、または化学的作用によって蓄積し、コンピュータ等から読み取ることができる記録媒体をいう。このような記録媒体のうちコンピュータ等から取り外し可能なものとしては、例えばフレキシブルディスク、光磁気ディスク、Compact Disc（ＣＤ）−Read Only Memory（ＲＯＭ）、ＣＤ−Recordable（Ｒ）、Digital Versatile Disk（ＤＶＤ）、ブルーレイディスク、Digital Audio Tape（ＤＡＴ）、８ｍｍテープ、フラッシュメモリなどのメモリカード等がある。また、コンピュータ等に固定された記録媒体としてハードディスク、ＲＯＭ（リードオンリーメモリ）等がある。さらに、Solid State Drive（ＳＳＤ）は、コンピュータ等から取り外し可能な記録媒体としても
、コンピュータ等に固定された記録媒体としても利用可能である。 Here, a computer-readable recording medium is a recording medium that stores information such as data and programs by electrical, magnetic, optical, mechanical, or chemical action and can be read from a computer or the like. Say. Examples of such a recording medium that can be removed from a computer or the like include a flexible disk, a magneto-optical disk, a Compact Disc (CD) -Read Only Memory (ROM), a CD-Recordable (R), and a Digital Versatile Disk (DVD). ), Blu-ray Disc, Digital Audio Tape (DAT), 8 mm tape, and memory cards such as flash memory. Further, a recording medium fixed to a computer or the like includes a hard disk, a ROM (Read Only Memory), and the like. Further, the solid state drive (SSD) can be used as a recording medium detachable from a computer or the like, or as a recording medium fixed to the computer or the like.

＜その他＞
本実施形態は、以下の付記と呼ぶ態様を含む。各態様の構成要素は、他の態様の構成要素と組み合わせてもよい。
（付記１）
文書を処理する情報処理装置であって、
処理対象の文書中で、前記情報処理装置が取り扱う文字規格に含まれない外字を判別し、形態素解析に用いられる第１の辞書から生成された置換辞書に基づき前記外字を前記文字規格に含まれる内字に置換し、前記外字が前記内字に置換された文書を、前記第１の辞書を用いて解析する処理を実行するプロセッサを備える情報処理装置。
（付記２）
付記１に記載の情報処理装置であって、
前記置換辞書は、前記第１の辞書に含まれている内字の文字形状を定義する文字形状情報を含み、
前記プロセッサは、前記判別された外字の文字形状を定義する文字形状情報と前記置換辞書に含まれる文字形状情報とを照合することによって前記照合された文字形状情報に対応する内字を選択し、前記外字を前記選択された内字に置換する情報処理装置。
（付記３）
付記１または２に記載の情報処理装置であって、
前記プロセッサは、前記置換辞書との照合によって前記外字を内字に置換できなかった場合に、前記外字の文字形状情報を前記文字規格に含まれない文字の文字形状情報を含む第２の辞書と照合することによって前記外字に対応する前記第２の辞書中の文字を決定し、前記決定された文字との類似度を基に前記第１の辞書から前記外字を置換するための内字を取得することをさらに実行する情報処理装置。
（付記４）
付記３に記載の情報処理装置であって、
前記プロセッサは、それぞれの文字形状に含まれる部品の形状と前記部品が前記文字形状内で配置される位置とにより前記類似度を算出する情報処理装置。
（付記５）
付記１から４のいずれか１項に記載の情報処理装置であって、
前記プロセッサは、文字を特定する文字コードの範囲に基づいて前記外字を判別する情報処理装置。
（付記６）
コンピュータに、
処理対象の文書中で、前記情報処理装置が取り扱う文字規格に含まれない外字を判別する判別処理と、形態素解析に用いられる第１の辞書から生成された置換辞書に基づき前記外字を前記文字規格に含まれる内字に置換する置換処理と、前記外字が前記内字に置換された文書を、前記第１の辞書を用いて解析する解析処理と、を実行させるプログラム。
（付記７）
付記６に記載のプログラムであって、
前記置換辞書は、前記第１の辞書に含まれている内字の文字形状を定義する文字形状情報を含み、
前記置換処理は、前記判別された外字の文字形状を定義する文字形状情報と前記置換辞書に含まれる文字形状情報とを照合することによって前記照合された文字形状情報に対応する内字を選択し、前記外字を前記選択された内字に置換することを特徴とする付記６に記載のプログラム。
（付記８）
付記６または７に記載のプログラムであって、
前記コンピュータに、前記置換辞書との照合によって前記外字を内字に置換できなかった場合に、前記外字の文字形状情報を前記文字規格に含まれない文字の文字形状情報を含む第２の辞書と照合することによって前記外字に対応する前記第２の辞書中の文字を決定し、前記決定された文字との類似度を基に前記第１の辞書から前記外字を置換するための内字を取得することをさらに実行させるためのプログラム。
（付記９）
付記８に記載のプログラムであって、
前記コンピュータに、それぞれの文字形状に含まれる部品の形状と前記部品が前記文字形状内で配置される位置とにより前記類似度を算出させるためのプログラム。
（付記１０）
付記６から９のいずれか１項に記載のプログラムであって、
前記コンピュータに、文字を特定する文字コードの範囲に基づいて外字を判別させるためのプログラム。
（付記１１）
コンピュータが、
処理対象の文書中で、前記情報処理装置が取り扱う文字規格に含まれない外字を判別し、形態素解析に用いられる第１の辞書から生成された置換辞書に基づき前記外字を前記文字規格に含まれる内字に置換し、前記外字が前記内字に置換された文書を、前記第１の辞書を用いて解析することを実行する情報処理方法。
（付記１２）
付記１１に記載の情報処理方法であって、
前記置換辞書は、前記第１の辞書に含まれている内字の文字形状を定義する文字形状情報を含み、
前記コンピュータが、
前記判別された外字の文字形状を定義する文字形状情報と前記置換辞書に含まれる文字形状情報とを照合することによって前記照合された文字形状情報に対応する内字を選択し、前記外字を前記選択された内字に置換することを実行する情報処理方法。
（付記１３）
付記１１または１２に記載の情報処理方法であって、
前記コンピュータが、前記置換辞書との照合によって前記外字を内字に置換できなかった場合に、前記外字の文字形状情報を前記文字規格に含まれない文字の文字形状情報を含む第２の辞書と照合することによって前記外字に対応する前記第２の辞書中の文字を決定し、前記決定された文字との類似度を基に前記第１の辞書から前記外字を置換するための内字を取得することをさらに実行する情報処理方法。
（付記１４）
付記１３に記載の情報処理方法であって、
前記コンピュータが、それぞれの文字形状に含まれる部品の形状と前記部品が前記文字形状内で配置される位置とにより前記類似度を算出する情報処理方法。
（付記１５）
付記１１から１４のいずれか１項に記載の情報処理方法であって、
前記コンピュータが、文字を特定する文字コードの範囲に基づいて外字を判別する情報
処理方法。 <Others>
This embodiment includes aspects referred to as the following supplementary notes. Components of each aspect may be combined with components of other aspects.
(Appendix 1)
An information processing apparatus for processing a document,
In the document to be processed, an external character not included in the character standard handled by the information processing apparatus is determined, and the external character is included in the character standard based on a replacement dictionary generated from a first dictionary used for morphological analysis. An information processing apparatus comprising: a processor that executes a process of analyzing a document in which a character is replaced with an internal character and the external character is replaced with the internal character using the first dictionary.
(Appendix 2)
The information processing apparatus according to claim 1, wherein
The replacement dictionary includes character shape information that defines the character shape of the inner character included in the first dictionary,
The processor selects an inner character corresponding to the collated character shape information by collating character shape information that defines the character shape of the determined external character and character shape information included in the replacement dictionary, An information processing device for replacing the external character with the selected internal character.
(Appendix 3)
The information processing apparatus according to claim 1 or 2, wherein
A second dictionary including character shape information of a character not included in the character standard, wherein the character shape information of the external character is not replaced with the internal character by collation with the replacement dictionary; A character in the second dictionary corresponding to the external character is determined by collation, and an internal character for replacing the external character is obtained from the first dictionary based on the degree of similarity with the determined character. An information processing device that further executes the following.
(Appendix 4)
An information processing apparatus according to claim 3, wherein
The information processing device, wherein the processor calculates the similarity based on a shape of a part included in each character shape and a position where the part is arranged in the character shape.
(Appendix 5)
The information processing apparatus according to any one of supplementary notes 1 to 4, wherein
The information processing device, wherein the processor determines the external character based on a range of a character code specifying a character.
(Appendix 6)
On the computer,
A determination process of determining an external character not included in a character standard handled by the information processing apparatus in a document to be processed; and converting the external character to the character standard based on a substitution dictionary generated from a first dictionary used for morphological analysis. And a analyzing process for analyzing a document in which the external character has been replaced with the internal character by using the first dictionary.
(Appendix 7)
The program according to Supplementary Note 6, wherein
The replacement dictionary includes character shape information that defines the character shape of the inner character included in the first dictionary,
The replacement process selects character characters corresponding to the collated character shape information by collating character shape information defining the character shape of the determined external character with character shape information included in the substitution dictionary. 7. The program according to claim 6, wherein the external character is replaced with the selected internal character.
(Appendix 8)
The program according to claim 6 or 7, wherein
A second dictionary that includes the character shape information of a character that is not included in the character standard when the external character cannot be replaced with the internal character by collation with the replacement dictionary; A character in the second dictionary corresponding to the external character is determined by collation, and an internal character for replacing the external character is obtained from the first dictionary based on the degree of similarity with the determined character. A program that lets you do more.
(Appendix 9)
The program according to claim 8, wherein
A program for causing the computer to calculate the similarity based on a shape of a part included in each character shape and a position where the part is arranged in the character shape.
(Appendix 10)
The program according to any one of supplementary notes 6 to 9, wherein
A program for causing the computer to determine an external character based on a character code range for specifying a character.
(Appendix 11)
Computer
In the document to be processed, an external character not included in the character standard handled by the information processing apparatus is determined, and the external character is included in the character standard based on a replacement dictionary generated from a first dictionary used for morphological analysis. An information processing method for executing, by using the first dictionary, a document in which the internal character is replaced and the external character is replaced with the internal character using the first dictionary.
(Appendix 12)
The information processing method according to claim 11, wherein
The replacement dictionary includes character shape information that defines the character shape of the inner character included in the first dictionary,
Said computer,
By comparing the character shape information defining the character shape of the determined external character with the character shape information included in the replacement dictionary, an inner character corresponding to the collated character shape information is selected, and the external character is selected. An information processing method for performing replacement with a selected inner character.
(Appendix 13)
An information processing method according to claim 11 or 12, wherein
A second dictionary that includes the character shape information of the character that is not included in the character standard, the character shape information of the external character when the computer fails to replace the external character with the internal character by collation with the replacement dictionary; A character in the second dictionary corresponding to the external character is determined by collation, and an internal character for replacing the external character is obtained from the first dictionary based on the degree of similarity with the determined character. An information processing method further performing:
(Appendix 14)
The information processing method according to supplementary note 13, wherein
An information processing method, wherein the computer calculates the similarity based on a shape of a part included in each character shape and a position where the part is arranged in the character shape.
(Appendix 15)
An information processing method according to any one of supplementary notes 11 to 14, wherein
An information processing method, wherein the computer determines an external character based on a range of a character code specifying a character.

１単語辞書
２ＯＣＲ辞書
３大規模文字集
４属性辞書
５部品の類似度辞書
１１ＣＰＵ
１２主記憶装置
１３外部記憶装置 DESCRIPTION OF SYMBOLS 1 Word dictionary 2 OCR dictionary 3 Large-scale character collection 4 Attribute dictionary 5 Similarity dictionary of parts 11 CPU
12 Main storage device 13 External storage device

Claims

情報処理装置に、
処理対象の文書中で、前記情報処理装置が取り扱う文字規格に含まれない外字を判別する判別処理と、
形態素解析に用いられる第１の辞書に登録された単語に対応する文字コードの組み合わせを分解して得られるそれぞれの文字コードと前記それぞれの文字コードに対応する文字形状とから生成された置換辞書に基づき前記外字を前記文字規格に含まれる内字に置換する置換処理と、
前記外字が前記内字に置換された文書を、前記第１の辞書を用いて解析する解析処理と、
を実行させるプログラム。 For information processing equipment ,
A determination process of determining an external character that is not included in a character standard handled by the information processing apparatus in a document to be processed;
In a replacement dictionary generated from each character code obtained by decomposing a combination of character codes corresponding to words registered in the first dictionary used for morphological analysis and a character shape corresponding to each of the character codes , A replacement process for replacing the external character with an internal character included in the character standard based on
An analysis process of analyzing the document in which the external characters have been replaced with the internal characters using the first dictionary;
A program that executes

請求項１記載のプログラムであって、
前記置換辞書は、前記第１の辞書に含まれている内字の文字形状を定義する文字形状情報を含み、
前記置換処理は、前記判別された外字の文字形状を定義する文字形状情報と前記置換辞書に含まれる文字形状情報とを照合することによって前記照合された文字形状情報に対応する内字を選択し、前記外字を前記選択された内字に置換する、
ことを特徴とするプログラム。 The program according to claim 1, wherein
The replacement dictionary includes character shape information that defines the character shape of the inner character included in the first dictionary,
The replacement process selects character characters corresponding to the collated character shape information by collating character shape information defining the character shape of the determined external character with character shape information included in the substitution dictionary. Replacing the external character with the selected internal character,
A program characterized by that:

請求項１または２に記載のプログラムであって、
前記情報処理装置に、前記置換辞書との照合によって前記外字を内字に置換できなかった場合に、前記外字の文字形状情報を前記文字規格に含まれない文字の文字形状情報を含む第２の辞書と照合することによって前記外字に対応する前記第２の辞書中の文字を決定し、前記決定された文字との類似度を基に前記第１の辞書から前記外字を置換するための内字を取得することをさらに実行させるためのプログラム。 The program according to claim 1 or 2,
In the information processing apparatus , when the external character cannot be replaced with the internal character by the comparison with the replacement dictionary, the character shape information of the external character includes character shape information of a character not included in the character standard. A character in the second dictionary corresponding to the external character is determined by collating with the dictionary, and an internal character for replacing the external character from the first dictionary based on the similarity with the determined character. A program for making it more executable.

請求項３記載のプログラムであって、
前記情報処理装置に、それぞれの文字形状に含まれる部品の形状と前記部品が前記文字形状内で配置される位置とにより前記類似度を算出させるためのプログラム。 The program according to claim 3, wherein
A program for causing the information processing device to calculate the similarity based on a shape of a part included in each character shape and a position where the part is arranged in the character shape.

請求項１から４のいずれか１項に記載のプログラムであって、
前記情報処理装置に、文字を特定する文字コードの範囲に基づいて外字を判別させるためのプログラム。 The program according to any one of claims 1 to 4,
A program for causing the information processing device to determine an external character based on a range of a character code specifying a character.

処理対象の文書中で、情報処理装置が取り扱う文字規格に含まれない外字を判別し、
形態素解析に用いられる第１の辞書に登録された単語に対応する文字コードの組み合わせを分解して得られるそれぞれの文字コードと前記それぞれの文字コードに対応する文字形状とから生成された置換辞書に基づき前記外字を前記文字規格に含まれる内字に置換し、
前記外字が前記内字に置換された文書を、前記第１の辞書を用いて解析する、
ことを特徴とする情報処理方法。 In the document to be processed, to determine the external character is not included in the character standards handled by the information processing apparatus,
In a replacement dictionary generated from each character code obtained by decomposing a combination of character codes corresponding to words registered in the first dictionary used for morphological analysis and a character shape corresponding to each of the character codes , Replace the external character with the internal character included in the character standard based on,
Analyzing the document in which the external characters are replaced with the internal characters using the first dictionary,
An information processing method, comprising:

処理対象の文書中で、情報処理装置が取り扱う文字規格に含まれない外字を判別し、形態素解析に用いられる第１の辞書に登録された単語に対応する文字コードの組み合わせを分解して得られるそれぞれの文字コードと前記それぞれの文字コードに対応する文字形状とから生成された置換辞書に基づき前記外字を前記文字規格に含まれる内字に置換し、前記外字が前記内字に置換された文書を、前記第１の辞書を用いて解析する処理を実行するプロセッサを備える情報処理装置。 In the document to be processed, to determine the external character is not included in the character standards handled by the information processing apparatus, to decompose the combination of character codes corresponding to the words registered in the first dictionary used for morphological analysis obtained The external character is replaced with the internal character included in the character standard based on the replacement dictionary generated from the respective character codes and the character shapes corresponding to the respective character codes, and the external character is replaced with the internal character. An information processing apparatus comprising: a processor configured to execute a process of analyzing a document using the first dictionary.