JPH08293005A

JPH08293005A - Japanese sentence reader

Info

Publication number: JPH08293005A
Application number: JP7096343A
Authority: JP
Inventors: Yukiko Chiba; 由紀子千葉
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-04-21
Filing date: 1995-04-21
Publication date: 1996-11-05

Abstract

PURPOSE: To improve the recognition rate for the character string part of only KATAKANA (square form of Japanese syllabary) characters by outputting the combination of candidate words that can be filled with KATAKANA characters from the head to the end of a KATAKANA area and with no overlapping of them or outputting the first candidate characters as it is if such preceding combination of candidate words is not available among those character strings which are written in KATAKANA. CONSTITUTION: A character recognition part 5 produces a set of the distance value and the candidate characters arranged in order of smaller distances set between the shapes of input characters and a standard character shape and also designates the area where two or more KATAKANA characters are continuous as a KATAKANA area. A word collation part 6 extracts the strings of words contained in a word dictionary 7 among those character strings which are produced from the combinations of set strings consisting of the candidate characters and the distance value. An output production part 8 selects only the combination of candidate words filled with KATAKANA characters from the head to the end of the KATAKANA area and with no overlapping of them against the character string that is decided as a KATAKANA area. If such combination is not available, the part 8 arranges the characters of first recognition ranks over an entire KATAKANA area and outputs these arranged characters.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、日本文を構成する文
字を認識して読み取る日本文読取装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a Japanese sentence reader for recognizing and reading characters constituting a Japanese sentence.

【０００２】[0002]

【従来の技術】一般に、従来の日本文読取装置では、読
取対象の文書画像を文字・表・罫線・図および写真の領
域に自動あるいは手動で分割し、文字領域および表領域
の文字部分について文字認識が行われる。文字認識結果
の読取精度を向上させるために、入力文字列に対応する
候補文字集合の列と単語辞書とを使用して単語照合ある
いは言語処理を施し、一致する単語を入力文字列の答え
とするという後処理を行うことが広く行われている。2. Description of the Related Art Generally, in a conventional Japanese text reading device, a document image to be read is automatically or manually divided into character, table, ruled line, figure, and photo areas, and the character portion of the character area and the table area is divided into characters. Recognition is done. In order to improve the reading accuracy of the character recognition result, word matching or linguistic processing is performed using a string of candidate character sets corresponding to the input character string and a word dictionary, and the matching word is used as the answer to the input character string. Post-processing is widely performed.

【０００３】しかし、現在、外来語をはじめとした（外
国語固有名詞を含む）新語や、既に辞書登録されている
単語を組み合わせて作られた複合語が次々と現れてい
る。辞書未登録の単語に単語照合を行った場合に、辞書
に登録されている単語に誤変換してしまう場合がある。
認識率向上の一手段として、辞書掲載単語数を増やすこ
とが挙げられるが、新語や複合語を全て辞書登録する
と、辞書単語が増加するために単語照合時間が増加し、
処理速度全体が遅延したり、処理に必要な記憶容量が増
加する恐れがある。また、新しい単語は絶えず生成され
ており、これらの単語を単語辞書に追加登録することに
より認識率を上げようとすることは堂々巡りになりかね
ない。However, at present, new words such as foreign words (including proper nouns in foreign languages) and compound words created by combining words already registered in the dictionary are appearing one after another. When word matching is performed on a word that is not registered in the dictionary, it may be erroneously converted into a word that is registered in the dictionary.
One way to improve the recognition rate is to increase the number of words in the dictionary. However, if all new words and compound words are registered in the dictionary, the number of dictionary words will increase and the word matching time will increase.
The overall processing speed may be delayed, and the storage capacity required for processing may increase. In addition, new words are constantly being generated, and it may become imposing to try to increase the recognition rate by additionally registering these words in the word dictionary.

【０００４】[0004]

【発明が解決しようとする課題】従来の日本文読取装置
の文字認識では、文字認識結果の精度を向上させるため
に、辞書登録単語を増やすということが行われていた
が、これによると、単語照合時間や処理に必要な記憶容
量が増加してしまう。この発明は、文字と認識された箇
所のうち、カタカナのみで構成される文字列部分の認識
率を、単語辞書中の登録単語数を過剰に増加させずに向
上させた日本文読取装置を実現することを目的としてい
る。In the conventional character recognition of Japanese sentence readers, the number of words registered in the dictionary has been increased in order to improve the accuracy of the character recognition result. The collation time and the storage capacity required for processing increase. The present invention realizes a Japanese sentence reading device in which the recognition rate of a character string portion composed only of katakana in a portion recognized as a character is improved without excessively increasing the number of registered words in a word dictionary. The purpose is to do.

【０００５】[0005]

【課題を解決するための手段】本発明は、日本文から成
る文書画像を文字・表・罫線・図および写真等の領域に
分割したもののうち文字部分を走査して光電変換し、画
像信号を出力する走査部と、その画像信号と予め登録さ
れた文字とが類似する度合いをこの文字の距離値として
この距離値が所定の範囲にある文字をその距離値の小さ
いものから順に候補文字として選出し、これらの候補文
字のうち最小の距離値を有する文字を第１候補文字とし
て指定し、第１候補文字の距離値の一定倍値の距離範囲
内にある候補文字が全てカタカナであれば、対応する画
像信号はカタカナ文字であると判断し、かつ、カタカナ
文字が２文字以上続く場合に、その連続したカタカナ文
字の領域をカタカナ領域と指定する文字認識部と、候補
文字から成る文字列に該当または近似する単語が、照合
のための単語が登録されている単語辞書にある場合に、
その単語を候補単語として選出する単語照合部と、この
単語照合部で得られた候補単語群から読取結果とすべき
単語を選出して出力すると共に、カタカナ領域について
は、このカタカナ領域の先頭から末尾までをカタカナで
重複なく埋めることのできる候補単語の組み合わせがあ
る場合、この組み合わせのみを出力し、その組み合わせ
が無い場合には、当該文字列は未登録カタカナ単語であ
ると判断し、カタカナ領域全てについて第１候補文字を
並べて出力する出力形成部とを有することを特徴として
いる。SUMMARY OF THE INVENTION According to the present invention, a document image composed of Japanese sentences is divided into areas such as characters, tables, ruled lines, figures and photographs, and the character portion is scanned and photoelectrically converted to obtain an image signal. The degree of similarity between the scanning unit to output and the image signal and the character registered in advance is used as the distance value of this character, and characters whose distance value is within a predetermined range are selected as candidate characters in order from the smallest distance value. Then, of these candidate characters, the character having the smallest distance value is designated as the first candidate character, and if all the candidate characters within the distance range of a constant multiple of the distance value of the first candidate character are katakana, If the corresponding image signal is judged to be katakana characters, and if two or more katakana characters continue, a character recognition part that specifies the continuous katakana character area as a katakana area and a character consisting of candidate characters When words are true or approximation, in the word dictionary words are registered for collation,
The word matching unit that selects that word as a candidate word, and the word that should be the reading result is selected and output from the candidate word group obtained by this word matching unit, and the Katakana area starts from the beginning of this Katakana area. If there is a combination of candidate words that can be filled up to the end with katakana without duplication, only this combination is output. If there is no such combination, the character string is determined to be an unregistered katakana word, and the katakana area And an output forming unit that outputs the first candidate characters side by side for all of them.

【０００６】[0006]

【作用】上記構成の装置により、日本文の文書を文書画
像として取り込み、この文書画像を文字・表・罫線・図
および写真等の領域に分割し、これらのうち文字領域お
よび表領域等の文字部分を走査部により走査して光電変
換し、画像信号を出力する。With the device having the above structure, a Japanese document is taken in as a document image, and this document image is divided into areas such as characters, tables, ruled lines, figures and photographs, and among these, characters such as character areas and table areas. The part is scanned by the scanning part, photoelectrically converted, and an image signal is output.

【０００７】文字認識部はこの画像信号に類似する文字
を候補文字として選出すると共にカタカナ領域を指定
し、単語照合部では、候補文字からなる文字列を単語辞
書と照合して該当または近似する単語が単語辞書にある
場合に、その単語を候補単語として選出する。出力形成
部では、この単語照合部で得られた候補単語群から読取
結果とすべき単語を選出して出力するが、カタカナ領域
と判断された文字列については、そのカタカナ領域の先
頭から末尾までをカタカナで重複なく埋めることのでき
る候補単語の組み合わせがある場合にのみこの組み合わ
せを出力し、その組み合わせが無い場合には文字列は単
語辞書に登録されていない未登録カタカナ単語であると
判断し、カタカナ領域全てについて対応する第１候補文
字を並べて出力する。The character recognizing unit selects a character similar to the image signal as a candidate character and specifies a katakana area, and the word matching unit compares a character string of the candidate character with a word dictionary to find a word that is relevant or approximate. If is in the word dictionary, that word is selected as a candidate word. The output forming unit selects and outputs a word to be the reading result from the candidate word group obtained by the word matching unit.For a character string determined to be a katakana area, the katakana area starts from the beginning to the end. This combination is output only when there is a combination of candidate words that can be filled in with katakana without duplication.If there is no combination, it is determined that the character string is an unregistered katakana word that is not registered in the word dictionary. , The corresponding first candidate characters for all katakana areas are arranged and output.

【０００８】[0008]

【実施例】以下に本発明の実施例を図を用いて説明す
る。図１は、本発明の実施例を示すブロック図である。
図において、１は読取対象である帳票を示している。２
は画像入力装置であり、その帳票１を文書画像として取
り込むためのものである。３はレイアウト解析部であ
り、前記の文書画像を、自動または手動で、文字・表・
罫線・図および写真等の各領域に分割する。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of the present invention.
In the figure, reference numeral 1 indicates a form to be read. Two
Is an image input device for taking in the form 1 as a document image. Reference numeral 3 is a layout analysis unit, which automatically or manually displays characters, tables, and
Divide into areas such as ruled lines, figures, and photographs.

【０００９】４は走査部であり、レイアウト解析部３に
より文字領域と判断された領域と表領域内の文字部分を
走査し、光電変換して得られる画像信号を文字認識部５
に転送する。この文字認識部５は、入力文字の字形と予
め登録されている各文字の標準字形との距離を計算し、
距離の小さい順に（つまり字形の似ている順に）並んだ
候補文字とその距離値からなる集合を形成すると共に、
カタカナ文字を判定してそのカタカナ文字が２文字以上
連続する領域をカタカナ領域と指定する。A scanning unit 4 scans a character portion in the area determined by the layout analysis unit 3 as a character area and a character portion in the table area, and an image signal obtained by photoelectric conversion is converted into a character recognition unit 5
Transfer to. The character recognition unit 5 calculates the distance between the glyph of the input character and the standard glyph of each character registered in advance,
Form a set of candidate characters arranged in ascending order of distance (that is, in order of similar glyphs) and their distance values.
A katakana character is determined and an area in which two or more katakana characters continue is designated as a katakana area.

【００１０】６は単語照合部であり、文字認識部５の文
字認識結果を受け入れて適当な位置で区切り、前述の候
補文字と距離値からなる集合の列から組み合わせて作ら
れる文字列のうち、単語辞書７を検索して該単語辞書７
に存在するものを抽出する。８は出力形成部であり、単
語照合部６で抽出された候補単語群の中から、出力すべ
き単語や文字を選択する。Reference numeral 6 denotes a word collating unit, which receives the character recognition result of the character recognizing unit 5, delimits it at an appropriate position, and combines it from a string of a set of the above-mentioned candidate characters and distance values to create a character string. The word dictionary 7 is searched and the word dictionary 7 is searched.
Extract what exists in. An output forming unit 8 selects a word or character to be output from the candidate word group extracted by the word matching unit 6.

【００１１】次に、具体的な認識対象を用いて、上記構
成の日本文読取装置の処理手順を説明する。図２は、本
発明の文字認識の一例を示す説明図である。図に示すよ
うに、ここでは認識対象の一例として、「インドのクー
リッシュ氏はサブサンプルを用いて説明した。」という
文を挙げている。Next, the processing procedure of the Japanese sentence reading apparatus having the above configuration will be described using a specific recognition target. FIG. 2 is an explanatory diagram showing an example of character recognition of the present invention. As shown in the figure, here, as an example of the recognition target, the sentence “Mr. Kulish of India explained using sub-samples” is cited.

【００１２】この認識対象を記した帳票１を、画像入力
装置２に文書画像として取り込ませる。レイアウト解析
部３では、その文書画像を自動あるいは手動で、文字・
表・罫線・図および写真等の各領域に分割する。これら
の領域のうちの文字領域および表領域等の文字部分を走
査部４により走査し、得られた画像信号を文字認識部５
に転送する。The form 1 on which the recognition target is described is captured by the image input device 2 as a document image. In the layout analysis unit 3, the document image is automatically or manually displayed by
Divide into areas such as tables, ruled lines, figures and photographs. The character portion such as the character area and the table area among these areas is scanned by the scanning unit 4, and the obtained image signal is used by the character recognition unit 5
Transfer to.

【００１３】文字認識部５は、入力文字の字形と予め登
録された各文字の標準字形との距離を計算して認識処理
を行う。図２に示すように、ここでは、距離の最も近い
ものすなわち最も字形が似ていると判定されたものから
順に、認識１位、認識２位、認識３位と選んで候補文字
を得る。なお、これは認識３位までを候補文字として用
いることに限定するものではなく、距離の近いものから
必要なだけ抽出して用いればよい。The character recognition unit 5 calculates the distance between the character shape of the input character and the standard character shape of each character registered in advance, and performs recognition processing. As shown in FIG. 2, here, candidate characters are obtained by selecting recognition first place, recognition second place, and recognition third place in order from the one having the closest distance, that is, the one determined to have the most similar shape. It should be noted that this is not limited to the use of the recognition characters up to the third rank as candidate characters, and may be used by extracting as much as necessary from those having a short distance.

【００１４】また、文字認識によりカタカナであると判
定された領域を、カタカナ領域として図示している。こ
の判定は、文字認識結果の各文字の第１候補文字から、
該第１候補文字の距離の一定倍値の距離の候補文字まで
が全てカタカナであれば、当該箇所の文字はカタカナ文
字であると判定するものであり、このカタカナ文字が２
文字以上続く場合に、その連続した領域をカタカナ領域
と指定している。この図２の例では、カタカナ領域は
「イソト」「クーリッシュ」「サフサシグレ」（いずれ
も認識第１位より）の領域とされる。An area determined to be katakana by character recognition is shown as a katakana area. This determination is based on the first candidate character of each character of the character recognition result,
If all of the candidate characters up to a distance that is a constant multiple of the distance of the first candidate character are katakana, it is determined that the character at that location is a katakana character.
When more than one character continues, the continuous area is designated as a katakana area. In the example of FIG. 2, the katakana area is an area of “isoto”, “coolish”, and “safusa sigre” (all from the first recognition position).

【００１５】図３は本発明の単語照合の一例を示す説明
図である。文字認識結果を、例えば先頭から句読点を単
位として区切り、各文字の各候補文字を単語の先頭文字
と仮定して単語辞書７と照合し、候補単語を抽出する。
このようにして得られた候補単語・候補文字から、出力
する文字列を選出することになる。図４は単語長・平均
候補順位による選択例の説明図であり、これは、各候補
単語同士が重なり合わないように、また、単語長が最も
長く、かつ、平均候補順位が上位の単語を優先して選択
した例である。また、各文字のうち、どの候補単語の一
部にもならなかった文字は、認識第１位の文字をそのま
ま出力することとしている。図４の「クー（ラッシ
ュ）」「サ（ジャングル）」はその例である。FIG. 3 is an explanatory diagram showing an example of word matching according to the present invention. The character recognition result is divided, for example, from the beginning in units of punctuation marks, and each candidate character of each character is assumed to be the beginning character of the word, collated with the word dictionary 7, and candidate words are extracted.
A character string to be output is selected from the candidate words / candidate characters thus obtained. FIG. 4 is an explanatory diagram of an example of selection based on word length and average candidate rank. This is to prevent candidate words from overlapping each other, and to select words with the longest word length and the highest average candidate rank. This is an example in which priority is selected. Further, among the respective characters, the characters that have not become a part of any of the candidate words are to be output as they are in the first character of recognition. "Coo (rush)" and "sa (jungle)" in FIG. 4 are examples.

【００１６】この図４に示す選択を行うと、「クーリッ
シュ」の様な辞書に登録されていない未登録カタカナ単
語は候補単語として出力することができない。もし「ク
ーリッシュ」の各候補文字から候補単語をひとつも抽出
することができなければ、「クーリッシュ」がそのまま
出力文字列として選ばれるが、図４に示すように、候補
文字群から「ラッシュ」という単語が抽出されるので、
結果として「クーラッシュ」が出力文字列となり、認識
結果を知識処理が改悪することとなる。When the selection shown in FIG. 4 is made, unregistered katakana words that are not registered in the dictionary such as "Courish" cannot be output as candidate words. If no candidate word can be extracted from each candidate character of "Courish", "Courish" is selected as the output character string as it is, but as shown in FIG. 4, it is called "rush" from the candidate character group. Since the words are extracted,
As a result, "cool rush" becomes an output character string, and the knowledge processing deteriorates the recognition result.

【００１７】また、「サブサンプル」の様に、「サブ」
と「サンプル」の二つの辞書掲載単語を複合して作られ
ている単語は、たとえ両単語を候補文字列から抽出でき
たとしても、当該文字列から「ジャングル」の様な、
「サブ」「サンプル」のどちらよりも長い単語が抽出さ
れてしまうと、単語長が長い単語を優先するという規則
が作用し、結果として「サジャングル」が出力文字列と
なってしまう。Also, as in "subsample", "sub"
Even if both words can be extracted from the candidate character string, the word made by combining the two dictionary words of "and""sample" is like "jungle" from the character string,
When a word longer than either "sub" or "sample" is extracted, the rule of giving priority to a word having a long word length operates, and as a result, "sajangle" becomes an output character string.

【００１８】そこで本発明では、前記図４に示す出力形
成に依らずに、文字認識結果からカタカナ領域であると
判断される文字列に対して、出力形成部８において候補
単語・候補文字群から出力すべきものを選出する際、カ
タカナ領域の先頭から末尾までをカタカナで重複なく埋
めることのできる候補単語の組合せのみを選出対象と
し、もし、そのような候補単語の組み合わせがなけれ
ば、当該カタカナ領域全てについて認識第１位の文字を
並べて出力することとして、未登録カタカナ単語である
新語や複合語を救済する。Therefore, according to the present invention, the output forming unit 8 selects a candidate word / candidate character group from a candidate word / candidate character group for a character string determined to be a katakana area based on the character recognition result, without depending on the output formation shown in FIG. When selecting what should be output, only the candidate word combinations that can be filled with katakana from the beginning to the end of the katakana area without overlapping are selected.If there is no such candidate word combination, the relevant katakana area is selected. By arranging and outputting the first-ranked characters for all of them, a new word or a compound word that is an unregistered katakana word is rescued.

【００１９】図５は本発明の出力形成の一例を示す説明
図であり、この図を用いて、上述した未登録カタカナ単
語の取扱につき、具体的に説明する。抽出された候補単
語の中から出力文字列を決定する際、カタカナ領域につ
いては、先頭から末尾までをカタカナで重複なく埋める
ことのできる候補単語の組み合わせのみを選択する。例
えば、カタカナ領域「イソト」では、この部分の候補単
語は「インド」であり、これはこの候補単語のみでカタ
カナ領域を先頭から末尾まで埋めることができるので、
そのまま「インド」を出力する。FIG. 5 is an explanatory view showing an example of the output formation of the present invention, and the handling of the above-mentioned unregistered katakana words will be specifically described with reference to this drawing. When determining an output character string from the extracted candidate words, for the katakana area, only combinations of candidate words that can be filled with katakana from the beginning to the end without duplication are selected. For example, in the katakana area “Isoto”, the candidate word for this part is “India”, which can fill the katakana area from the beginning to the end only with this candidate word.
"India" is output as it is.

【００２０】カタカナ領域「クーリッシュ」では、単語
同士で重なり合う部分がないように、かつ、単語長が長
いものを優先するように候補単語を選択すると、候補単
語「ラッシュ」のみが選択される。しかしこれでは、カ
タカナ領域の先頭から末尾までをカタカナで埋めること
のできる候補単語の組み合わせとはならないので、「ラ
ッシュ」は選択しないこととする。これにより、「クー
リッシュ」は固有名詞や新語等による未登録カタカナ単
語であると判断し、上記のように抽出された候補単語を
用いることなく、認識第１位の候補文字を並べた「クー
リッシュ」を出力する。In the katakana area "Courish", if the candidate words are selected so that there are no overlapping portions between the words and the word having a long word length is prioritized, only the candidate word "rush" is selected. However, this does not result in a combination of candidate words that can be filled with katakana from the beginning to the end of the katakana area, so "rush" is not selected. As a result, it is determined that "Courish" is an unregistered katakana word such as a proper noun or new word, and the "Courish" in which the candidate characters ranked first in recognition are arranged without using the candidate words extracted as described above. Is output.

【００２１】カタカナ領域「サブサンプル」では、単語
長が最も長いのは「ジャングル」であるが、前述の「ク
ーリッシュ」同様、カタカナ領域の先頭から末尾までを
カタカナで埋めることのできる候補単語の組み合わせと
はならない。しかし、同カタカナ領域の候補単語「サ
ブ」と「サンプル」を使えば、候補単語同士が重なり合
わず、しかも、カタカナ領域の先頭から末尾までをカタ
カナで埋めることができるので、「サブサンプル」は
「サブ」と「サンプル」の複合語による未登録カタカナ
単語であると判断して、同単語を出力する。In the katakana area "subsample", the longest word length is "jungle", but like "coolish" described above, a combination of candidate words that can be filled with katakana from the beginning to the end of the katakana area. Does not mean However, if you use the candidate words "sub" and "sample" in the same katakana area, the candidate words do not overlap and you can fill the katakana area from the beginning to the end with katakana, so "subsample" is It is determined that the word is an unregistered katakana word composed of a compound word of "sub" and "sample", and the same word is output.

【００２２】上述のように、カタカナで書かれた文字列
のうち、当該カタカナ領域の先頭から末尾までを重複な
くカタカナで埋めることのできる候補単語の組み合わせ
があれば、この文字列は複合語であると判断してこの組
み合わせを出力し、この組み合わせが無い場合には、こ
の文字列は未登録の新語や固有名詞等であると判断して
第１候補文字をそのまま出力することにより、単語辞書
７中の登録単語数を過剰に増加させることなく、文字認
識率を向上させることができる。As described above, if there is a combination of candidate words that can be filled with katakana from the beginning to the end of the katakana area in the character string written in katakana, this character string is a compound word. If there is not such a combination, it is judged that this character string is an unregistered new word, proper noun, etc., and the first candidate character is output as it is. The character recognition rate can be improved without excessively increasing the number of registered words in 7.

【００２３】[0023]

【発明の効果】以上のように、本発明によれば、カタカ
ナで書かれた文字列のうち、当該カタカナ領域の先頭か
ら末尾までを重複なくカタカナで埋めることのできる候
補単語の組み合わせがあればこの組み合わせを出力し、
この組み合わせが無い場合には第１候補文字をそのまま
出力することにより、文字認識率を向上させることが可
能となる効果を有する。As described above, according to the present invention, if there is a combination of candidate words that can be filled with katakana from the beginning to the end of the katakana area in the character string written in katakana without duplication. Print this combination,
When there is no such combination, the character recognition rate can be improved by directly outputting the first candidate character.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の実施例を示すブロック図FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】本発明の文字認識の一例を示す説明図FIG. 2 is an explanatory diagram showing an example of character recognition of the present invention.

【図３】本発明の単語照合の一例を示す説明図FIG. 3 is an explanatory diagram showing an example of word matching according to the present invention.

【図４】単語長・平均候補順位による出力形成の説明図FIG. 4 is an explanatory diagram of output formation based on word length and average candidate rank.

【図５】本発明の出力形成の一例を示す説明図FIG. 5 is an explanatory diagram showing an example of output formation according to the present invention.

【符号の説明】[Explanation of symbols]

４走査部５文字認識部６単語照合部７単語辞書８出力形成部 4 scanning unit 5 character recognition unit 6 word matching unit 7 word dictionary 8 output forming unit

Claims

【特許請求の範囲】[Claims]

【請求項１】日本文から成る文書画像を文字・表・罫
線・図および写真等の領域に分割したもののうち文字部
分を走査して光電変換し、画像信号を出力する走査部
と、その画像信号と予め登録された文字とが類似する度合い
をこの文字の距離値として該距離値が所定の範囲にある
文字をその距離値の小さいものから順に候補文字として
選出し、該候補文字のうち最小の距離値を有する文字を
第１候補文字として指定し、該第１候補文字の距離値の
一定倍値の距離範囲内にある候補文字が全てカタカナで
あれば、対応する画像信号はカタカナ文字であると判断
し、かつ、カタカナ文字が２文字以上続く場合に、その
連続したカタカナ文字の領域をカタカナ領域と指定する
文字認識部と、前記候補文字から成る文字列に該当または近似する単語
が、照合のための単語が登録されている単語辞書にある
場合に、その単語を候補単語として選出する単語照合部
と、該単語照合部で得られた候補単語群から読取結果とすべ
き単語を選出して出力すると共に、前記カタカナ領域に
ついては、このカタカナ領域の先頭から末尾までをカタ
カナで重複なく埋めることのできる候補単語の組み合わ
せがある場合、この組み合わせのみを出力し、その組み
合わせが無い場合には、当該文字列は未登録カタカナ単
語であると判断し、カタカナ領域全てについて第１候補
文字を並べて出力する出力形成部とを有することを特徴
とする日本文読取装置。1. A scanning unit for scanning a photoelectric conversion of a character portion of a document image composed of Japanese sentences divided into areas such as characters, tables, ruled lines, figures and photographs, and outputting an image signal, and the image. The degree of similarity between a signal and a character registered in advance is used as a distance value of this character, and characters having a distance value within a predetermined range are selected as candidate characters in order from the smallest distance value, and the minimum character among the candidate characters is selected. If a character having a distance value of is designated as a first candidate character and all candidate characters within a distance range of a constant multiple of the distance value of the first candidate character are katakana, the corresponding image signal is a katakana character. A character recognizing unit that determines that there is a continuous Katakana character area when two or more Katakana characters continue, and a word that corresponds to or approximates to the character string consisting of the candidate characters. , If a word for matching is in a registered word dictionary, a word matching unit that selects that word as a candidate word and a word that should be a reading result from the candidate word group obtained by the word matching unit In addition to selecting and outputting, if there is a combination of candidate words that can be filled with katakana from the beginning to the end of the katakana area without overlapping, if only this combination is output and there is no such combination And an output forming unit that determines that the character string is an unregistered Katakana word and outputs the first candidate characters in line for all the Katakana regions.