JPS63223987A

JPS63223987A - Character string retrieval device

Info

Publication number: JPS63223987A
Application number: JP62058314A
Authority: JP
Inventors: Yuzuru Tanaka; 譲田中; Kinya Takahashi; 欣也高橋
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1987-03-13
Filing date: 1987-03-13
Publication date: 1988-09-19
Anticipated expiration: 2010-12-20
Also published as: JPH07120387B2

Abstract

PURPOSE:To retrieve an optional character string in a document image by storing both information and a degenerate code in a segmented character area and deciding whether or not a degenerate code sequence is coincident. CONSTITUTION:A character string to be retrieved is inputted on a keyboard 105 and stored in the key buffer of a character string retrieval part 106 together with a character string end character. Then a character number (i) and a character number (j) in a degenerate code buffer are set initially to '1'. Then, it is checked whether or not information with the character number (i) comes to a data end and when not, it is checked whether or not a degenerate code Ci with the number (i) matches with the number (j), thereby setting a value ST in a character string start number register to (i) when j = 1. Then it is judged that the number (j) is the character number EC at the end of the degeneration code buffer and when j = EC, it is judged that the character string matches with a target character string. Then, a character rectangular area of character numbers from a character start number ST to the current (i) is displayed in black-and-white reverse mode on a display device 8 by referring to the table 103 to display the retrieved character string to a user.

Description

【発明の詳細な説明】〔産業上の利用分野〕画像として入力された文章の任意の文字列を検索する文
字列検索装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a character string search device that searches for an arbitrary character string in a sentence input as an image.

〔従来の技術〕[Conventional technology]

従来、ラスターイメージとして入力された文章から文字
領域を切り出し、これら文字領域の一部を削除したり、
新しい文字パターンを文章中に挿入、追加したり、適当
に文字領域の配置を変える等の手段により文章画像を編
集する画像編集処理装置があった。Conventionally, text areas were cut out from sentences input as raster images, and parts of these text areas were deleted.
There has been an image editing processing device that edits a text image by inserting or adding a new character pattern into a text or appropriately changing the arrangement of character areas.

〔発明が解決しようとしている問題点〕しかしながら上
記装置においては、文字認識をして、標準文字コード（
ＪＩＳ、ＡＳＣＩＩコード等）を得ていないので、利用
者の指示する任意の文字列を入力された文字画像から検
索する手段がな（、文章画像をディスプレイに写し出し
、人間が一字一字見ながら捜し出すしかなかった。[Problem to be solved by the invention] However, the above device recognizes characters and uses standard character codes (
JIS, ASCII codes, etc.), so there is no way to search for arbitrary character strings specified by the user from input character images. I had no choice but to find it.

また、人力された文章画像の各文字を従来の文字認識装
置により、標準文字コードにすることも考えられるが大
量の文章画像を検索対象とする場　−合には、検索対象
外の多くの不必要な文字に対しても認識処理をして時間
を費してしまうため効率的でないという欠点があった。It is also possible to convert each character in a human-generated text image into a standard character code using a conventional character recognition device, but when searching for a large number of text images, there are many errors that are not included in the search target. This method has the disadvantage that it is not efficient because it takes time to perform recognition processing even for necessary characters.

〔問題点を解決するための手段（及び作用）〕本発明に
よれば、文章画像から切り出された各文字パターンに対
し、従来文字認識装置で行うような複雑な処理を行わず
、単純な画像処理により、標準文字コードとは異なる数
ビットのコード（以下縮退コードと呼ぶことにする）を
画像編集処理の前処理の段階で割り当て、切り出された
文字領域の情報を共に保存しておき、上記縮退コード列
の一致、不一致を判定する手段を設けることにより、文
章画像の任意の文字列を検索するものである。[Means for Solving the Problems (and Effects)] According to the present invention, each character pattern cut out from a text image is not subjected to complicated processing that is performed by a conventional character recognition device, and a simple image is generated. Through processing, a code of several bits different from the standard character code (hereinafter referred to as a degenerate code) is assigned at the pre-processing stage of image editing processing, and information on the extracted character area is saved together, and the above By providing means for determining whether degenerate code strings match or do not match, an arbitrary character string in a text image can be searched.

〔実施例１〕第１図は、本発明の構成図である。１００は文章を含ん
だ紙面の画像パターンを入力する画像入力装置、１０１
は画像入力装置１００で読み取った画像データを格納す
る画像メモリ、１０２は画像メモリ１０１に含まれる個
々の文字パターンの領域を抽出する文字切り出し部、１
０４は切り出された個々の文字に対応する縮退コードを
生成する縮退コード生成部、１０３は切り出した文字領
域の座標値と縮退コードを格納する文字情報テーブル、
１０５はキーボード、１０６は画像として入力した文章
から指示された文字列を検索する文字列検索部、１０７
はキーボードから入力された文字とその文字に対応する
縮退コードを格納した縮退コードテーブル、１０８は画
像メモリの内容と、文字列検索結果を表示する表示装置
である。尚、画像メモリ１０１及び文字情報テーブル１
０３は書き変え可能なＲＡＭで、縮退コードテーブル１
０７はＲＯＭを用いる。[Example 1] FIG. 1 is a block diagram of the present invention. 100 is an image input device for inputting a paper image pattern containing text; 101;
1 is an image memory that stores image data read by the image input device 100; 102 is a character cutting unit that extracts areas of individual character patterns contained in the image memory 101;
04 is a degenerate code generation unit that generates a degenerate code corresponding to each extracted character; 103 is a character information table that stores the coordinate values of the extracted character area and the degenerate code;
105 is a keyboard; 106 is a character string search unit that searches for a specified character string from a text input as an image; 107
108 is a degenerate code table that stores characters input from the keyboard and degenerate codes corresponding to the characters, and 108 is a display device that displays the contents of the image memory and the character string search results. In addition, the image memory 101 and the character information table 1
03 is a rewritable RAM with degenerate code table 1
07 uses ROM.

まず画像入力装置１００を起動し、文章画像を読み取り
、２値化して画像メモリ１０１に格納する。画像メモリ
は少なくとも読み取られる紙面骨の画像データを格納で
きるような容量を用意する。次に画像表示装置１０８で
画像メモリの内容を表示画面１０９に表示する。次に文
字切り出し部１０２では、画像メモリ１０１に格納され
ている画像データを射影することにより、各文字の領域
を抽出する。文字の切り出しは公知の技術が多数あるの
で詳細な説明は省略する。First, the image input device 100 is activated, a text image is read, binarized, and stored in the image memory 101. The image memory has a capacity capable of storing at least the image data of the paper surface to be read. Next, the image display device 108 displays the contents of the image memory on the display screen 109. Next, the character cutting unit 102 extracts the area of each character by projecting the image data stored in the image memory 101. Since there are many known techniques for cutting out characters, detailed explanations will be omitted.

第２図は文字の切り出しの一例である。各文字は文字パ
ターンの最左端にある黒画素のＸ座標値Ｘｉと同一文字
行内にある複数の文字パターンのうち最上端にある黒画
素のｙ座標値Ｙｉと各パターンの横軸の画素数Ｗｉと各
パターンの高さの画素数Ｈ４の４つで表される矩形領域
２０１〜２１１として切り出される。高さＨｉは、Ｙｉ
から同一文字行内にある複数の文字パターンのうち最下
端にある黒画素のｙ座標までの長さである。各文字の切
り出し結果は文字情報テーブル１０３に文字の出現順に
格納する。文字情報テーブル１０３は各文字を識別する
ための文字番号欄１０９と、上記Ｘ　ｉ　、　Ｙ　ｉ　
、　Ｗ　ｉ　、　Ｈｉを格納する欄１１０〜１１３と後
述する縮退コード欄１１４から成り、切り出し結果は、
文字の出現順に文字番号ｌから順にＸ　ｉ　、　Ｙ　ｉ
　、　Ｗ　ｉ　、　Ｈｉを各々欄１１０，１１１゜１１
２．１１３に書き込む。データの最後部の欄１１０〜１
１３には、座標値や幅や高さであり得ない数値、例えば
−１を書き込み、データの終了を示す。FIG. 2 is an example of cutting out characters. Each character consists of the X coordinate value Xi of the black pixel at the leftmost end of the character pattern, the y coordinate value Yi of the black pixel at the top end of multiple character patterns in the same character line, and the number of pixels Wi on the horizontal axis of each pattern. The rectangular regions 201 to 211 are cut out as four rectangular regions 201 to 211 represented by the number of pixels H4 and the height of each pattern. The height Hi is Yi
It is the length from the y-coordinate of the black pixel at the bottom of the plurality of character patterns in the same character line. The extraction results for each character are stored in the character information table 103 in the order in which the characters appear. The character information table 103 includes a character number column 109 for identifying each character, and the above-mentioned X i , Y i
It consists of columns 110 to 113 for storing , W i , and Hi and a degenerate code column 114 to be described later, and the extraction result is as follows.
X i , Y i in order of character appearance starting from character number l
, W i , and Hi in columns 110 and 111゜11, respectively.
2. Write to 113. Columns 110 to 1 at the end of the data
13, an impossible value for the coordinate value, width, or height, such as -1, is written to indicate the end of the data.

次に縮退コード生成部１０４で、２ｂｉｔの縮退化した
コードを生成する。第３図は縮退コードの生成法を示し
た図である。まず、切り出された各矩形領域３０〜３２
を上端部ＵＰ　　３０〜ＵＰ　　３２、中央部ＭＤ　　
３０〜ＭＤ　　３２、下端部ＵＤ　　３０〜ＵＤ−３２
の領域に分割する。矩形領域３０における文字パターン
゛Ｃ′は上端部及び下端部にパターンが存在しない文字
の例であり、矩形領域３１における文字パターン゛ｂ゛
は上端部にパターンが存在する文字の例、矩形領域３２
における文字パターン“Ｐｏは下端部にパターンが存在
する文字の例を示したものである。本実施例では上端部
ＵＰにパターンが存在するかしないかで縮退コードのう
ち１ビツトを、下端部ＵＤにパターンが存在するかしな
いかで縮退コードの残り１ビツトを決定する。第４図（
ａ）は縮退コード生成の流れ図であり、同す図はこの処
理で用いるレジスタあるいはメモリ群である。まず、ス
テップ５４０１で文字番号１４５０を１に初期設定し、
ス′テップ５４０２で文字情報テーブル１０３の文字番
号ｉのデータが有効であるかどうかを判定し、もし、デ
ータの終了ならば処理を終了し、そうでなければステッ
プ５４０３で文字矩形領域を上端部ＵＰ、中央部ＭＤ、
下端部ＵＤに分割し、上端部ＵＰの矩形領域の位置情報
４５１及び下端部ＵＤの矩形領域の位置情報４５２をセ
ットする。５４０３においてｒＵＰ。Next, the degenerate code generation unit 104 generates a 2-bit degenerate code. FIG. 3 is a diagram showing a method of generating a degenerate code. First, each of the cut out rectangular areas 30 to 32
Upper end UP 30~UP 32, center MD
30~MD 32, lower end UD 30~UD-32
Divide into areas. The character pattern 'C' in the rectangular area 30 is an example of a character with no pattern at the top and bottom ends, and the character pattern 'b' in the rectangular area 31 is an example of a character with a pattern at the top end.
The character pattern "Po" in is an example of a character in which a pattern exists at the lower end.In this embodiment, one bit of the degenerate code is set to the lower end UD depending on whether a pattern exists at the upper end UP. The remaining 1 bit of the degenerate code is determined depending on whether a pattern exists or not.
Figure a) is a flowchart of degenerate code generation, and the same diagram shows registers or memory groups used in this process. First, in step 5401, character number 1450 is initialized to 1,
In step 5402, it is determined whether the data of the character number i in the character information table 103 is valid. If the data has ended, the process is terminated. If not, in step 5403, the character rectangular area is moved to the upper end. UP, central MD,
It is divided into lower end portions UD, and position information 451 of the rectangular area of the upper end portion UP and position information 452 of the rectangular area of the lower end portion UD are set. rUP at 5403.

ｒＵＤは各々文字矩形領域の高さＨｌを１とした時の、
上端部の高さＵＰＨ１下端部の高さＵＤＨで、高さの比
率である。ｒＵＰ、ｒＵＤは、英小文字においてフォン
トの種類に係わりなくほぼ一定であるため、予め最適値
を設定してお（。When the height Hl of each character rectangular area is set to 1, rUD is
The height of the upper end is UPH1 and the height of the lower end is UDH, which is the ratio of the heights. Since rUP and rUD are almost constant for lowercase letters regardless of the font type, the optimal values are set in advance (.

次にステップ５４０４では画像メモリ１０１内のＵＰＸ
。Next, in step 5404, the UPX in the image memory 101 is
.

ＵＰＹ、ＵＰＷ、ＵＰＨで表される上端部矩形領域ＵＰ
内の黒画素の個数をカウントし、カウント値ｎＵＰをレ
ジスタ４５３に格納する。ステップ８４０５〜５４０７
では黒画素数ｎＵＰが域値ｔｈより大きければ、上端部
に文字パターンが存在するとし、ＵＰｆｌａｇ４５５を
１′にセットし、そうでなければＵＰｆｌａｇ４５５に
０′をセットする。域値ｔｈは文字パターンが上下に若
干位置がずれて、例えば、本来中央部ＭＤ内のパターン
の一部が、上端部ＵＰに進入しているときに誤った判定
を防止するための値であり、予め設定しておく。上記誤
判定を防止する他の方法としては、上端部の領域ＵＰＸ
、ＵＰＹ、ＵＰＷ、ＵＰＨを設定する際に、ＵＰＨの値
を位置ずれが予想される画素数分小さめにとってお（方
法もある。Upper rectangular area UP represented by UPY, UPW, and UPH
The number of black pixels within is counted and the count value nUP is stored in the register 453. Steps 8405-5407
Then, if the number of black pixels nUP is larger than the threshold value th, it is assumed that a character pattern exists at the upper end, and the UPflag 455 is set to 1'; otherwise, the UPflag 455 is set to 0'. The threshold value th is a value to prevent erroneous determination when the character pattern is slightly misaligned vertically, for example, when a part of the pattern originally in the center MD enters the upper end UP. , set in advance. Another way to prevent the above misjudgment is to
, UPY, UPW, and UPH, set the value of UPH to be smaller by the number of pixels where positional shift is expected (there is also a method).

ステップ８４０８〜５４１１ではステップ８４０４〜５
４０７と同様にして下端部ＵＤに存在する黒画素数に応
じ、ＵＤｆｌａｇ４５６をセットする。In steps 8408-5411, steps 8404-5
Similarly to 407, UDflag 456 is set according to the number of black pixels present in the lower end UD.

ステップ５４１２では、ＵＤｆｌａｇ、　　ＵＰｆｌａ
ｇの値に応じて２ビツトの縮退コードを生成し、文字情
報テーブル１０３の文字番号ｉにおける縮退コード欄１
１４に格納する。In step 5412, UDflag, UPfla
A 2-bit degenerate code is generated according to the value of g, and the degenerate code field 1 at character number i of the character information table 103 is generated.
14.

第５図はＵＤｆｌａｇ、　ＵＰｆｌａｇの値と生成する
縮退コードの対応表である。例えば、ＵＤｆｌａｇが０
゛でＵＰｆｌａｇがｌ°のとき゛は、下端部にパターン
がな（、上端部にパターンがある′ｂ°のような文字で
あり、縮退コードは０１’となる。FIG. 5 is a correspondence table between the values of UDflag and UPflag and the generated degenerate code. For example, UDflag is 0
When the UPflag is 1° in ``,'' there is no pattern at the bottom (, it is a character like ``b'' with a pattern at the top, and the degenerate code is 01.

以上までの処理により、文字情報テーブルは完成し、こ
れ以降は文字列検索処理を開始する。第６図（ａ）は検
索処理の流れ図であり、同（ｂ）図は処理に必要なレジ
スタあるいはメモリである。まずステップＳ６０１で検
索したい文字列をキーボードから打ち込み、検索開始を
指示するキーを入力する。文字列はキーバッフ−ｉ　６
５０に文字列終了文字と共に格納される。第６図（ｂ）
では文字列“ｐａｔｔｅｒｎ”を入力した時の模様を示
している。図では、文字そのものを描いているが、実際
はＡＳＣＩＩコードや月Ｓコード等の標準文字コードが
格納棚１から順に打鍵順に入るものである。格納棚８は
文字列終了文字を意味している。次にステップ５６０２
では、キーバッファ６５０に格納されている標準文字コ
ード列の各文字コードを縮退コードテーブル１０７を参
照して、対応する縮退コードを得て、縮退コードバッフ
ァ６５１に格納する。縮退コードテーブル１０７は各標
準文字コードに対応した縮退コードを縮退コード生成方
法に従った区分方法で予め作成し用意しておく。By the above processing, the character information table is completed, and from this point on, character string search processing is started. FIG. 6(a) is a flowchart of the search process, and FIG. 6(b) shows registers or memories necessary for the process. First, in step S601, a character string to be searched is entered from the keyboard, and a key instructing to start the search is input. The string is keybuff-i 6
50 along with the character string end character. Figure 6(b)
This shows the pattern when the character string "pattern" is input. In the figure, the characters themselves are depicted, but in reality, standard character codes such as ASCII codes and monthly S codes are entered in the order of keystrokes starting from storage shelf 1. Storage shelf 8 means the character string end character. Next step 5602
Now, each character code of the standard character code string stored in the key buffer 650 is referred to the degenerate code table 107 to obtain the corresponding degenerate code and stored in the degenerate code buffer 651. The degenerate code table 107 is prepared by creating degenerate codes corresponding to each standard character code in advance using a classification method according to a degenerate code generation method.

第７図は、本実施例における英小文字に対応する縮退コ
ードを示した区分表である。英大文字は全て“０１′の
コードとなる。その他、数字等を縮退コードテーブルに
加えても英小文字を主体とする文章の文字列検索には影
響はでない。ステップ５６０２では文字列終端レジスタ
６５２に縮退コードバッファの最後の文字の格納棚番号
ＥＣ（第６図（ｂ）の例ではＥＣ＝７である。）を格納
しておく。FIG. 7 is a classification table showing degenerate codes corresponding to lowercase English letters in this embodiment. All uppercase letters are coded "01'.Adding numbers, etc. to the degenerate code table has no effect on character string searches for sentences that consist mainly of lowercase letters.In step 5602, the string end register 652 is The storage shelf number EC (in the example of FIG. 6(b), EC=7) of the last character of the degenerate code buffer is stored.

次にステップ５６０３では、テーブル文字番号レジスタ
６５３の値である文字情報テーブル１０３での文字番号
ｉと検索文字番号レジスタ６５４の値である縮退コード
バッファ６５１における文字番号ｊを１に初期設定する
。Next, in step 5603, the character number i in the character information table 103, which is the value of the table character number register 653, and the character number j, which is the value of the search character number register 654, in the degenerate code buffer 651 are initialized to 1.

ステップ５６０４では文字情報テーブル１０３において
文字番号ｉの示す情報が、データの終了であるかを判断
し、終了である場合は本処理を終了し、そうでない場合
はステップ５６０５にて、文字情報テーブル１０３にお
ける文字番号ｉの縮退文字コードＣｉと縮退コードバッ
ファ６５１における文字番号ｊの縮退文字コードが同じ
か否かを判定し、同じでなければ、ステップ５６１１に
てｉをインクリメント、ｊを１にしてステップ５６０４
へ戻る。同じであった場合は、ステップ５６０６にてｊ
が１であるかを判定し、ｊ＝ｔの場合は文字列開始番号
レジスタ６５５の値ＳＴをｉにする。In step 5604, it is determined whether the information indicated by the character number i in the character information table 103 indicates the end of the data. If it is the end, this process ends; otherwise, in step 5605, the information indicated by the character number i in the character information table 103 is determined. It is determined whether the degenerate character code Ci of the character number i in the degenerate code buffer 651 is the same as the degenerate character code of the character number j in the degenerate code buffer 651, and if they are not the same, i is incremented in step 5611, j is set to 1, and the step 5604
Return to If they are the same, in step 5606 j
is 1, and if j=t, the value ST of the character string start number register 655 is set to i.

次にステップ８６０８ではｊが縮退コードバッファの最
後の文字番号ＥＣであるかを判定し、そうでなければ、
ステップ５６１０にてｉとｊ共にインクリメントし、ス
テップ５６０４に戻り次の文字の一致を調べる。ステッ
プ８６０８でｊ＝ＥＥの場合は目的の文字列とマツチン
グした場合であるので、ステツブ５６０９で文字列開始
番号レジスタ６５５の値であるＳＴから現ｉまでの文字
番号の文字矩形領域を文字情報テーブル１０３を参照し
、表示装置１０８により上記文字矩形領域を例えば白黒
反転などで表示し、利用者に検索された文字列を明示す
る。その後ステップ５６１１を経て、ステップ５６０４
へ戻り、次の文字列を捜す。以上の処理を繰り返すこと
により、縮退コード文字列が一致する文字列を全て選び
出すことができる。Next, in step 8608, it is determined whether j is the last character number EC of the degenerate code buffer, and if not,
At step 5610, both i and j are incremented, and the process returns to step 5604 to check for a match of the next character. If j = EE in step 8608, it means that the target character string is matched, so in step 5609, the character rectangular area of the character numbers from ST to the current i, which is the value of the character string start number register 655, is stored in the character information table. 103, the display device 108 displays the character rectangular area in black and white inversion, for example, to clearly show the searched character string to the user. After that, through step 5611, step 5604
Go back and search for the next string. By repeating the above process, all character strings with matching degenerate code character strings can be selected.

〔実施例２〕第８図（ａ）は本実施例における縮退コード生成の処理
流れ図であり、実施例１の処理に新たなステップ８８０
１〜ステツプ５８０４を加え、ステップ５４１２の処理
内容を変更したものである。第８図（ｂ）は新たに使用
するレジスタである。[Embodiment 2] FIG. 8(a) is a process flowchart of degenerate code generation in this embodiment, and a new step 880 is added to the process of Embodiment 1.
1 to 5804 are added, and the processing content of step 5412 is changed. FIG. 8(b) shows a newly used register.

まずステップＳ８０１では、文字情報テーブル１０３に
格納されている文字矩形領域を参照することにより、各
文字パターンにおける鎖交数をカウントし、鎖交数ｎＣ
Ｒを鎖交数し°ジメタ８５１に格納する。First, in step S801, the number of linkages in each character pattern is counted by referring to the character rectangular area stored in the character information table 103, and the number of linkages nC
The linkage value of R is stored in the dimeta 851.

第１Ｏ図を鎖交数の求め方を示した図である。ＭＤｔｏ
ｏｏ〜ＭＤ　　１００３は文字１０００〜１００３にお
ける中央部ＭＤである。鎖交数はＭＤ１００Ｏ〜ＭＤ１
００３におけるｙ座標の中心線ＣＹ　　１０００〜ＣＹ
　　１００３にあたる部分の画素を画像メモリ１０１上
で走査し、一連の黒画素群が何回環れるかで定まる。第
１Ｏ図の例では文字１０００．１００１の交鎖数ｎＣＲ
は１、文字１００２．１００３の交鎖数ｎＣＲは２とし
て求められる。ステップ５８０２〜５８０４では交鎖数
ｎＣＲが１の場合には、ＣＲｆｌａｇ８５０を１に、そ
うでない場合は０に設定しステップ５４１２では第９図
に示す表に従って２ビツトの縮退コードを生成し、文字
情報テーブル１０３に保存する。縮退コードの右のビッ
トはＵＤｆｌａｇとＵＰｆｌａｇによって定める。ＵＤ
ｆｌａｇ。FIG. 1O is a diagram showing how to obtain the linkage number. MDto
oo~MD 1003 is the center MD of characters 1000~1003. Linkage number is MD1000~MD1
Center line of y coordinate at 003 CY 1000~CY
The pixels in the portion corresponding to 1003 are scanned on the image memory 101, and it is determined by how many times a series of black pixel groups can be circled. In the example of Figure 1O, the number of intersections nCR of characters 1000.1001
is determined as 1, and the number of intersections nCR of characters 1002 and 1003 is determined as 2. In steps 5802 to 5804, if the number of crossovers nCR is 1, CRflag 850 is set to 1, otherwise it is set to 0. In step 5412, a 2-bit degenerate code is generated according to the table shown in FIG. Save in table 103. The right bit of the degenerate code is determined by UDflag and UPflag. U.D.
flag.

Ｖ　　ＵＰｆｌａｇはＵＤｆｌａｇとＵＰｆｌａｇが共
にＯの場合であり、縮退コードの右ビットはＯを設定す
る。V UPflag is a case where both UDflag and UPflag are O, and the right bit of the degenerate code is set to O.

ＵＤｆｌａｇ　　Ｖ　　ＵＰｆｌａｇはＵＤｆｌａｇ、
ＵＰｆｌａｇのどちらかが１の場合であり、右ビットは
１を設定する。UDflag V UPflag is UDflag,
This is the case when either UPflag is 1, and the right bit is set to 1.

また縮退コードの左ビットはＣＲｆｌａｇの値がＯの時
は０，１のときはｌを設定する。Further, the left bit of the degenerate code is set to 0 when the value of CRflag is O, and is set to l when the value is 1.

第１１図は本実施例における英小文字に関する縮退コー
ドテーブルの各文字に対する縮退コードを示すものであ
る。本実施例においても英小文字に限らず、他の文字も
本生成法に従ってテーブルに用意してもよい。FIG. 11 shows degenerate codes for each character in the degenerate code table for lowercase English letters in this embodiment. In this embodiment as well, not only lowercase English letters but also other characters may be prepared in the table according to this generation method.

以上説明したように、文字の上端部、下端部による文字
パターン分類に交鎖数による他の分類手段を加え縮退コ
ードを生成するため、第１１図に示すように実施例１に
比べ、１つの縮退コードに対応する文字の個数をほぼ同
じように割り付けることができ、縮退コードが同じビッ
ト長であるのに文字列検索の正解率を向上することがで
きる。As explained above, in order to generate a degenerate code by adding another classification method based on the number of intersections to the character pattern classification based on the upper and lower ends of the character, as shown in FIG. It is possible to allocate almost the same number of characters corresponding to the degenerate codes, and it is possible to improve the accuracy rate of character string searches even though the degenerate codes have the same bit length.

〔実施例３〕本実施例は、縮退コードを３ビツトで表現する場合の実
施例である。[Embodiment 3] This embodiment is an example in which a degenerate code is expressed using 3 bits.

実施例２におけるＣＲｆｌａｇ、　ＵＤｆｌａｇ、　Ｕ
Ｐｆｌａｇを用い縮退コードを生成する。第１２図はそ
の対応表である。第１３図は、本実施例における英小文
字に関する縮退コードテーブルの各文字に対する縮退コ
ードを示すものである。英小文字では実際、６種類に分
類されるが４種類の場合に比べ、検索の正解率は高（な
る。CRflag, UDflag, U in Example 2
Generate a degenerate code using Pflag. FIG. 12 is the correspondence table. FIG. 13 shows degenerate codes for each character in the degenerate code table for lowercase English letters in this embodiment. In fact, lowercase English letters are classified into 6 types, but the search accuracy rate is higher than when there are 4 types.

〔実施例４〕第１４図は本実施例の構成図であり、第１〜第３の実施
例に文字認識装置１４００を付加したものである。文字
認識装置１４００は、文字パターンを標準文字コード（
ＡＳＣＩＩコード、ＪＩＳコード等）に変換する装置な
らどんなものでもよく、既知の文字認識装置を利用でき
る。第１５図（ａ）は文字列検索の処理フローを示した
図であり、第６図（ａ）に”新たに文字認識部５１５０
１を設けている。同（ｂ）図はこの処理に用いるバッフ
ァである。文字認識部５１５０１では、文字情報テーブ
ル１０３の文字番号ＳＴからｉまでの各文字の情報に従
って、画像メモリ１０１の各文字パターンを文字認識装
置にかけ、結果の標準文字コード列を認識コードバッフ
ァ１５５０に格納する。ステップ５１５０２では、認識
文字コートバッファ１５５０の内容とキーバッファ６５
０の内容を比較し一致したら、ステップ５６０９で文字
列の表示を行い、そうでなければ、ステップ５６１１を
経て次の文字列検索に移る。[Embodiment 4] FIG. 14 is a block diagram of this embodiment, in which a character recognition device 1400 is added to the first to third embodiments. The character recognition device 1400 converts character patterns into standard character codes (
Any device can be used as long as it converts the text into ASCII code, JIS code, etc., and a known character recognition device can be used. FIG. 15(a) is a diagram showing the processing flow of character string search.
1 is provided. The figure (b) shows the buffer used for this process. The character recognition unit 51501 applies each character pattern in the image memory 101 to a character recognition device according to the information on each character from character number ST to i in the character information table 103, and stores the resulting standard character code string in the recognition code buffer 1550. do. In step 51502, the contents of the recognized character code buffer 1550 and the key buffer 65 are
If the contents of 0 are compared and they match, the character string is displayed in step 5609, otherwise the process moves to step 5611 to search for the next character string.

以上説明したように本実施例によれば、文字パターンか
ら標準文字コードを得ることができる文字認識装置を付
加することにより、誤りのない文字検索を行うことがで
き、例えばある文字列を一括して他の文字列へ自動的に
変換する処理のための文字列検索等に有効である。As explained above, according to this embodiment, by adding a character recognition device that can obtain standard character codes from character patterns, error-free character searches can be performed. This is useful for character string searches, etc. for automatic conversion to other character strings.

尚実施例１〜４において検索結果はディスプレイ上に表
示するようになっているが、一致した文字列の文字番号
をメモリに格納し保存してもよい。In Examples 1 to 4, the search results are displayed on the display, but the character numbers of matching character strings may be stored and saved in memory.

又、縮退コードは２ビツト、３ビツトに限らず文字パタ
ーンの分類数に応じて決定する。Further, the degenerate code is not limited to 2 bits or 3 bits, but is determined according to the number of classifications of character patterns.

〔発明の効果〕〔Effect of the invention〕

以上説明したように文章画像から切り出された各文字パ
ターンに対して、１つあるいは複数の単純な画像処理を
施し、数ビットの縮退コードを生成して保持しておき、
上記縮退コード列の一致を調べる手段を設けることによ
り、従来用いられている文字認識手段を設けなくとも、
文章画像から文字列を検索できるという効果がある。１
つの縮退コードには通常、複数の文字が対応するが、文
字列を構成する文字の数が多ければ多い程、縮退コード
列と文字列の一意性が高まり、十分な検索結果が得られ
るものである。As explained above, each character pattern cut out from a text image is subjected to one or more simple image processes, and a degenerate code of several bits is generated and stored.
By providing a means to check the coincidence of the above degenerate code strings, it is possible to use
This has the effect of allowing character strings to be searched from text images. 1
Usually, multiple characters correspond to one degenerate code, but the more characters that make up the string, the more unique the degenerate code string and character string, and the more sufficient search results can be obtained. be.

また、正確な文字列の検索結果が要求される場合には、
縮退コード列による検索を第一の検索手段とし、その結
果得られた文字列を文字認識装置による第２の検索手段
にかけて、最終的な検索結果を得るという２段構えの構
成にすることにより、文字認識装置にかける文字を縮退
コード列による検索で大幅に絞り込めるため、検索処理
効率を向上でき、特に検索対象となる文章画像が膨大に
あるときに大きな効果をもたらす。Also, if exact string search results are required,
By adopting a two-stage configuration in which a search using a degenerate code string is used as the first search means, and the resulting character string is applied to a second search means using a character recognition device to obtain the final search result, Since the characters to be applied to the character recognition device can be greatly narrowed down by searching using degenerate code strings, search processing efficiency can be improved, which is particularly effective when there are a large number of text images to be searched.

【図面の簡単な説明】[Brief explanation of drawings]

第１図は文字列検索装置構成図、第２図は文字画像の切り出しを示す図、第３図は縮退コ
ード生成方法の説明図、第４図は縮退コード生成の処理
フロー及び主要レジスタとメモリの図、第５図は縮退コードを示す図、第６図は文字列検索の処理フロー及び主要レジスタとメ
モリの図、第７図は文字と縮退コードの対応力、第８図は縮退コード生成の処理フロー及び主要レジスタ
の図、第９図は縮退コードを示す図、第１０図は縮退コード生成方法の説明図、第１１図は文
字と縮退コードの対応図、第１２図は縮退コードを示す
図、第１３図は文字と縮退コードの対応図、第１４図は文字
列検索装置構成図、第１５図は文字列検索の処理フロー及び主要バッファの
図。Figure 1 is a block diagram of the character string search device, Figure 2 is a diagram showing character image extraction, Figure 3 is an explanatory diagram of the degenerate code generation method, and Figure 4 is the processing flow of degenerate code generation, main registers, and memory. , Figure 5 is a diagram showing the degenerate code, Figure 6 is the process flow of character string search and a diagram of the main registers and memory, Figure 7 is the correspondence between characters and degenerate codes, and Figure 8 is the degenerate code generation. Figure 9 shows the degenerate code, Figure 10 explains the degenerate code generation method, Figure 11 shows the correspondence between characters and the degenerate code, and Figure 12 shows the degenerate code. 13 is a correspondence diagram of characters and degenerate codes, FIG. 14 is a block diagram of a character string search device, and FIG. 15 is a diagram of a character string search process flow and main buffers.

Claims

【特許請求の範囲】[Claims]

画像として入力された文章画像から文字領域を切り出し
、編集処理をする装置において、切り出された各文字パ
ターンから縮退化された数ビツトのコードを生成する縮
退コード生成手段と生成された縮退コードを保持する手
段と検索指示された文字列を検索するための縮退コード
比較手段と検索結果を表示する手段から成る文字列検索
装置。In a device that cuts out character areas from a text image input as an image and performs editing processing, a degenerate code generation means that generates a degenerate several-bit code from each extracted character pattern and holds the generated degenerate code. a degenerate code comparison means for searching a specified character string; and a means for displaying search results.