JP2022108130A

JP2022108130A - Information processor and computer program

Info

Publication number: JP2022108130A
Application number: JP2021002995A
Authority: JP
Inventors: 佳紀中山; Yoshinori Nakayama; 錬松山; Ren Matsuyama
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2022-07-25

Abstract

To provide an information processor and a computer program that can perform a document assortment work efficiently and can improve the accuracy of the assortment.SOLUTION: An information processor comprises: an acquisition part that acquires a document image; an image generation part that performs subtractive color processing based on a plurality of threshold values for the acquired document image to generate a plurality of post-processing images; a text data generation part that optically reads out the plurality of generated post-processing images, respectively to generate text data; and a sorter that sorts the document images based on the plurality of generated text data.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置及びコンピュータプログラムに関する。 The present invention relates to an information processing device and a computer program.

自社には運用ノウハウがない業務を外部事業者に委託するＢＰＯ（ビジネス・プロセス・アウトソーシング）が様々な分野で行われている。例えば、各種申請に関する書類の受付・仕分け・審査業務を一括して委託するＢＰＯ業務の需要が増大しつつある。 Business process outsourcing (BPO), in which a company outsources operations that it does not have operational know-how to, is being carried out in various fields. For example, there is an increasing demand for BPO operations, in which the reception, sorting, and examination of documents related to various applications are collectively outsourced.

書類の仕分け業務では、書類をスキャナで読み取る作業が必要となる。特許文献１には、スキャナで読み取った原稿をオペレータが検品作業を行って電子的に管理する処理が開示されており、書類の仕分け業務もスキャナで読み取った書類の分類作業を人手で実施している場合が多い。 In document sorting work, it is necessary to read documents with a scanner. Patent Document 1 discloses a process in which an operator inspects and electronically manages manuscripts read by a scanner. There are many cases.

特開２００６－９４０３６号公報JP-A-2006-94036

しかし、作業者の疲労によりヒューマンエラーが発生すると、仕分け業務を再度実施しなければならず、業務効率が低下するだけでなく仕分け精度が低下する。また、急な案件増加による労働力不足が発生すると、タイムリーな業務委託ができずビジネスチャンスを逃すリスクもある。 However, if a human error occurs due to worker fatigue, sorting work must be performed again, which not only reduces work efficiency but also reduces sorting accuracy. Also, if there is a shortage of labor due to a sudden increase in projects, there is a risk of missing business opportunities due to the inability to outsource work in a timely manner.

本発明は、斯かる事情に鑑みてなされたものであり、書類の仕分け業務を効率よく、かつ仕分け精度を向上することができる情報処理装置及びコンピュータプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide an information processing apparatus and a computer program capable of efficiently sorting documents and improving sorting accuracy.

情報処理装置は、書類画像を取得する取得部と、前記取得部で取得した書類画像に対して複数の閾値に基づく減色処理を行って複数の処理後画像を生成する画像生成部と、前記画像生成部で生成した複数の処理後画像それぞれを光学的に読み取ってテキストデータを生成するテキストデータ生成部と、前記テキストデータ生成部で生成した複数のテキストデータに基づいて前記書類画像を分類する分類器とを備える。 The information processing apparatus includes an acquisition unit that acquires a document image, an image generation unit that performs color reduction processing on the document image acquired by the acquisition unit based on a plurality of threshold values to generate a plurality of processed images, and the image a text data generation unit that optically reads each of the plurality of processed images generated by the generation unit to generate text data; and a classification that classifies the document image based on the plurality of text data generated by the text data generation unit. Equipped with a vessel.

コンピュータプログラムは、コンピュータに、書類画像を取得し、取得した書類画像に対して複数の閾値に基づく減色処理を行って複数の処理後画像を生成し、生成した複数の処理後画像それぞれを光学的に読み取ってテキストデータを生成し、生成した複数のテキストデータに基づいて前記書類画像を分類する、処理を実行させる。 The computer program causes the computer to obtain a document image, perform color reduction processing on the obtained document image based on a plurality of thresholds to generate a plurality of post-processing images, and optically convert each of the generated plurality of post-processing images. Then, the document image is read, text data is generated, and the document image is classified based on the generated plurality of text data.

本発明によれば、書類の仕分け業務を効率よく、かつ仕分け精度を向上することができる。 According to the present invention, it is possible to efficiently sort documents and improve sorting accuracy.

情報処理システムの構成の一例を示す模式図である。It is a mimetic diagram showing an example of composition of an information processing system. 書類画像の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of a document image; 減色処理後の書類画像の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of a document image after color reduction processing; テキスト化データの一例を示す模式図である。It is a schematic diagram which shows an example of text-ized data. テキスト化データを連結したテキストデータの一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of text data in which text data are linked; 文字２－ｇｒａｍによる文字分割の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of character division by character 2-gram; 分割された文字列の出現頻度の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of appearance frequencies of divided character strings; 重み付けの算出方法の一例を示す模式図である。It is a schematic diagram which shows an example of the calculation method of weighting. 重み付けが付与された文字特徴量の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of weighted character feature amounts; 学習済みモデルを用いた書類の分類方法の第１例を示す模式図である。FIG. 3 is a schematic diagram showing a first example of a document classification method using a trained model; 学習済みモデルを用いた書類の分類方法の第２例を示す模式図である。FIG. 11 is a schematic diagram showing a second example of a document classification method using a trained model; ルールベースの書類の分類方法の一例を示す模式図である。1 is a schematic diagram showing an example of a rule-based document classification method; FIG. 寄与部分の特定方法の一例を示す模式図である。It is a schematic diagram which shows an example of the identification method of a contribution part. 強調表示した書類画像の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of a document image highlighted; 情報処理装置による書類画像の分類処理の手順の一例を示すフローチャートである。5 is a flow chart showing an example of a procedure of document image classification processing by an information processing apparatus;

以下、本発明の実施の形態を図面に基づいて説明する。図１は情報処理システムの構成の一例を示す模式図である。情報処理システムは、情報処理装置５０、及び端末装置１０を備える。情報処理装置５０と端末装置１０とは、通信ネットワーク１を介して接続されている。端末装置１０は、パーソナルコンピュータ、タブレット端末などで構成され、作業担当者によって使用される。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a schematic diagram showing an example of the configuration of an information processing system. The information processing system includes an information processing device 50 and a terminal device 10 . The information processing device 50 and the terminal device 10 are connected via the communication network 1 . The terminal device 10 is configured by a personal computer, a tablet terminal, or the like, and is used by a worker.

情報処理装置５０は、装置全体を制御する制御部５１、通信部５２、記憶部５３、画像処理部５４、ＯＣＲ処理部５５、特徴量抽出部５６、分類器５７、重み付け付与部５８、特定部５９、及び出力部６０を備える。制御部５１は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）などで構成することができる。 The information processing device 50 includes a control unit 51 that controls the entire device, a communication unit 52, a storage unit 53, an image processing unit 54, an OCR processing unit 55, a feature amount extraction unit 56, a classifier 57, a weighting unit 58, and a specifying unit. 59 and an output unit 60 . The control unit 51 can be configured with a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like.

通信部５２は、通信ネットワーク１を介して、端末装置１０との間で通信を行う機能を有し、所要の情報の送受信を行うことができる。より具体的には、通信部５２は、取得部としての機能を有し、端末装置１０から書類画像を取得する。 The communication unit 52 has a function of communicating with the terminal device 10 via the communication network 1, and can transmit and receive required information. More specifically, the communication section 52 functions as an acquisition section and acquires the document image from the terminal device 10 .

図２は書類画像の一例を示す模式図である。図２は書類の画像の一例として運転免許証の画像を図示している。書類は、本人確認用の書類を含み、例えば、運転免許証の他に、パスポート、マイナンバーカード、保険証（健康保険証、介護保険証など）、国民年金手帳、身体障害者手帳、住民票、戸籍謄本・妙本、印鑑証明書など種々の書類を含む。 FIG. 2 is a schematic diagram showing an example of a document image. FIG. 2 shows an image of a driver's license as an example of an image of a document. Documents include documents for identity verification, for example, in addition to driver's license, passport, my number card, insurance card (health insurance card, long-term care insurance card, etc.), national pension book, physical disability certificate, resident card, Includes various documents such as family register copies, myohon, and seal certificates.

記憶部５３は、半導体メモリ又はハードディスク等で構成され、通信部５２を介して取得した書類画像を記憶することができる。また、記憶部５３は、情報処理装置５０内の処理結果などの所要のデータを記憶することができる。 The storage unit 53 is composed of a semiconductor memory, a hard disk, or the like, and can store document images acquired via the communication unit 52 . Further, the storage unit 53 can store required data such as processing results in the information processing device 50 .

画像処理部５４は、通信部５２を介して取得した書類画像に対して画像前処理を行う機能を有する。画像前処理は、鮮鋭化処理、減色処理、サイズ変更処理などを含む。なお、本明細書では、画像処理部５４が、鮮鋭化処理、減色処理、サイズ変更処理を含む構成であるが、鮮鋭化処理、減色処理及びサイズ変更処理をそれぞれ別個の処理部として構成してもよい。以下、各画像前処理について説明する。 The image processing unit 54 has a function of performing image preprocessing on the document image acquired via the communication unit 52 . Image pre-processing includes sharpening, color reduction, resizing, and the like. In this specification, the image processing unit 54 is configured to include sharpening processing, color reduction processing, and size change processing. good too. Each image pre-processing will be described below.

画像処理部５４は、通信部５２を介して取得した書類画像（又は当該書類画像に対して所定の画像処理を行った後の画像）に対して鮮鋭化処理を行う。鮮鋭化処理は、エッジのコントラストを強調するフィルタを用いて、輪郭を強調する。なお、鮮鋭化処理は、減色処理の前又は後の画像に対して行うことができる。 The image processing unit 54 sharpens the document image acquired via the communication unit 52 (or an image after performing predetermined image processing on the document image). The sharpening process enhances contours using a filter that enhances edge contrast. Note that the sharpening process can be performed on the image before or after the color reduction process.

図３は減色処理後の書類画像の一例を示す模式図である。画像処理部５４は、画像生成部としての機能を有し、通信部５２を介して取得した書類画像に対して複数の閾値に基づく減色処理を行って複数の処理後画像を生成する。減色処理では、閾値以上の画素値が消去される。図３の例では、２つの異なる閾値Ｔｈ１、Ｔｈ２に基づいて２つの減色処理後の書類画像を生成している。Ｒ（赤）、Ｇ（緑）、Ｂ（青）の画素値をそれぞれ０（黒）～２５５（白）で表す。閾値Ｔｈ１は、例えば、（Ｒ：１００、Ｇ：１００、Ｂ：１００）とし、閾値Ｔｈ２は、（Ｒ：２００、Ｇ：２００、Ｂ：２００）とすることができるが、これに限定されない。また、閾値は、Ｒ、Ｇ、Ｂそれぞれについて同一値に限定されるものではなく、Ｒ、Ｇ、Ｂ毎に異なる閾値のセットを用いてもよい。 FIG. 3 is a schematic diagram showing an example of a document image after color reduction processing. The image processing unit 54 has a function as an image generation unit, and performs color reduction processing based on a plurality of thresholds on the document image acquired via the communication unit 52 to generate a plurality of processed images. In color reduction processing, pixel values equal to or greater than a threshold value are erased. In the example of FIG. 3, two document images after color reduction processing are generated based on two different threshold values Th1 and Th2. Pixel values of R (red), G (green), and B (blue) are represented by 0 (black) to 255 (white), respectively. For example, the threshold Th1 can be (R: 100, G: 100, B: 100) and the threshold Th2 can be (R: 200, G: 200, B: 200), but they are not limited to this. Also, the threshold values are not limited to the same value for each of R, G, and B, and different sets of threshold values may be used for each of R, G, and B.

背景と文字が重なる部分では、背景色と文字の色との組み合わせによっては文字の視認性が低下する。文字の部分の視認性を向上させる目的で、減色処理によって背景色を除去するが、閾値によっては文字の部分も除去されてしまう場合もある。そこで、２つの異なる閾値を用いて減色処理を行うことにより、一方の閾値に基づく減色処理で、仮に文字を除去してしまった場合でも、他方の閾値に基づく減色処理で取りこぼした文字を拾うことが可能となる。図３Ａでは、閾値Ｔｈ１＝（Ｒ：１００、Ｇ：１００、Ｂ：１００）としているので、Ｒ、Ｇ、Ｂの各画素値が１００以上の画素を消去することができる。図３Ａの例では、「運転免許証」の文字が除去された例を図示している。一方、図３Ｂでは、閾値Ｔｈ１＝（Ｒ：２００、Ｇ：２００、Ｂ：２００）としているので、Ｒ、Ｇ、Ｂの各画素値が２００以上の画素を消去することができる。図３Ｂの例では、図３Ａの例で除去された「運転免許証」の文字が消去されずに残っている。 In the portion where the background and the characters overlap, the visibility of the characters is lowered depending on the combination of the background color and the character color. In order to improve the visibility of the character portion, the background color is removed by color reduction processing, but depending on the threshold value, the character portion may also be removed. Therefore, by performing color reduction processing using two different thresholds, even if a character is removed by color reduction processing based on one threshold, it is possible to pick up the characters left out by color reduction processing based on the other threshold. becomes possible. In FIG. 3A, since the threshold value Th1=(R:100, G:100, B:100), pixels with R, G, and B pixel values of 100 or more can be erased. The example of FIG. 3A illustrates an example in which the characters "driver's license" have been removed. On the other hand, in FIG. 3B, since the threshold value Th1=(R:200, G:200, B:200), pixels with R, G, and B pixel values of 200 or more can be deleted. In the example of FIG. 3B, the characters "driver's license" that were removed in the example of FIG. 3A remain unerased.

上述のように、書類画像に対して複数の閾値に基づく減色処理を行うことにより、後述のＯＣＲ（Optical Character Recognition）処理の文字認識精度を向上させることができる。 As described above, by performing color reduction processing on a document image based on a plurality of thresholds, it is possible to improve character recognition accuracy in OCR (Optical Character Recognition) processing, which will be described later.

画像処理部５４は、通信部５２を介して取得した書類画像（又は当該書類画像に対して所定の画像処理を行った後の画像）に対してサイズ変更処理を行う。書類の種類によっては書類画像のサイズが異なる場合がある。画像処理部５４は、サイズ変更処理を行うことにより、ＯＣＲ処理に最適なサイズの画像に変換することができる。サイズ変更処理は、鮮鋭化処理後の書類画像に対して行うことができるが、減色処理によって生成された２つの減色処理後画像それぞれに対して行ってもよい。サイズ変更処理は、減色処理の前又は後の画像に対して行うことができる。 The image processing unit 54 performs size change processing on the document image acquired via the communication unit 52 (or an image after performing predetermined image processing on the document image). The size of the document image may differ depending on the type of document. The image processing unit 54 can convert the image into an image of the optimum size for OCR processing by performing the size change processing. The size change process can be performed on the document image after the sharpening process, but it may be performed on each of the two post-color reduction images generated by the color reduction process. The resizing process can be performed on the image before or after the color reduction process.

ＯＣＲ処理部５５は、テキストデータ生成部としての機能を有し、画像処理部５４による画像前処理後の画像を光学的に読み取ってテキストデータを生成する。より具体的には、ＯＣＲ処理部５５は、画像処理部５４で生成した複数の処理後画像それぞれを光学的に読み取って得られた複数のテキスト化データを連結してテキストデータを生成する。 The OCR processing unit 55 has a function as a text data generation unit, optically reads the image after image preprocessing by the image processing unit 54, and generates text data. More specifically, the OCR processing unit 55 generates text data by connecting a plurality of text data obtained by optically reading each of the processed images generated by the image processing unit 54 .

図４はテキスト化データの一例を示す模式図である。図４Ａは、図３Ａに示す減色処理後画像をテキスト化したものであり、「Ｏ月Ｏ日まで有効」「１２３４５６７８９０００」の文字がテキスト化されている。図４Ｂは、図３Ｂに示す減色処理後画像をテキスト化したものであり、「Ｏ月Ｏ日まで有効」「運転免許証」「１２３４５６７８９０００」の文字がテキスト化されている。 FIG. 4 is a schematic diagram showing an example of text data. FIG. 4A is a text representation of the color-reduction-processed image shown in FIG. 3A, in which characters "valid until month and day O" and "123456789000" are rendered as text. FIG. 4B is a text representation of the color-reduction-processed image shown in FIG. 3B, in which characters "Valid until O month O day", "Driver's license", and "123456789000" are rendered as text.

図５はテキスト化データを連結したテキストデータの一例を示す模式図である。図５の例では、図４Ａ及び図４Ｂそれぞれのテキスト化データが連結されている。図５の例では、図４Ａの下側に図４Ｂを配置させてテキストを連結した構成を示すが、これに限定されるものではなく、図４Ａの右側、左側、あるいは上側に図４Ｂを配置させてテキストを連結してもよい。連結することにより、２つのテキストファイルが１つに纏められる。 FIG. 5 is a schematic diagram showing an example of text data in which text data are linked. In the example of FIG. 5, the text data of FIGS. 4A and 4B are concatenated. The example of FIG. 5 shows a configuration in which FIG. 4B is placed below FIG. 4A and the text is connected, but the configuration is not limited to this, and FIG. 4B is placed on the right side, left side, or top side of FIG. 4A. You can also concatenate the text by Concatenation combines two text files into one.

特徴量抽出部５６は、ＯＣＲ処理部５５で生成したテキストデータから文字特徴量を抽出する。具体的には、特徴量抽出部５６は、ＯＣＲ処理部５５で生成したテキストデータを分割した文字列それぞれの当該テキストデータ中に出現する頻度に基づいて文字特徴量を抽出する。テキストデータの分割には、文字ｎ－ｇｒａｍを用いることができる。文字ｎ－ｇｒａｍは、文書を連続するｎ個の文字で分割する手法である。以下、文字ｎ－ｇｒａｍの例として、文字２－ｇｒａｍについて説明する。 A feature amount extraction unit 56 extracts a character feature amount from the text data generated by the OCR processing unit 55 . Specifically, the feature quantity extraction unit 56 extracts the character feature quantity based on the frequency of occurrence in the text data of each character string obtained by dividing the text data generated by the OCR processing unit 55 . Character n-grams can be used to divide the text data. A character n-gram is a method of dividing a document into consecutive n characters. Character 2-grams will be described below as an example of character n-grams.

図６は文字２－ｇｒａｍによる文字分割の一例を示す模式図である。分割前の文書としては、図５で例示したテキストデータを用いる。文字２－ｇｒａｍは、図５のテキストデータを連続した２文字で分割する。図６の例では、２文字ずつ、「Ｏ月」、「Ｏ日」、｛まで｝、…の如く分割されている。 FIG. 6 is a schematic diagram showing an example of character division by character 2-grams. As the document before division, the text data illustrated in FIG. 5 is used. The character 2-gram divides the text data of FIG. 5 by two consecutive characters. In the example of FIG. 6, each two characters are divided into "O month", "O day", {until}, and so on.

図７は分割された文字列の出現頻度の一例を示す模式図である。文字列「Ｏ月」は、テキストデータ内に２回出現するので、頻度は２となる。文字列「Ｏ日」は、テキストデータ内に２回出現するので、頻度は２となる。以下、同様である。また、文字列「運転」は、テキストデータ内に１回出現するので、頻度は１となる。文字列「免許」は、テキストデータ内に１回出現するので、頻度は１となる。文字列「証」は、テキストデータ内に１回出現するので、頻度は１となる。 FIG. 7 is a schematic diagram showing an example of the frequency of occurrence of divided character strings. Since the character string "O month" appears twice in the text data, the frequency is 2. Since the character string "O day" appears twice in the text data, the frequency is 2. The same applies hereinafter. Also, the character string "driving" appears once in the text data, so the frequency is 1. Since the character string "license" appears once in the text data, the frequency is 1. Since the character string "proof" appears once in the text data, the frequency is 1.

特徴量抽出部５６は、分割した文字列それぞれを要素とし、当該文字列の頻度を要素の値とする特徴ベクトル（文字特徴量ともいう）を抽出する。例えば、分割した文字列の数を１００とすると、特徴ベクトルは１００次元ベクトルとなる。 The feature amount extracting unit 56 extracts a feature vector (also called a character feature amount) having each divided character string as an element and having the frequency of the character string as the element value. For example, if the number of divided character strings is 100, the feature vector is a 100-dimensional vector.

分割した文字列に対しては、重要度に応じて予め重み付けを行うことができる。以下、重み付けの方法について説明する。 The divided character strings can be weighted in advance according to their importance. The weighting method will be described below.

重み付け付与部５８は、文字列の頻度に対して重み付けを付与する。重み付け付与の手法としては、例えば、ＴＦ－ＩＤＦ手法を用いることができる。ＴＦ－ＩＤＦとは、Term Frequency（ＴＦ）とInverse Document Frequency（ＩＤＦ）の意味である。ＴＦはある文字列の文書中の出現頻度を表す。ここで、文書中は、重み付けを求めるために収集された文章データ全部を示す。文書データ中に出現する頻度が多いほど、その文字列は重要であると考えられる。ＩＤＦは、ある文字列が出てくる文書頻度の逆数を表す。多くの文章中に出願する文字列は、一つの文書の中で重要度が低いと考えられる。重み付け付与部５８は、ＴＦとＩＤＦとの掛け算によって重み付けを算出することができる。 The weighting unit 58 weights the frequency of character strings. For example, the TF-IDF method can be used as a weighting method. TF-IDF stands for Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the appearance frequency of a certain character string in a document. Here, in the document, all sentence data collected for obtaining the weighting are shown. The more frequently a character string appears in the document data, the more important the character string is. IDF represents the reciprocal of document frequency in which a certain character string appears. Character strings applied in many sentences are considered to be of low importance in one document. The weighting unit 58 can calculate the weighting by multiplying the TF and the IDF.

図８は重み付けの算出方法の一例を示す模式図である。重み付け付与部５８は、文書データを学習用データとして用いて文字列毎の重み付けを算出する。文書データは、複数の運転免許証、パスポート、健康保険証、住民票などの書類から生成されたテキストデータである。図８の例では、重み付け付与部５８は、ＴＦ－ＩＤＦ手法を用いて、重み付け情報５８１を生成している。重み付け情報５８１は、文字列と当該文字列の重み付けとを対応付けた情報である。図８の例では、文字列「保険」、「運転」に対して、重み付けを５．０とし、文字列「住民」対して、重み付けを４．５としている。他の文字列も同様である。 FIG. 8 is a schematic diagram showing an example of a weighting calculation method. The weighting unit 58 calculates the weighting for each character string using the document data as learning data. Document data is text data generated from documents such as a plurality of driver's licenses, passports, health insurance cards, and resident cards. In the example of FIG. 8, the weighting unit 58 generates weighting information 581 using the TF-IDF technique. The weighting information 581 is information that associates a character string with the weighting of the character string. In the example of FIG. 8, the character strings "insurance" and "driving" are weighted 5.0, and the character string "resident" is weighted 4.5. The same is true for other character strings.

上述のように、文書データに出現する頻度が多ければ、特徴として重要である可能性が高く重み付けを大きくし、多くの文章に登場する未次烈は、特徴として重要である可能性が低く重み付けを小さくすることにより、分類器５７による書類画像の分類を行い易くすることができる。 As described above, if the frequency of occurrence in the document data is high, it is likely to be important as a feature and is given a large weighting. is made smaller, the classification of document images by the classifier 57 can be facilitated.

特徴量抽出部５６は、分割した文字列それぞれの頻度に重み付けを付与して文字特徴量を抽出することができる。 The feature amount extraction unit 56 can extract character feature amounts by assigning weights to the frequencies of the divided character strings.

図９は重み付けが付与された文字特徴量の一例を示す模式図である。図９において、文字列と対応する頻度との関係は、図７の例と同一である。前述のとおり、重み付け付与部５８は、予め重み付け情報を生成している。図９に示すように、文字列「Ｏ月」～「有効」それぞれの重み付けを０．５とし、文字列「１２」～「００」それぞれの重み付けを０．２とし、文字列「運転」「免許」それぞれの重み付けを５．０とし、文字列「証」の重み付けを１．０とする。文字特徴量としての特徴ベクトルは、各文字列の頻度と重み付けを掛け算することにより求められ、特徴ベクトルの各要素ｘ１～ｘ１３の値が計算される。例えば、要素ｘ１は、文字列「Ｏ月」に対応し、要素ｘ１の値は、２×０．５＝１．０となる。また、要素ｘ１１は、文字列「運転」に対応し、要素ｘ１１の値は、１×５．０＝５．０となる。他の要素も同様である。 FIG. 9 is a schematic diagram showing an example of weighted character feature amounts. In FIG. 9, the relationship between character strings and corresponding frequencies is the same as in the example of FIG. As described above, the weighting unit 58 generates weighting information in advance. As shown in FIG. 9, the weighting of each of the character strings "O month" to "effective" is set to 0.5, the weighting of each of the character strings "12" to "00" is set to 0.2, and the character strings "operating" " Let the weighting of each "licence" be 5.0, and the weighting of the character string "certificate" be 1.0. A feature vector as a character feature amount is obtained by multiplying the frequency of each character string by a weight, and the values of each element x1 to x13 of the feature vector are calculated. For example, the element x1 corresponds to the character string “O month” and the value of the element x1 is 2×0.5=1.0. Also, the element x11 corresponds to the character string “driving” and the value of the element x11 is 1×5.0=5.0. The same applies to other elements.

分類器５７は、ＯＣＲ処理部５５で生成したテキストデータに基づいて書類画像を分類する。より具体的には、分類器５７は、特徴量抽出部５６で抽出された文字特徴量に基づいて書類画像を分類することができる。分類器５７による書類の分類は、機械学習によって生成された学習済みモデルを用いてもよく、あるいは、文字特徴量と書類種別とを関連付けたルールを用いてもよい。まず、学習済みモデルを用いる場合について説明する。 Classifier 57 classifies the document image based on the text data generated by OCR processor 55 . More specifically, the classifier 57 can classify the document image based on the character feature quantity extracted by the feature quantity extraction section 56 . Classification of documents by the classifier 57 may use a learned model generated by machine learning, or may use a rule that associates character features with document types. First, the case of using a trained model will be described.

図１０は学習済みモデルを用いた書類の分類方法の第１例を示す模式図である。分類器５７は、学習済みモデルとしてのニューラルネットワーク５７１を有する。分類器５７は、例えば、ＣＰＵ（例えば、複数のプロセッサコアを実装したマルチ・プロセッサなど）、ＧＰＵ（Graphics Processing Units）、ＤＳＰ（Digital Signal Processors）、ＦＰＧＡ（Field-Programmable Gate Arrays）などのハードウェアを組み合わせることによって構成することができる。 FIG. 10 is a schematic diagram showing a first example of a document classification method using a trained model. The classifier 57 has a neural network 571 as a trained model. The classifier 57 is, for example, a CPU (for example, a multiprocessor that implements a plurality of processor cores), GPUs (Graphics Processing Units), DSPs (Digital Signal Processors), FPGAs (Field-Programmable Gate Arrays) hardware such as can be configured by combining

ニューラルネットワーク５７１は、特徴量抽出部５６で抽出された文字特徴量（特徴ベクトル）が入力されると、書類の種別毎に確率（０～１の数値）を出力する。書類の種別としては、たとえば、運転免許証、健康保険証、パスポート、住民票、その他等を含む。なお、書類の種別は、図１０の例に限定されない。分類器５７は、ニューラルネットワーク５７１が出力する書類の種別毎の確率のうち、最も数値の大きい種別に基づいて書類画像を分類する。例えば、運転免許証、健康保険証、パスポート、住民票、その他それぞれの確率が０．８、０．４、０．３、０．１、０．１とすると、書類画像は運転免許証であると分類する。なお、ニューラルネットワーク５７１に代えて、他の機械学習モデルでもよい。 When the neural network 571 receives the character feature amount (feature vector) extracted by the feature amount extraction unit 56, it outputs a probability (a numerical value between 0 and 1) for each type of document. Types of documents include, for example, driver's licenses, health insurance cards, passports, resident cards, and others. Note that the types of documents are not limited to the examples in FIG. The classifier 57 classifies the document image based on the type with the largest numerical value among the probabilities for each type of document output from the neural network 571 . For example, if the probabilities of driver's license, health insurance card, passport, resident's card, etc. are 0.8, 0.4, 0.3, 0.1, 0.1, the document image is a driver's license. classified as Note that other machine learning models may be used instead of the neural network 571 .

図１１は学習済みモデルを用いた書類の分類方法の第２例を示す模式図である。分類器５７は、第１の分類器５７２、第２の分類器５７３、及び加算平均部５７４を有する。分類器５７２、５７３それぞれは、特徴量抽出部５６で抽出された文字特徴量（特徴ベクトル）が入力されると、書類の種別毎に確率（０～１の数値）を出力する。分類器５７２は、出力端子Ａ１～Ａ５から運転免許証、健康保険証、パスポート、住民票、その他の確率を出力する。分類器５７３は、出力端子Ｂ１～Ｂ５から運転免許証、健康保険証、パスポート、住民票、その他の確率を出力する。加算平均部５７４は、出力端子Ａ１とＢ１から出力される確率を加算平均して、運転免許証の確率を出力する。健康保険証、パスポート、住民票、その他についても同様である。分類器５７は、加算平均部５７４が出力する書類の種別毎の確率のうち、最も数値の大きい種別に基づいて書類画像を分類する。 FIG. 11 is a schematic diagram showing a second example of a document classification method using a trained model. The classifier 57 has a first classifier 572 , a second classifier 573 and an averaging section 574 . Each of the classifiers 572 and 573 outputs a probability (a numerical value between 0 and 1) for each type of document when the character feature amount (feature vector) extracted by the feature amount extraction unit 56 is input. Classifier 572 outputs probabilities of driver's license, health insurance card, passport, resident's card, and others from output terminals A1 to A5. Classifier 573 outputs probabilities of driver's license, health insurance card, passport, resident's card, and others from output terminals B1-B5. The averaging unit 574 performs averaging of the probabilities output from the output terminals A1 and B1, and outputs the probability of a driver's license. The same applies to health insurance cards, passports, resident cards, and others. The classifier 57 classifies the document image based on the type with the largest numerical value among the probabilities for each type of document output by the averaging unit 574 .

第１の分類器５７２として、例えば、線形ＳＶＭ（サポート・ベクター・マシン）を用い、第２の分類器５７３として、例えば、勾配ブースティング木を用いることができる。アンサンブル学習を行って、複数の機械学習モデルを組み合わせることにより、書類画像の分類精度を向上させることができる。なお、組み合わせる機械学習モデルは、線形ＳＶＭや勾配ブースティング木に限定されない。 As the first classifier 572, for example, a linear SVM (support vector machine) can be used, and as the second classifier 573, for example, a gradient boosting tree can be used. By performing ensemble learning and combining a plurality of machine learning models, the classification accuracy of document images can be improved. Note that machine learning models to be combined are not limited to linear SVMs and gradient boosting trees.

次に、ルールベースの場合について説明する。 Next, the rule-based case will be described.

図１２はルールベースの書類の分類方法の一例を示す模式図である。分類器５７は、ルールベースの分類器５７５、ルールＤＢ５７７を有する。分類器５７５は、類似度算出部５７６を有する。ルールＤＢ５７７には、予め特徴ベクトルと書類の種別とを関連付けたルールが記録されている。例えば、特徴ベクトルＶ１には免許証が関連付けられ、特徴ベクトルＶ２には保険証が関連付けられている。他の書類の種別も同様である。 FIG. 12 is a schematic diagram showing an example of a rule-based document classification method. The classifier 57 has a rule-based classifier 575 and a rule DB 577 . The classifier 575 has a similarity calculator 576 . The rule DB 577 stores rules in which feature vectors are associated with types of documents in advance. For example, feature vector V1 is associated with a driver's license, and feature vector V2 is associated with an insurance card. The same applies to other document types.

分類器５７５は、特徴量抽出部５６で抽出された特徴ベクトルＶｘが入力されると、類似度算出部５７６の機能を用いて、入力された特徴ベクトルＶｘと、ルールＤＢ５７７に記録された各特徴ベクトルとの類似度を算出し、最も類似度の大きい特徴ベクトルに関連付けられた種別を書類の種別として出力する。類似度の算出は、例えば、コサイン類似度を用いることができる。 When the feature vector Vx extracted by the feature amount extraction unit 56 is input, the classifier 575 uses the function of the similarity calculation unit 576 to classify the input feature vector Vx and each feature recorded in the rule DB 577. The degree of similarity with the vector is calculated, and the type associated with the feature vector with the highest degree of similarity is output as the type of document. Cosine similarity, for example, can be used to calculate the similarity.

一般的な画像処理による書類の分類では、パターンマッチング等のように、書類のデザインや書式に基づいて書類の種別を判定するため、非定型書類（例えば、保険証や住民票など）の分類が困難であった。上述のように、情報処理装置５０は、ＯＣＲ処理部５５により光学的文字認識を行うので、書類のデザインや書式に依存することなく非定型の書類でも精度よく分類できる。 In the classification of documents by general image processing, the type of document is determined based on the design and format of the document, such as pattern matching. It was difficult. As described above, the information processing apparatus 50 performs optical character recognition using the OCR processing section 55, so that even irregular documents can be accurately classified without depending on the document design or format.

また、ＯＣＲ処理による文字認識の精度は１００％ではなく、誤認識が発生する。上述のように、情報処理装置５０は、画像処理部５４により、閾値の異なる複数の前処理（減色処理）を行うので、文字の取りこぼしを低減することができ、結果として、後続のＯＣＲ処理の文字認識精度を向上させることができる。 Further, the accuracy of character recognition by OCR processing is not 100%, and misrecognition occurs. As described above, the information processing device 50 uses the image processing unit 54 to perform a plurality of preprocessing (color reduction processing) with different thresholds, so that it is possible to reduce the number of missing characters. Character recognition accuracy can be improved.

また、従来、書類の仕分け業務は、スキャナで読み取った書類の分類作業を人手で実施している場合が多く、ヒューマンエラーが発生しやすい。上述のように、情報処理装置５０は、機械学習によって生成された分類器や、ルールベースの分類器を用いるので、書類の仕分け業務の自動化が可能となり、書類仕分け業務の効率化、仕分けミスの低減を図ることができる。 In addition, conventional document sorting operations often involve manually sorting documents read by a scanner, which is prone to human error. As described above, the information processing apparatus 50 uses a classifier generated by machine learning or a rule-based classifier, so that it is possible to automate the document sorting work, improve the efficiency of the document sorting work, and reduce sorting errors. reduction can be achieved.

情報処理装置５０により書類画像を分類するので、書類仕分け業務を自動化することが可能となる。しかし、仕分け業務後の各種申請に関する書類の審査業務においては、申請に必要な書類がそろっているか、有効な書類かどうか等の審査項目を正確に判断する必要があり、人手による審査が必須である。以下では、書類の審査に有効な情報を提供する方法について説明する。 Since the document images are classified by the information processing device 50, it is possible to automate the document sorting work. However, in the examination of documents related to various applications after the sorting work, it is necessary to accurately judge the examination items such as whether the documents necessary for the application are complete and whether the documents are valid, and manual examination is essential. be. The following describes how to provide useful information for document review.

特定部５９は、分類器５７による書類画像の分類に寄与したテキストデータの寄与部分を特定する。出力部６０は、当該寄与部分に対応する領域を強調表示した書類画像を出力する。例えば、出力部６０は、通信部５２を介して、強調表示した書類画像を端末装置１０へ出力することができる。なお、出力部６０は、書類の種別だけを端末装置１０へ出力してもよい。 The identifying unit 59 identifies the contributing portion of the text data that contributed to the classification of the document image by the classifier 57 . The output unit 60 outputs a document image in which the region corresponding to the contributing portion is highlighted. For example, the output unit 60 can output the highlighted document image to the terminal device 10 via the communication unit 52 . Note that the output unit 60 may output only the type of document to the terminal device 10 .

図１３は寄与部分の特定方法の一例を示す模式図である。特徴ベクトルは、その要素がテキストデータ内の文字列であり、文字列の出現頻度（重み付け付与された頻度）が要素の値となる。すなわち、特徴ベクトルの各要素には、テキストデータ中の文字列が対応するので、文字列のテキストデータ中の位置又は座標を記録しておくことができる。書類画像の分類に寄与する要素は、重み付けされた頻度の値が大きい部分と考えられるので、文字列のテキストデータ中の位置又は座標が分かれば、当該位置又は座標に基づいて、書類画像の分類に寄与した寄与部部を特定することができる。そして、テキストデータ中の寄与部分が分かれば、書類画像の寄与部分に対応する領域も分かる。 FIG. 13 is a schematic diagram showing an example of a method of specifying a contributing portion. The element of the feature vector is a character string in the text data, and the appearance frequency (weighted frequency) of the character string is the value of the element. That is, since each element of the feature vector corresponds to a character string in the text data, the position or coordinates of the character string in the text data can be recorded. Elements that contribute to the classification of document images are considered to be parts with large weighted frequency values. Therefore, if the positions or coordinates of character strings in the text data are known, the document images can be classified based on the positions or coordinates. can be identified. Then, if the contributing portion in the text data is known, the area corresponding to the contributing portion in the document image can also be known.

図１４は強調表示した書類画像の一例を示す模式図である。図１４の例では、書類が健康保険証であり、分類に寄与した寄与部分に対応する領域を太線で囲んでいる。寄与部分に対応する領域には、「健康保険」「被保険者証」という文字が記載されている。強調表示は、図１４のように太線で囲んだ表示態様でもよく、あるいは、色又は模様による強調表示等、審査を行う担当者が容易に書類の種別を判断できる表示態様であればよい。 FIG. 14 is a schematic diagram showing an example of a highlighted document image. In the example of FIG. 14, the document is a health insurance card, and the area corresponding to the portion that contributed to the classification is surrounded by a thick line. Characters such as "health insurance" and "insured card" are written in the area corresponding to the contribution portion. The highlighting may be a display mode surrounded by a thick line as shown in FIG. 14, or a display mode such as highlighting by color or pattern, as long as the person in charge of examination can easily determine the type of the document.

上述のように、書類の分類に寄与した説明的根拠を書類画像上の文字を強調表示することにより示すことができるので、書類の審査担当者は、審査項目を正確に、かつ迅速に判断することができる。 As described above, the descriptive grounds that contributed to the classification of the document can be shown by highlighting the characters on the document image. be able to.

図１５は情報処理装置５０による書類画像の分類処理の手順の一例を示すフローチャートである。以下では便宜上、処理の主体を制御部５１として説明する。制御部５１は、書類画像を取得し（Ｓ１１）、取得した書類画像に対して鮮鋭化処理を行う（Ｓ１２）。制御部５１は、必要に応じて、サイズ変更処理を行う（Ｓ１３）。 FIG. 15 is a flow chart showing an example of the document image classification process performed by the information processing apparatus 50 . For the sake of convenience, the following description assumes that the control unit 51 is the subject of processing. The control unit 51 acquires a document image (S11), and performs sharpening processing on the acquired document image (S12). The control unit 51 performs size change processing as necessary (S13).

制御部５１は、複数の閾値に基づく減色処理を行い（Ｓ１４）、減色処理によって生成された複数の処理後画像に対してＯＣＲ処理を行ってテキスト化し（Ｓ１５）、テキスト化データを連結してテキストデータを生成する（Ｓ１６）。 The control unit 51 performs color reduction processing based on a plurality of thresholds (S14), performs OCR processing on a plurality of post-processing images generated by the color reduction processing to convert them into text (S15), and concatenates the text data. Text data is generated (S16).

制御部５１は、テキストデータかた文字特徴量（特徴ベクトル）を抽出し（Ｓ１７）、抽出した文字特徴量を分類器５７に入力して書類画像を分類する（Ｓ１８）。制御部５１は、分類に寄与した寄与部分に対応する領域を強調表示する書類画像を端末装置１０へ出力する（Ｓ１９）。 The control unit 51 extracts a character feature amount (feature vector) from the text data (S17), and inputs the extracted character feature amount to the classifier 57 to classify the document image (S18). The control unit 51 outputs to the terminal device 10 a document image highlighting the area corresponding to the contributing portion that contributed to the classification (S19).

制御部５１は、他の書類画像があるか否かを判定し（Ｓ２０）、他の書類画像がある場合（Ｓ２０でＹＥＳ）、ステップＳ１１以降の処理を繰り返し、他の書類画像がない場合（Ｓ２０でＮＯ）、処理を終了する。 The control unit 51 determines whether or not there is another document image (S20), and if there is another document image (YES in S20), repeats the processing from step S11 onward, and if there is no other document image ( NO in S20), the process is terminated.

情報処理装置５０は、例えば、ＣＰＵ（例えば、複数のプロセッサコアを実装したマルチ・プロセッサなど）、ＧＰＵ（Graphics Processing Units）、ＲＡＭなどを備えたコンピュータを用いて実現することもできる。図１５に示すような処理の手順を定めたコンピュータプログラム（記録媒体に記録可能）をコンピュータに備えられたＲＡＭにロードし、コンピュータプログラムをＣＰＵ（プロセッサ）で実行することにより、コンピュータ上で情報処理装置５０を実現することができる。 The information processing device 50 can also be implemented using a computer including, for example, a CPU (for example, a multiprocessor having a plurality of processor cores), a GPU (Graphics Processing Units), a RAM, and the like. A computer program (which can be recorded on a recording medium) defining a processing procedure as shown in FIG. Device 50 can be implemented.

情報処理装置は、前記減色処理の前又は後の画像に対して鮮鋭化処理を行う鮮鋭化処理部を備える。 The information processing apparatus includes a sharpening processing unit that performs sharpening processing on the image before or after the color reduction processing.

情報処理装置は、前記減色処理の前又は後の画像に対してサイズ変更処理を行うサイズ変更処理部を備える。 The information processing apparatus includes a size change processing unit that performs size change processing on the image before or after the color reduction processing.

情報処理装置において、前記テキストデータ生成部は、前記画像生成部で生成した複数の処理後画像それぞれを光学的に読み取って得られた複数のテキスト化データを連結してテキストデータを生成する。 In the information processing apparatus, the text data generation unit generates text data by connecting a plurality of text data obtained by optically reading each of the processed images generated by the image generation unit.

情報処理装置において、前記分類器は、文字特徴量と書類種別とを関連付けたルールを用いて、又は文字特徴量を入力した場合に書類種別を出力する学習済みモデルを用いて、前記テキストデータ生成部で生成したテキストデータから抽出された文字特徴量に基づいて前記書類画像を分類する。 In the information processing device, the classifier generates the text data using a rule that associates character feature amounts with document types, or using a trained model that outputs document types when character feature amounts are input. The document image is classified based on the character feature amount extracted from the text data generated by the section.

情報処理装置は、前記テキストデータ生成部で生成したテキストデータを分割した文字列それぞれの前記テキストデータ中に出現する頻度に基づいて前記文字特徴量を抽出する特徴量抽出部を備える。 The information processing apparatus includes a feature amount extraction unit that extracts the character feature amount based on the frequency of appearance in the text data of each character string obtained by dividing the text data generated by the text data generation unit.

情報処理装置において、前記特徴量抽出部は、文字ｎ－ｇｒａｍ手法を含む。 In the information processing device, the feature extraction unit includes a character n-gram method.

情報処理装置は、文字列の頻度に対して予め重み付けを付与する重み付け付与部を備え、前記特徴量抽出部は、前記分割した文字列それぞれの頻度に重み付けを付与して前記文字特徴量を抽出する。 The information processing device includes a weighting unit that weights the frequency of the character string in advance, and the feature amount extraction unit weights the frequency of each of the divided character strings to extract the character feature amount. do.

情報処理装置において、前記重み付け付与部は、ＴＦ－ＩＤＦ手法を含む。 In the information processing device, the weighting unit includes a TF-IDF technique.

情報処理装置は、前記書類画像の分類に寄与した前記テキストデータの寄与部分を特定する特定部と、前記寄与部分に対応する領域を強調表示した前記書類画像を出力する出力部とを備える。 The information processing apparatus includes a specifying unit that specifies a contributing portion of the text data that contributed to the classification of the document image, and an output unit that outputs the document image in which a region corresponding to the contributing portion is highlighted.

１通信ネットワーク
１０端末装置
５０情報処理装置
５１制御部
５２通信部
５３記憶部
５４画像処理部
５５ＯＣＲ処理部
５６特徴量抽出部
５７、５７２、５７３、５７５分類器
５７４加算平均部
５７６類似度算出部
５７７ルールＤＢ
５７１ニューラルネットワーク
５８重み付け付与部
５８１重み付け情報
５９特定部
６０出力部 1 communication network 10 terminal device 50 information processing device 51 control unit 52 communication unit 53 storage unit 54 image processing unit 55 OCR processing unit 56 feature amount extraction unit 57, 572, 573, 575 classifier 574 averaging unit 576 similarity calculation unit 577 Rule DB
571 neural network 58 weighting unit 581 weighting information 59 identifying unit 60 output unit

Claims

書類画像を取得する取得部と、
前記取得部で取得した書類画像に対して複数の閾値に基づく減色処理を行って複数の処理後画像を生成する画像生成部と、
前記画像生成部で生成した複数の処理後画像それぞれを光学的に読み取ってテキストデータを生成するテキストデータ生成部と、
前記テキストデータ生成部で生成した複数のテキストデータに基づいて前記書類画像を分類する分類器と
を備える、
情報処理装置。 an acquisition unit that acquires a document image;
an image generation unit that performs color reduction processing based on a plurality of thresholds on the document image acquired by the acquisition unit to generate a plurality of processed images;
a text data generation unit that optically reads each of the plurality of processed images generated by the image generation unit to generate text data;
a classifier that classifies the document image based on a plurality of text data generated by the text data generation unit;
Information processing equipment.

前記減色処理の前又は後の画像に対して鮮鋭化処理を行う鮮鋭化処理部を備える、
請求項１に記載の情報処理装置。 A sharpening processing unit that performs sharpening processing on the image before or after the color reduction processing,
The information processing device according to claim 1 .

前記減色処理の前又は後の画像に対してサイズ変更処理を行うサイズ変更処理部を備える、
請求項１又は請求項２に記載の情報処理装置。 A size change processing unit that performs size change processing on the image before or after the color reduction process,
The information processing apparatus according to claim 1 or 2.

前記テキストデータ生成部は、
前記画像生成部で生成した複数の処理後画像それぞれを光学的に読み取って得られた複数のテキスト化データを連結してテキストデータを生成する、
請求項１から請求項３のいずれか一項に記載の情報処理装置。 The text data generation unit
generating text data by concatenating a plurality of text data obtained by optically reading each of the plurality of processed images generated by the image generation unit;
The information processing apparatus according to any one of claims 1 to 3.

前記分類器は、
文字特徴量と書類種別とを関連付けたルールを用いて、又は文字特徴量を入力した場合に書類種別を出力する学習済みモデルを用いて、前記テキストデータ生成部で生成したテキストデータから抽出された文字特徴量に基づいて前記書類画像を分類する、
請求項１から請求項４のいずれか一項に記載の情報処理装置。 The classifier is
Extracted from the text data generated by the text data generation unit using a rule that associates character features with document types, or using a trained model that outputs document types when character features are input classifying the document image based on the character feature quantity;
The information processing apparatus according to any one of claims 1 to 4.

前記テキストデータ生成部で生成したテキストデータを分割した文字列それぞれの前記テキストデータ中に出現する頻度に基づいて前記文字特徴量を抽出する特徴量抽出部を備える、
請求項５に記載の情報処理装置。 a feature amount extraction unit that extracts the character feature amount based on the frequency of appearance in the text data of each character string obtained by dividing the text data generated by the text data generation unit;
The information processing device according to claim 5 .

前記特徴量抽出部は、文字ｎ－ｇｒａｍ手法を含む、
請求項６に記載の情報処理装置。 The feature quantity extraction unit includes a character n-gram method,
The information processing device according to claim 6 .

文字列の頻度に対して予め重み付けを付与する重み付け付与部を備え、
前記特徴量抽出部は、
前記分割した文字列それぞれの頻度に重み付けを付与して前記文字特徴量を抽出する、
請求項６又は請求項７に記載の情報処理装置。 A weighting unit that weights the frequency of the character string in advance,
The feature quantity extraction unit is
extracting the character feature amount by weighting the frequency of each of the divided character strings;
The information processing apparatus according to claim 6 or 7.

前記重み付け付与部は、ＴＦ－ＩＤＦ手法を含む、
請求項８に記載の情報処理装置。 The weighting unit includes a TF-IDF technique,
The information processing apparatus according to claim 8 .

前記書類画像の分類に寄与した前記テキストデータの寄与部分を特定する特定部と、
前記寄与部分に対応する領域を強調表示した前記書類画像を出力する出力部と
を備える、
請求項１から請求項９のいずれか一項に記載の情報処理装置。 an identifying unit that identifies a contributing portion of the text data that contributed to the classification of the document image;
an output unit that outputs the document image in which a region corresponding to the contributing portion is highlighted,
The information processing apparatus according to any one of claims 1 to 9.

コンピュータに、
書類画像を取得し、
取得した書類画像に対して複数の閾値に基づく減色処理を行って複数の処理後画像を生成し、
生成した複数の処理後画像それぞれを光学的に読み取ってテキストデータを生成し、
生成した複数のテキストデータに基づいて前記書類画像を分類する、
処理を実行させるコンピュータプログラム。 to the computer,
Get the document image,
performing color reduction processing based on a plurality of thresholds on the acquired document image to generate a plurality of post-processing images;
Optically reading each of the generated multiple processed images to generate text data,
classifying the document image based on the generated multiple text data;
A computer program that causes a process to be performed.