CN115546801A

CN115546801A - Method for extracting paper image data features of test document

Info

Publication number: CN115546801A
Application number: CN202210725519.0A
Authority: CN
Inventors: 严浩; 王芳潇; 范强; 江春; 周晓磊; 张骁雄
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-12-30

Abstract

The invention discloses a method for extracting paper image data features of a test document. The method comprises the steps of image preprocessing, namely performing paging, inclination correction and binarization operation on paper image data of a test document; performing layout analysis, and detecting a field area, a table area, an iconic notation area, a page number area and an image area contained in the image based on RefineNet; establishing an index, and indexing the detected character area, table area, icon area, page number area and image area by establishing a data dictionary; and character recognition, namely recognizing paragraph region characters, table region characters and page numbers through a character recognition technology based on CRNN. Compared with the prior art, the method has the characteristics of light model weight and short detection and identification time, can effectively shorten the time for manually inputting the paper image data, and saves the labor cost.

Description

Method for extracting paper image data features of test document

Technical Field

The invention belongs to the field of image feature extraction, and particularly relates to a test document paper image data feature extraction method.

Background

The electronization of paper documents is a current trend of informatization construction, an intelligent digital extraction method is lacked in the acquisition of paper image data of a test document by current enterprises and public institutions, the uniform standardized extraction of the paper image data is difficult, a standard and standardized data set foundation cannot be provided for data mining analysis and artificial intelligence model training, and the acquired traditional data information cannot provide required intelligent service application for the enterprises and public institutions.

Disclosure of Invention

The invention provides a test document paper image data feature extraction method, which aims at the technical problems of poor paper image data acquisition effect and long consumed time in the prior art. The method provides an index function for a paragraph area, a table area, a chart (table) note area, a page number area and an image area of paper image data, and supports quick identification of paragraph area characters, table area characters and page numbers.

The invention specifically adopts the following technical scheme: a method for extracting paper image data features of a test document comprises the following steps:

step SS1: uploading an image to be identified, comprising: uploading a PDF image to be identified to a processing program;

step SS2: image pre-processing, comprising: strengthening effective image information of the PDF image to be identified, and weakening redundant or invalid information, wherein the effective image information comprises paging processing, inclined image correction processing and image binarization processing of input multi-page PDF data;

and step SS3: layout analysis, comprising: detecting a field area, a table area, a legend area, a page number area and an image area by intelligently identifying the preprocessed image data based on RefineNet;

and step SS4: establishing an index, comprising: establishing indexes for different areas identified in paper image data through a data dictionary, and mapping the position relation of different area types;

step SS5: character recognition, comprising: establishing a CRNN character recognition model, wherein the CRNN character recognition model comprises a CNN layer, an RNN convolution layer and a CTC layer; firstly, extracting the features of a picture through a convolutional neural network to obtain an input feature sequence, and then predicting the input feature sequence by adopting an LSTM cyclic neural network to obtain more context information; finally, solving the alignment problem of the indefinite length input by taking the CTC as a loss function;

step SS6: verifying after identification, comprising: after intelligent feature extraction of characters is carried out, aiming at recognized errors, checking and error correction are carried out on a front-end web interface.

As a preferred embodiment, the step SS2 includes the following steps:

step SS21: paging the multiple pages of PDF images;

step SS22: image inclination correction, namely performing Hough transformation on the shot inclined paper-based image to obtain a corrected image;

step SS23: and (3) image binarization processing, namely performing image binarization by adopting one-dimensional maximum entropy threshold segmentation, so that the quality of the input PDF image is improved to the greatest extent, and the requirement of a subsequent automatic input system on the input image is met.

As a preferred embodiment, the layout analysis in step SS3 includes: the method comprises the following steps that an intelligent identification method based on RefineNet is adopted, a frame of the RefineNet consists of two modules, namely an ARM module and an ODM module, and the ARM module and the ODM module are connected through a TCB; the loss function is shown as the following formula and comprises an ARM module and an ODM module, wherein the ARM module comprises two categories of loss lb and regression loss Lr; similarly, the ODM module comprises loss lm and regression loss Lr of Multi-class Classification;

wherein p is _i And x _i Representing confidence of Anchor classification and regression coordinates in ARM module, p _i And x _i Representing confidence and coordinate regression of the referred Anchor classification in the ODM module; n is a radical of _arm And N _odm Represents the number of positive samples in batch;

represents the ground truth position and size of the ith anchor.

As a preferred embodiment, the step SS5 of predicting the sequence based on CRNN's character recognition comprises the following steps:

step SS51: CNN model design, adopting convolution layer and maximum pooling layer in VGG structure to extract features of image sequence;

step SS52: the design of an RNN layer, wherein a deep bidirectional recurrent neural network Bi-LSTM is adopted as the RNN layer, and the RNN layer corresponds to one output for each input of the characteristic sequence input by the CNN layer;

step SS53: designing a CTC layer, defining a keyword-winning mother table/syllable set in a sequence annotation task as A, wherein A' is an expansion table set added with blank characters;

inputting a sequence x, A 'with the length of T for the probability of outputting an element k at the moment T by the CTC network' ^T Is a sequence set with the length of T in the A' set; assuming that the outputs at different times are conditionally independent, after inputting x, any path in the set π ∈ A' ^T The probability distribution of (a) is:

l is denoted as A' ^T A 'is defined as the sequence of output labels in the set, where multiple paths in the set are mapped to the same result' ^T →A ^≤T Mapping from the path set to the final prediction sequence is realized; the probability of predicting a true tag sequence is expressed as:

where p (l | x) is the probability that the true tag sequence is predicted.

As a preferred embodiment, the step SS51 further includes fine-tuning the VGG network: changing the core scale of the third and fourth largest pooling layers from 2 x 2 to 1 x 2; the fifth and sixth convolution layers are followed by a Batch Normalization layer to speed up the training process.

As a preferred embodiment, the step SS52 specifically includes: to prevent the gradient from vanishing during training and to use both the forward and backward information of the sequence for the prediction of the sequence, the deep Bi-directional recurrent neural network Bi-LSTM controls the long-term state c by 3 "gates", denoted as:

g(x)＝σ(Wx+b)

wherein g (x) is a control gate function, sigma is a sigmoid function, W is a weight vector of a gate, b is a bias term, and the input is x; since σ is a sigmoid function, and the value range is (0, 1), the states of the gates are all half-open and half-closed;

the first gate controls the storage of the long-term state c and is called a forgetting gate f _ t;

f_t＝σ(W_f·[h_(t-1),x_t]+b_f)

wherein, W _ f is a weight matrix of a forgetting gate, [ h _ (t-1), x _ t ] is a merged hidden layer and a merged matrix of current input, and b _ f is a weight matrix;

the second gate controls the input of the instant state to the long-term state c, which is called an input gate i _ t;

i_t＝σ(W_i·[h_(t-1),x_t]+b_i)

wherein, W _ i is a weight matrix, [ h _ (t-1), x _ t ] is a merged matrix of the merged hidden layer and the current input, and b _ i is an offset term;

the third "gate" for describing the currently input cell state

Controlling the output quantity of the Bi-LSTM in the current deep bidirectional circulation neural network in the long-term state c;

wherein, W _ c is a weight matrix, [ h _ (t-1), x _ t ] is a merged matrix of the merged hidden layer and the current input, b _ c is an offset term;

the current state c _ ((t)) is represented as the state of the previous cell c _ ((t-1)) times the forgetting gate f _ t by element, plus the current input cell state

Multiplying the input gate i _ t by element yields:

the forgetting door controls the state of the unit, so that the early state information can be stored, the input door controls the current input, and the input quantity of the current state can be controlled to enter the memory;

the output gate o _ ((t)) controls the effect of the long-term state on the current output:

o_((t))＝σ(W_o·[h_(t-1),x_t]+b_o)

wherein, W _ o is a weight matrix, [ h _ (t-1), x _ t ] is a merged matrix of the merged hidden layer and the current input, and b _ o is an offset term;

the output gate and cell state together determine the final output h _ t of LSTM:

h_t＝o_((t))·tanh(c_t)。

as a preferred embodiment, the step SS6 specifically includes: and (4) selecting a target detection text area, and adding an editable window to realize the change of the identification error.

Compared with the prior art, the invention has the beneficial effects that: aiming at the problems of poor acquisition effect and long time consumption of the existing paper image data, the invention provides a test document paper image data feature extraction method, which is characterized in that a PDF image to be identified is uploaded to a processing program; strengthening effective image information of the PDF image to be identified, and weakening redundant or invalid information, wherein the effective image information comprises paging processing, inclined image correction processing and image binarization processing of input multi-page PDF data; detecting a field area, a table area, a legend area, a page number area and an image area by intelligently identifying the preprocessed image data based on RefineNet; establishing indexes for different areas identified in paper image data through a data dictionary, and mapping the position relations of different area types; establishing a CRNN character recognition model, wherein the CRNN character recognition model comprises a CNN layer, an RNN convolutional layer and a CTC layer; firstly, extracting the features of a picture through a convolutional neural network to obtain an input feature sequence, and then predicting the input feature sequence by adopting an LSTM cyclic neural network to obtain more context information; finally, solving the alignment problem of the indefinite length input by taking the CTC as a loss function; after the intelligent character extraction is carried out on characters, the verification and the error investigation are carried out on a front-end web interface aiming at the identified error, the rapid character extraction can be carried out on paper image data under the limited computing resource on the premise of not sacrificing great precision, and the problems of poor acquisition effect and long time consumption of the existing paper image data are integrally solved.

Drawings

FIG. 1 is a flow chart of a method for extracting paper image data features of a test document according to the present invention;

FIG. 2 is a CRNN network structure;

fig. 3 is a check chart after identification.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1: as shown in fig. 1, 2 and 3, the invention provides a method for extracting paper image data features of a test document, which comprises the following specific steps:

and S1, uploading PDF images to be identified.

And S2, image preprocessing, namely performing paging processing, inclination correction and binarization processing on the uploaded image.

Specifically, in step S2, the following steps are further included:

step S21, paging processing is carried out on a plurality of pages of PDF images;

s22, correcting the image inclination, namely performing Hough transformation on the shot inclined papery image to obtain a corrected image;

and S23, carrying out image binarization processing, wherein one-dimensional maximum entropy threshold segmentation is adopted for image binarization, so that the quality of the input PDF image is improved to the greatest extent, and the requirement of a subsequent automatic input system on the input image is met.

And S3, analyzing the layout, and detecting a field area, a table area, a chart (table) note area and a page number area of the preprocessed image according to different area types.

And step S4, as shown in fig. 2, mapping the detected regions of different types to corresponding identifiers, respectively, to implement indexing of the regions of different types.

Step S5, based on character recognition of CRNN, FIG. 3 is the structure diagram of the CRNN network, including a CNN layer, an RNN layer and a CTC layer.

Specifically, the CNN layer performs feature extraction on the image sequence by using a convolutional layer and a maximum pooling layer in the VGG structure. Specifically, the step S41 performs fine tuning on the VGG network, which mainly includes:

changing the nuclear scale of the third and fourth largest pooling layers from 2 x 2 to 1 x 2;

the fifth and sixth convolution layers are followed by a Batch Normalization layer to speed up the training process.

The RNN layer is designed by adopting a Deep Bidirectional recurrent neural network (Bi-LSTM) as the RNN layer. The RNN layer corresponds to a sequence of features for CNN layer inputs, one output for each input. In order to prevent the gradient from disappearing during training, and sequence forward information and backward information are used for the prediction of the sequence. In particular, bi-LSTM controls the long-term state c through 3 "gates". The "gate" can be expressed as:

g(x)＝σ(Wx+b)

g (x) is a control gate function, sigma is a sigmoid function, W is a weight vector of the gate, b is an offset term, and the input is x. Since σ is a sigmoid function and the range is (0, 1), the gate states are both half-open and half-closed.

The first "gate" controlling the saving of the long-term state c is called the forgetting gate f _t 。

f _t ＝σ(W _f ·[h _t-1 ，x _t ]+b _f )

Wherein, W _f Is the forgetting gate weight matrix, [ h ] _t-1 ，x _t ]Is a combined matrix of the combined hidden layer and the current input, b _f Is a weight matrix.

The second "gate" controlling the entry of the instantaneous state into the long-term state c, called entry gate i _t 。

i _t ＝σ(W _i ·[h _t-1 ，x _t ]+b _i )

Wherein, W _i Is a weight matrix, [ h ] _t-1 ，x _t ]Is a combined matrix of the combined hidden layer and the current input, b _i Is a bias term.

The third "gate" for describing the currently input cell state

The output of the long-term state c at the current LSTM is controlled.

Wherein, W _c Is a weight matrix, [ h ] _t-1 ，x _t ]Is a combined matrix of the combined hidden layer and the current input, b _c Is a bias term.

Current state c _(t) State c represented as the last cell _(t-1) Forgetting gate f by element _t Plus the current input cell state

Multiply input Gate by element i _t Obtaining:

the forgetting door controls the state of the unit, so that the early state information can be stored, and the input door controls the current input and controls the input quantity of the current state to enter the memory.

Output gate controls the effect of long-term conditions on the current output:

o _(t) ＝σ(W _o ·[h _t-1 ，x _t ]+b _o )

wherein, W _o Is a weight matrix, [ h ] _t-1 ，x _t ]Is a combined matrix of the combined hidden layer and the current input, b _o Is the bias term.

The output gate and cell states together determine the final output h of the LSTM _t ：

h _t ＝o _(t) ·tanh(c _t )

The CTC layer is designed to define a keyword table/syllable set A in the sequence marking task. A' is an extended table set with blank characters added.

And outputting the probability of the element k for the CTC network at the time t. Inputting sequence x, A 'with length of T' ^T Is the set of sequences of length T in the A' set. Assuming that the outputs at different times are conditionally independent, after inputting x, any one path in the set is obtainedπ∈A′ ^T The probability distribution of (c) is:

l is denoted as A' ^T The sequence of output labels in the set, where multiple paths in the set would map to the same result, defines a function B: a' ^T →A ^≤T Mapping from the set of paths to the final predicted sequence is achieved. The probability of predicting a true tag sequence can be expressed as:

and S6, verifying after identification, wherein the method mainly comprises the steps of selecting a target detection character area in a frame, and adding an editable window to realize the change of the identification error.

The invention provides an intelligent detection and identification technology of paper document image data based on deep learning, which can perform intelligent data feature extraction on a large number of paper document scanned parts. Inputting paper image data, converting the paper image data into a picture sequence, and detecting a field area, a table area, a chart (table) note area and a page number area in a document by using a layout recognition algorithm based on RefineNet. And for the detected fields, tables, drawing (table) notes and page numbers, recognizing the characters by using the intelligent recognition technology based on CRNN. And manual verification is added after the identification, and manual verification and error correction can be realized on a front-end web interface.

It should be noted that, the Anchor reference Module is referred to as an ARM Module for short; an Object Detection Module, abbreviated as an ODM Module; the transmission Connection Block is called TCB for short.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A method for extracting paper image data features of a test document is characterized by comprising the following steps:

and step SS3: layout analysis, comprising: detecting a field area, a table area, a legend area, a page number area and an image area from the preprocessed image data through intelligent recognition based on RefineNet;

and step SS4: establishing an index, comprising: establishing indexes for different areas identified in paper image data through a data dictionary, and mapping the position relations of different area types;

step SS5: character recognition, comprising: establishing a CRNN character recognition model, wherein the CRNN character recognition model comprises a CNN layer, an RNN convolution layer and a CTC layer; firstly, extracting the features of a picture through a convolutional neural network to obtain an input feature sequence, and then predicting the input feature sequence by adopting an LSTM cyclic neural network to obtain more context information; finally, solving the alignment problem of indefinite length input by taking the CTC as a loss function;

step SS6: verifying after identification, comprising: after intelligent character extraction, the error is checked and surveyed on the front-end web interface aiming at the identified error.

2. The method for extracting paper image data features of test paper documents as claimed in claim 1, wherein the step SS2 comprises the following steps:

step SS21: paging the multiple pages of PDF images;

step SS22: correcting the image inclination, namely performing Hough transformation on the shot inclined paper-based image to obtain a corrected image;

step SS23: and (3) image binarization processing, namely, adopting image binarization and adopting one-dimensional maximum entropy threshold segmentation, so that the quality of the input PDF image is improved to the greatest extent, and the requirement of a subsequent automatic input system on the input image is met.

3. The method for extracting paper image data of test paper as claimed in claim 1, wherein the layout analysis in step SS3 includes: the method comprises the following steps that an intelligent identification method based on RefineNet is adopted, a frame of the RefineNet consists of two modules, namely an ARM module and an ODM module, and the ARM module and the ODM module are connected through a TCB; the loss function is shown as the following formula and comprises an ARM module and an ODM module, wherein the ARM module comprises two categories of loss lb and regression loss Lr; similarly, the ODM module comprises loss lm and regression loss Lr of Multi-class Classification;

wherein p is _i And x _i Representing confidence of Anchor classification and regression coordinates in ARM module, p _i And x _i Representing confidence and coordinate regression of the referred Anchor classification in the ODM module; n is a radical of hydrogen _arm And N _odm Represents the number of positive samples in batch;

represents the ground truth position and size of the ith anchor.

4. The method for extracting paper image data features of test paper as claimed in claim 1, wherein the step SS5 of predicting the sequence based on CRNN character recognition comprises the following steps:

step SS51: CNN model design, adopting convolution layer and maximum pooling layer in VGG structure to extract image sequence features;

inputting a sequence x, A 'with the length of T for the probability of outputting an element k at the moment T by the CTC network' ^T Is a sequence set with the length of T in the A' set; assuming that the outputs at different times are conditionally independent, after inputting x, any path in the set π ∈ A' ^T The probability distribution of (c) is:

l is denoted as A' ^T And (3) outputting a sequence of labels in the set, wherein a plurality of paths in the set can be mapped to the same result, and a function B is defined: a' ^T →A ^≤T Mapping from the path set to the final prediction sequence is realized; the probability of predicting a true tag sequence is expressed as:

where p (l | x) is the probability that the true tag sequence is predicted.

5. The method for extracting paper image data features of test paper as claimed in claim 4, wherein the step SS51 further comprises fine-tuning a VGG network: changing the core scale of the third and fourth largest pooling layers from 2 x 2 to 1 x 2; the fifth and sixth convolution layers are followed by a Batch Normalization layer to speed up the training process.

6. The method for extracting paper image data features of test paper as claimed in claim 4, wherein the step SS52 specifically comprises: to prevent the gradient from vanishing during training and to use both the forward and backward information of the sequence for the prediction of the sequence, the deep Bi-directional recurrent neural network Bi-LSTM controls the long-term state c by 3 "gates", denoted as:

g(x)＝σ(Wx+b)

f_t＝σ(W_f·[h_(t-1),x_t]+b_f)

the second gate controls the input of the instant state to the long-term state c, called the input gate i _ t;

i_t＝σ(W_i·[h_(t-1),x_t]+b_i)

wherein, W _ i is a weight matrix, [ h _ (t-1), x _ t ] is a merged matrix of the merged hidden layer and the current input, b _ i is a bias term;

the third "gate" for describing the currently input cell state

Multiplying the input gate i _ t by element yields:

the forgetting door controls the state of the unit, so that earlier state information can be stored, the input door controls the current input, and the input quantity of the current state can be controlled to enter the memory;

o_((t))＝σ(W_o·[h_(t-1),x_t]+b_o)

h_t＝o_((t))·tanh(c_t)。

7. the method for extracting paper image data features of test paper according to claim 1, wherein the step SS6 specifically comprises: and (4) selecting a target detection text area, and adding an editable window to realize the change of the identification error.