CN113435240A - End-to-end table detection and structure identification method and system - Google Patents

End-to-end table detection and structure identification method and system Download PDF

Info

Publication number
CN113435240A
CN113435240A CN202110396302.5A CN202110396302A CN113435240A CN 113435240 A CN113435240 A CN 113435240A CN 202110396302 A CN202110396302 A CN 202110396302A CN 113435240 A CN113435240 A CN 113435240A
Authority
CN
China
Prior art keywords
image
node
table area
features
wired
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110396302.5A
Other languages
Chinese (zh)
Other versions
CN113435240B (en
Inventor
周勃宇
王勇
朱军民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yidao Boshi Technology Co ltd
Original Assignee
Beijing Yidao Boshi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yidao Boshi Technology Co ltd filed Critical Beijing Yidao Boshi Technology Co ltd
Priority to CN202110396302.5A priority Critical patent/CN113435240B/en
Publication of CN113435240A publication Critical patent/CN113435240A/en
Application granted granted Critical
Publication of CN113435240B publication Critical patent/CN113435240B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an end-to-end table detection and structure identification method and system, and relates to the field of computer vision. The method comprises the following steps: stretching the original image in the vertical direction, carrying out size normalization with unchanged length-width ratio and boundary compensation 0 to form a preprocessed image; determining a table area in the preprocessed image by taking an encoder-decoder model as a main structure, and classifying the table area into a wired table image and a wireless table image; separating a corrected table area image containing only the table area from the preprocessed image based on the determined table area; and respectively adopting different methods to identify the table structure according to the classification of the table area image into a wired table image and a wireless table image. The invention adopts different structure recognition methods for different types of tables, and fully combines the advantages of the convolutional neural network image segmentation algorithm, the graph convolutional neural network algorithm and the traditional rule analysis method to improve the robustness and the universality of the algorithm.

Description

End-to-end table detection and structure identification method and system
Technical Field
The invention relates to the field of computer vision, in particular to an end-to-end form detection and structure identification method and system.
Background
In reality, a form is widely used as a method for carrying key information in objects such as PDF, scanned document, and photographed picture. Table structure recognition is an important prerequisite for many downstream tasks, such as document analysis, information extraction, and visualization. The automatic table identification method generally includes two major steps of table detection and table structure identification, wherein the purpose of table detection is to locate table areas in a picture, and the table identification is to identify the internal structure of a table in each area so as to obtain final structured data. The manual extraction of the table contents consumes a lot of labor and time. In contrast, an automated approach would greatly improve the efficiency of operation.
In reality, tables have a large number of different styles, formats and internal structures, so that it is often difficult to adopt a uniform recognition method. Conventional form recognition methods typically rely on hand-designed features (e.g., line-row separation lines, blank areas, cell data types, etc.) and heuristic rules. The table detection usually adopts a bottom-up strategy, such as positioning the row and column positions in the table by using the explicit text alignment relationship in the table, and then fusing all row and column information together to calculate the table area. The variability of the form style and the complexity of the internal structure bring great difficulty to the detection of the rows and columns, and further influence the overall detection effect. Form structure recognition typically relies on explicit separation lines in the form, and the relative positional relationship of the separation lines to the text instances. The method can obtain better performance in wired tables, but cannot deal with wireless tables with partial or complete missing separation lines.
In recent years, the deep learning technology promotes the rapid development of computer vision, and is also applied to the field of table recognition. In summary, the deep learning table identification method generally has two advantages compared to the conventional method. First, the deep learning method takes an image as an input, and can be applied to any recognition object convertible into an image, such as PDF, scanned document, and the like, in principle. Therefore, the method has the advantages of a unified method; second, due to the powerful automatic feature coding capability and the unified end-to-end trainable method, deep learning has outstanding performance compared with the traditional method in which the manual design features and heuristic rules are dominant.
Therefore, the integrated process for detecting the table structure identification from the table based on the advantages of deep learning has a good application prospect.
Disclosure of Invention
In order to achieve the above object, the present invention provides a structure recognition method integrating table detection, which can efficiently extract the internal structure information of the table from the image. The image segmentation technology used in the scheme can accurately calculate the edge of the table in a pixel-level prediction mode, and can classify the table into a wired table and a wireless table. According to the scheme, different structure recognition methods are adopted for different types of tables in subsequent steps, and the robustness and the universality of the algorithm are improved by fully combining the advantages of a convolutional neural network image segmentation algorithm, a graph convolution neural network algorithm and a traditional rule analysis method.
Specifically, the method first uses a convolutional neural network to implement the detection of the table area. For the detected wired table, adopting a convolutional neural network to finish the detection of the table line, and finishing the identification of the table structure by combining a post-processing rule; and for the wireless table, predicting the relation between the cells, the rows and the columns by adopting a graph convolution neural network so as to complete the identification of the structure.
According to a first aspect of the present invention, there is provided an end-to-end table detection and structure identification method, wherein an input original image contains a table, the method comprising the steps of:
step 1: an image preprocessing step, namely stretching the original image in the vertical direction, carrying out size normalization with unchanged length-width ratio and boundary compensation 0, and forming a preprocessed image;
step 2: a table region prediction step of determining a table region in the preprocessed image with an Encoder-Decoder (Encoder-Decoder) model as a main structure and classifying the table region into a wired table image and a wireless table image;
and step 3: a table image correction step of separating a corrected table area image including only the table area from the preprocessed image based on the determined table area;
and 4, step 4: and a table structure recognition step of performing table structure recognition in different modes according to the classification into the wired table image and the wireless table image for the table area image.
Further, in the step 2, the encoding part of the encoder-decoder model samples the low-resolution representation from the first high-resolution representation in a convolution mode; the decoding portion upsamples a second high resolution representation from the low resolution representation by means of a transposed convolution or interpolation.
Further, the encoding part operates as follows:
the method comprises the steps of generating Multi-Resolution representations by adopting a mechanism of parallel connection of Multi-Resolution sub-networks in a High-Resolution network (HRNet), introducing a Multi-Resolution Fusion Module to realize feature information exchange and Fusion among the Multi-Resolution representations, and finally outputting first feature maps with multiple scales.
Further, the decoding section operates as follows:
firstly, a cavity space convolution Pooling Pyramid (ASPP) module is adopted to carry out cavity convolution parallel sampling on feature maps with the minimum size in the first feature maps at different sampling intervals, and then the space dimensions of other first feature maps are respectively expanded by two times in a transposition convolution mode to form a plurality of second feature maps with the same number as the first feature maps;
splicing the second feature map with a first feature map with the same size from an encoding part, and finally performing convolution to generate two Mask (Mask) predicted images with the same size as the size of the preprocessed image;
thereby determining a table area and distinguishing between the wired table image and the wireless table image.
Further, the step 3 specifically includes:
step 31: calculating the contour of the periphery of the table by using a Canny edge detection operator (Canny operator) according to the mask predicted image;
step 32: detecting all straight lines in the contour by utilizing a Hough transform operator and merging part of the straight lines meeting merging conditions;
step 33: the exact table positions are calculated from the positions of all the straight lines, thereby separating the corrected table area image containing only the table area.
Further, in step 32, the merging conditions are:
firstly, judging whether the two line segments are parallel lines or not, if so, calculating the vertical distance of the two line segments, and if the distance is greater than a certain value, the two line segments cannot be merged. If the difference is not the parallel line, the slope difference of the two line segments is calculated, and when the difference value is larger than the threshold value, the two line segments cannot be merged.
When the above conditions are met, it is necessary to continuously determine whether the two line segments overlap in a certain projection direction, if so, the vertical distances from the two end points of the line segment to the other line segment are respectively calculated, and when the minimum value of the four distances is smaller than a threshold value, the two line segments can be merged. If there is no overlap, the distance of the end points between the two line segments is calculated separately, and when the minimum value of the four distances is less than the threshold, the two line segments can be merged.
Further, in the step 4, the table area image belonging to the wired table image specifically includes:
predicting an image according to a mask of the parting line, and calculating a contour map of the explicit parting line by using a canny edge detection operator;
extracting a contour skeleton graph of the separation line by using a boundary corrosion algorithm;
calculating all straight lines from the outline skeleton graph by using a Hough transform algorithm and fusing part of straight lines meeting the merging condition;
the positions of the table cells are obtained by calculating the positions of the intersection points of all the transverse lines and the vertical lines;
extracting the content and the position of a text example in the table;
and calculating and outputting table structure information according to the relative positions of the table cells and the text examples.
Further, in the step 4, the table area image belonging to the wireless table image specifically includes:
extracting node features by taking each text instance as a node, wherein the node features are formed by jointly splicing the position features, the boundary box background features, the row background features and the column background features of each text instance;
aiming at a certain node a, selecting all nodes in the current feature space of the node a, calculating the similarity, and selecting a plurality of peripheral nearest neighbor nodes;
respectively splicing the node characteristics A of the node a and the similarity difference values of the node characteristics A and the node characteristics of the nearest neighbor nodes together, inputting the node characteristics A 'into a trained graph convolution neural network, and outputting updated node characteristics A';
repeating the above operations to obtain updated node characteristics of all nodes in the table area image;
and respectively determining the structural relationship between the node and the row, column and cell of a plurality of nearest neighbor nodes by using the updated node characteristics through the three multilayer perceptron networks, thereby determining and outputting table structure information.
Further, the number of the nearest neighbor nodes is preferably 10 to 15.
Further, the location features of the text instance consist of the coordinates of the upper left and lower right corners of the bounding box.
Further, the bounding box background feature, the row background feature and the column background feature of the text example are extracted from the feature map by a Region of Interest (ROI) Pooling image feature extraction method.
Here, the text example refers to a word, sentence or paragraph composed of several connected words.
According to a second aspect of the present invention there is provided an end-to-end table detection and structure identification apparatus, characterised in that the apparatus operates in accordance with the method of any of the preceding aspects, the apparatus comprising:
the image preprocessing unit is used for stretching the original image in the vertical direction, carrying out size normalization with unchanged length-width ratio and performing boundary 0 supplementation to form a preprocessed image;
a table region prediction unit for determining a table region in the preprocessed image with an Encoder-Decoder (Encoder-Decoder) model as a main structure and classifying the table region into a wired table image and a wireless table image;
and a table image correction unit for separating a corrected table area image containing only the table area from the preprocessed image based on the determined table area.
And the table structure identification unit is used for identifying the table structure of the table area image in different modes according to the classification into the wired table image and the wireless table image.
According to a third aspect of the present invention there is provided an end-to-end table detection and structure identification system, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions to perform an end-to-end table detection and structure identification method as claimed in any of the above aspects.
According to a fourth aspect of the present invention, there is provided a computer-readable storage medium, characterized in that a computer program is stored thereon, which, when executed by a processor, implements the end-to-end table detection and structure identification method according to any of the above aspects.
The invention has the beneficial effects that:
1, the invention automatically selects different structure recognition methods according to the table detection result. The selection mechanism combines the advantages of the traditional rule algorithm and the deep learning algorithm, and improves the robustness and the universality of the algorithm.
2, the table detection method based on image segmentation can calculate the table edge more accurately, especially under the condition that the inclined table exists in the image. The pixel-level prediction mode can also eliminate the interference of non-table content areas on the subsequent table structure identification work as much as possible.
And 3, projection transformation before the structure identification step can help to obtain a table with orderly arranged unit grids, so that the difficulty of subsequent structure identification work is reduced.
And 4, fusing a node characteristic consisting of the background characteristic of the text example, the image characteristic of the row and column where the text example is located and the position characteristic of the text example by the graph convolution neural network in a graph convolution mode, and more efficiently extracting the global characteristic and the local characteristic in the graph structure. The updated node features may help to more accurately predict structural relationships between text instances, especially when there are merge cells in the table.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 is an exemplary diagram of a wired table and a wireless table according to an embodiment of the present invention;
FIG. 2 is an algorithmic flow diagram of an end-to-end table detection and structure identification method according to an embodiment of the invention;
FIG. 3 is an algorithm structure diagram of an end-to-end table detection and structure identification method according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an example of image pre-processing transformation according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a variation in feature map size in a Decoder according to an embodiment of the present invention;
FIG. 6 is an exemplary diagram of a table image segmentation result according to an embodiment of the invention;
FIG. 7 is a block diagram of a table position optimization algorithm according to an embodiment of the present invention;
FIG. 8 is an exemplary diagram of a table image correction result according to an embodiment of the present invention;
FIG. 9 is a block diagram of a table line extraction post-processing algorithm according to an embodiment of the present invention;
FIG. 10 is a diagram illustrating a division result of a division line of a table with lines according to an embodiment of the present invention;
FIG. 11 is a schematic plane diagram of node feature extraction according to an embodiment of the present invention;
FIG. 12 is a schematic diagram illustrating a node visual feature encoding according to an embodiment of the present invention;
FIG. 13 is a diagram illustrating node feature updates in graph convolution according to an embodiment of the present invention;
FIG. 14 is a block diagram of a table structure recognition algorithm (graph convolution neural network) according to an embodiment of the present invention;
fig. 15 is a diagram illustrating an example of a table structure recognition result according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
A plurality, including two or more.
And/or, it should be understood that, for the term "and/or" as used in this disclosure, it is merely one type of association that describes an associated object, meaning that three types of relationships may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.
The invention provides an efficient table structure integrated identification method. Aiming at the problem of diversity of inclined forms and form structures, the team creatively adopts an image segmentation technology to realize form detection and a mechanism combining rules and a deep learning technology to realize structure identification, thereby greatly improving the robustness and the universality of the algorithm.
Aiming at the difficult problem of table identification, the invention considers the characteristics of multiple styles, multiple plates, complex internal structures and the like of table objects, is based on the solution idea of end-to-end unification, fully combines the advantages of convolutional neural network image segmentation, graph convolutional neural network and the traditional rule analysis method, realizes the integrated process of detecting the table structure identification from the table, has the characteristics of data drive, unification, independence on specific table styles and the like, and achieves better effect on various tables. The results of the processing of the wired form and the wireless form as shown in fig. 7, 9, and 14.
Examples
The first step is as follows: image pre-processing
This step performs a series of pre-processing operations on the input image. The image contains one or more tables. Aiming at the characteristics of compact arrangement and small line spacing of lines in most tables, in order to improve the distinction degree between the lines, the design firstly completes vertical direction stretching transformation to increase the pixel distance between the lines at the stage. The next preprocessing operation also comprises the size normalization of the image with the unchanged length-width ratio and the 0 complementing of the boundary, so that the size of the image can support the requirement of a neural network, and the global and local feature information is maximally reserved. During training, necessary data enhancement, such as affine transformation (Rotation, Shear, Scale, and the like) and color distortion, needs to be completed in the image preprocessing stage, so that the distribution of training samples is closer to the potential real sample distribution, the problem of possible data scarcity is alleviated, and the robustness and invariance of the learning model are improved. The invention also introduces expansion transformation as a data enhancement mode, wherein the transformation firstly converts an input image into a binary image, and then performs expansion transformation on all pixels by using a 2 x 2 kernel operator, and the transformation can expand a black pixel region in the binary image. The binary image generated by the expansion transformation can not only enlarge the sample set, but also imitate the situation that the black-and-white form image is fuzzy, thereby improving the robustness of the model. In the prediction stage, the algorithm only performs normalization processing on the image size.
The second step is that: table area prediction
Compared with an image segmentation method based on a target detection result, such as Mask-RCNN and the like, the method does not need to segment the image on the basis of the target detection result, avoids the influence of the minimum circumscribed rectangle of the detected object, and can obtain better advantages in edge precision.
The algorithm adopts an image segmentation algorithm taking an Encoder-Decoder as a main structure in the step, wherein the Encoder part is responsible for down-sampling a low-resolution representation from a high-resolution representation in a convolution mode, and the Decoder part is responsible for up-sampling a high-resolution representation from the low-resolution representation in a transposition convolution or interpolation mode. The design innovatively adopts the HRNet model as the Encoder structure, and the mechanism of parallel connection of multi-resolution sub-networks in the HRNet network generates multi-resolution representations, so that the high-resolution representations with rich semantics are kept all the time, and information loss caused by downsampling operation is avoided. The HRNet model starts with a high resolution sub-network as a first stage, gradually adds high resolution to low resolution sub-networks, introduces more stages, and connects the multi-resolution sub-networks in parallel. Meanwhile, a Multi-Resolution Fusion Module (Multi-Resolution Fusion Module) is introduced into the model to realize the feature information exchange and Fusion between the Multi-Resolution representations, so that the high-Resolution representations with rich semantics and accurate spatial positions are obtained. In the multi-resolution fusion module, the algorithm adopts the convolution with convolution kernel of 3 × 3 and step length of 2 to extract the low-resolution representation and the bilinear interpolation method from the high-resolution representation to recover the high-resolution representation from the low-resolution representation. The Encoder portion of the design ultimately generates feature maps in four dimensions, with spatial dimensions 1/2, 1/4, 1/8, and 1/16 of the original dimensions.
The Decoder part of the model firstly utilizes an Atrous Spatial clustering Power (ASPP) module to convolute and sample the feature map with the minimum size of the previous stage by holes at different sampling intervals in parallel, helps the model capture more scales of feature information, wherein the kernel sizes of convolution operators are 1, 3 and 3 respectively, the partition rates are 1, 6 and 12 respectively, and the output of the original input size is maintained in a padding way. Next, the Decoder expands the spatial dimension of the small-sized feature map by two times by means of transposed convolution step by step and concatenates it with the same-sized feature map from the Encoder. The specific process diagram is shown in FIG. 5, where s2、s4、s8And s16Characteristic diagrams of artwork sizes 1/2, 1/4, 1/8 and 1/16 generated by Encoder, respectively
Finally, the Decoder part generates a Mask image with the same size and the same depth as the original image and 2 by using convolution of 1 x 1, and pixel-level prediction is realized. Because the pixel points may belong to three conditions of a wired table area, a wireless table area and a non-table area, the segmentation model needs to output two Mask predicted images, and the wired table and the wireless table are classified while the table area is accurately calculated. The value of each pixel point position in the Mask images is in the range of 0 to 1, and the pixel values of the two Mask images respectively represent the confidence that the current pixel point belongs to a wired table or a wireless table.
The classification result output by this step will then help the model to automatically select the corresponding structure recognition method.
The third step: table image correction
The first aim of the step is to calculate a complete table area by using the Mask predicted image obtained in the last step to carry out edge fitting, and then separate the table area from the original image by using projection transformation to form a new picture.
The method comprises the steps of firstly calculating the outline of the periphery of a table according to a Mask image, then detecting all straight lines in the outline by using a Hough transform operator, fusing part of straight lines meeting the merging condition, and finally calculating the accurate position of the table according to the positions of all the straight lines. The algorithm structure of this step is shown in fig. 7.
The projective transformation at this step may ensure that the new picture contains only the table contents in most cases, and that most of the cells within the table are well-aligned. Therefore, the interference of the content of the non-table area in the original image can be eliminated, the difficulty of the identification task can be reduced, and the accuracy of table identification is further improved.
The fourth step: table structure identification
The step uses the table classification result obtained from the second step, and adopts a targeted structure identification mode for the table area image obtained from the third step.
For the wired table, the algorithm adopts an explicit separation line detection method. The algorithm firstly predicts the position of an explicit table separation line by using an image segmentation model and extracts the content and the position of a text instance in a table by using an OCR engine, and then calculates the position of the text instance in a table structure by using a post-processing algorithm. The detailed structure of the post-processing algorithm is shown in fig. 9. The post-processing algorithm firstly calculates the outline of the explicit separation line according to the Mask image of the separation line, then extracts the outline skeleton image of the separation line by using a boundary corrosion algorithm, then calculates all straight lines from the skeleton image by using a Hough transform algorithm and fuses the straight lines meeting the merging condition, deduces the position of a cell by calculating the position of the intersection point of all transverse lines and vertical lines, and finally calculates the table structure information according to the relative positions of the table cell and a text example.
On the other hand, the algorithm employs a convolutional neural network to process the wireless table. In the method of wireless form processing, ResNet50 is used as a Feature Extractor to extract the image features of the input form, and an OCR engine is responsible for extracting the content and position of the text instance in the form. The nodes in the graph convolution neural network take a text example as a unit, and the node characteristics are formed by jointly splicing the position characteristics, the boundary box background characteristics, the row background characteristics and the column background characteristics of the text example. The location features of the text instance consist of the coordinates of the upper left and lower right corners of the bounding box. The algorithm introduces an extraction mode of RoIPooling image features to acquire corresponding bounding box background features, line background features and column background features from the corresponding positions of the text instances on the feature map, as shown in FIG. 11.
RoIPooling can acquire a feature map with a fixed size at a corresponding position of a text instance, and then the algorithm performs global averaging on the feature map along two dimensions of width and height, so that only the features of the depth dimension of the feature map are reserved.
The graph convolution neural network is not limited by Grid-Structure Data like common convolution, such as a fixed adjacency relation in an image, but can select nodes similar to the current nodes in the graph Structure according to a measurement mode of feature similarity for feature extraction and updating. In the multilayer graph convolution network, the adjacent nodes of each node are continuously changed along with the updating of the node characteristics, and the effective receptive field of the network is increased to a certain extent.
In the invention, the algorithm adopts Euclidean distance as a measurement mode of node feature similarity in graph convolution operation, a plurality of nearest neighbor nodes of each node are respectively calculated, and then the node features and the difference values of the node features and the features of each neighbor node are respectively spliced together and input into a full-connection network for feature extraction, as shown in FIG. 13. The graph convolution network layer of each layer averagely fuses the extraction results of the current node and all the neighboring node characteristics together to serve as updated node characteristics. In the graph convolution operation adopted by the design algorithm, the node features represent global features in the graph structure, the difference values of the node features and the adjacent node features represent local features in the graph structure, and the multilayer graph convolution operation can help the algorithm to fully extract the global and local features in the graph structure. Finally, three multi-layer perceptron networks (classifiers) used by the algorithm respectively judge the structural relationship of the rows, the columns and the cells among the nodes by using the updated node characteristics.
Fig. 14 is a structural diagram of a table structure recognition algorithm based on a graph convolution neural network, and fig. 15 is an exemplary diagram of a table structure recognition result.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the above implementation method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation method. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. An end-to-end table detection and structure identification method, wherein an input original image contains a table, the method is characterized by comprising the following steps:
step 1: an image preprocessing step, namely stretching the original image in the vertical direction, carrying out size normalization with unchanged length-width ratio and boundary compensation 0, and forming a preprocessed image;
step 2: a table region prediction step of determining a table region in the preprocessed image by using an encoder-decoder model as a main structure, and classifying the table region into a wired table image and a wireless table image;
and step 3: a table image correction step of separating a corrected table area image including only the table area from the preprocessed image based on the determined table area;
and 4, step 4: and a table structure recognition step of performing table structure recognition in different modes according to the classification into the wired table image and the wireless table image for the table area image.
2. The method of claim 1, wherein in step 2, the encoding portion of the encoder-decoder model convolves and downsamples the low-resolution representation from the first high-resolution representation; the decoding portion upsamples a second high resolution representation from the low resolution representation by means of a transposed convolution or interpolation.
3. The method of claim 2, wherein the encoding portion operates as follows:
the method comprises the steps of generating multi-resolution representations by adopting a mechanism of parallel connection of multi-resolution sub-networks in a high-resolution network, introducing a multi-resolution fusion module to realize feature information exchange and fusion among the multi-resolution representations, and finally outputting first feature maps with various scales.
4. The method of claim 3, wherein the decoding portion operates as follows:
firstly, a cavity space convolution pooling pyramid module is adopted to carry out cavity convolution parallel sampling on feature maps with the minimum size in the first feature maps at different sampling intervals, and then the space dimensions of other first feature maps are respectively expanded by two times in a transposition convolution mode to form a plurality of second feature maps with the same number as the first feature maps;
splicing the second feature map with the first feature map with the same size from the coding part, and finally, performing convolution to generate two mask predicted images with the same size as the preprocessed image;
thereby determining a table area and distinguishing between the wired table image and the wireless table image.
5. The method according to claim 1, wherein step 3 specifically comprises:
step 31: calculating the outline around the table by using a canny edge detection operator according to the mask predicted image;
step 32: detecting all straight lines in the contour by utilizing a Hough transform operator and merging part of the straight lines meeting merging conditions;
step 33: the exact table positions are calculated from the positions of all the straight lines, thereby separating the corrected table area image containing only the table area.
6. The method according to claim 1, wherein in the step 4, the table area image belonging to the wired table image specifically includes:
predicting an image according to a mask of the parting line, and calculating a contour map of the explicit parting line by using a canny edge detection operator;
extracting a contour skeleton diagram of the separation line by using a boundary corrosion method;
calculating all straight lines from the contour skeleton graph by using a Hough transform method and fusing part of straight lines meeting the merging condition;
the positions of the table cells are obtained by calculating the positions of the intersection points of all the transverse lines and the vertical lines;
extracting the content and the position of a text example in the table;
and calculating and outputting table structure information according to the relative positions of the table cells and the text examples.
7. The method according to claim 1, wherein in the step 4, the table area image belonging to the wireless table image specifically includes:
extracting node features by taking each text instance as a node, wherein the node features are formed by jointly splicing the position features, the boundary box background features, the row background features and the column background features of each text instance;
aiming at a certain node a, selecting all nodes in the current feature space of the node a, calculating the similarity, and selecting a plurality of peripheral nearest neighbor nodes;
respectively splicing the node characteristics A of the node a and the similarity difference values of the node characteristics A and the node characteristics of the nearest neighbor nodes together, inputting the node characteristics A 'into a trained graph convolution neural network, and outputting updated node characteristics A';
repeating the above operations to obtain updated node characteristics of all nodes in the table area image;
and respectively determining the structural relationship between the node and the row, column and cell of a plurality of nearest neighbor nodes by using the updated node characteristics through the three multilayer perceptron networks, thereby determining and outputting table structure information.
8. The method of claim 7, wherein the location features of the text instance consist of coordinates of an upper left corner and a lower right corner of a bounding box.
9. The method of claim 7, wherein the bounding box background features, row background features, and column background features of the text instance are extracted from the feature map by extraction of region-of-interest pooled image features.
10. An end-to-end form detection and structure identification device, characterized in that it operates based on the method according to any one of claims 1 to 9, characterized in that it comprises the following components:
the image preprocessing unit is used for stretching the original image in the vertical direction, carrying out size normalization with unchanged length-width ratio and performing boundary 0 supplementation to form a preprocessed image;
a table region prediction unit for determining a table region in the preprocessed image with an encoder-decoder model as a main structure, and classifying the table region into a wired table image and a wireless table image;
a table image correction unit for separating a corrected table area image containing only the table area from the preprocessed image based on the determined table area;
and the table structure identification unit is used for identifying the table structure of the table area image in different modes according to the classification into the wired table image and the wireless table image.
CN202110396302.5A 2021-04-13 2021-04-13 End-to-end form detection and structure identification method and system Active CN113435240B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110396302.5A CN113435240B (en) 2021-04-13 2021-04-13 End-to-end form detection and structure identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110396302.5A CN113435240B (en) 2021-04-13 2021-04-13 End-to-end form detection and structure identification method and system

Publications (2)

Publication Number Publication Date
CN113435240A true CN113435240A (en) 2021-09-24
CN113435240B CN113435240B (en) 2024-06-14

Family

ID=77753027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110396302.5A Active CN113435240B (en) 2021-04-13 2021-04-13 End-to-end form detection and structure identification method and system

Country Status (1)

Country Link
CN (1) CN113435240B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936287A (en) * 2021-10-20 2022-01-14 平安国际智慧城市科技股份有限公司 Table detection method and device based on artificial intelligence, electronic equipment and medium
CN114565927A (en) * 2022-03-03 2022-05-31 上海恒生聚源数据服务有限公司 Table identification method and device, electronic equipment and storage medium
CN116092105A (en) * 2023-04-07 2023-05-09 北京中关村科金技术有限公司 Method and device for analyzing table structure
CN116257459A (en) * 2023-05-16 2023-06-13 北京城建智控科技股份有限公司 Form UI walk normalization detection method and device
CN116311301A (en) * 2023-02-17 2023-06-23 北京感易智能科技有限公司 Wireless form identification method and system
CN116311310A (en) * 2023-05-19 2023-06-23 之江实验室 Universal form identification method and device combining semantic segmentation and sequence prediction

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6018776A (en) * 1992-06-30 2000-01-25 Discovision Associates System for microprogrammable state machine in video parser clearing and resetting processing stages responsive to flush token generating by token generator responsive to received data
WO2008154611A2 (en) * 2007-06-11 2008-12-18 Honeywell International Inc. Optical reader system for extracting information in a digital image
CN105184265A (en) * 2015-09-14 2015-12-23 哈尔滨工业大学 Self-learning-based handwritten form numeric character string rapid recognition method
CN105426834A (en) * 2015-11-17 2016-03-23 中国传媒大学 Projection feature and structure feature based form image detection method
US20190050640A1 (en) * 2017-08-10 2019-02-14 Adobe Systems Incorporated Form structure extraction network
US20190050381A1 (en) * 2017-08-14 2019-02-14 Adobe Systems Incorporated Detecting the bounds of borderless tables in fixed-format structured documents using machine learning
US20190266394A1 (en) * 2018-02-26 2019-08-29 Abc Fintech Co., Ltd. Method and device for parsing table in document image
US20200151444A1 (en) * 2018-11-14 2020-05-14 Adobe Inc. Table Layout Determination Using A Machine Learning System
CN111753727A (en) * 2020-06-24 2020-10-09 北京百度网讯科技有限公司 Method, device, equipment and readable storage medium for extracting structured information
CN111783735A (en) * 2020-07-22 2020-10-16 欧冶云商股份有限公司 Steel document analytic system based on artificial intelligence
CN111814722A (en) * 2020-07-20 2020-10-23 电子科技大学 Method and device for identifying table in image, electronic equipment and storage medium
CN111860257A (en) * 2020-07-10 2020-10-30 上海交通大学 Table identification method and system fusing multiple text features and geometric information
CN111950453A (en) * 2020-08-12 2020-11-17 北京易道博识科技有限公司 Optional-shape text recognition method based on selective attention mechanism
WO2020254924A1 (en) * 2019-06-16 2020-12-24 Way2Vat Ltd. Systems and methods for document image analysis with cardinal graph convolutional networks
US20210056429A1 (en) * 2019-08-21 2021-02-25 Eygs Llp Apparatus and methods for converting lineless tables into lined tables using generative adversarial networks
US20210073326A1 (en) * 2019-09-06 2021-03-11 Wipro Limited System and method for extracting tabular data from a document
WO2021053687A1 (en) * 2019-09-18 2021-03-25 Tata Consultancy Services Limited Deep learning based table detection and associated data extraction from scanned image documents

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6018776A (en) * 1992-06-30 2000-01-25 Discovision Associates System for microprogrammable state machine in video parser clearing and resetting processing stages responsive to flush token generating by token generator responsive to received data
WO2008154611A2 (en) * 2007-06-11 2008-12-18 Honeywell International Inc. Optical reader system for extracting information in a digital image
CN105184265A (en) * 2015-09-14 2015-12-23 哈尔滨工业大学 Self-learning-based handwritten form numeric character string rapid recognition method
CN105426834A (en) * 2015-11-17 2016-03-23 中国传媒大学 Projection feature and structure feature based form image detection method
US20190050640A1 (en) * 2017-08-10 2019-02-14 Adobe Systems Incorporated Form structure extraction network
CN109389027A (en) * 2017-08-10 2019-02-26 奥多比公司 Form structure extracts network
US20190050381A1 (en) * 2017-08-14 2019-02-14 Adobe Systems Incorporated Detecting the bounds of borderless tables in fixed-format structured documents using machine learning
US20190266394A1 (en) * 2018-02-26 2019-08-29 Abc Fintech Co., Ltd. Method and device for parsing table in document image
US20200151444A1 (en) * 2018-11-14 2020-05-14 Adobe Inc. Table Layout Determination Using A Machine Learning System
WO2020254924A1 (en) * 2019-06-16 2020-12-24 Way2Vat Ltd. Systems and methods for document image analysis with cardinal graph convolutional networks
US20210056429A1 (en) * 2019-08-21 2021-02-25 Eygs Llp Apparatus and methods for converting lineless tables into lined tables using generative adversarial networks
US20210073326A1 (en) * 2019-09-06 2021-03-11 Wipro Limited System and method for extracting tabular data from a document
WO2021053687A1 (en) * 2019-09-18 2021-03-25 Tata Consultancy Services Limited Deep learning based table detection and associated data extraction from scanned image documents
CN111753727A (en) * 2020-06-24 2020-10-09 北京百度网讯科技有限公司 Method, device, equipment and readable storage medium for extracting structured information
CN111860257A (en) * 2020-07-10 2020-10-30 上海交通大学 Table identification method and system fusing multiple text features and geometric information
CN111814722A (en) * 2020-07-20 2020-10-23 电子科技大学 Method and device for identifying table in image, electronic equipment and storage medium
CN111783735A (en) * 2020-07-22 2020-10-16 欧冶云商股份有限公司 Steel document analytic system based on artificial intelligence
CN111950453A (en) * 2020-08-12 2020-11-17 北京易道博识科技有限公司 Optional-shape text recognition method based on selective attention mechanism

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DEVASHISH PRASAD, AYAN GADPAL, KSHITIJ KAPADNI, MANISH VISAVE, KAVITA SULTANPURE: ""CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents"", HTTPS://DOI.ORG/10.48550/ARXIV.2004.12629 *
SHOAIB AHMED SIDDIQUI;PERVAIZ IQBAL KHAN;ANDREAS DENGEL;: "" Rethinking Semantic Segmentation for Table Structure Recognition in Documents"", 《2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR)》 *
卜飞宇, 刘长松, 丁晓青: "版面分析中表格与图形的鉴别", 计算机工程与应用, no. 12 *
应自炉,赵毅鸿,宣晨,邓文博: ""多特征融合的文档图像版面分析"", 《中国图象图形学报》, pages 1 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936287A (en) * 2021-10-20 2022-01-14 平安国际智慧城市科技股份有限公司 Table detection method and device based on artificial intelligence, electronic equipment and medium
CN114565927A (en) * 2022-03-03 2022-05-31 上海恒生聚源数据服务有限公司 Table identification method and device, electronic equipment and storage medium
CN116311301A (en) * 2023-02-17 2023-06-23 北京感易智能科技有限公司 Wireless form identification method and system
CN116311301B (en) * 2023-02-17 2024-06-07 北京感易智能科技有限公司 Wireless form identification method and system
CN116092105A (en) * 2023-04-07 2023-05-09 北京中关村科金技术有限公司 Method and device for analyzing table structure
CN116257459A (en) * 2023-05-16 2023-06-13 北京城建智控科技股份有限公司 Form UI walk normalization detection method and device
CN116257459B (en) * 2023-05-16 2023-07-28 北京城建智控科技股份有限公司 Form UI walk normalization detection method and device
CN116311310A (en) * 2023-05-19 2023-06-23 之江实验室 Universal form identification method and device combining semantic segmentation and sequence prediction

Also Published As

Publication number Publication date
CN113435240B (en) 2024-06-14

Similar Documents

Publication Publication Date Title
CN113435240B (en) End-to-end form detection and structure identification method and system
CN109948510B (en) Document image instance segmentation method and device
CN109740548B (en) Reimbursement bill image segmentation method and system
Cheung et al. An Arabic optical character recognition system using recognition-based segmentation
JP7246104B2 (en) License plate identification method based on text line identification
CN105574524B (en) Based on dialogue and divide the mirror cartoon image template recognition method and system that joint identifies
Dal Poz et al. Automated extraction of road network from medium-and high-resolution images
CN111738055B (en) Multi-category text detection system and bill form detection method based on same
CN111553349B (en) Scene text positioning and identifying method based on full convolution network
CN115331245B (en) Table structure identification method based on image instance segmentation
CN115457565A (en) OCR character recognition method, electronic equipment and storage medium
CN113673541B (en) Image sample generation method for target detection and application
CN114529925A (en) Method for identifying table structure of whole line table
Kölsch et al. Recognizing challenging handwritten annotations with fully convolutional networks
CN110956088A (en) Method and system for positioning and segmenting overlapped text lines based on deep learning
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN113158895A (en) Bill identification method and device, electronic equipment and storage medium
CN113033558A (en) Text detection method and device for natural scene and storage medium
CN115830359A (en) Workpiece identification and counting method based on target detection and template matching in complex scene
CN110634142B (en) Complex vehicle road image boundary optimization method
CN116740758A (en) Bird image recognition method and system for preventing misjudgment
CN113657225B (en) Target detection method
CN111832497B (en) Text detection post-processing method based on geometric features
CN113033559A (en) Text detection method and device based on target detection and storage medium
Lou et al. Generative shape models: Joint text recognition and segmentation with very little training data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant