CN113435240A

CN113435240A - End-to-end table detection and structure identification method and system

Info

Publication number: CN113435240A
Application number: CN202110396302.5A
Authority: CN
Inventors: 周勃宇; 王勇; 朱军民
Original assignee: Beijing Yidao Boshi Technology Co ltd
Current assignee: Beijing Yidao Boshi Technology Co ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-09-24
Anticipated expiration: 2041-04-13
Also published as: CN113435240B

Abstract

The invention discloses an end-to-end table detection and structure identification method and system, and relates to the field of computer vision. The method comprises the following steps: stretching the original image in the vertical direction, carrying out size normalization with unchanged length-width ratio and boundary compensation 0 to form a preprocessed image; determining a table area in the preprocessed image by taking an encoder-decoder model as a main structure, and classifying the table area into a wired table image and a wireless table image; separating a corrected table area image containing only the table area from the preprocessed image based on the determined table area; and respectively adopting different methods to identify the table structure according to the classification of the table area image into a wired table image and a wireless table image. The invention adopts different structure recognition methods for different types of tables, and fully combines the advantages of the convolutional neural network image segmentation algorithm, the graph convolutional neural network algorithm and the traditional rule analysis method to improve the robustness and the universality of the algorithm.

Description

End-to-end table detection and structure identification method and system

Technical Field

The invention relates to the field of computer vision, in particular to an end-to-end form detection and structure identification method and system.

Background

In reality, a form is widely used as a method for carrying key information in objects such as PDF, scanned document, and photographed picture. Table structure recognition is an important prerequisite for many downstream tasks, such as document analysis, information extraction, and visualization. The automatic table identification method generally includes two major steps of table detection and table structure identification, wherein the purpose of table detection is to locate table areas in a picture, and the table identification is to identify the internal structure of a table in each area so as to obtain final structured data. The manual extraction of the table contents consumes a lot of labor and time. In contrast, an automated approach would greatly improve the efficiency of operation.

In reality, tables have a large number of different styles, formats and internal structures, so that it is often difficult to adopt a uniform recognition method. Conventional form recognition methods typically rely on hand-designed features (e.g., line-row separation lines, blank areas, cell data types, etc.) and heuristic rules. The table detection usually adopts a bottom-up strategy, such as positioning the row and column positions in the table by using the explicit text alignment relationship in the table, and then fusing all row and column information together to calculate the table area. The variability of the form style and the complexity of the internal structure bring great difficulty to the detection of the rows and columns, and further influence the overall detection effect. Form structure recognition typically relies on explicit separation lines in the form, and the relative positional relationship of the separation lines to the text instances. The method can obtain better performance in wired tables, but cannot deal with wireless tables with partial or complete missing separation lines.

In recent years, the deep learning technology promotes the rapid development of computer vision, and is also applied to the field of table recognition. In summary, the deep learning table identification method generally has two advantages compared to the conventional method. First, the deep learning method takes an image as an input, and can be applied to any recognition object convertible into an image, such as PDF, scanned document, and the like, in principle. Therefore, the method has the advantages of a unified method; second, due to the powerful automatic feature coding capability and the unified end-to-end trainable method, deep learning has outstanding performance compared with the traditional method in which the manual design features and heuristic rules are dominant.

Therefore, the integrated process for detecting the table structure identification from the table based on the advantages of deep learning has a good application prospect.

Disclosure of Invention

In order to achieve the above object, the present invention provides a structure recognition method integrating table detection, which can efficiently extract the internal structure information of the table from the image. The image segmentation technology used in the scheme can accurately calculate the edge of the table in a pixel-level prediction mode, and can classify the table into a wired table and a wireless table. According to the scheme, different structure recognition methods are adopted for different types of tables in subsequent steps, and the robustness and the universality of the algorithm are improved by fully combining the advantages of a convolutional neural network image segmentation algorithm, a graph convolution neural network algorithm and a traditional rule analysis method.

Specifically, the method first uses a convolutional neural network to implement the detection of the table area. For the detected wired table, adopting a convolutional neural network to finish the detection of the table line, and finishing the identification of the table structure by combining a post-processing rule; and for the wireless table, predicting the relation between the cells, the rows and the columns by adopting a graph convolution neural network so as to complete the identification of the structure.

According to a first aspect of the present invention, there is provided an end-to-end table detection and structure identification method, wherein an input original image contains a table, the method comprising the steps of:

step 1: an image preprocessing step, namely stretching the original image in the vertical direction, carrying out size normalization with unchanged length-width ratio and boundary compensation 0, and forming a preprocessed image;

step 2: a table region prediction step of determining a table region in the preprocessed image with an Encoder-Decoder (Encoder-Decoder) model as a main structure and classifying the table region into a wired table image and a wireless table image;

and step 3: a table image correction step of separating a corrected table area image including only the table area from the preprocessed image based on the determined table area;

and 4, step 4: and a table structure recognition step of performing table structure recognition in different modes according to the classification into the wired table image and the wireless table image for the table area image.

Further, in the step 2, the encoding part of the encoder-decoder model samples the low-resolution representation from the first high-resolution representation in a convolution mode; the decoding portion upsamples a second high resolution representation from the low resolution representation by means of a transposed convolution or interpolation.

Further, the encoding part operates as follows:

the method comprises the steps of generating Multi-Resolution representations by adopting a mechanism of parallel connection of Multi-Resolution sub-networks in a High-Resolution network (HRNet), introducing a Multi-Resolution Fusion Module to realize feature information exchange and Fusion among the Multi-Resolution representations, and finally outputting first feature maps with multiple scales.

Further, the decoding section operates as follows:

firstly, a cavity space convolution Pooling Pyramid (ASPP) module is adopted to carry out cavity convolution parallel sampling on feature maps with the minimum size in the first feature maps at different sampling intervals, and then the space dimensions of other first feature maps are respectively expanded by two times in a transposition convolution mode to form a plurality of second feature maps with the same number as the first feature maps;

splicing the second feature map with a first feature map with the same size from an encoding part, and finally performing convolution to generate two Mask (Mask) predicted images with the same size as the size of the preprocessed image;

thereby determining a table area and distinguishing between the wired table image and the wireless table image.

Further, the step 3 specifically includes:

step 31: calculating the contour of the periphery of the table by using a Canny edge detection operator (Canny operator) according to the mask predicted image;

step 32: detecting all straight lines in the contour by utilizing a Hough transform operator and merging part of the straight lines meeting merging conditions;

step 33: the exact table positions are calculated from the positions of all the straight lines, thereby separating the corrected table area image containing only the table area.

Further, in step 32, the merging conditions are:

firstly, judging whether the two line segments are parallel lines or not, if so, calculating the vertical distance of the two line segments, and if the distance is greater than a certain value, the two line segments cannot be merged. If the difference is not the parallel line, the slope difference of the two line segments is calculated, and when the difference value is larger than the threshold value, the two line segments cannot be merged.

When the above conditions are met, it is necessary to continuously determine whether the two line segments overlap in a certain projection direction, if so, the vertical distances from the two end points of the line segment to the other line segment are respectively calculated, and when the minimum value of the four distances is smaller than a threshold value, the two line segments can be merged. If there is no overlap, the distance of the end points between the two line segments is calculated separately, and when the minimum value of the four distances is less than the threshold, the two line segments can be merged.

Further, in the step 4, the table area image belonging to the wired table image specifically includes:

predicting an image according to a mask of the parting line, and calculating a contour map of the explicit parting line by using a canny edge detection operator;

extracting a contour skeleton graph of the separation line by using a boundary corrosion algorithm;

calculating all straight lines from the outline skeleton graph by using a Hough transform algorithm and fusing part of straight lines meeting the merging condition;

the positions of the table cells are obtained by calculating the positions of the intersection points of all the transverse lines and the vertical lines;

extracting the content and the position of a text example in the table;

and calculating and outputting table structure information according to the relative positions of the table cells and the text examples.

Further, in the step 4, the table area image belonging to the wireless table image specifically includes:

extracting node features by taking each text instance as a node, wherein the node features are formed by jointly splicing the position features, the boundary box background features, the row background features and the column background features of each text instance;

aiming at a certain node a, selecting all nodes in the current feature space of the node a, calculating the similarity, and selecting a plurality of peripheral nearest neighbor nodes;

respectively splicing the node characteristics A of the node a and the similarity difference values of the node characteristics A and the node characteristics of the nearest neighbor nodes together, inputting the node characteristics A 'into a trained graph convolution neural network, and outputting updated node characteristics A';

repeating the above operations to obtain updated node characteristics of all nodes in the table area image;

and respectively determining the structural relationship between the node and the row, column and cell of a plurality of nearest neighbor nodes by using the updated node characteristics through the three multilayer perceptron networks, thereby determining and outputting table structure information.

Further, the number of the nearest neighbor nodes is preferably 10 to 15.

Further, the location features of the text instance consist of the coordinates of the upper left and lower right corners of the bounding box.

Further, the bounding box background feature, the row background feature and the column background feature of the text example are extracted from the feature map by a Region of Interest (ROI) Pooling image feature extraction method.

Here, the text example refers to a word, sentence or paragraph composed of several connected words.

According to a second aspect of the present invention there is provided an end-to-end table detection and structure identification apparatus, characterised in that the apparatus operates in accordance with the method of any of the preceding aspects, the apparatus comprising:

the image preprocessing unit is used for stretching the original image in the vertical direction, carrying out size normalization with unchanged length-width ratio and performing boundary 0 supplementation to form a preprocessed image;

a table region prediction unit for determining a table region in the preprocessed image with an Encoder-Decoder (Encoder-Decoder) model as a main structure and classifying the table region into a wired table image and a wireless table image;

and a table image correction unit for separating a corrected table area image containing only the table area from the preprocessed image based on the determined table area.

And the table structure identification unit is used for identifying the table structure of the table area image in different modes according to the classification into the wired table image and the wireless table image.

According to a third aspect of the present invention there is provided an end-to-end table detection and structure identification system, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions to perform an end-to-end table detection and structure identification method as claimed in any of the above aspects.

According to a fourth aspect of the present invention, there is provided a computer-readable storage medium, characterized in that a computer program is stored thereon, which, when executed by a processor, implements the end-to-end table detection and structure identification method according to any of the above aspects.

The invention has the beneficial effects that:

1, the invention automatically selects different structure recognition methods according to the table detection result. The selection mechanism combines the advantages of the traditional rule algorithm and the deep learning algorithm, and improves the robustness and the universality of the algorithm.

2, the table detection method based on image segmentation can calculate the table edge more accurately, especially under the condition that the inclined table exists in the image. The pixel-level prediction mode can also eliminate the interference of non-table content areas on the subsequent table structure identification work as much as possible.

And 3, projection transformation before the structure identification step can help to obtain a table with orderly arranged unit grids, so that the difficulty of subsequent structure identification work is reduced.

And 4, fusing a node characteristic consisting of the background characteristic of the text example, the image characteristic of the row and column where the text example is located and the position characteristic of the text example by the graph convolution neural network in a graph convolution mode, and more efficiently extracting the global characteristic and the local characteristic in the graph structure. The updated node features may help to more accurately predict structural relationships between text instances, especially when there are merge cells in the table.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is an exemplary diagram of a wired table and a wireless table according to an embodiment of the present invention;

FIG. 2 is an algorithmic flow diagram of an end-to-end table detection and structure identification method according to an embodiment of the invention;

FIG. 3 is an algorithm structure diagram of an end-to-end table detection and structure identification method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an example of image pre-processing transformation according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a variation in feature map size in a Decoder according to an embodiment of the present invention;

FIG. 6 is an exemplary diagram of a table image segmentation result according to an embodiment of the invention;

FIG. 7 is a block diagram of a table position optimization algorithm according to an embodiment of the present invention;

FIG. 8 is an exemplary diagram of a table image correction result according to an embodiment of the present invention;

FIG. 9 is a block diagram of a table line extraction post-processing algorithm according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating a division result of a division line of a table with lines according to an embodiment of the present invention;

FIG. 11 is a schematic plane diagram of node feature extraction according to an embodiment of the present invention;

FIG. 12 is a schematic diagram illustrating a node visual feature encoding according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating node feature updates in graph convolution according to an embodiment of the present invention;

FIG. 14 is a block diagram of a table structure recognition algorithm (graph convolution neural network) according to an embodiment of the present invention;

fig. 15 is a diagram illustrating an example of a table structure recognition result according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

A plurality, including two or more.

And/or, it should be understood that, for the term "and/or" as used in this disclosure, it is merely one type of association that describes an associated object, meaning that three types of relationships may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.

The invention provides an efficient table structure integrated identification method. Aiming at the problem of diversity of inclined forms and form structures, the team creatively adopts an image segmentation technology to realize form detection and a mechanism combining rules and a deep learning technology to realize structure identification, thereby greatly improving the robustness and the universality of the algorithm.

Aiming at the difficult problem of table identification, the invention considers the characteristics of multiple styles, multiple plates, complex internal structures and the like of table objects, is based on the solution idea of end-to-end unification, fully combines the advantages of convolutional neural network image segmentation, graph convolutional neural network and the traditional rule analysis method, realizes the integrated process of detecting the table structure identification from the table, has the characteristics of data drive, unification, independence on specific table styles and the like, and achieves better effect on various tables. The results of the processing of the wired form and the wireless form as shown in fig. 7, 9, and 14.

Examples

The first step is as follows: image pre-processing

This step performs a series of pre-processing operations on the input image. The image contains one or more tables. Aiming at the characteristics of compact arrangement and small line spacing of lines in most tables, in order to improve the distinction degree between the lines, the design firstly completes vertical direction stretching transformation to increase the pixel distance between the lines at the stage. The next preprocessing operation also comprises the size normalization of the image with the unchanged length-width ratio and the 0 complementing of the boundary, so that the size of the image can support the requirement of a neural network, and the global and local feature information is maximally reserved. During training, necessary data enhancement, such as affine transformation (Rotation, Shear, Scale, and the like) and color distortion, needs to be completed in the image preprocessing stage, so that the distribution of training samples is closer to the potential real sample distribution, the problem of possible data scarcity is alleviated, and the robustness and invariance of the learning model are improved. The invention also introduces expansion transformation as a data enhancement mode, wherein the transformation firstly converts an input image into a binary image, and then performs expansion transformation on all pixels by using a 2 x 2 kernel operator, and the transformation can expand a black pixel region in the binary image. The binary image generated by the expansion transformation can not only enlarge the sample set, but also imitate the situation that the black-and-white form image is fuzzy, thereby improving the robustness of the model. In the prediction stage, the algorithm only performs normalization processing on the image size.

The second step is that: table area prediction

Compared with an image segmentation method based on a target detection result, such as Mask-RCNN and the like, the method does not need to segment the image on the basis of the target detection result, avoids the influence of the minimum circumscribed rectangle of the detected object, and can obtain better advantages in edge precision.

The algorithm adopts an image segmentation algorithm taking an Encoder-Decoder as a main structure in the step, wherein the Encoder part is responsible for down-sampling a low-resolution representation from a high-resolution representation in a convolution mode, and the Decoder part is responsible for up-sampling a high-resolution representation from the low-resolution representation in a transposition convolution or interpolation mode. The design innovatively adopts the HRNet model as the Encoder structure, and the mechanism of parallel connection of multi-resolution sub-networks in the HRNet network generates multi-resolution representations, so that the high-resolution representations with rich semantics are kept all the time, and information loss caused by downsampling operation is avoided. The HRNet model starts with a high resolution sub-network as a first stage, gradually adds high resolution to low resolution sub-networks, introduces more stages, and connects the multi-resolution sub-networks in parallel. Meanwhile, a Multi-Resolution Fusion Module (Multi-Resolution Fusion Module) is introduced into the model to realize the feature information exchange and Fusion between the Multi-Resolution representations, so that the high-Resolution representations with rich semantics and accurate spatial positions are obtained. In the multi-resolution fusion module, the algorithm adopts the convolution with convolution kernel of 3 × 3 and step length of 2 to extract the low-resolution representation and the bilinear interpolation method from the high-resolution representation to recover the high-resolution representation from the low-resolution representation. The Encoder portion of the design ultimately generates feature maps in four dimensions, with spatial dimensions 1/2, 1/4, 1/8, and 1/16 of the original dimensions.

The Decoder part of the model firstly utilizes an Atrous Spatial clustering Power (ASPP) module to convolute and sample the feature map with the minimum size of the previous stage by holes at different sampling intervals in parallel, helps the model capture more scales of feature information, wherein the kernel sizes of convolution operators are 1, 3 and 3 respectively, the partition rates are 1, 6 and 12 respectively, and the output of the original input size is maintained in a padding way. Next, the Decoder expands the spatial dimension of the small-sized feature map by two times by means of transposed convolution step by step and concatenates it with the same-sized feature map from the Encoder. The specific process diagram is shown in FIG. 5, where s₂、s₄、s₈And s₁₆Characteristic diagrams of artwork sizes 1/2, 1/4, 1/8 and 1/16 generated by Encoder, respectively

Finally, the Decoder part generates a Mask image with the same size and the same depth as the original image and 2 by using convolution of 1 x 1, and pixel-level prediction is realized. Because the pixel points may belong to three conditions of a wired table area, a wireless table area and a non-table area, the segmentation model needs to output two Mask predicted images, and the wired table and the wireless table are classified while the table area is accurately calculated. The value of each pixel point position in the Mask images is in the range of 0 to 1, and the pixel values of the two Mask images respectively represent the confidence that the current pixel point belongs to a wired table or a wireless table.

The classification result output by this step will then help the model to automatically select the corresponding structure recognition method.

The third step: table image correction

The first aim of the step is to calculate a complete table area by using the Mask predicted image obtained in the last step to carry out edge fitting, and then separate the table area from the original image by using projection transformation to form a new picture.

The method comprises the steps of firstly calculating the outline of the periphery of a table according to a Mask image, then detecting all straight lines in the outline by using a Hough transform operator, fusing part of straight lines meeting the merging condition, and finally calculating the accurate position of the table according to the positions of all the straight lines. The algorithm structure of this step is shown in fig. 7.

The projective transformation at this step may ensure that the new picture contains only the table contents in most cases, and that most of the cells within the table are well-aligned. Therefore, the interference of the content of the non-table area in the original image can be eliminated, the difficulty of the identification task can be reduced, and the accuracy of table identification is further improved.

The fourth step: table structure identification

The step uses the table classification result obtained from the second step, and adopts a targeted structure identification mode for the table area image obtained from the third step.

For the wired table, the algorithm adopts an explicit separation line detection method. The algorithm firstly predicts the position of an explicit table separation line by using an image segmentation model and extracts the content and the position of a text instance in a table by using an OCR engine, and then calculates the position of the text instance in a table structure by using a post-processing algorithm. The detailed structure of the post-processing algorithm is shown in fig. 9. The post-processing algorithm firstly calculates the outline of the explicit separation line according to the Mask image of the separation line, then extracts the outline skeleton image of the separation line by using a boundary corrosion algorithm, then calculates all straight lines from the skeleton image by using a Hough transform algorithm and fuses the straight lines meeting the merging condition, deduces the position of a cell by calculating the position of the intersection point of all transverse lines and vertical lines, and finally calculates the table structure information according to the relative positions of the table cell and a text example.

On the other hand, the algorithm employs a convolutional neural network to process the wireless table. In the method of wireless form processing, ResNet50 is used as a Feature Extractor to extract the image features of the input form, and an OCR engine is responsible for extracting the content and position of the text instance in the form. The nodes in the graph convolution neural network take a text example as a unit, and the node characteristics are formed by jointly splicing the position characteristics, the boundary box background characteristics, the row background characteristics and the column background characteristics of the text example. The location features of the text instance consist of the coordinates of the upper left and lower right corners of the bounding box. The algorithm introduces an extraction mode of RoIPooling image features to acquire corresponding bounding box background features, line background features and column background features from the corresponding positions of the text instances on the feature map, as shown in FIG. 11.

RoIPooling can acquire a feature map with a fixed size at a corresponding position of a text instance, and then the algorithm performs global averaging on the feature map along two dimensions of width and height, so that only the features of the depth dimension of the feature map are reserved.

The graph convolution neural network is not limited by Grid-Structure Data like common convolution, such as a fixed adjacency relation in an image, but can select nodes similar to the current nodes in the graph Structure according to a measurement mode of feature similarity for feature extraction and updating. In the multilayer graph convolution network, the adjacent nodes of each node are continuously changed along with the updating of the node characteristics, and the effective receptive field of the network is increased to a certain extent.

In the invention, the algorithm adopts Euclidean distance as a measurement mode of node feature similarity in graph convolution operation, a plurality of nearest neighbor nodes of each node are respectively calculated, and then the node features and the difference values of the node features and the features of each neighbor node are respectively spliced together and input into a full-connection network for feature extraction, as shown in FIG. 13. The graph convolution network layer of each layer averagely fuses the extraction results of the current node and all the neighboring node characteristics together to serve as updated node characteristics. In the graph convolution operation adopted by the design algorithm, the node features represent global features in the graph structure, the difference values of the node features and the adjacent node features represent local features in the graph structure, and the multilayer graph convolution operation can help the algorithm to fully extract the global and local features in the graph structure. Finally, three multi-layer perceptron networks (classifiers) used by the algorithm respectively judge the structural relationship of the rows, the columns and the cells among the nodes by using the updated node characteristics.

Fig. 14 is a structural diagram of a table structure recognition algorithm based on a graph convolution neural network, and fig. 15 is an exemplary diagram of a table structure recognition result.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the above implementation method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation method. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An end-to-end table detection and structure identification method, wherein an input original image contains a table, the method is characterized by comprising the following steps:

step 2: a table region prediction step of determining a table region in the preprocessed image by using an encoder-decoder model as a main structure, and classifying the table region into a wired table image and a wireless table image;

2. The method of claim 1, wherein in step 2, the encoding portion of the encoder-decoder model convolves and downsamples the low-resolution representation from the first high-resolution representation; the decoding portion upsamples a second high resolution representation from the low resolution representation by means of a transposed convolution or interpolation.

3. The method of claim 2, wherein the encoding portion operates as follows:

the method comprises the steps of generating multi-resolution representations by adopting a mechanism of parallel connection of multi-resolution sub-networks in a high-resolution network, introducing a multi-resolution fusion module to realize feature information exchange and fusion among the multi-resolution representations, and finally outputting first feature maps with various scales.

4. The method of claim 3, wherein the decoding portion operates as follows:

firstly, a cavity space convolution pooling pyramid module is adopted to carry out cavity convolution parallel sampling on feature maps with the minimum size in the first feature maps at different sampling intervals, and then the space dimensions of other first feature maps are respectively expanded by two times in a transposition convolution mode to form a plurality of second feature maps with the same number as the first feature maps;

splicing the second feature map with the first feature map with the same size from the coding part, and finally, performing convolution to generate two mask predicted images with the same size as the preprocessed image;

5. The method according to claim 1, wherein step 3 specifically comprises:

step 31: calculating the outline around the table by using a canny edge detection operator according to the mask predicted image;

6. The method according to claim 1, wherein in the step 4, the table area image belonging to the wired table image specifically includes:

extracting a contour skeleton diagram of the separation line by using a boundary corrosion method;

calculating all straight lines from the contour skeleton graph by using a Hough transform method and fusing part of straight lines meeting the merging condition;

extracting the content and the position of a text example in the table;

7. The method according to claim 1, wherein in the step 4, the table area image belonging to the wireless table image specifically includes:

8. The method of claim 7, wherein the location features of the text instance consist of coordinates of an upper left corner and a lower right corner of a bounding box.

9. The method of claim 7, wherein the bounding box background features, row background features, and column background features of the text instance are extracted from the feature map by extraction of region-of-interest pooled image features.

10. An end-to-end form detection and structure identification device, characterized in that it operates based on the method according to any one of claims 1 to 9, characterized in that it comprises the following components:

a table region prediction unit for determining a table region in the preprocessed image with an encoder-decoder model as a main structure, and classifying the table region into a wired table image and a wireless table image;

a table image correction unit for separating a corrected table area image containing only the table area from the preprocessed image based on the determined table area;