CN113435240B

CN113435240B - End-to-end form detection and structure identification method and system

Info

Publication number: CN113435240B
Application number: CN202110396302.5A
Authority: CN
Inventors: 周勃宇; 王勇; 朱军民
Original assignee: Beijing Yidao Boshi Technology Co ltd
Current assignee: Beijing Yidao Boshi Technology Co ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2024-06-14
Anticipated expiration: 2041-04-13
Also published as: CN113435240A

Abstract

The invention discloses an end-to-end form detection and structure identification method and system, and relates to the field of computer vision. The method comprises the following steps: stretching the original image in the vertical direction, normalizing the dimension with the length-width ratio unchanged and supplementing 0 to the boundary to form a preprocessed image; taking the encoder-decoder model as a main structure, determining a table area in the preprocessing image, and classifying the table area into a wired table image and a wireless table image; separating a corrected form area image containing only the form area from the preprocessed image based on the determined form area; and carrying out table structure identification on the table area image by adopting different methods according to the classified wired table image and wireless table image. According to the method, different structure identification methods are adopted for different types of tables, and the advantages of a convolutional neural network image segmentation algorithm, a graph convolution neural network algorithm and a traditional rule analysis method are fully combined to improve the robustness and the universality of the algorithm.

Description

End-to-end form detection and structure identification method and system

Technical Field

The invention relates to the field of computer vision, in particular to an end-to-end form detection and structure identification method and system.

Background

In reality, tables exist widely as a bearing way of key information in PDF, scanned documents, photographed pictures and other objects. Table structure identification is an important prerequisite for many downstream tasks, such as document analysis, information extraction and visualization. The automatic form recognition method generally comprises two major steps of form detection and form structure recognition, wherein the purpose of form detection is to locate form areas in a picture, and form recognition is to recognize the form internal structure in each area so as to acquire final structured data. The manner in which the table contents are manually extracted would take a great deal of manpower and time. In contrast, the automated approach would greatly improve the efficiency of operation.

In reality, there are a large number of different styles, formats and internal structures of tables, so it is often difficult to use a uniform recognition method. Conventional form recognition methods typically rely on manually designed features (e.g., row and column separation lines, blank areas, cell data types, etc.) and heuristic rules. Form inspection typically employs a bottom-up strategy, such as locating the position of rows in the form using explicit text alignment in the form, and then fusing all of the row and column information together to calculate the form area. The variability of the form style and the complexity of the internal structure can present great difficulties in the detection of rows and columns, thereby affecting the overall detection effect. The table structure identification is then typically dependent on explicit separation lines in the table, as well as the relative positional relationship of the separation lines to the text instances. This approach also achieves better performance in wired forms, but fails to cope with wireless forms where the parting line is partially missing or completely missing.

In recent years, the deep learning technology promotes the rapid development of computer vision, and is also applied to the field of form identification. In summary, the deep learning table recognition method generally has two major advantages over the conventional method. First, the deep learning method takes an image as an input, and can be applied to any recognition object convertible into an image, such as PDF, scanned document, and the like, in principle. Therefore, the method has the advantage of unifying the method; second, due to the strong automatic feature encoding capability and the unified end-to-end trainable approach, deep learning has outstanding performance compared to traditional approaches that are mainly designed with features and heuristic rules manually.

Therefore, the integrated process from table detection to table structure identification based on the deep learning advantages has good application prospect.

Disclosure of Invention

In order to achieve the above object, the present invention provides a structure recognition method integrating form detection, which can efficiently extract form internal structure information from an image. The image segmentation technology used in the scheme not only can accurately calculate the edges of the table through a pixel level prediction mode, but also can classify the table into a wired table and a wireless table. In the scheme, different structure identification methods are adopted for different types of tables in the subsequent steps, and the advantages of a convolutional neural network image segmentation algorithm, a graph convolution neural network algorithm and a traditional rule analysis method are fully combined to improve the robustness and the universality of the algorithm.

Specifically, the method first utilizes a convolutional neural network to realize the detection of the table region. For the detected wired table, detecting the table grid lines by adopting a convolutional neural network, and completing the identification of the table structure by combining with a post-processing rule; and for the wireless table, the graph convolution neural network is adopted to realize the prediction of the cell, row and column relations, so that the identification of the structure is completed.

According to a first aspect of the present invention, there is provided an end-to-end form detection and structure identification method, wherein an input original image contains a form, the method comprising the steps of:

Step 1: an image preprocessing step, namely stretching an original image in the vertical direction, normalizing the dimension with the length-width ratio unchanged and supplementing 0 to the boundary to form a preprocessed image;

Step 2: a table region prediction step of determining a table region in the preprocessed image by using an encoder-Decoder (Encoder-Decoder) model as a main structure, and classifying the table region into a wired table image and a wireless table image;

Step 3: a form image correction step of separating a corrected form area image containing only the form area from the preprocessed image based on the determined form area;

Step 4: and a table structure identification step, namely carrying out table structure identification on the table region image in different modes according to the classified wired table image and the wireless table image.

Further, in the step 2, the coding part of the encoder-decoder model downsamples the low resolution representation from the first high resolution representation by means of convolution; the decoding section upsamples the second high resolution representation from the low resolution representation by means of transposed convolution or interpolation.

Further, the encoding section operates as follows:

The Multi-resolution characterization is generated by adopting a mechanism of parallel connection of Multi-resolution sub-networks in a High-resolution network (High-Resoultion Net, HRNet), and a Multi-resolution fusion module (Multi-Resolution Fusion Module) is introduced to realize feature information exchange and fusion among the Multi-resolution characterization, and finally a first feature map with multiple scales is output.

Further, the decoding section operates as follows:

Firstly, adopting a cavity space convolution pooling pyramid (Atrous SPATIAL PYRAMID Pooling, ASPP) module to parallelly sample the feature images with the smallest size in the first feature images in a cavity convolution mode with different sampling intervals, and then expanding the space dimensions of other first feature images by two times in a transposition convolution mode to form a plurality of second feature images with the same number as the first feature images;

splicing the second characteristic diagram and the first characteristic diagram with the same size from the coding part, and finally, convoluting to generate two Mask (Mask) predicted images with the same size as the preprocessed image;

the form area is thus determined and distinguished into a wired form image and a wireless form image.

Further, the step 3 specifically includes:

Step 31: calculating the outline around the table by using a Canny edge detection operator (Canny operator) according to the mask predicted image;

step 32: detecting all straight lines in the contour by using a Hough transform operator and merging the straight lines with part meeting the merging condition;

Step 33: an accurate form position is calculated from the positions of all the straight lines, whereby a corrected form area image containing only form areas is separated.

Further, in step 32, the merging condition is:

Firstly, judging whether the two line segments are parallel lines, if so, calculating the vertical distance between the two line segments, and when the distance is larger than a certain value, the two line segments can not be combined. If the two line segments are not parallel lines, the slope difference of the two line segments is calculated, and when the difference is larger than the threshold value, the two line segments can not be combined.

When the above conditions are met, whether two line segments overlap in a certain projection direction needs to be continuously judged, if so, the vertical distances from two endpoints of the line segments to the other line segment are calculated respectively, and when the minimum value in the four distances is smaller than a threshold value, the two line segments can be combined. If there is no overlap, the distances of the endpoints between the two line segments are calculated separately, and when the minimum of the four distances is less than the threshold, the two line segments can be merged.

Further, in the step 4, for the table area image belonging to the wired table image, the method specifically includes:

Calculating a profile diagram of the explicit separation line by using a Canni edge detection operator according to the mask prediction image of the separation line;

extracting a profile skeleton diagram of the separation line by using a boundary corrosion algorithm;

calculating all straight lines from the contour skeleton map by using a Hough transformation algorithm, and fusing partial straight lines meeting the merging condition together;

The positions of the table cells are obtained by calculating the positions of all intersecting points of the horizontal lines and the vertical lines;

Extracting the content and the position of a text instance in the form;

And calculating the table structure information according to the relative positions of the table cells and the text examples and outputting the table structure information.

Further, in the step 4, for the table area image belonging to the wireless table image, the method specifically includes:

Taking each text instance as a node, extracting node characteristics, wherein the node characteristics are formed by splicing position characteristics, boundary frame background characteristics, row background characteristics and column background characteristics of each text instance together;

Selecting all nodes in the characteristic space of a certain node a, calculating similarity, and selecting a plurality of nearest neighbor nodes around the node a;

Respectively splicing the node characteristics A of the node a and similarity differences between the node characteristics A and the node characteristics of the nearest neighbor nodes together, inputting a trained graph convolution neural network, and outputting updated node characteristics A';

repeating the above operation to obtain updated node characteristics of all nodes in the table area image;

And respectively utilizing the updated node characteristics to determine the structural relationship between the nodes and the rows, columns and cells of a plurality of nearest neighbor nodes through three multi-layer perceptron networks, thereby determining and outputting the table structural information.

Further, the number of the nearest neighbor nodes is preferably 10-15.

Further, the location features of the text instance are composed of coordinates of the upper left and lower right corners of the bounding box.

Further, bounding box background features, row background features and column background features of the text instance are extracted from the feature map by way of extraction of region of interest pooled (Region of Interest Pooling, ROI Pooling) image features.

Here, the text example refers to a word, sentence or segment composed of several connected words.

According to a second aspect of the present invention there is provided an end-to-end form detection and structure identification apparatus, characterised in that the apparatus operates based on the method of any of the above aspects, the apparatus comprising the following components:

The image preprocessing unit is used for stretching an original image in the vertical direction, normalizing the dimension with the length-width ratio unchanged and supplementing 0 to the boundary to form a preprocessed image;

A table region prediction unit for determining a table region in the preprocessed image using an encoder-Decoder (Encoder-Decoder) model as a main structure, and classifying the table region into a wired table image and a wireless table image;

And a form image correction unit configured to separate a corrected form area image containing only the form area from the preprocessed image based on the determined form area.

And the table structure identification unit is used for carrying out table structure identification on the table region image in different modes according to the classified wired table image and wireless table image.

According to a third aspect of the present invention there is provided an end-to-end form detection and structure identification system, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions to perform the end-to-end table detection and structure identification method as described in any of the above aspects.

According to a fourth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the end-to-end form detection and structure identification method of any of the above aspects.

The invention has the beneficial effects that:

1, the invention automatically selects different structure identification methods according to the table detection result. The selection mechanism combines the advantages of the traditional rule algorithm and the deep learning algorithm, and improves the robustness and the universality of the algorithm.

2, The table edge can be calculated more accurately by the table detection method based on image segmentation, especially under the condition that an inclined table exists in the image. The prediction mode of the pixel level can also exclude the interference of the non-table content area on the subsequent table structure identification work as far as possible.

And 3, the projection transformation before the structure identification step can help to acquire the orderly arranged tables of the unit cells, and the difficulty of subsequent structure identification work is reduced.

And 4, fusing node characteristics consisting of background characteristics of the text examples, image characteristics of rows and columns where the text examples are positioned and position characteristics of the text examples by the graph convolution neural network, and extracting global characteristics and local characteristics in the graph structure more efficiently. Updated node features may help predict structural relationships between text instances more accurately, especially when merging cells exist in the table.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an exemplary diagram of wired and wireless tables according to an embodiment of the present invention;

FIG. 2 is an algorithm flow chart of an end-to-end form detection and structure identification method according to an embodiment of the invention;

FIG. 3 is an algorithm block diagram of an end-to-end table detection and structure identification method according to an embodiment of the present invention;

FIG. 4 is an exemplary diagram of an image preprocessing transformation in accordance with an embodiment of the present invention;

FIG. 5 is a diagram illustrating a change in dimension of a feature in a Decoder according to an embodiment of the present invention;

FIG. 6 is an exemplary graph of a tabular image segmentation result in accordance with an embodiment of the present invention;

FIG. 7 is a block diagram of a table location optimization algorithm according to an embodiment of the invention;

FIG. 8 is a table image correction result example diagram according to an embodiment of the present invention;

FIG. 9 is a block diagram of a tabular wire extraction post-processing algorithm according to an embodiment of the invention;

FIG. 10 is an exemplary graph of a division result of a wire table division line according to an embodiment of the present invention;

FIG. 11 is a schematic plan view of node feature extraction according to an embodiment of the invention;

FIG. 12 is a schematic representation of a node visual feature encoding scheme according to an embodiment of the present invention;

FIG. 13 is a diagram of node feature update in graph convolution according to an embodiment of the present invention;

FIG. 14 is a block diagram of a table structure identification algorithm (graph convolutional neural network) according to an embodiment of the present invention;

fig. 15 is a diagram showing an example of a table structure recognition result according to an embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein, for example.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

A plurality, including two or more.

And/or, it should be understood that for the term "and/or" used in this disclosure, it is merely one association relationship describing associated objects, meaning that there may be three relationships. For example, a and/or B may represent: a exists alone, A and B exist together, and B exists alone.

The invention provides a high-efficiency integrated recognition method for a table structure. Aiming at the problems of inclined tables and diversity of table structures, the team innovatively adopts an image segmentation technology to realize table detection and adopts a mechanism combining rules and a deep learning technology to realize structure identification, thereby greatly improving the robustness and the universality of the algorithm.

Aiming at the problem of difficulty in table identification, the invention considers the characteristics of multi-style, multi-plate type, complex internal structure and the like of the table object, and is based on the solution idea of end-to-end unification, fully combines the advantages of convolutional neural network image segmentation, graph convolution neural network and traditional rule analysis method, realizes the integrated process of detecting the table structure identification from the table, has the characteristics of data driving, unification, independence of specific table style and the like, and achieves better effect on various tables. The result of the processing of the wired and wireless tables as shown in fig. 7, 9, 14.

Examples

The first step: image preprocessing

This step performs a series of preprocessing operations on the input image. The image includes one or more tables. Aiming at the characteristics of compact arrangement and smaller row spacing among rows in most tables, in order to improve the degree of distinction among rows, the design firstly completes stretching conversion in the vertical direction at this stage to increase the pixel distance among rows. The following preprocessing operation also includes aspect ratio invariant size normalization and boundary interpolation 0 for the image, so that the image size can support the neural network requirements and maximally preserve global and local feature information. During training, the image preprocessing stage also needs to complete necessary data enhancement, such as image affine transformation (Rotation, shear, scale, etc.), color warping, and the like, so that the distribution of training samples is closer to the distribution of potential real samples, thereby alleviating the problem of possible data scarcity and improving the robustness and invariance of the learning model. The invention also introduces expansion transformation as a data enhancement mode, the transformation firstly converts an input image into a binary image, then uses a 2x 2 kernel to carry out expansion transformation on all pixels, and the transformation can enlarge a black pixel area in the binary image. The binary image generated by expansion transformation can not only enlarge a sample set, but also simulate the condition that a black-and-white table image is blurred, and the robustness of the model is improved. In the prediction stage, the algorithm only performs normalization processing of the image size.

And a second step of: table region prediction

The image segmentation technology is utilized to classify the image at the pixel level to locate the actual position of the table in the image, compared with an image segmentation method based on a target detection result, such as Mask-RCNN, the method does not need to carry out image segmentation on the basis of the target detection result, the influence of the minimum circumscribed rectangle of the detected object is avoided, and better advantages can be obtained in edge precision.

The algorithm adopts an image segmentation algorithm taking Encoder-Decoder as a main body structure in the step, wherein a Encoder part is responsible for downsampling the low-resolution representation from the high-resolution representation in a convolution mode, and the Decoder part is responsible for upsampling the high-resolution representation from the low-resolution representation in a transposition convolution or interpolation mode. The design innovatively adopts HRNet model as Encoder structure, and a mechanism of parallel connection of multi-resolution sub-networks in HRNet network generates multi-resolution characterization, so that high-resolution characterization with rich semantics is always kept, and information loss caused by downsampling operation is avoided. The HRNet model starts with a high resolution sub-network as a first phase, steps up the high resolution to low resolution sub-network, introduces more phases, and connects the multi-resolution sub-networks in parallel. Meanwhile, a Multi-resolution fusion module (Multi-Resolution Fusion Module) is also introduced into the model to realize feature information exchange and fusion between Multi-resolution characterization, so that high-resolution characterization with more abundant semantics and accurate spatial position is obtained. In the multi-resolution fusion module, the algorithm adopts a convolution with a convolution kernel of 3*3 and a step length of 2 to extract the low-resolution representation from the high-resolution representation and recover the high-resolution representation from the low-resolution representation by a bilinear interpolation method. The Encoder part of the design finally generates four-scale feature graphs, and the space sizes are 1/2, 1/4, 1/8 and 1/16 of the original graph sizes respectively.

The Decoder part of the model firstly uses Atrous SPATIAL PYRAMID modeling (ASPP) module to sample the feature map of the minimum size of the previous stage in parallel with the cavity convolution of different sampling intervals to help the model capture the feature information of more scales, wherein the kernel size of the convolution operator is respectively 1,3 and 3, the condition rate is respectively 1, 6 and 12, and the output of the original input size is kept by a packing mode. Next, the Decoder steps through the transpose convolution to double the spatial dimension of the small-sized feature map and concatenate it with the same-sized feature map from Encoder. The specific process diagram is shown in FIG. 5, wherein s ₂、s₄、s₈ and s ₁₆ are respectively characteristic diagrams of 1/2, 1/4, 1/8 and 1/16 of the original image size generated by Encoder

The Decoder part finally generates a Mask image with the same size as the original image and the depth of 2 by utilizing 1*1 convolution, and the prediction of the pixel level is realized. Because the pixels may belong to three cases of wired table area, wireless table area and non-table area, the segmentation model needs to output two Mask predictive images, and the wired table and the wireless table are classified while the table area is accurately calculated. The value of each pixel point in the Mask image is in the range of 0 to 1, and the pixel values in the two Mask images represent the confidence that the current pixel point belongs to a wired table or a wireless table respectively.

The class result output by the step will help the model to automatically select the corresponding structure identification method.

And a third step of: tabular image correction

The first goal of the step is to calculate the complete form area by using the Mask predicted image obtained in the last step to perform edge fitting, and then to separate the form area from the original image by using projective transformation to form a new picture.

The method comprises the steps of firstly calculating the outline around a table according to Mask images, then detecting all straight lines in the outline by using a Hough transform operator, fusing the straight lines with partial meeting merging conditions, and finally calculating the accurate table position according to the positions of all the straight lines. The algorithm structure of this step is shown in fig. 7.

The projective transformation in this step may ensure that the new picture contains in most cases only the table content and that most cells within the table are aligned. Therefore, interference of non-form area content in the original image can be eliminated, difficulty in identifying tasks can be reduced, and accuracy of form identification is further improved.

Fourth step: table structure identification

The step uses the form classification result obtained from the second step to adopt a targeted structure identification mode for the form area image obtained from the third step.

For wired tables, the algorithm uses an explicit separation line detection method. The algorithm firstly predicts the position of the explicit form separation line and the content and the position of the text instance in the form by using the image segmentation model, and then calculates the position of the text instance in the form structure by using the post-processing algorithm. A detailed block diagram of the post-processing algorithm is shown in fig. 9. The post-processing algorithm firstly calculates a profile diagram of an explicit separation line according to a separation line Mask image, then extracts a profile skeleton diagram of the separation line by utilizing a boundary corrosion algorithm, then calculates all straight lines from the skeleton diagram by utilizing a Hough transformation algorithm, fuses the straight lines meeting the merging condition together, deduces the positions of cells by calculating the positions of intersection points of all horizontal lines and vertical lines, and finally calculates the table structure information according to the relative positions of the table cells and the text examples.

Algorithms, on the other hand, employ graph convolutional neural networks to process wireless tables. In the method of wireless form processing ResNet as Feature Extractor to extract image features of the input form, the OCR engine is responsible for extracting the content and location of the text instance within the form. The nodes in the graph convolution neural network take a text example as a unit, and node features are formed by jointly splicing position features, bounding box background features, row background features and column background features of the text example. The location features of the text instance consist of the coordinates of the upper left and lower right corners of the bounding box. The algorithm introduces RoIPooling an extraction mode of image features to obtain corresponding bounding box background features, row background features and column background features from the corresponding positions of the text examples on the feature map, as shown in fig. 11.

RoIPooling can obtain feature maps of fixed size at corresponding positions of the text instance, and then the algorithm performs global average on the feature maps along the two dimensions of width and height to only retain the features of the depth dimension of the feature map.

The graph convolution neural network is not limited to the fixed adjacency relationship in Grid-Structure Data, such as images, like ordinary convolution, but selects nodes similar to the current node in the graph Structure for feature extraction and update according to the measurement mode of feature similarity. In a multi-layer graph rolling network, along with the updating of node characteristics, the neighbor nodes of each node are continuously changed, so that the effective receptive field of the network is increased to a certain extent.

In the invention, the algorithm adopts Euclidean distance as a measurement mode of node characteristic similarity in graph convolution operation to respectively calculate a plurality of nearest neighbor nodes of each node, and then respectively splices the node characteristics and differences between the node characteristics and each neighbor node characteristic together and inputs the spliced node characteristics and differences between the node characteristics and each neighbor node characteristic into a fully-connected network for characteristic extraction, as shown in fig. 13. The graph rolling network layer of each layer can average and fuse the extraction results of the current node and all the adjacent node characteristics together to be used as updated node characteristics. In the graph convolution operation adopted by the design algorithm, node characteristics represent global characteristics in the graph structure, the difference value between the node characteristics and neighbor node characteristics represents local characteristics in the graph structure, and the multi-layer graph convolution operation can help the algorithm to fully extract the global and local characteristics in the graph structure. Finally, three multi-layer perceptron networks (classifiers) used by the algorithm respectively judge the structural relationship of rows, columns and cells among the nodes by using the updated node characteristics.

Fig. 14 is a structural diagram of a table structure recognition algorithm based on a graph convolutional neural network, and fig. 15 is an exemplary graph of a table structure recognition result.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be apparent to those skilled in the art that the above implementation may be implemented by means of software plus necessary general purpose hardware platform, or of course by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims

1. An end-to-end form detection and structure identification method, wherein an input original image contains a form, is characterized by comprising the following steps:

step 2: a table region prediction step of determining a table region in the preprocessed image by using an encoder-decoder model as a main structure, and classifying the table region into a wired table image and a wireless table image;

step 4: a table structure recognition step of performing table structure recognition on the table region image in different manners according to the classification into the wired table image and the wireless table image,

In the step 4, for the table area image belonging to the wireless table image, the method specifically includes:

2. The method according to claim 1, wherein in step2, the encoded portion of the encoder-decoder model downsamples the low resolution representation from the first high resolution representation by means of convolution; the decoding section upsamples the second high resolution representation from the low resolution representation by means of transposed convolution or interpolation.

3. The method of claim 2, wherein the encoding portion operates as follows:

And generating multi-resolution characterization by adopting a mechanism of parallel connection of multi-resolution sub-networks in the high-resolution network, introducing a multi-resolution fusion module to realize feature information exchange and fusion among the multi-resolution characterization, and finally outputting a first feature map with multiple scales.

4. A method according to claim 3, wherein the decoding part operates as follows:

Firstly, adopting a cavity space convolution pooling pyramid module to parallelly sample the feature images with the smallest size in the first feature images at different sampling intervals in a cavity convolution manner, and then expanding the space dimensions of other first feature images by two times in a transposition convolution manner to form a plurality of second feature images with the same number as the first feature images;

Splicing the second characteristic diagram with the first characteristic diagram with the same size from the coding part, and finally, convoluting to generate two mask predicted images with the same size as the preprocessed image;

5. The method according to claim 1, wherein the step 3 specifically comprises:

Step 31: calculating the outline around the table by using a Canni edge detection operator according to the mask predicted image;

6. The method according to claim 1, wherein in the step 4, for the form area image belonging to the wired form image, specifically comprises:

Extracting a profile skeleton diagram of the separation line by using a boundary corrosion method;

Calculating all straight lines from the outline skeleton diagram by using a Hough transformation method, and fusing the straight lines partially meeting the merging condition together;

Extracting the content and the position of a text instance in the form;

7. The method of claim 1, wherein the location features of the text instance consist of coordinates of an upper left corner and a lower right corner of the bounding box.

8. The method of claim 1, wherein bounding box background features, row background features and column background features of the text instance are extracted from the feature map by way of extraction of region-of-interest pooled image features.

9. An end-to-end form detection and structure identification device, characterized in that it operates based on the method according to any one of claims 1 to 8, characterized in that it comprises the following components:

a table region prediction unit for determining a table region in the preprocessed image with the encoder-decoder model as a main structure, and classifying the table region into a wired table image and a wireless table image;

A form image correction unit for separating a corrected form region image containing only the form region from the preprocessed image based on the determined form region;