CN108132916B

CN108132916B - Method for analyzing PDF table data and storage medium

Info

Publication number: CN108132916B
Application number: CN201711235867.5A
Authority: CN
Inventors: 蓝树和; 段涵瑞; 薛艳英; 江汉祥
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2022-02-11
Anticipated expiration: 2037-11-30
Also published as: CN108132916A

Abstract

The invention provides a method and a storage medium for analyzing PDF table data, wherein the method comprises the following steps: acquiring coordinates of each line segment and each character of each page of PDF; dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells; and acquiring the field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates. The invention accurately marks out the cells and the characters in the cells according to the relation between each line segment and each character coordinate, accurately extracts the PDF table and the data in the table, and realizes the accurate, convenient and automatic analysis of the PDF table.

Description

Method for analyzing PDF table data and storage medium

Technical Field

The invention relates to the field of data analysis, in particular to a method and a storage medium for analyzing PDF table data.

Background

The objects of PDF analysis in the prior art are generally characters, the internal tables are only visual, there is no real table object, each cell is only divided by a line segment, and the PDF protocol only records the position information of the characters, the line segment, the picture, and the like.

The existing related analysis only obtains the characters in the table, but the table data should strictly correspond to the corresponding columns of the title, due to the specificity of PDF, such as the continuity of tables of front and back pages, the uncertainty of line feed of a single cell, watermarks and the like. The division of simple characters is not practical, the distinguishing characteristics of tables in each format are analyzed firstly, and then corresponding scripts are written and imported into the database, so that the workload is large and difficult to imagine, and the automatic extraction and storage of the table data of the PDF in the database are difficult to realize.

Therefore, PDF analysis in the market is relatively closed source, and such table data is simply processed by characters, so that it is difficult to make correspondence between data and title, and to determine the correlation between data rows.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a method and a storage medium for analyzing PDF table data realize full-automatic and accurate analysis of the table data and have strong practicability.

In order to solve the technical problems, the invention adopts the technical scheme that:

a method of parsing PDF tabular data, comprising:

acquiring coordinates of each line segment and each character of each page of PDF;

dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;

according to the inclusion relation between the coordinates of the characters and the rectangular coordinates, the field blocks corresponding to the cells are obtained

The invention provides another technical scheme as follows:

a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

and acquiring the field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates.

The invention has the beneficial effects that: the method for visually analyzing the PDF table data is provided, how fields are analyzed according to a specific PDF file is not required to be divided, a header of a table is not required to be determined, the field block data can be analyzed and organized automatically and accurately, and the method is high in applicability. Specifically, the cells and the characters in the cells are accurately marked out according to the relation between each line segment and each character coordinate, the forms of the PDFs and the data in the forms are accurately extracted, the automation is strong, and the importing of the PDF forms is greatly simplified. The invention can greatly improve the accuracy and convenience of the analysis of the PDF table data and has very obvious effect.

Drawings

FIG. 1 is a diagram of a PDF table in a single table format;

FIG. 2 is a schematic diagram of a random blank cell;

FIG. 3 is a schematic diagram of a page spread cell;

FIG. 4 is a table diagram of a multi-layer watermark;

FIG. 5 is a flow chart illustrating a method for analyzing PDF table data according to the present invention;

FIG. 6 is a schematic diagram of line segment intersections;

FIG. 7 is a schematic diagram of line segment compositions forming an active cell;

fig. 8 is a flowchart illustrating a first embodiment.

Detailed Description

In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.

The most key concept of the invention is as follows: the cell and the characters in the cell are accurately marked out according to the relation between each line segment and each character coordinate, the forms of PDF and the data in the forms are accurately extracted, and the PDF forms are accurately, conveniently and automatically analyzed.

Referring to fig. 5, the present invention provides a method for analyzing PDF table data, comprising:

Further, the method also comprises the following steps:

and determining the corresponding unit cell of each row according to the median line of the unit cell.

From the above description, it can be known that whether the cells are in the same row is determined according to the determined error range between the median lines of the cells, and the cells are normalized so as to obtain an orderly list.

Further, the method also comprises the following steps:

converting each page of PDF into an image data form;

if the upper page and the lower page which are connected with each other are gradually overlapped and closed along the Y-axis direction, the corresponding vertical line segment can be obtained, and the horizontal line segment can be respectively obtained on the vertical line segment, then the cells at the connection position of the upper page and the lower page are combined.

According to the description, whether the cells connected between the upper page and the lower page belong to the same cell or not can be judged according to the image visual feature analysis, namely whether the cells are separated due to paging or not, and if so, the cells are combined. The split cells are automatically and accurately merged.

Further, if the upper and lower pages that link up gradually superpose along the Y axle direction and draw close the back mutually, can acquire corresponding vertically line segment, and can respectively acquire the horizontally line segment on the vertical line segment, then merge the cell of upper and lower page linking department, specifically do:

presetting the upper left corner of each page of PDF as a coordinate origin;

starting from the maximum value of a Y axis of a current page, advancing towards the direction of an origin to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment exists on the vertical line segment or not; and at the same time

Starting from a Y-axis zero coordinate of a next page, advancing towards the direction of the maximum value to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment exists on the vertical line segment or not;

if yes, combining the cells corresponding to the adjacent vertical line segments in the current page and the corresponding cells in the next page into the same cell.

From the above description, the correlation between the cells of the PDF pages can be judged by using a visual algorithm, the split cells are automatically merged, and the expression form of the finally obtained table is further improved.

Further, the obtaining of the field block corresponding to each cell according to the inclusion relationship between the coordinates of the character and the rectangular coordinates specifically includes:

acquiring characters corresponding to each non-blank rectangular coordinate according to whether the coordinates of the characters are located in the rectangular coordinates;

according to a matrix coefficient of the character mapped to a user visual space from a coordinate space of PDF, eliminating watermark characters in each non-blank rectangular coordinate;

characters corresponding to the non-blank rectangular coordinates form a field block, and blank fields corresponding to blank rectangular coordinates are supplemented;

and acquiring the field block corresponding to each cell.

According to the description, the watermark characters can be effectively removed, and the accuracy of the table obtained by analysis is guaranteed. Meanwhile, blank fields are correspondingly configured for the blank cells, and the alignment of the blank cells and the corresponding titles is realized. Thereby ensuring the integrity and accuracy of the finally obtained form.

Further, the obtaining of the coordinates of each line segment and the coordinates of each character of each page of PDF specifically includes:

rendering the line segments and characters of each page of PDF to a CImage handle, and capturing the coordinates of each line segment and each character during rendering.

According to the above description, the line segments and the characters are rendered to the CImage handle, so that the structured PDF data is converted into the image data convenient for analysis and processing, the subsequent detection and analysis are conveniently and directly carried out according to the image data, the characteristic data of the line segments and the characters is obtained, and finally the required data is obtained according to the characteristic data.

Further, the cell is divided according to the intersection point of the line segments, and the rectangular coordinate corresponding to each cell is obtained, specifically:

if the distance between one end point coordinate of one line segment and one end point coordinate of another line segment is within a preset first threshold value range, judging that the line segment is intersected with the another line segment;

and if the four adjacent line segments are intersected end to end in sequence and the formed area exceeds a preset second threshold range, acquiring the coordinates of the four line segments and marking the coordinates as rectangular coordinates corresponding to the cell formed by the four line segments.

As can be seen from the above description, since the coordinates of the PDF user space are of a floating point type, it is determined whether the corresponding line segments intersect by correspondingly determining whether the distance between two points is within a certain threshold range. The cells can be conveniently and accurately divided according to the number of intersection points in the follow-up process.

The invention provides another technical scheme as follows:

Further, the program can also realize the steps of:

converting each page of PDF into an image data form;

determining a cell corresponding to each row according to the median line of the cells;

Further, the step of obtaining coordinates of each line segment and each character of each page of PDF specifically comprises:

rendering the line segments and characters of each page of PDF to a CImage handle, and capturing the coordinates of each line segment and each character during rendering;

dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells, wherein the steps are as follows:

if four adjacent line segments are intersected end to end in sequence and the formed area exceeds a preset second threshold range, acquiring coordinates of the four line segments and marking the coordinates as rectangular coordinates corresponding to a cell formed by the four line segments;

acquiring a field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates, specifically:

and acquiring the field block corresponding to each cell.

Example one

The embodiment mainly provides a method for analyzing PDF form data, which is suitable for analyzing a form in PDF format data to obtain corresponding form data, and facilitates subsequent editing operations. If the data is cleaned at the front end, most of the telephone bills and bills provided by the customers are in the form of the table format PDF, the table format PDF can be extracted into the corresponding CSV format through the embodiment, and the CSV format is automatically imported into the database for analysis.

As shown in fig. 1-4, there are several forms of PDF tables that are common in the prior art. Specifically, FIG. 1 corresponds to a single table; FIG. 2 corresponds to a random blank cell; FIG. 3 corresponds to a page-spread cell; fig. 4 corresponds to a multi-layer watermark or the like. The existing PDF table analysis is relatively closed, and the data of the table is processed by simple characters, so that the data and the title are difficult to correspond, and the correlation between rows is difficult to judge.

In view of the above problems, the present invention will solve the analysis of different table forms through a plurality of specific embodiments in this embodiment.

Referring to fig. 8, the method for parsing PDF table data of the present embodiment includes:

s1: converting each page of PDF into an image data form; setting the upper left corner of each PDF page as a coordinate origin; acquiring coordinates of each line segment and each character of each page of PDF;

the method specifically comprises the following steps:

s101: loading a PDF file, and circularly acquiring an object of each page; the object is a pointer pointing to each page of PDF and is used for sequentially acquiring PDF data of each page;

s102: rendering the line segment and the characters of each page of PDF to a CImage handle, and setting the upper left corner of each page of PDF data of the image data as a coordinate origin; coordinates of line segments and characters are captured while rendering.

Here, the rendering to CImage handle is for the purpose of: 1. copying the structured PDF data and converting the PDF data into image data; 2. independent processing is carried out, original PDF data are stored, and source files are prevented from being lost; 3. image data of pure line segments can be obtained, and interference is eliminated for subsequent image binaryzation and line detection; 4. and the image data is converted into image data, so that the subsequent processing is facilitated, and the required data is obtained by directly obtaining the corresponding characteristics through image detection.

The coordinates of the line segments and the characters can be obtained simultaneously during rendering, only for obtaining the image of the pure line segments. Acquiring coordinates of a line segment refers to acquiring coordinates of a pair of points.

S2: dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;

because the coordinates of the PDF user space are of a floating point type, in the xy space coordinate of the picture data format, the intersection point of the line segments means that the distance between one end point of one line segment and one end point of the other line segment in the space coordinate is within a certain threshold range; as shown in fig. 6, segment a (x1, y1), (x2, y2) has an intersection with segment B (x3, y3), (x4, y 4). The cell is that four intersection points exist in four groups of line segments of the space coordinate, and the formed area can be regarded as an effective cell when exceeding a certain threshold value; as shown in fig. 7, four adjacent line segments A, B, C and D form a cell.

Therefore, step S2 specifically includes:

s201: if the distance between the coordinate of one end point of one line segment and the coordinate of one end point of the other line segment is within a preset first threshold value range, judging that the two line segments are intersected;

s202: and if the four adjacent line segments are sequentially intersected end to end and the formed area exceeds a preset second threshold range, judging that the four line segments form an effective cell, simultaneously acquiring the coordinates of the four line segments, and marking the coordinates as rectangular coordinates corresponding to the cell formed by the four line segments.

S203: and acquiring the corresponding rectangular coordinates of each cell.

S3: and acquiring the field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates.

Field block, refers to the sequential set of all valid characters within a single cell of the PDF (excluding watermark characters falling within the cell).

Step S3 specifically includes:

s301: judging whether the rectangular coordinates contain characters or not according to the inclusion relation between the coordinates of the characters and the rectangular coordinates, namely whether the rectangular coordinates of the characters fall in the rectangular coordinates of the (cell); if yes, executing S302; if not, go to S303. This process is determined intuitively and naturally from the spatial coordinate positional relationship of the image format.

S302: sequentially acquiring all characters in the rectangular coordinates to form a field block corresponding to the rectangular coordinates;

s303: if no character is determined in a certain rectangular coordinate, setting the field block corresponding to the rectangular coordinate to be null, namely supplementing the blank field block to the rectangular coordinate, so as to ensure that the blank cell corresponding to the blank rectangular coordinate can be aligned with the corresponding title.

In one embodiment, after determining that the rectangular coordinates contain the character, i.e. before step S302, the following steps are further performed; to address the PDF table parsing where watermarks exist for targeting.

The steps specifically include: the watermark characters in each rectangular coordinate are excluded according to the matrix coefficients of the characters mapped from the coordinate space of the PDF to the user's visual space (i.e., the xy coordinate space in the form of image data in the present embodiment). Specifically, the matrix characteristic coefficient refers to a group of matrixes of characters mapped to the user visual space from the coordinate space of the PDF, and the watermark is generally a character with an angle, so that the converted matrix is different from a normal character, and whether a certain character is a watermark or not is judged according to the mode.

Next, S3 of the present embodiment further includes:

s304: and acquiring a field block corresponding to each cell.

In another embodiment, S305 is further included to simultaneously implement the resolution of multiple rows of cells.

S305: and determining the corresponding unit cell of each row according to the median line of the unit cell. Specifically, whether the data are the same row data is determined according to the error range between the corresponding median lines of the cells. If the y-axis coordinate of the middle bit line of each cell in the row is within a certain threshold range, if the y-axis coordinate is not within the threshold range, the cell is judged not to be in the same row, and the cell is divided into different rows.

In another embodiment, S4-S5 will be included to further enable parsing of the spread cells.

S4: each page of PDF data in the form of image data is converted into a Mat object opencv.

S5: starting from the maximum value of a Y axis of a current page, advancing towards the direction of an origin to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment can be detected on the vertical line segment; and at the same time

Starting from a Y-axis zero coordinate of a next page, advancing towards the direction of the maximum value to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment can be detected on the vertical line segment;

if the two conditions are met simultaneously, combining the cells corresponding to the adjacent vertical line segments in the current page and the cells corresponding to the next page into the same cell. As shown in fig. 3, incomplete cells split by the upper and lower page pages are merged into complete cells.

In this embodiment, the method further includes the following steps:

s6: and aggregating the table data which is arranged into a CSV format.

The method for analyzing the PDF table provided by the embodiment does not need to divide the analysis fields according to the specific PDF file, does not need to determine the header of the table, can realize full-automatic and accurate analysis and organization of the field block data, and has strong practicability and applicability. Furthermore, the embodiment adopts the relationship of coordinates between characters to be accurately divided, and adopts a visual algorithm to judge the correlation of cells between PDF pages. In conclusion, the embodiment can automatically, accurately and comprehensively analyze the PDF table data, greatly improves the accuracy and convenience of data cleaning, and has very remarkable effect.

Example two

This embodiment corresponds to the first embodiment, and provides a corresponding computer readable storage medium, on which a computer program is stored, which when executed by a processor can implement all the steps included in the first embodiment.

In summary, the method and the storage medium for analyzing the PDF form data provided by the present invention can implement accurate, convenient, and automatic analysis of the PDF form. The method can not only accurately analyze the data of single tables and multiple tables, but also accurately analyze random blank cells, page-crossing cells and multi-layer watermark cells; it has strong practicability and wide application range. Furthermore, the method is used for analyzing based on the character coordinates and the line segment coordinates, is different from the existing simple character-based processing, not only realizes more accurate and convenient analysis, but also can ensure the correspondence between the data and the title; meanwhile, the correlation between the rows can be analyzed, and support is provided for realizing various types of table analysis.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields are included in the scope of the present invention.

Claims

1. A method for analyzing PDF table data, comprising:

acquiring a field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates;

further comprising:

converting each page of PDF into an image data form;

if the upper page and the lower page which are connected with each other are gradually overlapped and closed along the Y-axis direction, the corresponding vertical line segment can be obtained, and the horizontal line segment can be respectively obtained on the vertical line segment, then the cells at the connection position of the upper page and the lower page are combined;

if the upper and lower pages that link up gradually superposes along Y axle direction and draws close the back mutually, can acquire corresponding vertically line segment, and can respectively in acquire the horizontal line segment on the vertically line segment, then merge the cell of upper and lower page joint department specifically is:

presetting the upper left corner of each page of PDF as a coordinate origin;

2. The method of parsing PDF table data according to claim 1, further comprising:

3. The method according to claim 1, wherein the obtaining of the field block corresponding to each cell according to the inclusion relationship between the coordinates of the character and the rectangular coordinates includes:

characters corresponding to the non-blank rectangular coordinates form a field block, and each rectangular coordinate for supplementing blanks corresponds to a blank field;

and acquiring the field block corresponding to each cell.

4. The method for parsing PDF table data according to claim 1, wherein said obtaining coordinates of line segments and coordinates of characters of each page of PDF specifically comprises:

5. The method according to claim 1, wherein the dividing cells according to the line segment intersections and obtaining the rectangular coordinates corresponding to each cell are specifically:

and if the four adjacent line segments are intersected end to end in sequence and the formed area exceeds a preset second threshold range, acquiring the coordinates of the four line segments and marking the coordinates as rectangular coordinates corresponding to a cell formed by the four line segments.

6. A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, carries out the steps of:

the program can also realize the following steps:

converting each page of PDF into an image data form;

presetting the upper left corner of each page of PDF as a coordinate origin;

7. The computer-readable storage medium of claim 6, wherein the step of obtaining coordinates of each line segment and each character of each page of PDF comprises:

and acquiring the field block corresponding to each cell.