CN108132916B - Method for analyzing PDF table data and storage medium - Google Patents

Method for analyzing PDF table data and storage medium Download PDF

Info

Publication number
CN108132916B
CN108132916B CN201711235867.5A CN201711235867A CN108132916B CN 108132916 B CN108132916 B CN 108132916B CN 201711235867 A CN201711235867 A CN 201711235867A CN 108132916 B CN108132916 B CN 108132916B
Authority
CN
China
Prior art keywords
line segment
coordinates
page
pdf
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711235867.5A
Other languages
Chinese (zh)
Other versions
CN108132916A (en
Inventor
蓝树和
段涵瑞
薛艳英
江汉祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201711235867.5A priority Critical patent/CN108132916B/en
Publication of CN108132916A publication Critical patent/CN108132916A/en
Application granted granted Critical
Publication of CN108132916B publication Critical patent/CN108132916B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Editing Of Facsimile Originals (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides a method and a storage medium for analyzing PDF table data, wherein the method comprises the following steps: acquiring coordinates of each line segment and each character of each page of PDF; dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells; and acquiring the field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates. The invention accurately marks out the cells and the characters in the cells according to the relation between each line segment and each character coordinate, accurately extracts the PDF table and the data in the table, and realizes the accurate, convenient and automatic analysis of the PDF table.

Description

Method for analyzing PDF table data and storage medium
Technical Field
The invention relates to the field of data analysis, in particular to a method and a storage medium for analyzing PDF table data.
Background
The objects of PDF analysis in the prior art are generally characters, the internal tables are only visual, there is no real table object, each cell is only divided by a line segment, and the PDF protocol only records the position information of the characters, the line segment, the picture, and the like.
The existing related analysis only obtains the characters in the table, but the table data should strictly correspond to the corresponding columns of the title, due to the specificity of PDF, such as the continuity of tables of front and back pages, the uncertainty of line feed of a single cell, watermarks and the like. The division of simple characters is not practical, the distinguishing characteristics of tables in each format are analyzed firstly, and then corresponding scripts are written and imported into the database, so that the workload is large and difficult to imagine, and the automatic extraction and storage of the table data of the PDF in the database are difficult to realize.
Therefore, PDF analysis in the market is relatively closed source, and such table data is simply processed by characters, so that it is difficult to make correspondence between data and title, and to determine the correlation between data rows.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a method and a storage medium for analyzing PDF table data realize full-automatic and accurate analysis of the table data and have strong practicability.
In order to solve the technical problems, the invention adopts the technical scheme that:
a method of parsing PDF tabular data, comprising:
acquiring coordinates of each line segment and each character of each page of PDF;
dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;
according to the inclusion relation between the coordinates of the characters and the rectangular coordinates, the field blocks corresponding to the cells are obtained
The invention provides another technical scheme as follows:
a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring coordinates of each line segment and each character of each page of PDF;
dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;
and acquiring the field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates.
The invention has the beneficial effects that: the method for visually analyzing the PDF table data is provided, how fields are analyzed according to a specific PDF file is not required to be divided, a header of a table is not required to be determined, the field block data can be analyzed and organized automatically and accurately, and the method is high in applicability. Specifically, the cells and the characters in the cells are accurately marked out according to the relation between each line segment and each character coordinate, the forms of the PDFs and the data in the forms are accurately extracted, the automation is strong, and the importing of the PDF forms is greatly simplified. The invention can greatly improve the accuracy and convenience of the analysis of the PDF table data and has very obvious effect.
Drawings
FIG. 1 is a diagram of a PDF table in a single table format;
FIG. 2 is a schematic diagram of a random blank cell;
FIG. 3 is a schematic diagram of a page spread cell;
FIG. 4 is a table diagram of a multi-layer watermark;
FIG. 5 is a flow chart illustrating a method for analyzing PDF table data according to the present invention;
FIG. 6 is a schematic diagram of line segment intersections;
FIG. 7 is a schematic diagram of line segment compositions forming an active cell;
fig. 8 is a flowchart illustrating a first embodiment.
Detailed Description
In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.
The most key concept of the invention is as follows: the cell and the characters in the cell are accurately marked out according to the relation between each line segment and each character coordinate, the forms of PDF and the data in the forms are accurately extracted, and the PDF forms are accurately, conveniently and automatically analyzed.
Referring to fig. 5, the present invention provides a method for analyzing PDF table data, comprising:
acquiring coordinates of each line segment and each character of each page of PDF;
dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;
and acquiring the field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates.
Further, the method also comprises the following steps:
and determining the corresponding unit cell of each row according to the median line of the unit cell.
From the above description, it can be known that whether the cells are in the same row is determined according to the determined error range between the median lines of the cells, and the cells are normalized so as to obtain an orderly list.
Further, the method also comprises the following steps:
converting each page of PDF into an image data form;
if the upper page and the lower page which are connected with each other are gradually overlapped and closed along the Y-axis direction, the corresponding vertical line segment can be obtained, and the horizontal line segment can be respectively obtained on the vertical line segment, then the cells at the connection position of the upper page and the lower page are combined.
According to the description, whether the cells connected between the upper page and the lower page belong to the same cell or not can be judged according to the image visual feature analysis, namely whether the cells are separated due to paging or not, and if so, the cells are combined. The split cells are automatically and accurately merged.
Further, if the upper and lower pages that link up gradually superpose along the Y axle direction and draw close the back mutually, can acquire corresponding vertically line segment, and can respectively acquire the horizontally line segment on the vertical line segment, then merge the cell of upper and lower page linking department, specifically do:
presetting the upper left corner of each page of PDF as a coordinate origin;
starting from the maximum value of a Y axis of a current page, advancing towards the direction of an origin to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment exists on the vertical line segment or not; and at the same time
Starting from a Y-axis zero coordinate of a next page, advancing towards the direction of the maximum value to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment exists on the vertical line segment or not;
if yes, combining the cells corresponding to the adjacent vertical line segments in the current page and the corresponding cells in the next page into the same cell.
From the above description, the correlation between the cells of the PDF pages can be judged by using a visual algorithm, the split cells are automatically merged, and the expression form of the finally obtained table is further improved.
Further, the obtaining of the field block corresponding to each cell according to the inclusion relationship between the coordinates of the character and the rectangular coordinates specifically includes:
acquiring characters corresponding to each non-blank rectangular coordinate according to whether the coordinates of the characters are located in the rectangular coordinates;
according to a matrix coefficient of the character mapped to a user visual space from a coordinate space of PDF, eliminating watermark characters in each non-blank rectangular coordinate;
characters corresponding to the non-blank rectangular coordinates form a field block, and blank fields corresponding to blank rectangular coordinates are supplemented;
and acquiring the field block corresponding to each cell.
According to the description, the watermark characters can be effectively removed, and the accuracy of the table obtained by analysis is guaranteed. Meanwhile, blank fields are correspondingly configured for the blank cells, and the alignment of the blank cells and the corresponding titles is realized. Thereby ensuring the integrity and accuracy of the finally obtained form.
Further, the obtaining of the coordinates of each line segment and the coordinates of each character of each page of PDF specifically includes:
rendering the line segments and characters of each page of PDF to a CImage handle, and capturing the coordinates of each line segment and each character during rendering.
According to the above description, the line segments and the characters are rendered to the CImage handle, so that the structured PDF data is converted into the image data convenient for analysis and processing, the subsequent detection and analysis are conveniently and directly carried out according to the image data, the characteristic data of the line segments and the characters is obtained, and finally the required data is obtained according to the characteristic data.
Further, the cell is divided according to the intersection point of the line segments, and the rectangular coordinate corresponding to each cell is obtained, specifically:
if the distance between one end point coordinate of one line segment and one end point coordinate of another line segment is within a preset first threshold value range, judging that the line segment is intersected with the another line segment;
and if the four adjacent line segments are intersected end to end in sequence and the formed area exceeds a preset second threshold range, acquiring the coordinates of the four line segments and marking the coordinates as rectangular coordinates corresponding to the cell formed by the four line segments.
As can be seen from the above description, since the coordinates of the PDF user space are of a floating point type, it is determined whether the corresponding line segments intersect by correspondingly determining whether the distance between two points is within a certain threshold range. The cells can be conveniently and accurately divided according to the number of intersection points in the follow-up process.
The invention provides another technical scheme as follows:
a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring coordinates of each line segment and each character of each page of PDF;
dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;
and acquiring the field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates.
Further, the program can also realize the steps of:
converting each page of PDF into an image data form;
determining a cell corresponding to each row according to the median line of the cells;
if the upper page and the lower page which are connected with each other are gradually overlapped and closed along the Y-axis direction, the corresponding vertical line segment can be obtained, and the horizontal line segment can be respectively obtained on the vertical line segment, then the cells at the connection position of the upper page and the lower page are combined.
Further, the step of obtaining coordinates of each line segment and each character of each page of PDF specifically comprises:
rendering the line segments and characters of each page of PDF to a CImage handle, and capturing the coordinates of each line segment and each character during rendering;
dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells, wherein the steps are as follows:
if the distance between one end point coordinate of one line segment and one end point coordinate of another line segment is within a preset first threshold value range, judging that the line segment is intersected with the another line segment;
if four adjacent line segments are intersected end to end in sequence and the formed area exceeds a preset second threshold range, acquiring coordinates of the four line segments and marking the coordinates as rectangular coordinates corresponding to a cell formed by the four line segments;
acquiring a field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates, specifically:
acquiring characters corresponding to each non-blank rectangular coordinate according to whether the coordinates of the characters are located in the rectangular coordinates;
according to a matrix coefficient of the character mapped to a user visual space from a coordinate space of PDF, eliminating watermark characters in each non-blank rectangular coordinate;
characters corresponding to the non-blank rectangular coordinates form a field block, and blank fields corresponding to blank rectangular coordinates are supplemented;
and acquiring the field block corresponding to each cell.
Example one
The embodiment mainly provides a method for analyzing PDF form data, which is suitable for analyzing a form in PDF format data to obtain corresponding form data, and facilitates subsequent editing operations. If the data is cleaned at the front end, most of the telephone bills and bills provided by the customers are in the form of the table format PDF, the table format PDF can be extracted into the corresponding CSV format through the embodiment, and the CSV format is automatically imported into the database for analysis.
As shown in fig. 1-4, there are several forms of PDF tables that are common in the prior art. Specifically, FIG. 1 corresponds to a single table; FIG. 2 corresponds to a random blank cell; FIG. 3 corresponds to a page-spread cell; fig. 4 corresponds to a multi-layer watermark or the like. The existing PDF table analysis is relatively closed, and the data of the table is processed by simple characters, so that the data and the title are difficult to correspond, and the correlation between rows is difficult to judge.
In view of the above problems, the present invention will solve the analysis of different table forms through a plurality of specific embodiments in this embodiment.
Referring to fig. 8, the method for parsing PDF table data of the present embodiment includes:
s1: converting each page of PDF into an image data form; setting the upper left corner of each PDF page as a coordinate origin; acquiring coordinates of each line segment and each character of each page of PDF;
the method specifically comprises the following steps:
s101: loading a PDF file, and circularly acquiring an object of each page; the object is a pointer pointing to each page of PDF and is used for sequentially acquiring PDF data of each page;
s102: rendering the line segment and the characters of each page of PDF to a CImage handle, and setting the upper left corner of each page of PDF data of the image data as a coordinate origin; coordinates of line segments and characters are captured while rendering.
Here, the rendering to CImage handle is for the purpose of: 1. copying the structured PDF data and converting the PDF data into image data; 2. independent processing is carried out, original PDF data are stored, and source files are prevented from being lost; 3. image data of pure line segments can be obtained, and interference is eliminated for subsequent image binaryzation and line detection; 4. and the image data is converted into image data, so that the subsequent processing is facilitated, and the required data is obtained by directly obtaining the corresponding characteristics through image detection.
The coordinates of the line segments and the characters can be obtained simultaneously during rendering, only for obtaining the image of the pure line segments. Acquiring coordinates of a line segment refers to acquiring coordinates of a pair of points.
S2: dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;
because the coordinates of the PDF user space are of a floating point type, in the xy space coordinate of the picture data format, the intersection point of the line segments means that the distance between one end point of one line segment and one end point of the other line segment in the space coordinate is within a certain threshold range; as shown in fig. 6, segment a (x1, y1), (x2, y2) has an intersection with segment B (x3, y3), (x4, y 4). The cell is that four intersection points exist in four groups of line segments of the space coordinate, and the formed area can be regarded as an effective cell when exceeding a certain threshold value; as shown in fig. 7, four adjacent line segments A, B, C and D form a cell.
Therefore, step S2 specifically includes:
s201: if the distance between the coordinate of one end point of one line segment and the coordinate of one end point of the other line segment is within a preset first threshold value range, judging that the two line segments are intersected;
s202: and if the four adjacent line segments are sequentially intersected end to end and the formed area exceeds a preset second threshold range, judging that the four line segments form an effective cell, simultaneously acquiring the coordinates of the four line segments, and marking the coordinates as rectangular coordinates corresponding to the cell formed by the four line segments.
S203: and acquiring the corresponding rectangular coordinates of each cell.
S3: and acquiring the field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates.
Field block, refers to the sequential set of all valid characters within a single cell of the PDF (excluding watermark characters falling within the cell).
Step S3 specifically includes:
s301: judging whether the rectangular coordinates contain characters or not according to the inclusion relation between the coordinates of the characters and the rectangular coordinates, namely whether the rectangular coordinates of the characters fall in the rectangular coordinates of the (cell); if yes, executing S302; if not, go to S303. This process is determined intuitively and naturally from the spatial coordinate positional relationship of the image format.
S302: sequentially acquiring all characters in the rectangular coordinates to form a field block corresponding to the rectangular coordinates;
s303: if no character is determined in a certain rectangular coordinate, setting the field block corresponding to the rectangular coordinate to be null, namely supplementing the blank field block to the rectangular coordinate, so as to ensure that the blank cell corresponding to the blank rectangular coordinate can be aligned with the corresponding title.
In one embodiment, after determining that the rectangular coordinates contain the character, i.e. before step S302, the following steps are further performed; to address the PDF table parsing where watermarks exist for targeting.
The steps specifically include: the watermark characters in each rectangular coordinate are excluded according to the matrix coefficients of the characters mapped from the coordinate space of the PDF to the user's visual space (i.e., the xy coordinate space in the form of image data in the present embodiment). Specifically, the matrix characteristic coefficient refers to a group of matrixes of characters mapped to the user visual space from the coordinate space of the PDF, and the watermark is generally a character with an angle, so that the converted matrix is different from a normal character, and whether a certain character is a watermark or not is judged according to the mode.
Next, S3 of the present embodiment further includes:
s304: and acquiring a field block corresponding to each cell.
In another embodiment, S305 is further included to simultaneously implement the resolution of multiple rows of cells.
S305: and determining the corresponding unit cell of each row according to the median line of the unit cell. Specifically, whether the data are the same row data is determined according to the error range between the corresponding median lines of the cells. If the y-axis coordinate of the middle bit line of each cell in the row is within a certain threshold range, if the y-axis coordinate is not within the threshold range, the cell is judged not to be in the same row, and the cell is divided into different rows.
In another embodiment, S4-S5 will be included to further enable parsing of the spread cells.
S4: each page of PDF data in the form of image data is converted into a Mat object opencv.
S5: starting from the maximum value of a Y axis of a current page, advancing towards the direction of an origin to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment can be detected on the vertical line segment; and at the same time
Starting from a Y-axis zero coordinate of a next page, advancing towards the direction of the maximum value to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment can be detected on the vertical line segment;
if the two conditions are met simultaneously, combining the cells corresponding to the adjacent vertical line segments in the current page and the cells corresponding to the next page into the same cell. As shown in fig. 3, incomplete cells split by the upper and lower page pages are merged into complete cells.
In this embodiment, the method further includes the following steps:
s6: and aggregating the table data which is arranged into a CSV format.
The method for analyzing the PDF table provided by the embodiment does not need to divide the analysis fields according to the specific PDF file, does not need to determine the header of the table, can realize full-automatic and accurate analysis and organization of the field block data, and has strong practicability and applicability. Furthermore, the embodiment adopts the relationship of coordinates between characters to be accurately divided, and adopts a visual algorithm to judge the correlation of cells between PDF pages. In conclusion, the embodiment can automatically, accurately and comprehensively analyze the PDF table data, greatly improves the accuracy and convenience of data cleaning, and has very remarkable effect.
Example two
This embodiment corresponds to the first embodiment, and provides a corresponding computer readable storage medium, on which a computer program is stored, which when executed by a processor can implement all the steps included in the first embodiment.
In summary, the method and the storage medium for analyzing the PDF form data provided by the present invention can implement accurate, convenient, and automatic analysis of the PDF form. The method can not only accurately analyze the data of single tables and multiple tables, but also accurately analyze random blank cells, page-crossing cells and multi-layer watermark cells; it has strong practicability and wide application range. Furthermore, the method is used for analyzing based on the character coordinates and the line segment coordinates, is different from the existing simple character-based processing, not only realizes more accurate and convenient analysis, but also can ensure the correspondence between the data and the title; meanwhile, the correlation between the rows can be analyzed, and support is provided for realizing various types of table analysis.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields are included in the scope of the present invention.

Claims (7)

1. A method for analyzing PDF table data, comprising:
acquiring coordinates of each line segment and each character of each page of PDF;
dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;
acquiring a field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates;
further comprising:
converting each page of PDF into an image data form;
if the upper page and the lower page which are connected with each other are gradually overlapped and closed along the Y-axis direction, the corresponding vertical line segment can be obtained, and the horizontal line segment can be respectively obtained on the vertical line segment, then the cells at the connection position of the upper page and the lower page are combined;
if the upper and lower pages that link up gradually superposes along Y axle direction and draws close the back mutually, can acquire corresponding vertically line segment, and can respectively in acquire the horizontal line segment on the vertically line segment, then merge the cell of upper and lower page joint department specifically is:
presetting the upper left corner of each page of PDF as a coordinate origin;
starting from the maximum value of a Y axis of a current page, advancing towards the direction of an origin to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment exists on the vertical line segment or not; and at the same time
Starting from a Y-axis zero coordinate of a next page, advancing towards the direction of the maximum value to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment exists on the vertical line segment or not;
if yes, combining the cells corresponding to the adjacent vertical line segments in the current page and the corresponding cells in the next page into the same cell.
2. The method of parsing PDF table data according to claim 1, further comprising:
and determining the corresponding unit cell of each row according to the median line of the unit cell.
3. The method according to claim 1, wherein the obtaining of the field block corresponding to each cell according to the inclusion relationship between the coordinates of the character and the rectangular coordinates includes:
acquiring characters corresponding to each non-blank rectangular coordinate according to whether the coordinates of the characters are located in the rectangular coordinates;
according to a matrix coefficient of the character mapped to a user visual space from a coordinate space of PDF, eliminating watermark characters in each non-blank rectangular coordinate;
characters corresponding to the non-blank rectangular coordinates form a field block, and each rectangular coordinate for supplementing blanks corresponds to a blank field;
and acquiring the field block corresponding to each cell.
4. The method for parsing PDF table data according to claim 1, wherein said obtaining coordinates of line segments and coordinates of characters of each page of PDF specifically comprises:
rendering the line segments and characters of each page of PDF to a CImage handle, and capturing the coordinates of each line segment and each character during rendering.
5. The method according to claim 1, wherein the dividing cells according to the line segment intersections and obtaining the rectangular coordinates corresponding to each cell are specifically:
if the distance between one end point coordinate of one line segment and one end point coordinate of another line segment is within a preset first threshold value range, judging that the line segment is intersected with the another line segment;
and if the four adjacent line segments are intersected end to end in sequence and the formed area exceeds a preset second threshold range, acquiring the coordinates of the four line segments and marking the coordinates as rectangular coordinates corresponding to a cell formed by the four line segments.
6. A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, carries out the steps of:
acquiring coordinates of each line segment and each character of each page of PDF;
dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells;
acquiring a field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates;
the program can also realize the following steps:
converting each page of PDF into an image data form;
determining a cell corresponding to each row according to the median line of the cells;
if the upper page and the lower page which are connected with each other are gradually overlapped and closed along the Y-axis direction, the corresponding vertical line segment can be obtained, and the horizontal line segment can be respectively obtained on the vertical line segment, then the cells at the connection position of the upper page and the lower page are combined;
if the upper and lower pages that link up gradually superposes along Y axle direction and draws close the back mutually, can acquire corresponding vertically line segment, and can respectively in acquire the horizontal line segment on the vertically line segment, then merge the cell of upper and lower page joint department specifically is:
presetting the upper left corner of each page of PDF as a coordinate origin;
starting from the maximum value of a Y axis of a current page, advancing towards the direction of an origin to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment exists on the vertical line segment or not; and at the same time
Starting from a Y-axis zero coordinate of a next page, advancing towards the direction of the maximum value to obtain a vertical line segment, and then judging whether a horizontal line segment intersected with the vertical line segment exists on the vertical line segment or not;
if yes, combining the cells corresponding to the adjacent vertical line segments in the current page and the corresponding cells in the next page into the same cell.
7. The computer-readable storage medium of claim 6, wherein the step of obtaining coordinates of each line segment and each character of each page of PDF comprises:
rendering the line segments and characters of each page of PDF to a CImage handle, and capturing the coordinates of each line segment and each character during rendering;
dividing cells according to the intersection points of the line segments, and acquiring rectangular coordinates corresponding to the cells, wherein the steps are as follows:
if the distance between one end point coordinate of one line segment and one end point coordinate of another line segment is within a preset first threshold value range, judging that the line segment is intersected with the another line segment;
if four adjacent line segments are intersected end to end in sequence and the formed area exceeds a preset second threshold range, acquiring coordinates of the four line segments and marking the coordinates as rectangular coordinates corresponding to a cell formed by the four line segments;
acquiring a field block corresponding to each cell according to the inclusion relation between the coordinates of the characters and the rectangular coordinates, specifically:
acquiring characters corresponding to each non-blank rectangular coordinate according to whether the coordinates of the characters are located in the rectangular coordinates;
according to a matrix coefficient of the character mapped to a user visual space from a coordinate space of PDF, eliminating watermark characters in each non-blank rectangular coordinate;
characters corresponding to the non-blank rectangular coordinates form a field block, and each rectangular coordinate for supplementing blanks corresponds to a blank field;
and acquiring the field block corresponding to each cell.
CN201711235867.5A 2017-11-30 2017-11-30 Method for analyzing PDF table data and storage medium Active CN108132916B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711235867.5A CN108132916B (en) 2017-11-30 2017-11-30 Method for analyzing PDF table data and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711235867.5A CN108132916B (en) 2017-11-30 2017-11-30 Method for analyzing PDF table data and storage medium

Publications (2)

Publication Number Publication Date
CN108132916A CN108132916A (en) 2018-06-08
CN108132916B true CN108132916B (en) 2022-02-11

Family

ID=62390012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711235867.5A Active CN108132916B (en) 2017-11-30 2017-11-30 Method for analyzing PDF table data and storage medium

Country Status (1)

Country Link
CN (1) CN108132916B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446487A (en) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 A kind of method and device parsing portable document format document table
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN110008809B (en) * 2019-01-04 2020-08-25 阿里巴巴集团控股有限公司 Method and device for acquiring form data and server
CN109815958B (en) * 2019-02-01 2022-02-15 杭州睿琪软件有限公司 Laboratory test report identification method and device, electronic equipment and storage medium
CN109871524B (en) * 2019-02-21 2023-06-09 腾讯科技(深圳)有限公司 Chart generation method and device
CN110134957B (en) * 2019-05-14 2023-06-13 云南电网有限责任公司电力科学研究院 Scientific and technological achievement warehousing method and system based on semantic analysis
CN112069991A (en) * 2020-09-04 2020-12-11 税友软件集团股份有限公司 PDF table information extraction method and related device
CN112541332B (en) * 2020-12-08 2023-06-23 北京百度网讯科技有限公司 Form information extraction method and device, electronic equipment and storage medium
CN112712014B (en) * 2020-12-29 2024-04-30 平安健康保险股份有限公司 Method, system, device and readable storage medium for parsing table picture structure
CN113435166B (en) * 2021-06-09 2024-03-19 深圳市世强元件网络有限公司 Underline method and system, computer device and readable storage medium
CN113361257B (en) * 2021-06-29 2022-10-11 深圳壹账通智能科技有限公司 PDF document analysis method, system, electronic device and storage medium
CN113642408A (en) * 2021-07-15 2021-11-12 杭州玖欣物联科技有限公司 Method for processing and analyzing picture data in real time through industrial internet

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866335B (en) * 2010-06-14 2012-12-12 深圳市万兴软件有限公司 Form processing method and device in document conversion
CN102467378A (en) * 2010-11-11 2012-05-23 深圳市金蝶友商电子商务服务有限公司 HTML (Hypertext Markup Language) form processing method based on two-dimensional matrix and computer
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN104268127B (en) * 2014-09-22 2018-02-09 同方知网(北京)技术有限公司 A kind of method of electronics shelves layout files reading order analysis
CN105989013A (en) * 2015-01-28 2016-10-05 腾讯科技(深圳)有限公司 Method and device for removing character watermarks
CN105988979B (en) * 2015-02-16 2018-11-16 北京邮电大学 Table extracting method and device based on pdf document
CN106484340B (en) * 2016-09-08 2019-04-05 中标软件有限公司 Watermark addition is carried out to document in print procedure and method for distinguishing is known in watermark
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN106897690B (en) * 2017-02-22 2018-04-13 南京述酷信息技术有限公司 PDF table extracting methods

Also Published As

Publication number Publication date
CN108132916A (en) 2018-06-08

Similar Documents

Publication Publication Date Title
CN108132916B (en) Method for analyzing PDF table data and storage medium
CN103389937A (en) Interface testing method and device
JP2012003753A5 (en)
CN109656652B (en) Webpage chart drawing method, device, computer equipment and storage medium
CN109448088B (en) Method and device for rendering three-dimensional graphic wire frame, computer equipment and storage medium
CN112668289A (en) Extraction method and device of nested table and storage medium
CN114239508A (en) Form restoration method and device, storage medium and electronic equipment
CN102063496A (en) Spatial data simplifying method and device
CN115906360A (en) Drainage system CAD-GIS data conversion and standard marking method and device
CN113362420A (en) Road marking generation method, device, equipment and storage medium
CN111428700A (en) Table identification method and device, electronic equipment and storage medium
CN100492403C (en) Character image line selecting method and device and character image identifying method and device
CN104268545A (en) Method for table area recognition and content rasterization in electronic document layout files
CN107871128B (en) High-robustness image recognition method based on SVG dynamic graph
CN112084103B (en) Interface test method, device, equipment and medium
CN105701761A (en) image processing apparatus and method for processing images
CN115457581A (en) Table extraction method and device and computer equipment
CN113850265A (en) PDF document analysis method and device, electronic equipment and storage medium
CN113592981A (en) Picture labeling method and device, electronic equipment and storage medium
KR101814728B1 (en) The method for extracting 3D model skeletons
US20230215033A1 (en) Convex geometry image capture
JP4967934B2 (en) Image processing apparatus and program
CN118135116B (en) Automatic generation method and system based on CAD two-dimensional conversion three-dimensional entity
CN107204003B (en) Method and device for identifying connected area of two-dimensional digital core
CN103345437B (en) The method of testing of the graphic output interface of mobile terminal client terminal browser and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant