CN113850265A - PDF document analysis method and device, electronic equipment and storage medium - Google Patents

PDF document analysis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113850265A
CN113850265A CN202111082611.1A CN202111082611A CN113850265A CN 113850265 A CN113850265 A CN 113850265A CN 202111082611 A CN202111082611 A CN 202111082611A CN 113850265 A CN113850265 A CN 113850265A
Authority
CN
China
Prior art keywords
character
coordinate
character string
abscissa
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111082611.1A
Other languages
Chinese (zh)
Inventor
赵亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202111082611.1A priority Critical patent/CN113850265A/en
Publication of CN113850265A publication Critical patent/CN113850265A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a PDF document analysis method, a PDF document analysis device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a page object according to the PDF document; determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object; determining a character string according to coordinates of character elements in the page object; determining column identification of the character string according to the coordinate and abscissa set of the character string; determining the row identification of the character string according to the coordinate and the vertical coordinate set of the character string; and drawing the electronic table according to the row identification and the column identification. The line identification of the character string in the electronic form can be accurately determined. And extracting character strings in the table in the PDF into the electronic table accurately according to the row identification and the column identification, so that the analysis efficiency of the table in the PDF is improved.

Description

PDF document analysis method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to a data processing technology, in particular to an artificial intelligence analysis technology of a PDF document, and particularly relates to a PDF document analysis method, a PDF document analysis device, electronic equipment and a storage medium.
Background
Portable Document Format (PDF), a file Format developed by Adobe Systems for exchanging files in a manner independent of application programs, operating Systems, and hardware. The core of the PDF format contains a series of instruction streams that describe how to draw on a page. The text data is not stored in the form of paragraphs or words, but in the form of characters in which page specific position information is recorded.
In actual use, the image instructions in the PDF need to be converted into an electronic document such as a spreadsheet. Currently, PDF documents are parsed using pgthon-based open source PDF parsing tools (e.g., pdfplumber, py2PDF, or pdfminer). However, the analysis tool can only extract a complete form in the PDF, and if a form line or the like is not drawn in the form, the PDF document cannot be accurately analyzed.
Disclosure of Invention
The invention provides a method and a device for analyzing a PDF document, electronic equipment and a storage medium, which are used for improving the analysis efficiency of the PDF document.
In a first aspect, an embodiment of the present invention provides a method for parsing a PDF document, including:
acquiring a page object according to the PDF document;
determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object;
determining a character string according to coordinates of character elements in the page object;
determining column identification of the character string according to the coordinate and abscissa set of the character string;
determining the row identification of the character string according to the coordinate and the vertical coordinate set of the character string;
and drawing the electronic table according to the row identification and the column identification.
In a second aspect, an embodiment of the present invention further provides an apparatus for parsing a PDF document, including:
the page object acquisition module is used for acquiring a page object according to the PDF document;
the coordinate set determining module is used for determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object;
the character string determining module is used for determining character strings according to the coordinates of the character elements in the page object;
the column identification determining module is used for determining the column identification of the character string according to the coordinate and the abscissa set of the character string;
the line identifier determining module is used for determining the line identifier of the character string according to the coordinate and the vertical coordinate set of the character string;
and the drawing module is used for drawing the electronic table according to the row identification and the column identification.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for parsing a PDF document according to the embodiment of the present application.
In a fourth aspect, the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are used to execute the parsing method of the PDF document according to the embodiment of the present application.
According to the PDF document analysis method provided by the embodiment of the application, a page object is obtained according to the PDF document; determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object; determining a character string according to coordinates of character elements in the page object; determining column identification of the character string according to the coordinate and abscissa set of the character string; determining the row identification of the character string according to the coordinate and the vertical coordinate set of the character string; and drawing the electronic table according to the row identification and the column identification. Compared with the problem of low PDF document analysis efficiency at present, the PDF document analysis method provided by the embodiment of the invention can be used for respectively processing the linear elements and the character elements after the PDF document is analyzed, so that the content in the PDF can be quickly and accurately extracted. The table area can be determined by determining the abscissa set and the ordinate set according to the end point coordinates of the straight line elements in the page object. And drawing a table longitudinal line according to an abscissa in the abscissa set, drawing a table transverse line according to an ordinate in the ordinate set, and further determining a table line needing to be drawn in the electronic table. According to the coordinates of the character elements, the character strings in the same row can be determined, and according to the coordinate and abscissa set of the character strings, the column identification where the character strings are located can be determined. And the line identification of the character string in the electronic form can be accurately determined by combining the ordinate of the character string. And extracting character strings in the table in the PDF into the electronic table accurately according to the row identification and the column identification, so that the analysis efficiency of the table in the PDF is improved.
Drawings
FIG. 1 is a flowchart of a method for parsing a PDF document according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a PDF document parsing method according to a second embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a PDF document parsing apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer device in the fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a PDF document parsing method according to an embodiment of the present invention, where the present embodiment is applicable to a case of parsing a PDF document, the method may be executed by a computer device for parsing a PDF document, where the computer device may be a personal computer or a notebook computer, the computer device may also be a terminal, and the terminal includes a smart phone, a tablet computer, and the method specifically includes the following steps:
and step 110, acquiring a page object according to the PDF document.
Optionally, the PDF document is identified by a pdfplunmer tool to obtain a page array. The page array is composed of at least one page object, and each page object identifies one page in the PDF document. Each PDF object has a variety of elements such as line elements and character elements.
The linear elements comprise the coordinates of the end points of the linear elements, and the coordinates of the end points are the coordinates of two end points of the linear. The character element includes coordinates for representing the coordinate interval occupied by the character, and the character content. The coordinates representing the character occupying coordinate interval may be the upper left LU coordinates and the lower right RD coordinates of the character. The character may be a single character. As the coordinates indicating the character position, the sitting LU coordinates or the lower right RD coordinates may be used.
And step 120, determining an abscissa set and an ordinate set according to the end point coordinates of the straight line elements in the page object.
Wherein the set of abscissa includes the coordinates of the end points of all the straight-line elements in the page object. The endpoint coordinates may be represented by two-dimensional coordinates (x, y). X represents the abscissa of the endpoint and y represents the ordinate of the power outage. When the content of the corresponding PDF document is a table, the straight lines forming the table are identified to obtain the end point coordinates of the straight line elements, and the end point coordinates are split into an abscissa x and an ordinate y.
The straight line in the table document includes a horizontal line and a vertical line, and the ordinate of both end points of the horizontal line is the same and the abscissa of both end points of the vertical line is the same. Therefore, in order to more briefly represent a horizontal line, an ordinate common to the end points of the horizontal line may be used to represent a certain horizontal line. Similarly, for more concise representation of vertical lines, an abscissa common to the endpoints of the vertical line may be used to represent a vertical line. From the coordinates of the end points of the straight line elements, a plurality of abscissas can be obtained for representing a plurality of vertical lines, and a plurality of ordinates can be obtained for representing a plurality of horizontal lines.
The plurality of abscissas form an abscissa set including the abscissas of the endpoints of all the straight-line elements in the page object, and the abscissa set is used to represent the abscissas of the vertical form lines (i.e., vertical lines) of the spreadsheet. If the abscissa is the same, deduplication is performed.
The plurality of ordinates form a set of ordinates that includes the ordinates of the endpoints of all the rectilinear elements in the page object, the set of ordinates being used to represent the ordinates of the spreadsheet transverse form lines (i.e., the screen-break risk). If the ordinate is the same, the deduplication is performed.
After the abscissa set and the ordinate are combined, because a table may have a shadow line of a table bar, or because of scanning or the like, two adjacent table line frames appear, which belong to a fuzzy bar carried by the PDF document. In order to identify the table line more accurately and avoid identifying an unnecessary line by mistake, further, after determining the abscissa set and the ordinate set according to the end point coordinates of the line elements in the page object in step 120, the method further includes:
if the difference value between the first abscissa and the second abscissa in the abscissa set is smaller than a preset distance threshold, deleting the first abscissa or deleting the second abscissa; and if the difference value of the first vertical coordinate and the second vertical coordinate in the vertical coordinate set is smaller than a preset distance threshold, deleting the first vertical coordinate or deleting the second vertical coordinate.
The preset distance threshold may be 0.5 pixels. In the abscissa set and the ordinate set, respectively, it is compared whether a difference between two adjacent coordinates (e.g., a first abscissa and a second abscissa, or a first ordinate and a second ordinate) is less than a preset distance threshold. If the distance is smaller than the preset distance threshold, the distance between the two adjacent coordinates is too short, the two adjacent coordinates belong to a fuzzy line, and one of the coordinates is deleted.
In the embodiment, the fuzzy lines can be accurately identified through the preset distance threshold, and the fuzzy lines are deleted, so that the accuracy of PDF document identification is improved.
And step 130, determining character strings according to the coordinates of the character elements in the page object.
All the character elements contained in the page object may be acquired. The character elements have upper left and lower right coordinates. Since the character string is composed of a plurality of characters which are transversely distributed in the same line, a plurality of lines can be divided according to all character elements, and in each line, the character string is determined according to the space between the character elements. The rows may be uniformly divided using the upper left coordinates of the character elements. The rows may also be uniformly divided by the lower right coordinates of the character elements.
In one implementation, the step 130 of determining the character string according to the coordinates of the character elements in the page object can be implemented by:
and 131, dividing character elements in the target page object according to the vertical coordinate to obtain a plurality of line groups.
The target page object is any page object in the PDF document. All the character elements in the target page object are grouped according to the ordinate (y-coordinate) of the upper left coordinate of the character elements, and the upper left coordinate of the character elements in each row group has the same ordinate.
Step 132, for each of the line groups, determining at least one character string according to the spacing between adjacent characters.
In each row group, the character sequence and the spacing between characters are determined according to the numerical value of the abscissa of the characters, and further, the character string is determined.
In the above embodiment, determining at least one character string according to the space between adjacent characters may be implemented as:
sorting the plurality of character elements in the row grouping according to the size of the abscissa in the upper left coordinate of the character element; in the sorting result, the pitches of two adjacent character elements are sequentially obtained. And if the distance is smaller than a preset distance threshold value, determining that two adjacent character elements belong to the same character string.
In the row grouping, the plurality of character elements in the row grouping are sorted according to the size of the abscissa in the upper left coordinate of the character element, and a sorting result is obtained. The sorting may be done in an ascending manner according to the abscissa values. In the sorting result, the pitches of two adjacent character elements are sequentially obtained. And if the distance is smaller than a preset distance threshold value, determining that the two adjacent character elements belong to the same character string. The preset pitch threshold may be 1 pixel. And continuously comparing the next group of adjacent character elements, wherein the character element with the larger abscissa of the former group of adjacent character elements is used as the character element with the smaller abscissa of the latter group of adjacent character elements.
Illustratively, the two adjacent character elements are a first character element and a second character element. And acquiring the distance between the first character element and the second character element, and if the distance is smaller than a preset distance threshold, determining that the first character element and the second character element belong to the same character string. And continuously comparing whether the distance between the second character element and the third character element is smaller than a preset distance threshold value. And if the distance is smaller than a preset distance threshold value, determining that the second character element and the third character element belong to the same character string. And if the distance is larger than the preset distance threshold value, determining that the second character element and the third character element belong to different character strings.
And by analogy, judging whether the distance between the Nth character element and the (N + 1) th character element is smaller than a preset distance threshold value, and if so, determining that the Nth character element and the (N + 1) th character element belong to the same character string. Otherwise, if the distance is larger than the preset distance threshold, the Nth character element and the (N + 1) th character element are not considered to belong to the same character string, and the identified character string is stored.
The embodiment can quickly and accurately determine the character strings in the row grouping by taking the coordinates of the character elements as the basis, and improves the speed and the accuracy of the character strings.
Further, when the pitch of two adjacent character elements is obtained, it can be calculated in the following manner. The first character element and the second character element are assumed to be two adjacent character elements, and the horizontal coordinate value of the lower right coordinate of the first character element is smaller than the horizontal coordinate value of the upper left coordinate of the second character element. Correspondingly, obtaining the space between two adjacent character elements comprises the following steps:
calculating the difference value of the horizontal coordinate numerical value of the upper left coordinate of the second character element and the horizontal coordinate numerical value of the lower right coordinate of the first character element, and taking the difference value as the distance between the first character element and the second character element; the first character element and the second character element are two adjacent character elements, and the horizontal coordinate value of the lower right coordinate of the first character element is smaller than the horizontal coordinate value of the upper left coordinate of the second character element.
By means of the method, the distance between two adjacent characters can be calculated more accurately, and therefore the recognition accuracy of the character strings is improved.
And step 140, determining the column identification of the character string according to the coordinate and abscissa set of the character string.
The character string determined in step 130 may also be represented by an upper left coordinate and a lower right coordinate, where the upper left coordinate of the character string is the upper left coordinate of the first character element of the character string, and the lower right and left coordinates of the character string are the lower right coordinates of the last character element of the character string.
The set of abscissas is used to record the abscissas of the vertical form lines of the form. The abscissas in the set of abscissas can be sorted to obtain a plurality of columns in the table. Each column corresponds to two adjacent abscissas. From the abscissa in the upper left coordinate of the character string and the abscissa in the lower right coordinate of the character string, and the two abscissas representing the columns, the column in which the character string is located, i.e., the column identification, can be determined.
In one implementation, the step 140 of determining the column identifier of the character string according to the coordinate and abscissa set of the character string can be implemented by:
acquiring a first abscissa interval occupied by the character string according to a first character and a last character contained in the character string; determining a second abscissa interval between adjacent longitudinal form lines according to the abscissa set; if the first abscissa interval is located in the second abscissa interval, taking the column identifier corresponding to the second abscissa interval as the column identifier of the character string represented by the first abscissa interval; if the first coordinate interval exceeds a second coordinate interval and does not exceed the range of a plurality of continuous coordinate intervals, determining the column identifier of the merging cell according to a plurality of column identifiers corresponding to the plurality of continuous second coordinate intervals which minimally contain the first coordinate interval; and taking the column identification of the merging cells as the column identification of the character string.
After the first coordinate interval and the plurality of second coordinate intervals are obtained, the second coordinate intervals containing the initial coordinates of the first coordinate interval are searched according to the comparison of the initial coordinates of the first coordinate interval with the plurality of second coordinate intervals, and if a certain second coordinate interval contains the initial coordinates and the ending coordinates of the first coordinate interval, the column identification of the second coordinate interval is used as the column identification of the character string represented by the first coordinate interval.
The table may include a cell, or a merged cell formed by a plurality of cells. If the character string is located in the merging cell, the character string will exceed the second coordinate interval, at this time, another second coordinate interval adjacent to the second coordinate interval can be obtained, and whether the character string is located in the abscissa interval formed by the two second coordinate intervals is judged, if so, the column identifier of the merging cell formed by the two second coordinate intervals is used as the column identifier of the character string.
According to the embodiment, the first abscissa interval can be accurately determined based on the abscissas of the character elements in the character string, the plurality of second abscissa areas are determined according to the combination of the abscissas, the first coordinate interval where the character string is located can be rapidly and accurately determined according to the inclusion relation between the second abscissa areas and the first abscissa areas, the column identification where the character string is located is further determined, and the character string positioning accuracy is improved.
And 150, determining the row identification of the character string according to the coordinate and the ordinate set of the character string.
The vertical coordinate set comprises vertical coordinates of horizontal table lines in the PDF document. The character string grouped by each line may be regarded as a character string located on the same line. Furthermore, the association relationship between the character strings and the ordinate can be established, and further the character strings in the same row associated with the ordinate can be determined according to the ordinate.
Further, before determining the row identifier of the character string according to the set of coordinates and ordinate of the character string in step 150, the method further includes:
determining an interference character string outside a table range according to the ordinate set; and deleting the interference character string.
And if the abscissa of the character string is smaller than the minimum value of the ordinate in the ordinate set, determining that the character string is located outside the table range. Or if the abscissa of the character string is larger than the maximum value of the ordinate in the ordinate set, determining that the character string is located outside the table range.
The above embodiment can screen out contents other than the form, and further improve the accuracy of form recognition.
And 160, drawing the electronic table according to the row identification and the column identification.
After the row identification and the column identification of the character string are obtained, a cell is determined according to the row identification and the column identification in the electronic table, and the character string is written in the cell. If the same cell contains a plurality of character strings, the character strings are divided into groups according to lines, and the character strings are written in each group in sequence according to the horizontal coordinate numerical value.
According to the PDF document analysis method provided by the embodiment of the application, a page object is obtained according to the PDF document; determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object; determining a character string according to coordinates of character elements in the page object; determining column identification of the character string according to the coordinate and abscissa set of the character string; determining the row identification of the character string according to the coordinate and the vertical coordinate set of the character string; and drawing the electronic table according to the row identification and the column identification. Compared with the problem of low PDF document analysis efficiency at present, the PDF document analysis method provided by the embodiment of the invention can be used for respectively processing the linear elements and the character elements after the PDF document is analyzed, so that the content in the PDF can be quickly and accurately extracted. The table area can be determined by determining the abscissa set and the ordinate set according to the end point coordinates of the straight line elements in the page object. And drawing a table longitudinal line according to an abscissa in the abscissa set, drawing a table transverse line according to an ordinate in the ordinate set, and further determining a table line needing to be drawn in the electronic table. According to the coordinates of the character elements, the character strings in the same row can be determined, and according to the coordinate and abscissa set of the character strings, the column identification where the character strings are located can be determined. And the line identification of the character string in the electronic form can be accurately determined by combining the ordinate of the character string. And extracting character strings in the table in the PDF into the electronic table accurately according to the row identification and the column identification, so that the analysis efficiency of the table in the PDF is improved.
Example two
Fig. 2 is a schematic flow chart of a PDF document parsing method according to a second embodiment of the present invention, which is used to further describe the above embodiment, and includes:
step 201, acquiring a page object according to the PDF document.
Step 202, determining an abscissa set and an ordinate set according to the end point coordinates of the straight line elements in the page object.
Further, if the difference value between the first abscissa and the second abscissa in the abscissa set is smaller than a preset distance threshold, deleting the first abscissa or deleting the second abscissa; and if the difference value of the first vertical coordinate and the second vertical coordinate in the vertical coordinate set is smaller than a preset distance threshold, deleting the first vertical coordinate or deleting the second vertical coordinate.
And 203, dividing character elements in the target page object according to the vertical coordinate to obtain a plurality of line groups.
And 204, aiming at each row group, sequencing the plurality of character elements in the row group according to the size of the abscissa in the upper left coordinate of the character element.
And step 205, sequentially acquiring the space between two adjacent character elements in the sorting result.
Calculating the difference value of the horizontal coordinate numerical value of the upper left coordinate of the second character element and the horizontal coordinate numerical value of the lower right coordinate of the first character element, and taking the difference value as the distance between the first character element and the second character element; the first character element and the second character element are two adjacent character elements, and the horizontal coordinate value of the lower right coordinate of the first character element is smaller than the horizontal coordinate value of the upper left coordinate of the second character element.
And step 206, if the distance is smaller than a preset distance threshold, determining that two adjacent character elements belong to the same character string. And if the distance is larger than the preset distance threshold value, determining that two adjacent character elements belong to different character strings.
Step 207, obtaining a first abscissa interval occupied by the character string according to the first character and the last character contained in the character string.
And 208, determining a second abscissa interval between the adjacent longitudinal table lines according to the abscissa set.
And 209, if the first abscissa interval is located in the second abscissa interval, taking the column identifier corresponding to the second abscissa interval as the column identifier of the character string represented by the first abscissa interval.
Step 210, if the first coordinate interval exceeds a second coordinate interval and does not exceed the range of a plurality of continuous coordinate intervals, determining the column identifier of the merging cell according to a plurality of column identifiers corresponding to the plurality of continuous second coordinate intervals which minimally contain the first coordinate interval; and taking the column identification of the merging cells as the column identification of the character string.
Further, determining an interference character string outside the table range according to the ordinate set; and deleting the interference character string.
And step 211, determining the row identification of the character string according to the coordinate and the ordinate set of the character string.
Step 212, drawing the electronic table according to the row identification and the column identification.
According to the PDF document analysis method, the line elements and the character elements can be processed respectively after the PDF document is analyzed, and then the content in the PDF can be extracted quickly and accurately. The table area can be determined by determining the abscissa set and the ordinate set according to the end point coordinates of the straight line elements in the page object. And drawing a table longitudinal line according to an abscissa in the abscissa set, drawing a table transverse line according to an ordinate in the ordinate set, and further determining a table line needing to be drawn in the electronic table. According to the coordinates of the character elements, the character strings in the same row can be determined, and according to the coordinate and abscissa set of the character strings, the column identification where the character strings are located can be determined. And the line identification of the character string in the electronic form can be accurately determined by combining the ordinate of the character string. And extracting character strings in the table in the PDF into the electronic table accurately according to the row identification and the column identification, so that the analysis efficiency of the table in the PDF is improved.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a PDF document parsing apparatus according to a third embodiment of the present invention, where the present embodiment is applicable to a case of parsing a PDF document, the apparatus may be implemented by a computer device for parsing a PDF document, where the computer device may be a personal computer or a notebook computer, the computer device may also be a terminal, and the terminal includes a smart phone, a tablet computer, and the like, and the apparatus specifically includes a page object obtaining module 310, a coordinate set determining module 320, a character string determining module 330, a column identifier determining module 340, a row identifier determining module 350, and a drawing module 360.
A page object obtaining module 310, configured to obtain a page object according to a PDF document;
the coordinate set determining module 320 is used for determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object;
the character string determining module 330 is configured to determine a character string according to coordinates of character elements in the page object;
a column identifier determining module 340, configured to determine a column identifier of the character string according to the set of coordinates and abscissa of the character string;
a row identifier determining module 350, configured to determine a row identifier of the character string according to the coordinate and ordinate set of the character string;
and a drawing module 360 for drawing the electronic table according to the row identifier and the column identifier.
On the basis of the above embodiment, the character string determination module 330 is configured to:
dividing character elements in the target page object according to the vertical coordinates to obtain a plurality of line groups;
for each row grouping, at least one character string is determined according to the spacing between adjacent characters.
On the basis of the above embodiment, the character string determination module 330 is configured to:
sorting the plurality of character elements in the row grouping according to the size of the abscissa in the upper left coordinate of the character element;
sequentially acquiring the space between two adjacent character elements in the sequencing result;
and if the distance is smaller than a preset distance threshold value, determining that two adjacent character elements belong to the same character string.
On the basis of the above embodiment, the character string determination module 330 is configured to:
calculating the difference value of the horizontal coordinate numerical value of the upper left coordinate of the second character element and the horizontal coordinate numerical value of the lower right coordinate of the first character element, and taking the difference value as the distance between the first character element and the second character element; the first character element and the second character element are two adjacent character elements, and the horizontal coordinate value of the lower right coordinate of the first character element is smaller than the horizontal coordinate value of the upper left coordinate of the second character element.
On the basis of the above embodiment, the column identification determining module 340 is configured to:
acquiring a first abscissa interval occupied by the character string according to a first character and a last character contained in the character string;
determining a second abscissa interval between adjacent longitudinal form lines according to the abscissa set;
if the first abscissa interval is located in the second abscissa interval, taking the column identifier corresponding to the second abscissa interval as the column identifier of the character string represented by the first abscissa interval;
if the first coordinate interval exceeds a second coordinate interval and does not exceed the range of a plurality of continuous coordinate intervals, determining the column identifier of the merging cell according to a plurality of column identifiers corresponding to the plurality of continuous second coordinate intervals which minimally contain the first coordinate interval; and taking the column identification of the merging cells as the column identification of the character string.
On the basis of the above embodiment, the system further includes an interference string processing module, where the interference string processing module is configured to:
determining an interference character string outside a table range according to the ordinate set;
and deleting the interference character string.
On the basis of the above embodiment, the device further comprises a coordinate set dessicating module, wherein the coordinate set dessicating module is configured to:
after determining the abscissa set and the ordinate set according to the endpoint coordinates of the straight line elements in the page object, the method further comprises the following steps:
if the difference value between the first abscissa and the second abscissa in the abscissa set is smaller than a preset distance threshold, deleting the first abscissa or deleting the second abscissa;
and if the difference value of the first vertical coordinate and the second vertical coordinate in the vertical coordinate set is smaller than a preset distance threshold, deleting the first vertical coordinate or deleting the second vertical coordinate.
In the apparatus for parsing a PDF document provided in the embodiment of the present application, the page object obtaining module 310 is configured to obtain a page object according to a PDF document; the coordinate set determining module 320 is used for determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object; the character string determining module 330 is configured to determine a character string according to coordinates of character elements in the page object; a column identifier determining module 340, configured to determine a column identifier of the character string according to the set of coordinates and abscissa of the character string; a row identifier determining module 350, configured to determine a row identifier of the character string according to the coordinate and ordinate set of the character string; and a drawing module 360 for drawing the electronic table according to the row identifier and the column identifier. Compared with the problem of low PDF document analysis efficiency at present, the PDF document analysis device provided by the embodiment of the invention can process the linear elements and the character elements respectively after the PDF document is analyzed, so as to quickly and accurately extract the content in the PDF. The table area can be determined by determining the abscissa set and the ordinate set according to the end point coordinates of the straight line elements in the page object. And drawing a table longitudinal line according to an abscissa in the abscissa set, drawing a table transverse line according to an ordinate in the ordinate set, and further determining a table line needing to be drawn in the electronic table. According to the coordinates of the character elements, the character strings in the same row can be determined, and according to the coordinate and abscissa set of the character strings, the column identification where the character strings are located can be determined. And the line identification of the character string in the electronic form can be accurately determined by combining the ordinate of the character string. And extracting character strings in the table in the PDF into the electronic table accurately according to the row identification and the column identification, so that the analysis efficiency of the table in the PDF is improved.
The PDF document analysis device provided by the embodiment of the invention can execute the PDF document analysis method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 4 is a schematic structural diagram of a computer apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the computer apparatus includes a processor 40, a memory 41, an input device 42, and an output device 43; the number of processors 40 in the computer device may be one or more, and one processor 40 is taken as an example in fig. 4; the processor 40, the memory 41, the input device 42 and the output device 43 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 4.
The memory 41 serves as a computer-readable storage medium, and may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the parsing method of the PDF document in the embodiment of the present invention (for example, the page object obtaining module 310, the coordinate set determining module 320, the character string determining module 330, the column identifier determining module 340, the row identifier determining module 350, and the drawing module 360 in the parsing apparatus of the PDF document). The processor 40 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 41, that is, implements the above-described parsing method of the PDF document.
The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 41 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 41 may further include memory located remotely from processor 40, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 42 is operable to receive input numeric or character information and to generate key signal inputs relating to user settings and function controls of the computer apparatus. The output device 43 may include a display device such as a display screen.
EXAMPLE five
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are executed by a computer processor to perform a method for parsing a PDF document, and the method includes:
acquiring a page object according to the PDF document;
determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object;
determining a character string according to coordinates of character elements in the page object;
determining column identification of the character string according to the coordinate and abscissa set of the character string;
determining the row identification of the character string according to the coordinate and the vertical coordinate set of the character string;
and drawing the electronic table according to the row identification and the column identification.
On the basis of the above embodiment, determining a character string according to coordinates of character elements in a page object includes:
dividing character elements in the target page object according to the vertical coordinates to obtain a plurality of line groups;
for each row grouping, at least one character string is determined according to the spacing between adjacent characters.
On the basis of the above embodiment, determining at least one character string according to the space between adjacent characters includes:
sorting the plurality of character elements in the row grouping according to the size of the abscissa in the upper left coordinate of the character element;
sequentially acquiring the space between two adjacent character elements in the sequencing result;
and if the distance is smaller than a preset distance threshold value, determining that two adjacent character elements belong to the same character string.
On the basis of the above embodiment, acquiring the space between two adjacent character elements includes:
calculating the difference value of the horizontal coordinate numerical value of the upper left coordinate of the second character element and the horizontal coordinate numerical value of the lower right coordinate of the first character element, and taking the difference value as the distance between the first character element and the second character element; the first character element and the second character element are two adjacent character elements, and the horizontal coordinate value of the lower right coordinate of the first character element is smaller than the horizontal coordinate value of the upper left coordinate of the second character element.
On the basis of the above embodiment, determining the column identifier of the character string according to the set of coordinates and abscissa of the character string includes:
acquiring a first abscissa interval occupied by the character string according to a first character and a last character contained in the character string;
determining a second abscissa interval between adjacent longitudinal form lines according to the abscissa set;
if the first abscissa interval is located in the second abscissa interval, taking the column identifier corresponding to the second abscissa interval as the column identifier of the character string represented by the first abscissa interval;
if the first coordinate interval exceeds a second coordinate interval and does not exceed the range of a plurality of continuous coordinate intervals, determining the column identifier of the merging cell according to a plurality of column identifiers corresponding to the plurality of continuous second coordinate intervals which minimally contain the first coordinate interval; and taking the column identification of the merging cells as the column identification of the character string.
On the basis of the above embodiment, before determining the row identifier of the character string according to the set of coordinates and vertical coordinates of the character string, the method further includes:
determining an interference character string outside a table range according to the ordinate set;
and deleting the interference character string.
On the basis of the above embodiment, after determining the abscissa set and the ordinate set according to the endpoint coordinates of the line elements in the page object, the method further includes:
if the difference value between the first abscissa and the second abscissa in the abscissa set is smaller than a preset distance threshold, deleting the first abscissa or deleting the second abscissa;
and if the difference value of the first vertical coordinate and the second vertical coordinate in the vertical coordinate set is smaller than a preset distance threshold, deleting the first vertical coordinate or deleting the second vertical coordinate.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the above method operations, and may also perform related operations in the PDF document parsing method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods of the embodiments of the present invention.
It should be noted that, in the embodiment of the PDF document parsing apparatus, each included unit and module are only divided according to functional logic, but are not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A PDF document parsing method is characterized by comprising the following steps:
acquiring a page object according to the PDF document;
determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object;
determining a character string according to the coordinates of the character elements in the page object;
determining column identification of the character string according to the coordinates of the character string and the abscissa set;
determining the row identification of the character string according to the coordinate of the character string and the vertical coordinate set;
and drawing the electronic table according to the row identification and the column identification.
2. The method of claim 1, wherein determining a character string according to coordinates of character elements in the page object comprises:
dividing character elements in the target page object according to the vertical coordinates to obtain a plurality of line groups;
for each of the row groupings, at least one string is determined from the spacing between adjacent characters.
3. The method of claim 2, wherein determining at least one string based on a spacing between adjacent characters comprises:
sorting the plurality of character elements in the row grouping according to the size of an abscissa in an upper left coordinate of the character elements;
sequentially acquiring the space between two adjacent character elements in the sequencing result;
and if the distance is smaller than a preset distance threshold value, determining that the two adjacent character elements belong to the same character string.
4. The method of claim 3, wherein obtaining the spacing between two adjacent character elements comprises:
calculating the difference value of the horizontal coordinate numerical value of the upper left coordinate of the second character element and the horizontal coordinate numerical value of the lower right coordinate of the first character element, and taking the difference value as the distance between the first character element and the second character element; the first character element and the second character element are two adjacent character elements, and the horizontal coordinate value of the lower right coordinate of the first character element is smaller than the horizontal coordinate value of the upper left coordinate of the second character element.
5. The method of claim 1, wherein determining the column identifier of the character string from the coordinates of the character string and the set of abscissa comprises:
acquiring a first abscissa interval occupied by the character string according to a first character and a last character contained in the character string;
determining a second abscissa interval between adjacent longitudinal form lines according to the abscissa set;
if the first abscissa interval is located in the second abscissa interval, taking the column identifier corresponding to the second abscissa interval as the column identifier of the character string represented by the first abscissa interval;
if the first coordinate interval exceeds a second coordinate interval and does not exceed the range of a plurality of continuous coordinate intervals, determining the column identifier of the merging cell according to a plurality of column identifiers corresponding to the plurality of continuous second coordinate intervals which minimally contain the first coordinate interval; and taking the column identification of the merging cell as the column identification of the character string.
6. The method of claim 1, further comprising, prior to determining the row identification of the string from the set of coordinates of the string and the set of ordinates:
determining an interference character string outside a table range according to the ordinate set;
and deleting the interference character string.
7. The method of claim 1, after determining the set of abscissas and the set of ordinates from the coordinates of the endpoints of the line elements in the page object, further comprising:
if the difference value between the first abscissa and the second abscissa in the abscissa set is smaller than a preset distance threshold, deleting the first abscissa or deleting the second abscissa;
and if the difference value of the first vertical coordinate and the second vertical coordinate in the vertical coordinate set is smaller than a preset distance threshold, deleting the first vertical coordinate or deleting the second vertical coordinate.
8. A PDF document parsing device, comprising:
the page object acquisition module is used for acquiring a page object according to the PDF document;
the coordinate set determining module is used for determining an abscissa set and an ordinate set according to the endpoint coordinates of the linear elements in the page object;
the character string determining module is used for determining character strings according to the coordinates of the character elements in the page object;
the column identification determining module is used for determining the column identification of the character string according to the coordinates of the character string and the abscissa set;
the row identification determining module is used for determining the row identification of the character string according to the coordinate of the character string and the vertical coordinate set;
and the drawing module is used for drawing the electronic table according to the row identification and the column identification.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of parsing a PDF document according to any one of claims 1 to 7 when executing the program.
10. A storage medium containing computer executable instructions for performing a method of parsing a PDF document according to any one of claims 1-7 when executed by a computer processor.
CN202111082611.1A 2021-09-15 2021-09-15 PDF document analysis method and device, electronic equipment and storage medium Pending CN113850265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111082611.1A CN113850265A (en) 2021-09-15 2021-09-15 PDF document analysis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111082611.1A CN113850265A (en) 2021-09-15 2021-09-15 PDF document analysis method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113850265A true CN113850265A (en) 2021-12-28

Family

ID=78974152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111082611.1A Pending CN113850265A (en) 2021-09-15 2021-09-15 PDF document analysis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113850265A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618847A (en) * 2022-12-20 2023-01-17 浙江保融科技股份有限公司 Method and device for analyzing PDF document and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618847A (en) * 2022-12-20 2023-01-17 浙江保融科技股份有限公司 Method and device for analyzing PDF document and readable storage medium

Similar Documents

Publication Publication Date Title
JP6710483B2 (en) Character recognition method for damages claim document, device, server and storage medium
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
CN110321470B (en) Document processing method, device, computer equipment and storage medium
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
CN110334346B (en) Information extraction method and device of PDF (Portable document Format) file
CN108563739B (en) Weather data acquisition method and device, computer device and readable storage medium
CN106709032A (en) Method and device for extracting structured information from spreadsheet document
CN107689070B (en) Chart data structured extraction method, electronic device and computer-readable storage medium
CN112380825B (en) PDF document cross-page table merging method and device, electronic equipment and storage medium
CN110909123A (en) Data extraction method and device, terminal equipment and storage medium
CN115828874A (en) Industry table digital processing method based on image recognition technology
CN110990010A (en) Software interface code generation method and device
CN115687655A (en) PDF document-based knowledge graph construction method, system, equipment and storage medium
CN112651331A (en) Text table extraction method, system, computer device and storage medium
JP2022185143A (en) Text detection method, and text recognition method and device
CN109871743B (en) Text data positioning method and device, storage medium and terminal
CN113283231B (en) Method for acquiring signature bit, setting system, signature system and storage medium
CN113850265A (en) PDF document analysis method and device, electronic equipment and storage medium
CN109063155B (en) Language model parameter determination method and device and computer equipment
CN112749639B (en) Model training method and device, computer equipment and storage medium
CN112528832A (en) Method and system for processing PDF-format relay protection fixed value list
CN111651971A (en) Form information transcription method, system, electronic equipment and storage medium
CN113806472A (en) Method and equipment for realizing full-text retrieval of character, picture and image type scanning piece
CN114155547B (en) Chart identification method, device, equipment and storage medium
CN116226681A (en) Text similarity judging method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination