CN106709032B - Method and device for extracting structured information in electronic form document - Google Patents

Method and device for extracting structured information in electronic form document Download PDF

Info

Publication number
CN106709032B
CN106709032B CN201611245472.9A CN201611245472A CN106709032B CN 106709032 B CN106709032 B CN 106709032B CN 201611245472 A CN201611245472 A CN 201611245472A CN 106709032 B CN106709032 B CN 106709032B
Authority
CN
China
Prior art keywords
cells
business
spreadsheet document
column
cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611245472.9A
Other languages
Chinese (zh)
Other versions
CN106709032A (en
Inventor
张军
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaao Data Technology Co Ltd filed Critical Shenzhen Huaao Data Technology Co Ltd
Priority to CN201611245472.9A priority Critical patent/CN106709032B/en
Publication of CN106709032A publication Critical patent/CN106709032A/en
Application granted granted Critical
Publication of CN106709032B publication Critical patent/CN106709032B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a method and a device for extracting structured information in an electronic form document. The invention provides a method for extracting structured information in an electronic form document, which comprises the following steps: acquiring all business tables in the electronic form document through an isolated table identification algorithm; performing layout analysis on the business table; and extracting contents from the business table according to the layout analysis result, and performing corresponding conversion processing to obtain structured information. The method and the device for extracting the structured information in the electronic form document realize the function of automatically acquiring all the business forms in the electronic form document in batches, and improve the efficiency of large-scale data extraction.

Description

Method and device for extracting structured information in electronic form document
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for extracting structured information in an electronic form document.
Background
Spreadsheet documents, such as Excel, although called spreadsheet software, are still unstructured or semi-structured data. Furthermore, a spreadsheet document may have multiple tabs, and there may be multiple isolated business forms in each tab, and the layout of each business form may be very random. Therefore, the data in the table cannot be directly used, and needs to be extracted and then converted into structured data after certain processing. The existing data extraction algorithm is difficult to handle the complicated and variable situations.
Disclosure of Invention
Aiming at the defects in the prior art, the method and the device for extracting the structured information in the electronic form document realize the function of automatically acquiring all business forms in the electronic form document in batches, and improve the efficiency of large-scale data extraction.
In a first aspect, the present invention provides a method for extracting structured information from a spreadsheet document, comprising: acquiring all business tables in the electronic form document through an isolated table identification algorithm; performing layout analysis on the business table; and extracting contents from the business table according to the layout analysis result, and performing corresponding conversion processing to obtain structured information.
According to the method for extracting the structured information in the electronic form document, all independent business forms in the electronic form document can be automatically obtained in batch through an isolated form identification algorithm, so that the efficiency of large-scale data extraction is improved; by extracting the business data after the layout analysis is carried out on the business table, the reliability of the extracted data is improved, and the method is particularly more effective for the identification and extraction of large-scale semi-structured data.
Preferably, the obtaining all business forms in the spreadsheet document through the isolated form identification algorithm includes: establishing two-dimensional bit arrays with the same size as the spreadsheet document, and recording the two-dimensional bit arrays as A and B; traversing all cells in the spreadsheet document, if the cells have contents, marking the corresponding position in the A as 1, otherwise marking the corresponding position as 0; traversing all cells in the spreadsheet document, and marking B according to the border lines of the cells; if the value in B is 1, the value of the same position in A is set as 1; and acquiring the business table coordinates in the electronic table document according to the updated A.
Preferably, traversing all the cells in the spreadsheet document, and marking B according to the border line of the cell comprises: and traversing all the cells in the spreadsheet document, and if at least one of the four corners of the cell has two frame lines, marking the corresponding position in the B as 1.
Preferably, the traversing all the cells in the spreadsheet document, if at least one of the four corners of the cell has two frame lines, and after the corresponding position in B is marked as 1, further includes: step S132, traversing all the cells in the spreadsheet document again, and if the cells have frame lines, the corresponding value on B is 0, and at least one of the values of the four cells adjacent to the cells in B, namely the upper, the lower, the left and the right cells in B, is marked as 1, marking the position of the cell in B as 1; step S133, traversing all the cells in the spreadsheet document again, if the corresponding value of a cell on B is 0, and the corresponding values of other three cells on B are all 1 in the 2 x2 area containing the cell, marking the cell on B as 1, and adding 1 to the counter; in step S134, if the counter is not 0, the counter is cleared, and step S133 is executed again.
Preferably, the obtaining the business table coordinates in the spreadsheet document according to the updated a includes: carrying out reduction operation on the updated A to obtain LA; and acquiring business table coordinates in the spreadsheet document according to the LA.
Preferably, the performing a zoom-out operation on the updated a to obtain LA includes: traversing all columns in A from the leftmost side of A, recording the column coordinate X1 of the column if the column has a value of 1, and terminating the traversal; traversing all columns in A from the rightmost side of A, recording the column coordinate X2 of the column if the column has a value of 1, and terminating the traversal; traversing all the rows in the A from the top of the A, if the rows have a value of 1, recording row coordinates Y1 of the rows, and terminating the traversal; traversing all the rows in the A from the lowest side of the A, recording row coordinates Y2 of the rows if the rows have a value of 1, and terminating the traversal; data of positions [ X1, X2, Y1 and Y2] in the A are extracted to form a two-dimensional bit array LA, and the coordinate mapping relation of the LA and the A is determined according to X1, X2, Y1 and Y2.
Preferably, the obtaining the business form coordinates in the spreadsheet document according to the LA includes: if all values in LA are 1, only one table in the spreadsheet document has business table coordinates [ X1, X2, Y1, Y2 ]; otherwise, detecting whether the cells in the X1 th column and the Y1 th row in the spreadsheet document are empty, if the cells are not empty, detecting the rest cells rightward until the empty cells are detected, recording the column coordinates of the empty cells as X3, detecting whether the cells in the X1 th column are empty from top to bottom until the empty cells are detected, recording the row coordinates of the empty cells as the maximum row coordinates of the X1 th column, continuing to detect the next column until the X3 th column is detected, if the maximum value in all the maximum row coordinates is Y3, setting the service table coordinates as [ X1, X3, Y1 and Y3], setting the content of the position corresponding to [ X1, X3, Y1 and Y3] in LA as 0, and obtaining new LA; and acquiring the service table coordinates in the spreadsheet document according to the updated LA until all service tables in the spreadsheet document are extracted.
Preferably, the performing layout analysis on the business table includes: detecting a header part in the service table; extracting multi-dimensional information of the header-removed part in the business table; and judging the table layout according to the extracted multi-dimensional information.
In a second aspect, the present invention provides an apparatus for extracting structured information from a spreadsheet document, comprising: the business table acquisition module is used for acquiring all business tables in the electronic table document through an isolated table identification algorithm; the table layout analysis module is used for carrying out layout analysis on the business table; and the table information extraction module is used for extracting contents from the business table according to the layout analysis result and performing corresponding conversion processing to obtain the structured information.
According to the device for extracting the structured information in the electronic form document, all independent business forms in the electronic form document can be automatically obtained in batches through an isolated form identification algorithm, so that the efficiency of large-scale data extraction is improved; by extracting the business data after the layout analysis is carried out on the business table, the reliability of the extracted data is improved, and the method is particularly more effective for the identification and extraction of large-scale semi-structured data.
Preferably, the service form obtaining module is specifically configured to: establishing two-dimensional bit arrays with the same size as the spreadsheet document, and recording the two-dimensional bit arrays as A and B; traversing all cells in the spreadsheet document, if the cells have contents, marking the corresponding position in the A as 1, otherwise marking the corresponding position as 0; traversing all cells in the spreadsheet document, and marking B according to the border lines of the cells; if the value in B is 1, the value of the same position in A is set as 1; and acquiring the business table coordinates in the electronic table document according to the updated A.
Drawings
FIG. 1 is a flow chart of a method for extracting structured information from a spreadsheet document according to an embodiment of the present invention;
FIG. 2 is a layout of a header section, a remark section, and a business data section in an exemplary table;
FIG. 3 is an example of a vertical multiple TL layout;
FIG. 4 is an example of a lateral multiple TL layout;
FIG. 5 is an example of cut merging of tables of a multiple TL layout;
FIG. 6 is an example of cut merging of tables of a multiple TL layout;
FIG. 7 is an example of processing a table for a single TL (multi-level) layout;
FIG. 8 is an example of an electronic document containing a plurality of mutually independent business forms;
FIG. 9 is a table with only outline border lines;
FIG. 10 is a block diagram of an apparatus for extracting structured information from a spreadsheet document according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.
As shown in FIG. 1, the present embodiment provides a method for extracting structured information from a spreadsheet document, comprising:
step S1, obtain all business forms in the spreadsheet document through the isolated form identification algorithm.
Common spreadsheet documents include Excel, ods files of open office, and the like, but are not limited to the above list. As shown in fig. 8, an electronic document may include a plurality of business forms independent of each other, and all business forms in the electronic document are extracted through an isolated form recognition algorithm. A business table refers to a table containing business data.
And step S2, performing layout analysis on the business table.
And step S3, extracting contents from the business table according to the layout analysis result, and performing corresponding conversion processing to obtain structured information.
The conversion processing comprises splitting and merging data blocks, deleting blank lines, replacing special characters and the like.
According to the method for extracting the structured information in the electronic form document, all independent business forms in the electronic form document can be automatically obtained in batch through an isolated form identification algorithm, and the efficiency of large-scale data extraction is improved; by extracting the business data after the layout analysis is carried out on the business table, the reliability of the extracted data is improved, and the method is particularly more effective for the identification and extraction of large-scale semi-structured data.
In order to improve the accuracy of extracting the service form, the isolated form identification algorithm in step S1 specifically includes the following steps:
step S11, two-dimensional bit arrays with the same size as the spreadsheet document are established and marked as A and B.
The size of the electronic form document indicates the number of the cells of the electronic form document, the number of the lines of the two-dimensional bit array is equal to the number of the lines of the cells, and the number of the columns of the two-dimensional bit array is equal to the number of the columns of the cells. Compared with the data types of other structures, the two-dimensional bit array saves the most space, is more convenient to process and is beneficial to improving the processing speed.
Step S12, traverse all cells in the spreadsheet document, if there is content in the cell, mark the corresponding position in a as 1, otherwise mark as 0.
The cells in the spreadsheet document correspond to the elements in the two-dimensional bit array A one by one, that is, if the cells in the first row and the first column have contents, the first row and the first column of the two-dimensional bit array A are marked as 1. The purpose of step S12 is to mark those cells belonging to the business table according to the contents in the cells, and the cell corresponding to the position marked with 1 in a belongs to the business table.
Step S13, traversing all cells in the spreadsheet document, marking B according to the border lines of the cells.
The purpose of step S13 is to mark those cells belonging to the business table according to whether the cells have border lines, and the cell corresponding to the position marked with 1 in B belongs to the business table. When the service table has empty rows, the accuracy rate can be reduced only by judging the cell contents, so that the accuracy rate of judgment can be improved by adding the frame line judging method.
In step S15, if the value in B is 1, the value at the same position in a is 1. The purpose of this step is to supplement the cell that is missing the mark in a.
And step S16, acquiring business table coordinates in the electronic form document according to the updated A.
Further, a preferred embodiment of step S13 specifically includes the following steps:
step S131, traversing all cells in the spreadsheet document, if at least one of the four corners of the cell has two frame lines, if there are two frame lines at all of the four corners of the cell, i.e., there are frame lines at all of the two sides of the corner, i.e., the cases of +, and B, then the corresponding position in B is marked as 1.
The frame line refers to a frame line which is actually displayed and is arranged for the cells, and is not an auxiliary line for distinguishing each cell for the convenience of a user as in an excel table.
Some tables have only the outer border lines and no border lines inside, as shown in fig. 9, and in this case, step S131 can recognize only the cells at the four corners of the table, i.e., the four cells labeled "1" in fig. 9.
Step S132, traverse all the cells in the spreadsheet document again, and mark the position of the cell in B as 1 if the cell has a border line, and the corresponding value on B is 0, and at least one of the values of the four cells adjacent to the cell in B, i.e. the upper, lower, left, right, and left, is marked as 1.
The cell that is one turn close to the outside frame line, i.e., the cell labeled "2" in fig. 9, can be identified by step S132.
Step S133, traverse all cells in the spreadsheet document again, if the cell corresponds to a value of 0 on B and the other three cells all correspond to a value of 1 on B in the 2 × 2 area containing the cell, mark the cell as 1 on B and add 1 to the counter.
The numerical value of the counter is used for recording that a plurality of cells are marked in the traversal process.
By repeating step S133, the mark inside the table, such as the cell marked with "3" in fig. 9, can be completed.
In step S134, after step S133 is executed, if the counter is not 0, the counter is cleared, and step S133 is executed again.
If the counter is not 0, indicating that there may be some missing cells that are not marked, it is necessary to return to step S133 to mark again. If the counter is 0, indicating that all cells in the document have been marked, then the spreadsheet document is not traversed, and so far, all the areas contained by the border line are marked as 1.
Specifically, step S16 includes:
in step S161, the updated a is reduced to obtain LA.
The reduction operation is to remove a large amount of contents which do not belong to the business table part in the A, reduce useless data in the A, reduce data processing amount and contribute to improving the efficiency of extracting the business table coordinates.
Step S162, obtaining business table coordinate in the electronic table document according to LA.
Wherein, step S161 specifically includes:
in step S1611, all columns in a are traversed from the leftmost side of a, and if a value of 1 exists in a column, the column coordinate X1 of the column is recorded, and the traversal is terminated.
In step S1612, all columns in a are traversed from the rightmost side of a, and if a value of 1 exists in a column, the column coordinate X2 of the column is recorded, and the traversal is terminated.
In step S1613, all rows in a are traversed from the top of a, and if a value of 1 exists in a row, the row coordinate Y1 of the row is recorded, and the traversal is terminated.
In step S1614, all rows in a are traversed from the lowermost side of a, and if a value of 1 exists in a row, the row coordinate Y2 of the row is recorded, and the traversal is terminated.
Step S1615, data of [ X1, X2, Y1, Y2] positions in a are extracted to form a two-dimensional bit array LA, and a coordinate mapping relationship between LA and a is determined according to X1, X2, Y1, and Y2, where LA (m, n) ═ a (m + X1-1, n + Y1-1).
Wherein, step S162 specifically includes:
in step S1621, if all values in LA are 1, there is only one table in the spreadsheet document, and the business table coordinates are [ X1, X2, Y1, Y2 ].
In step S1622, if the LA has a value containing 0, it is checked whether the cells in the X1 th column and the Y1 th row in the spreadsheet document are empty, and if the cells are not empty, the remaining cells are checked rightward until an empty cell is detected, and the column coordinate of the empty cell is recorded as X3.
If the value of 0 in LA indicates that there are multiple independent business forms in the spreadsheet document, the method starts with step S1622 to extract the multiple independent business forms.
Step S1623, detect from top to bottom whether the cell in the X1 th column is empty until detecting an empty cell, record the row coordinate of the empty cell as the maximum row coordinate of the X1 th column, and continue detecting the next column until detecting the X3 th column.
In step S1624, if the maximum value among all the maximum row coordinates is Y3, the service table coordinates are [ X1, X3, Y1, Y3], and the content of the position corresponding to [ X1, X3, Y1, Y3] in the LA is set to 0, so as to obtain a new LA.
After a service table is extracted, data corresponding to the service table needs to be cleared in the LA to find a next table, and only the first service table is found when the service table is extracted next time without clearing.
Step S163, returning to step S161, obtaining the business form coordinates in the spreadsheet document according to the LA updated in step S1624 until all the business forms in the spreadsheet document are extracted.
In order to facilitate management of the service table coordinates obtained in step S1, a List object (PList) is pre-established, and a one-dimensional array with a length of 4 is stored in the PList to store the service table coordinates. The four elements in the service table coordinate sequentially represent the positions of the first column, the last column, the first row and the last row of the service table in the electronic form document, so that the service table can be extracted from the electronic form document according to the service table coordinate.
A preferred embodiment of step S2 specifically includes the following steps:
in step S21, the header portion in the service table is detected.
As shown in fig. 2, the header portion of the table is usually a large merge cell, which may be one or more rows, and the table may further include a remark portion, where the structure of the remark portion is similar to that of the header portion, and the remaining portion is the service data to be extracted except the header portion and the remark portion of the table. When the remark section is present in the table, the remark section needs to be detected in step S21 in the same manner as the header section.
And step S22, extracting the multi-dimensional information of the header-removed part in the service table.
Wherein the multi-dimensional information comprises: cell content, background color attributes, etc. The cell contents are text contents in the cells, such as "name", "age", "zhang san", "40". The background color attribute specifies the background color of the cell.
Step S23, determining the table layout according to the extracted multi-dimensional information.
Common table layouts are divided into horizontal single TL, horizontal multiple TL, vertical single TL, vertical multiple TL and multiple table combinations. TL (titleline) is a column header (or data header portion) (possibly physically a plurality of lines, but logically an area), and represents the header of each item of service data, such as the first line of the service data portion in fig. 2 and is TL. TL may be transverse or longitudinal, as shown in fig. 3 for a longitudinal multi-TL layout and fig. 4 for a transverse multi-TL layout.
The title part and the remark part are generally in the first row or the second row of the table, and are a merged cell, so the specific implementation manner of step S21 includes: detecting whether each row in the business table is a merging cell, if so, detecting the detected row belongs to the header part, and detecting the next row; if not, indicating that the line is beginning to be traffic data, then the detection of the header portion is stopped.
In the prior art, when filtering useless data (such as a header part and a remark part), the positions of the useless data need to be known in advance, and then the positions are specified in a program so as to skip the previous rows of the useless data. However, the method in step S21 in this embodiment is more general, and no matter how many lines of title parts and remark parts exist in the table, the lines of title parts and remark parts can be detected accurately and efficiently, so as to ensure that the service data can be extracted accurately.
In the process of extracting data, besides directly acquiring the corresponding information of the conventional cell, special processing needs to be performed on the merged cell to make the extracted data satisfy the storage format, which facilitates subsequent processing, and therefore, a preferred method of step S22 includes: extracting multi-dimensional information of a header part (including a remark part if the remark part exists) in a business table, splitting a merged cell in the extracted information, storing the information of each dimension in a two-dimensional array form, and marking the split cell specially.
The merging cells are divided into transverse merging, longitudinal merging and mixed merging. Splitting a merged cell which transversely merges 5 cells to obtain the following results:
ABC {←} {←} {←} {←}
the special mark "{ ← }" is specific to the content of the extracted cell, and indicates that the content in the cell is the same as the content in the cell on the left side of the extracted cell, so that the flexibility is provided for the processing of TL and the output of final content, and the extraction of other data does not need to be specially marked.
The extracted background color attributes are:
#F7FBFE #F7FBFE #F7FBFE #F7FBFE #F7FBFE
for the case where there are multiple lateral merges in a single row, the problem of coordinate translation needs to be noted as well, as shown in the following table:
ABC {←} DEF {←} {←}
the data extraction is also carried out by adopting a similar method for longitudinal combination and mixed combination.
Only if the table layout is known, the business data can be accurately extracted, and the table is converted into the structured data according to the table layout. The judgment of the table layout in step S23 includes the following operations:
(1) rows and columns that are not TL are excluded, depending on the extracted cell content.
And (4) performing exclusivity judgment according to the data type, the length and the keywords in the TL unit cell. The judgment basis comprises: the length of the field names in each cell of the TL cannot exceed a threshold (for example, 50), the number of the field names of the TL cannot exceed a threshold (for example, 1000), the field names cannot be pure numeric character strings, common field names include keywords such as "Name", "Address", "type", "remark", and the like, a keyword library is obtained according to common table statistics, and whether a row or a column contains the keyword in the keyword library is detected.
Therefore, the specific implementation steps for judging the table layout based on the cell content are as follows: detecting the extracted cell contents line by line and column by column; if the data type of the cell content is a numeric character string, the row or column of the cell is not TL; if the field length of the cell content exceeds a first threshold, the line or the column of the cell is not TL; if a given keyword is contained in the contents of a plurality of cells in a row or a column, the row or the column is TL.
When the keyword-based judgment method is used, in order to ensure the judgment reliability, at least two keywords are required to appear to identify the line or the column as TL.
(2) And judging the table layout according to the extracted background color attribute.
When the table is displayed, in order to provide convenience for a user to read, the background color of the table TL and the background color of the data may be different, or the odd and even rows of the data may adopt staggered background colors, so that the background color attribute can be used for judging which rows or columns may be TL, and further judging whether the table layout is horizontal or vertical.
(3) And judging the table layout according to whether the data types of the cell contents in the same row or the same column are the same.
The TL part is removed from the business data of the table, and the types of the cells under the field names of the TL should be the same as long as they are not null values (the method can distinguish only 'pure numeric type string', 'date and time type string', 'no obvious characteristic string'). For example, the table in fig. 2 is a horizontal layout, in which the data types of the cells in each column of the business data part except the first line TL are the same, for example, the column of the field name "serial number" is a pure numeric character string, the column of the field name "execution court" is a 'no obvious character string', the column of the field name "execution case number" is a 'no obvious character string', and in short, the data types of the columns except the TL line are the same.
According to the characteristics, whether the data types of the same row are the same or not is detected, and if the data types of all the rows of the table are the same (namely, the data types of all the cells in the same row are either 'pure numeric character strings' or 'date-time character strings' or 'no obvious characteristic character strings'), the table is in a longitudinal layout; and detecting whether the data types of the same column are the same or not, and the data types of all columns of the table are the same (namely, the data types of all cells in the same column are 'pure numeric character strings', or 'date-time character strings' or 'no obvious characteristic character strings'), so that the table is in a horizontal layout.
In order to avoid the influence of the cells on the detection result, the cells with empty contents do not fall into the detection range when the rows and the columns are detected.
The data volume of the business data part of the table is generally large, and the detection on all rows and columns can reduce the judgment efficiency, so that short circuit judgment can be adopted, namely, if the judgment result of a new row can deny a certain layout, the judgment can be skipped.
The table layout can be judged by various combinations according to actual requirements, so that the judgment accuracy is improved; in addition, the method of the embodiment can identify the condition of multiple TL in the table, and improves the reliability of the extracted data.
In the case where the table layout is a vertical layout, the table formed by the cell contents also needs to be shifted to a horizontal layout.
The TL is classified into a single-stage TL and a multi-stage TL, but they are collectively referred to as TL unless otherwise specified. As shown in fig. 2, there is only one TL and a single level TL. As shown in fig. 7, only one TL is provided and is a multi-level TL (composed of multiple rows with membership of upper and lower levels), and the field names in the multiple rows need to be combined to form a single-row field name output. As shown in fig. 7, the TL in the original table is divided into two parts, the left part is a plurality of lines (multi-level), the right part is a single line, the first level of the multi-level part is a merged cell with a field name of 'basic information', the second level of the multi-level part is 'name', 'age', 'gender' fields, and finally a single level of TL is output, and its structure is 'basic information _ name', 'basic information _ age', 'basic information _ gender', 'other fields a', 'other fields B'.
For the case that the table layout is multiple TL, the table formed by the cell contents also needs to be cut and merged, and converted into the layout of a single TL, so as to meet the format requirement of the structured data. The cutting and combining operation comprises the following steps: comparing cell contents of the plurality of TLs; TL's of the same content only reserve one line TL as shown in FIG. 5; the TLs of different content are spliced into a line of TLs as shown in fig. 6.
Finally, for the merged cells, the special mark can be corrected according to the service requirement. For example
ABC {←} {←} {←} {←}
The following formats can be adjusted:
ABC ABC ABC ABC ABC
the method for extracting the structured information in the electronic form document is also suitable for the condition that the electronic form document comprises a plurality of sheet tabs, and the specific method comprises the following steps: and acquiring sheet tabs in the electronic form document one by one, and extracting the business form in each sheet tab by adopting the methods of the steps S1 to S3 for each sheet tab.
Based on the same inventive concept as the method for extracting the structured information in the spreadsheet document, the embodiment further provides an apparatus for extracting the structured information in the spreadsheet document, as shown in fig. 10, including: the business form acquisition module is used for acquiring all business forms in the spreadsheet document; the table layout analysis module is used for carrying out layout analysis on the business table; and the table information extraction module is used for extracting contents from the business table according to the layout analysis result and performing corresponding conversion processing to obtain the structured information.
The device for extracting the structured information in the electronic form document provided by the embodiment can automatically acquire all independent business forms in the electronic form document in batch through an isolated form identification algorithm, so that the efficiency of large-scale data extraction is improved; by extracting the business data after the layout analysis is carried out on the business table, the reliability of the extracted data is improved, and the method is particularly more effective for the identification and extraction of large-scale semi-structured data.
The service form acquisition module is specifically configured to: establishing two-dimensional bit arrays with the same size as the spreadsheet document, and recording the two-dimensional bit arrays as A and B; traversing all cells in the spreadsheet document, if the cells have contents, marking the corresponding position in the A as 1, otherwise marking the corresponding position as 0; traversing all cells in the spreadsheet document, and marking B according to the border lines of the cells; if the value in B is 1, the value of the corresponding position in A is 1; and acquiring the business table coordinates in the spreadsheet document according to the updated A.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (8)

1. A method of extracting structured information from a spreadsheet document, comprising:
acquiring all business tables in the electronic form document through an isolated table identification algorithm;
performing layout analysis on the business table;
extracting contents from the business table according to the layout analysis result, and performing corresponding conversion processing to obtain structured information; the obtaining of all business forms in the spreadsheet document by the isolated form recognition algorithm includes:
establishing two-dimensional bit arrays with the same size as the spreadsheet document, and recording the two-dimensional bit arrays as A and B;
traversing all cells in the spreadsheet document, if the cells have contents, marking the corresponding position in the A as 1, otherwise marking the corresponding position as 0;
traversing all cells in the spreadsheet document, and marking B according to the border lines of the cells;
if the value in B is 1, the value of the same position in A is set as 1;
and acquiring the business table coordinates in the electronic table document according to the updated A.
2. The method of claim 1, wherein traversing all cells in the spreadsheet document, marking B according to the border line of the cell, comprises:
and traversing all the cells in the spreadsheet document, and if at least one of the four corners of the cell has two frame lines, marking the corresponding position in the B as 1.
3. The method of claim 2, wherein traversing all cells in the spreadsheet document, if there are two border lines in at least one of the four corners of a cell, and after the corresponding position in B is marked as 1, further comprising:
step S132, traversing all the cells in the spreadsheet document again, and if the cells have frame lines, the corresponding value on B is 0, and at least one of the values of the four cells adjacent to the cells in B, namely the upper, the lower, the left and the right cells in B, is marked as 1, marking the position of the cell in B as 1;
step S133, traversing all the cells in the spreadsheet document again, if the corresponding value of a cell on B is 0, and the corresponding values of other three cells on B are all 1 in the 2 x2 area containing the cell, marking the cell on B as 1, and adding 1 to the counter;
in step S134, if the counter is not 0, the counter is cleared, and step S133 is executed again.
4. The method of claim 2, wherein obtaining business form coordinates in the spreadsheet document according to the updated A comprises:
carrying out reduction operation on the updated A to obtain LA;
and acquiring the service table coordinates in the spreadsheet document according to the LA.
5. The method of claim 4, wherein the scaling down the updated A to obtain LA comprises:
traversing all columns in A from the leftmost side of A, recording the column coordinate X1 of the column if the column has a value of 1, and terminating the traversal;
traversing all columns in A from the rightmost side of A, recording the column coordinate X2 of the column if the column has a value of 1, and terminating the traversal;
traversing all the rows in the A from the top of the A, if the rows have a value of 1, recording row coordinates Y1 of the rows, and terminating the traversal;
traversing all the rows in the A from the lowest side of the A, recording row coordinates Y2 of the rows if the rows have a value of 1, and terminating the traversal;
data of positions [ X1, X2, Y1 and Y2] in the A are extracted to form a two-dimensional bit array LA, and the coordinate mapping relation of the LA and the A is determined according to X1, X2, Y1 and Y2.
6. The method of claim 5, wherein obtaining business form coordinates in the spreadsheet document from the LA comprises:
if all values in LA are 1, only one table in the spreadsheet document has business table coordinates [ X1, X2, Y1, Y2 ];
otherwise, detecting whether the cells in the X1 th column and the Y1 th row in the spreadsheet document are empty, if not, detecting the rest cells to the right until the empty cells are detected, recording the column coordinates of the empty cells as X3,
detecting whether the cell of the X1 th column is empty or not from top to bottom until detecting the empty cell, recording the row coordinate of the empty cell as the maximum row coordinate of the X1 th column, continuing to detect the next column until detecting the X3 th column,
if the maximum value in all the maximum row coordinates is Y3, the service table coordinates are [ X1, X3, Y1, Y3], the content of the position corresponding to [ X1, X3, Y1, Y3] in LA is set as 0, and new LA is obtained;
and acquiring the service table coordinates in the spreadsheet document according to the updated LA until all service tables in the spreadsheet document are extracted.
7. The method of claim 1, wherein the performing layout analysis on the business table comprises:
detecting a header part in the service table;
extracting multi-dimensional information of the header-removed part in the business table;
and judging the table layout according to the extracted multi-dimensional information.
8. An apparatus for extracting structured information from a spreadsheet document, comprising:
the business table acquisition module is used for acquiring all business tables in the electronic table document through an isolated table identification algorithm;
the table layout analysis module is used for carrying out layout analysis on the business table;
the table information extraction module is used for extracting contents from the business table according to the layout analysis result and performing corresponding conversion processing to obtain structured information; the service form acquisition module is specifically configured to:
establishing two-dimensional bit arrays with the same size as the spreadsheet document, and recording the two-dimensional bit arrays as A and B;
traversing all cells in the spreadsheet document, if the cells have contents, marking the corresponding position in the A as 1, otherwise marking the corresponding position as 0;
traversing all cells in the spreadsheet document, and marking B according to the border lines of the cells;
if the value in B is 1, the value of the same position in A is set as 1;
and acquiring the business table coordinates in the electronic table document according to the updated A.
CN201611245472.9A 2016-12-29 2016-12-29 Method and device for extracting structured information in electronic form document Active CN106709032B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611245472.9A CN106709032B (en) 2016-12-29 2016-12-29 Method and device for extracting structured information in electronic form document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611245472.9A CN106709032B (en) 2016-12-29 2016-12-29 Method and device for extracting structured information in electronic form document

Publications (2)

Publication Number Publication Date
CN106709032A CN106709032A (en) 2017-05-24
CN106709032B true CN106709032B (en) 2019-12-20

Family

ID=58904022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611245472.9A Active CN106709032B (en) 2016-12-29 2016-12-29 Method and device for extracting structured information in electronic form document

Country Status (1)

Country Link
CN (1) CN106709032B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170697B (en) * 2017-07-12 2021-08-20 信号旗智能科技(上海)有限公司 International trade file processing method and system and server
CN110889310B (en) * 2018-09-07 2023-05-09 深圳市赢时胜信息技术股份有限公司 Financial document information intelligent extraction system and method
CN110969000B (en) * 2018-09-30 2024-05-03 北京国双科技有限公司 Data merging processing method and device
CN110377604B (en) * 2019-07-23 2022-06-24 北京小米移动软件有限公司 Method, device and medium for extracting form information
CN110489423B (en) * 2019-08-26 2021-10-08 北京香侬慧语科技有限责任公司 Information extraction method and device, storage medium and electronic equipment
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document
CN110888965A (en) * 2019-10-22 2020-03-17 深圳市迪博企业风险管理技术有限公司 Document data extraction method and device
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与***有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN110968667B (en) * 2019-11-27 2023-04-18 广西大学 Periodical and literature table extraction method based on text state characteristics
CN111966734A (en) * 2020-03-30 2020-11-20 北京来也网络科技有限公司 Data processing method and electronic equipment of spreadsheet combined with RPA and AI
CN112307030B (en) * 2020-11-05 2023-12-26 金蝶软件(中国)有限公司 Dimension combination acquisition method and related equipment
CN112381143B (en) * 2020-11-13 2023-12-05 新长城科技有限公司 Automatic variable classification method and system based on machine learning
CN112328589B (en) * 2020-11-28 2021-08-17 河北省科学技术情报研究院(河北省科技创新战略研究院) Electronic form data granulation and index standardization processing method
CN114417798A (en) * 2022-01-19 2022-04-29 广州天维信息技术股份有限公司 Document structured extraction method and device, computer equipment and storage medium
CN114973259A (en) * 2022-03-03 2022-08-30 北京电解智科技有限公司 Information extraction method, device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620593A (en) * 2008-06-30 2010-01-06 国际商业机器公司 Resolve the method and the electronic form server of the content of electronic spreadsheet
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN103279455A (en) * 2013-06-28 2013-09-04 中国农业银行股份有限公司 Spreadsheet style processing method and device
CN104731813A (en) * 2013-12-23 2015-06-24 珠海金山办公软件有限公司 Form file display method and system
CN106156239A (en) * 2015-04-27 2016-11-23 ***通信集团公司 A kind of form abstracting method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620593A (en) * 2008-06-30 2010-01-06 国际商业机器公司 Resolve the method and the electronic form server of the content of electronic spreadsheet
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN103279455A (en) * 2013-06-28 2013-09-04 中国农业银行股份有限公司 Spreadsheet style processing method and device
CN104731813A (en) * 2013-12-23 2015-06-24 珠海金山办公软件有限公司 Form file display method and system
CN106156239A (en) * 2015-04-27 2016-11-23 ***通信集团公司 A kind of form abstracting method and device

Also Published As

Publication number Publication date
CN106709032A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN106709032B (en) Method and device for extracting structured information in electronic form document
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
CN107622230B (en) PDF table data analysis method based on region identification and segmentation
US7937653B2 (en) Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
CN104142961B (en) The logic processing device of composite diagram and logical process method in format document
CN101770446B (en) Method and system for identifying form in layout file
AU2009281901B2 (en) Segmenting printed media pages into articles
KR101394723B1 (en) Reconstruction of lists in a document
US7852499B2 (en) Captions detector
US20150095769A1 (en) Layout Analysis Method And System
CN110516221B (en) Method, equipment and storage medium for extracting chart data in PDF document
CN106777259A (en) The method and device of structured message in adaptive decimation HTML Table labels
CN109710771B (en) Table information extraction method, device and storage medium
CN110704570A (en) Continuous page layout document structured information extraction method
Klampfl et al. A comparison of two unsupervised table recognition methods from digital scientific articles
CN111797630A (en) PDF-format-paper-oriented biomedical entity identification method
CN112380812B (en) Method, device, equipment and storage medium for extracting incomplete frame line table of PDF (Portable document Format)
CN106777281B (en) Data processing method and device for improving stability and usability of web crawler
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
CN104598462A (en) Method and device for extracting structural data
CN115828874A (en) Industry table digital processing method based on image recognition technology
CN113962201A (en) Document structuralization and extraction method for documents
CN113408323B (en) Extraction method, device and equipment of table information and storage medium
Kasar et al. Table information extraction and structure recognition using query patterns
CN109062921B (en) Method and system for extracting ship tray management information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong

Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: 518000 units J and K, 12 / F, block B, building 7, Baoneng Science Park, Qinghu Industrial Zone, Qingxiang Road, Longhua New District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

CP02 Change in the address of a patent holder