CN114021543B - Document comparison analysis method and system based on table structure analysis - Google Patents

Document comparison analysis method and system based on table structure analysis Download PDF

Info

Publication number
CN114021543B
CN114021543B CN202210003662.9A CN202210003662A CN114021543B CN 114021543 B CN114021543 B CN 114021543B CN 202210003662 A CN202210003662 A CN 202210003662A CN 114021543 B CN114021543 B CN 114021543B
Authority
CN
China
Prior art keywords
document
difference
data
content
comparison
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210003662.9A
Other languages
Chinese (zh)
Other versions
CN114021543A (en
Inventor
郑飞鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Real Intelligence Technology Co ltd
Original Assignee
Hangzhou Real Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Real Intelligence Technology Co ltd filed Critical Hangzhou Real Intelligence Technology Co ltd
Priority to CN202210003662.9A priority Critical patent/CN114021543B/en
Publication of CN114021543A publication Critical patent/CN114021543A/en
Application granted granted Critical
Publication of CN114021543B publication Critical patent/CN114021543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a document comparison analysis method and system based on table structure analysis. The method comprises the steps of S1, receiving various types of source files and uniformly converting the source files into PDF files; s2, aiming at different types of content parts in the PDF file, extracting, dividing and identifying by using different tools respectively to obtain table data and non-table data with text content, coordinate information and table structure; and S3, comparing the tabular data with the non-tabular data respectively to obtain the text difference and the tabular difference outside the table. The system comprises a file conversion module, a file identification module and a data comparison module. The invention has the advantages of focusing on the comparison of the document content and the semantic level, having the capability of comparing the structures and the semantics among the tables in the document comparison, good comparison effect, low resource occupation and accurate character recognition.

Description

Document comparison analysis method and system based on table structure analysis
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a document comparison analysis method and system based on table structure analysis.
Background
Today, information updates are iterative and rapid, and there are more and more documents exposed to and used in productive life activities, whether by businesses, governments, utilities, or individuals. In practical situations, a document often requires multiple handoffs. In the process of transferring hands, each time of transferring hands, some deviation may be added, which finally results in a larger discrepancy with the original document, and a certain influence is generated on the related business. For example, a business contract requires multiple revisions from initial contract to final contract, each revision may cause subtle changes in format due to different document editing tools used by the personnel involved in the revision, in addition to content changes. The contracting parties may need to transmit the contract documents in the form of fax, printed matter, etc., which in turn involves the interconversion from text documents to different carrier forms such as electronic images, paper documents, scanned matter, etc. The method is limited by the accuracy problem of the current technologies such as OCR and the like, and the conversion process has certain probability to cause the problems of content deletion, content change and the like.
Currently, the most common way to deal with such conversion errors is to rely on manual checking. And when the contract content is more, the efficiency of manual review of the contract examiner is lower.
Therefore, it is very important to develop an electronic computer program that can accept multiple document formats, accurately analyze document contents, accurately describe content differences in different versions and different source files, and perform efficient comparison.
However, the existing document alignment related technology has the following disadvantages:
1. the document comparison based on the image template matching technology has too strict requirements on the content structure of the document;
2. the comparison effect of the mixed document of the image-text table is poor;
3. OCR recognition occupies higher resources and has a certain probability of character recognition error.
Therefore, it is very important to design a document comparison analysis method and system based on table structure analysis, which focuses on comparison between document contents and semantic levels, has the capability of comparing structures and semantics among tables in document comparison, and has good comparison effect, low resource occupation and accurate character recognition.
For example, chinese patent application No. CN202110644806.4 describes a method, an apparatus and a storage medium for fast comparing long documents, and the comparing method includes the following steps for two long documents to be compared: s1 analyzing the two documents to form a tree-shaped document structure; s2 splitting the two documents into two groups of content chunks according to the tree document structure; s3 establishing a mapping relationship between two groups of content blocks to form a plurality of mapping pairs; s4 parallel multiple tasks, each task for word-by-word comparison of two content blocks of a mapping pair to find a difference point. Although the speed of long document comparison can be improved, the method has the disadvantages that the capability of table structure and semantic comparison in document comparison is lacked, and the difference cannot be reflected under the condition that the table structure in the actual document has difference.
Disclosure of Invention
The invention provides a document comparison analysis method and a document comparison analysis system based on table structure analysis, which are focused on comparison of document contents and semantic levels, have the capability of comparing structures between tables and semantics in document comparison, have good comparison effect, low occupied resources and accurate character recognition and aim to overcome the problems that the existing document comparison technology has strict requirements on the structure of document contents, has limitations, poor comparison effect, higher occupied resources and certain probability of character recognition errors in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the document comparison analysis method based on table structure analysis comprises the following steps;
s1, receiving various types of source files and uniformly converting the source files into PDF files;
s2, aiming at different types of content parts in the PDF file, extracting, dividing and identifying by using different tools respectively to obtain table data and non-table data with text content, coordinate information and table structure;
and S3, comparing the tabular data with the non-tabular data respectively to obtain the text difference and the tabular difference outside the table.
Preferably, the out-of-table text difference and the table difference comprise reasons for difference generation, difference content and document positions of the differences; the reasons for the difference include addition, deletion and modification; the document position where the difference is located includes a page number and XY coordinates.
Preferably, step S2 includes a PDF file parsing process, which includes the following steps:
s21, using the PDF file analysis tool to analyze the PDF file generated in the step S1, and obtaining table data;
s22, extracting the content from the PDF file according to the reading sequence by using a PDF file analysis tool;
s23, if the extracted content is a character, judging whether the character belongs to the table data in the step S21; if not, adding the characters to the tail part of the character array; if yes, skipping the character, and going to step S24;
s24, if the extracted content is a picture, recognizing the picture by using OCR, and sequentially adding character results obtained by OCR recognition to the tail part of the character array in the step S23;
s25, the final identification result is a character array and a table array formed by table data; each table object in the table array contains the relative position relationship of the table in the character array.
Preferably, the table data alignment in step S3 includes the following steps:
s31, extracting the table structure characteristics, generating a hash by using the table structure characteristics plus the table content, and marking the table with the same hash as the 'complete matching' in the source document and the converted PDF document; in the rest incompletely matched tables, tables with similar structural features in the source document and the converted PDF document are compared in a traversing manner, and the table with the content similarity exceeding 60% is marked as 'partial matching'; the other tables which still fail to be successfully matched are marked as deleted/added;
the tables with similar structural characteristics refer to that the structural characteristic matrix of the table in any document is completely contained by the structural matrix of the table of another document or the same part is overlapped by more than 80 percent;
s32, aiming at the tables of 'complete matching' and 'partial matching', allocating a placeholder mark composed of special characters for each group, and ensuring that in a comparison task, the placeholder marks of any two groups of 'complete matching' or 'partial matching' tables do not contain any same characters, and all characters in the placeholder marks composed of special characters are all uncommon characters;
s33, forming correspondence between unmatched feature blocks in the 'partial matching' table and the original table cells, and marking the corresponding original table cells as addition or deletion of the table; reading the matched characteristic blocks in the 'partially matched' table line by line and forming correspondence with the original table cells, comparing the addition and deletion of the cells in the first line of the original table cell area through an editing distance algorithm after the removing and changing operation, obtaining the column addition and deletion of the corresponding original table cell area, and marking;
s34, in the columns of the feature blocks which are successfully matched in the residual partial matching table, comparing the first column of the corresponding original table cell by the editing distance algorithm after the removing and changing operation to obtain the line addition and deletion of the corresponding original table cell area, and newly marking; and for the rest cells, splicing the text contents of all the cells according to the reading sequence of the front row and the rear row, and obtaining the change of the cell contents by using a general edit distance algorithm.
Preferably, the non-tabular data alignment of step S3 includes the steps of:
s35, splicing the non-table data parts of the source document and the converted PDF document into a character string, inserting the character string into the placeholder mark of the table obtained in the step S32 according to the table-character relative index position, comparing the two character strings of the two documents by using an editing distance algorithm to obtain difference points of the outer texts in the two documents, and feeding back the difference caused by the relative position change of the table and the outer characters from the difference points;
and S36, finally obtaining the final document difference as the out-of-form text difference and the form difference.
Preferably, the tabular difference comprises 3 particle sizes:
difference in whole table: adding/deleting tables; table block difference: table block add/delete; cell content difference: and adding/deleting/changing the cell content.
Preferably, the generation rule of the table structure feature matrix in step S31 is as follows:
s311, determining the size of the two-dimensional array according to the minimum granularity of the cells in the table;
s312, filling the two-dimensional array elements, wherein the rule is as follows:
the head unit of the transverse merging unit is H, and the rest is _ H;
the head unit of the longitudinal merging unit is V, and the rest is _ V;
the no merging cell is N;
the head unit of the horizontal and vertical bidirectional merging cells is D, and the rest is _ D;
s313, compressing the two-dimensional array formed in the step S312; the compression rules are as follows:
the column compression direction is from right to left, and the row compression direction is from bottom to top;
when all letters of a certain column/row satisfy the following rule, the corresponding column/row is compressed:
the column/row letters are completely the same as the left/upper side of the column/row letters, or are combined cells and main cells;
deleting the compressed row/line from the array, and repeatedly executing the step S313 until the two-dimensional array cannot be compressed;
s314, through step S313, a table structure feature matrix is finally obtained.
The invention also provides a document comparison analysis system based on table structure analysis, which comprises the following steps:
the file conversion module is used for receiving various types of source files and uniformly converting the source files into PDF files;
the file identification module is used for extracting, dividing and identifying different tools aiming at different types of content parts in the PDF file respectively to obtain table data and non-table data with text content, coordinate information and a table structure;
and the data comparison module is used for respectively comparing the table data with the non-table data to finally obtain the text difference and the table difference outside the table.
Preferably, the file identification module includes:
the table data identification module is used for analyzing the PDF file and obtaining table data;
and the non-table data identification module is used for analyzing the PDF file and obtaining non-table data.
Preferably, the data alignment module comprises;
the table data comparison module is used for obtaining the difference of the table data parts of the source document and the converted PDF document;
and the non-table data comparison module is used for obtaining the difference of the non-table data part of the source document and the converted PDF document.
Compared with the prior art, the invention has the beneficial effects that: (1) the invention introduces a character, table and picture classification and identification mechanism, so that pictures and tables can be brought into a comparison range, and the application range of the document comparison device is strengthened; (2) the comparison dimensionality of the table is extended to the comparison of three dimensionalities of the whole table, the table block and the table cell content from the simple cell text content comparison, so that the method can effectively sense the addition and deletion of the whole table, the addition and deletion of the table columns and the table rows and the addition and deletion of the matched cell content in most real service scenes; (3) the invention can more intuitively embody the structural change of the user based on the original form.
Drawings
FIG. 1 is a flow chart of a PDF file parsing process in a document comparison analysis method based on table structure parsing according to the present invention;
FIG. 2 is a flowchart of table data comparison in a document comparison analysis method based on table structure parsing according to the present invention;
FIG. 3 is a flow chart of non-tabular data alignment in the document alignment analysis method based on table structure parsing according to the present invention;
FIG. 4 is a schematic diagram of a two-dimensional array according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the compressed table structure feature matrix of FIG. 4;
FIG. 6 is a diagram illustrating a text content of a document A according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a text content of the document B according to the embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
the document comparison analysis method based on table structure analysis comprises the following steps;
s1, receiving various types of source files and uniformly converting the source files into PDF files;
s2, aiming at different types of content parts in the PDF file, extracting, dividing and identifying by using different tools respectively to obtain table data and non-table data with text content, coordinate information and table structure;
and S3, comparing the tabular data with the non-tabular data respectively to obtain the text difference and the tabular difference outside the table.
Different types of files are uniformly converted into PDFs, and because the PDF format files can keep the stability of the document format, no matter the files are across system platforms or printed, the phenomenon of structural disorder is avoided. Meanwhile, whether pictures or frequently-used WORD (Microsoft office software WORD processing software) documents can be converted into PDF, and the file types are unified, so that subsequent unified processing is facilitated. And PDF can be displayed at the front end of the Web page, and the document difference points can be displayed on the Web page in a wire frame graph form by combining the comparison result output by the invention.
Further, as shown in fig. 1, step S2 includes a PDF file parsing process, which includes the following steps:
s21, using the PDF file analysis tool to analyze the PDF file generated in the step S1, and obtaining table data;
s22, extracting the content from the PDF file according to the reading sequence by using a PDF file analysis tool;
s23, if the extracted content is a character, judging whether the character belongs to the table data in the step S21; if not, adding the characters to the tail part of the character array; if yes, skipping the character, and going to step S24;
s24, if the extracted content is a picture, recognizing the picture by using OCR, and sequentially adding character results obtained by OCR recognition to the tail part of the character array in the step S23;
s25, the final identification result is a character array and a table array formed by table data; each table object in the table array contains the relative position relationship of the table in the character array.
Further, as shown in fig. 2, the table data comparison in step S3 includes the following steps:
s31, extracting the table structure characteristics, generating a hash by using the table structure characteristics plus the table content, and marking the table with the same hash as the 'complete matching' in the source document and the converted PDF document; in the rest incompletely matched tables, tables with similar structural features in the source document and the converted PDF document are compared in a traversing manner, and the table with the content similarity exceeding 60% is marked as 'partial matching'; the other tables which still fail to be successfully matched are marked as deleted/added;
the tables with similar structural characteristics refer to that the structural characteristic matrix of the table in any document is completely contained by the structural matrix of the table of another document or the same part is overlapped by more than 80 percent;
s32, aiming at the tables of 'complete matching' and 'partial matching', allocating a placeholder mark composed of special characters for each group, and ensuring that in a comparison task, the placeholder marks of any two groups of 'complete matching' or 'partial matching' tables do not contain any same characters, and all characters in the placeholder marks composed of special characters are all uncommon characters;
s33, forming correspondence between unmatched feature blocks in the 'partial matching' table and the original table cells, and marking the corresponding original table cells as addition or deletion of the table; reading the matched characteristic blocks in the 'partially matched' table line by line and forming correspondence with the original table cells, comparing the addition and deletion of the cells in the first line of the original table cell area through an editing distance algorithm after the removing and changing operation, obtaining the column addition and deletion of the corresponding original table cell area, and marking;
s34, in the columns of the feature blocks which are successfully matched in the residual partial matching table, comparing the first column of the corresponding original table cell by the editing distance algorithm after the removing and changing operation to obtain the line addition and deletion of the corresponding original table cell area, and newly marking; and for the rest cells, splicing the text contents of all the cells according to the reading sequence of the front row and the rear row, and obtaining the change of the cell contents by using a general edit distance algorithm.
Further, as shown in fig. 3, the non-tabular data alignment in step S3 includes the following steps:
s35, splicing the non-table data parts of the source document and the converted PDF document into a character string, inserting the character string into the placeholder mark of the table obtained in the step S32 according to the table-character relative index position, comparing the two character strings of the two documents by using an editing distance algorithm to obtain difference points of the outer texts in the two documents, and feeding back the difference caused by the relative position change of the table and the outer characters from the difference points;
and S36, finally obtaining the final document difference as the out-of-form text difference and the form difference.
Further, the table difference contains 3 granularities:
difference in whole table: adding/deleting tables; table block difference: table block add/delete; cell content difference: and adding/deleting/changing the cell content.
Further, the generating rule of the table structure feature matrix in step S31 is as follows:
s311, determining the size of the two-dimensional array according to the minimum granularity of the cells in the table, taking the following table 1 as an example;
TABLE 1 Table of sales and revenue for two seasons products of a company
Figure 442434DEST_PATH_IMAGE002
The two-dimensional array size corresponding to the table is 8 rows and 5 columns;
s312, filling the two-dimensional array elements, wherein the rule is as follows:
the head unit of the transverse merging unit is H, and the rest is _ H;
the head unit of the longitudinal merging unit is V, and the rest is _ V;
the no merging cell is N;
the head unit of the horizontal and vertical bidirectional merging cells is D, and the rest is _ D;
from this, a two-dimensional array as shown in FIG. 4 can be obtained;
s313, compressing the two-dimensional array formed in the step S312; the compression rules are as follows:
the column compression direction is from right to left, and the row compression direction is from bottom to top;
when all letters of a certain column/row satisfy the following rule, the corresponding column/row is compressed:
the column/row letters are completely the same as the left/upper side of the column/row letters, or are combined cells and main cells;
deleting the compressed row/line from the array, and repeatedly executing the step S313 until the two-dimensional array cannot be compressed;
s314, through step S313, a table structure feature matrix is finally obtained, and the result taking table 1 as an example is shown in fig. 5.
The invention also provides a document comparison analysis system based on table structure analysis, which comprises the following steps:
the file conversion module is used for receiving various types of source files and uniformly converting the source files into PDF files;
the file identification module is used for extracting, dividing and identifying different tools aiming at different types of content parts in the PDF file respectively to obtain table data and non-table data with text content, coordinate information and a table structure;
and the data comparison module is used for respectively comparing the table data with the non-table data to finally obtain the text difference and the table difference outside the table.
Further, the file identification module comprises:
the table data identification module is used for analyzing the PDF file and obtaining table data;
and the non-table data identification module is used for analyzing the PDF file and obtaining non-table data.
Further, the data comparison module comprises;
the table data comparison module is used for obtaining the difference of the table data parts of the source document and the converted PDF document;
and the non-table data comparison module is used for obtaining the difference of the non-table data part of the source document and the converted PDF document.
Based on the technical scheme of the invention, in the concrete implementation and operation process, the operation flow of the invention is described by the document A and the document B shown in FIG. 6 and FIG. 7:
the difference points between the document A and the document B comprise:
1. the entry in document A contains the table heading "Purchase Entries List" in one row, and this heading in document B exists outside the table.
2. The item table in document a is between the two postings 2 and 3, and the item table in document B is after the postings 3.
3. The material table in document a contains the "order" column but no "quotation" row, and the material table in document B does not contain the "order" but no "quotation" row.
The specific operation flow is as follows:
1. document content is identified and extracted from document a and document B.
Wherein, the extracted text content of the document A is as follows: the main title and text characters except the table, the table content is: item tables (with title) and materials tables.
The text content extracted from the document B is as follows: main title, project table title, text characters except table, table content is: item tables (without title) and materials tables.
2. And generating a table characteristic matrix according to the document A and the document B tables.
The document A project list feature matrix is as follows:
Figure DEST_PATH_IMAGE003
the material table feature matrix is:
Figure 322796DEST_PATH_IMAGE004
the document B project table feature matrix is as follows:
Figure 730644DEST_PATH_IMAGE004
(ii) a The material table feature matrix is:
Figure 376608DEST_PATH_IMAGE004
3. since the table structure of each table in documents a and B is not completely the same as the text in the table, the Hash generated by table structure + table text is not necessarily equal, and thus there is no table that achieves a "complete match" relationship. Traversing the tables of the two documents and comparing the similarity of the content and the structural features, the method is easy to obtain: the item table of document A is associated with the item table of document B as a "partial match"; the material table of document A is associated with the material table of document B as a "partial match".
4. Special placeholders PH _1 (entry table) and PH _2 (material table) are generated for the two sets of association tables, respectively. (because the special placeholder is longer in length and there are invisible characters, so reference is made here to PH _1 and PH _ 2)
5. Comparing the project list structure characteristic matrixes in the document A and the document B, wherein the project list structure characteristic matrix of the document A is one more than the project list structure characteristic matrix of the document B
Figure DEST_PATH_IMAGE005
And (5) structure. This structure is mapped back to the document A project table, which is known to correspond to the title line. The line change is marked as new.
6. The processing of the overlapped part of the structural characteristics takes a material table as an example: because the structural feature matrices of the material tables in document A and document B are completely overlapped, the material tables with 4 columns and 3 rows (document A) and 3 columns and 4 rows (document B) can be obtained by mapping the structural feature matrices back to the original tables.
7. Comparing the first row of the two table structures obtained in the step 6, the document A compares the material table in the document B with more 'sequence' columns, that is, the columns 2, 3 and 4 of the material table in the document A respectively form corresponding relations with the columns 1, 2 and 3 of the material table in the document B. The column is marked as new.
8. And (3) excluding the first row of the material table, and comparing the document A with the first column of the association column of the data row of the material table in the document B, namely comparing the 2 nd column, the 2 nd to 3 rd rows of the material table of the document A with the 1 st column and the 2 nd to 4 th rows of the material table of the document B. The available document a has one row of "quotation" deleted compared to the document B materials table. And rows 2 and 3 of the document A materials table are associated with rows 2 and 4 of the document B materials table, respectively. Line 3 of the document B materials table is marked as deleted.
9. And obtaining the row-column association relationship of the material tables in the documents A and B to obtain the cell association relationship of the material tables in the documents A and B. And sequentially comparing the difference points of the associated cell contents by using an edit distance algorithm.
10. And inserting the special placeholder obtained in the step 4 into a corresponding position in a list formed by the respective epiword contents of the two documents according to the sequence of the table and the epiword in the documents. Respectively splicing the exterior character lists of the document A and the document B into character strings, and comparing by using an edit distance algorithm to obtain difference points: compared with the document B, PH _1 is removed at the end of the bidding description 2, but the text of "Purchase item List" and PH _1 are added at the end of the description 3.
11. From the placeholder-to-table relationship mapping, it can be seen that document B moves the item table from the end of the caption 2 to the end of the caption 3.
The invention introduces a character, table and picture classification and identification mechanism, so that pictures and tables can be brought into a comparison range, and the application range of the document comparison device is strengthened; the comparison dimensionality of the table is extended to the comparison of three dimensionalities of the whole table, the table block and the table cell content from the simple cell text content comparison, so that the method can effectively sense the addition and deletion of the whole table, the addition and deletion of the table columns and the table rows and the addition and deletion of the matched cell content in most real service scenes; the invention can more intuitively embody the structural change of the user based on the original form.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims (9)

1. The document comparison analysis method based on table structure analysis is characterized by comprising the following steps;
s1, receiving various types of source files and uniformly converting the source files into PDF files;
s2, aiming at different types of content parts in the PDF file, extracting, dividing and identifying by using different tools respectively to obtain table data and non-table data with text content, coordinate information and table structure;
s3, comparing the tabular data with the non-tabular data respectively to finally obtain the text difference and the tabular difference outside the table;
the table data comparison in step S3 includes the following steps:
s31, extracting the table structure characteristics, generating a hash by using the table structure characteristics plus the table content, and marking the table with the same hash as the 'complete matching' in the source document and the converted PDF document; in the rest incompletely matched tables, tables with similar structural features in the source document and the converted PDF document are compared in a traversing manner, and the table with the content similarity exceeding 60% is marked as 'partial matching'; the other tables which still fail to be successfully matched are marked as deleted/added;
the tables with similar structural characteristics refer to that the structural characteristic matrix of the table in any document is completely contained by the structural matrix of the table of another document or the same part is overlapped by more than 80 percent;
s32, aiming at the tables of 'complete matching' and 'partial matching', allocating a placeholder mark composed of special characters for each group, and ensuring that in a comparison task, the placeholder marks of any two groups of 'complete matching' or 'partial matching' tables do not contain any same characters, and all characters in the placeholder marks composed of special characters are all uncommon characters;
s33, forming correspondence between unmatched feature blocks in the 'partial matching' table and the original table cells, and marking the corresponding original table cells as addition or deletion of the table; reading the matched characteristic blocks in the 'partially matched' table line by line and forming correspondence with the original table cells, comparing the addition and deletion of the cells in the first line of the original table cell area through an editing distance algorithm after the removing and changing operation, obtaining the column addition and deletion of the corresponding original table cell area, and marking;
s34, in the columns of the feature blocks which are successfully matched in the residual partial matching table, comparing the first column of the corresponding original table cell by the editing distance algorithm after the removing and changing operation to obtain the line addition and deletion of the corresponding original table cell area, and newly marking; and for the rest cells, splicing the text contents of all the cells according to the reading sequence of the front row and the rear row, and obtaining the change of the cell contents by using a general edit distance algorithm.
2. The document comparison analysis method based on form structure analysis according to claim 1, wherein the out-of-form text differences and the form differences both include reasons for differences, content of differences, and document locations where the differences are located; the reasons for the difference include addition, deletion and modification; the document position where the difference is located includes a page number and XY coordinates.
3. The method according to claim 1, wherein step S2 comprises a PDF file parsing process, which comprises the following steps:
s21, using the PDF file analysis tool to analyze the PDF file generated in the step S1, and obtaining table data;
s22, extracting the content from the PDF file according to the reading sequence by using a PDF file analysis tool;
s23, if the extracted content is a character, judging whether the character belongs to the table data in the step S21; if not, adding the characters to the tail part of the character array; if yes, skipping the character, and going to step S24;
s24, if the extracted content is a picture, recognizing the picture by using OCR, and sequentially adding character results obtained by OCR recognition to the tail part of the character array in the step S23;
s25, the final identification result is a character array and a table array formed by table data; each table object in the table array contains the relative position relationship of the table in the character array.
4. The method for document alignment analysis based on table structure analysis of claim 1, wherein the non-table data alignment in step S3 comprises the following steps:
s35, splicing the non-table data parts of the source document and the converted PDF document into a character string, inserting the character string into the placeholder mark of the table obtained in the step S32 according to the table-character relative index position, comparing the two character strings of the two documents by using an editing distance algorithm to obtain difference points of the outer texts in the two documents, and feeding back the difference caused by the relative position change of the table and the outer characters from the difference points;
and S36, finally obtaining the final document difference as the out-of-form text difference and the form difference.
5. The document alignment analysis method based on table structure analysis according to claim 1 or 4, wherein the table difference comprises 3 granularities:
difference in whole table: adding/deleting tables; table block difference: table block add/delete; cell content difference: and adding/deleting/changing the cell content.
6. The document alignment analysis method based on table structure analysis according to claim 1, wherein the generation rule of the table structure feature matrix in step S31 is as follows:
s311, determining the size of the two-dimensional array according to the minimum granularity of the cells in the table;
s312, filling the two-dimensional array elements, wherein the rule is as follows:
the head unit of the transverse merging unit is H, and the rest is _ H;
the head unit of the longitudinal merging unit is V, and the rest is _ V;
the no merging cell is N;
the head unit of the horizontal and vertical bidirectional merging cells is D, and the rest is _ D;
s313, compressing the two-dimensional array formed in the step S312; the compression rules are as follows:
the column compression direction is from right to left, and the row compression direction is from bottom to top;
when all letters of a certain column/row satisfy the following rule, the corresponding column/row is compressed:
the column/row letters are completely the same as the left/upper side of the column/row letters, or are combined cells and main cells;
deleting the compressed row/line from the array, and repeatedly executing the step S313 until the two-dimensional array cannot be compressed;
s314, through step S313, a table structure feature matrix is finally obtained.
7. The system for comparing and analyzing documents based on table structure analysis according to any one of claims 1 to 6, wherein the system for comparing and analyzing documents based on table structure analysis comprises:
the file conversion module is used for receiving various types of source files and uniformly converting the source files into PDF files;
the file identification module is used for extracting, dividing and identifying different tools aiming at different types of content parts in the PDF file respectively to obtain table data and non-table data with text content, coordinate information and a table structure;
and the data comparison module is used for respectively comparing the table data with the non-table data to finally obtain the text difference and the table difference outside the table.
8. The system of claim 7, wherein the file identification module comprises:
the table data identification module is used for analyzing the PDF file and obtaining table data;
and the non-table data identification module is used for analyzing the PDF file and obtaining non-table data.
9. The system of claim 7, wherein the data alignment module comprises;
the table data comparison module is used for obtaining the difference of the table data parts of the source document and the converted PDF document;
and the non-table data comparison module is used for obtaining the difference of the non-table data part of the source document and the converted PDF document.
CN202210003662.9A 2022-01-05 2022-01-05 Document comparison analysis method and system based on table structure analysis Active CN114021543B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210003662.9A CN114021543B (en) 2022-01-05 2022-01-05 Document comparison analysis method and system based on table structure analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210003662.9A CN114021543B (en) 2022-01-05 2022-01-05 Document comparison analysis method and system based on table structure analysis

Publications (2)

Publication Number Publication Date
CN114021543A CN114021543A (en) 2022-02-08
CN114021543B true CN114021543B (en) 2022-04-22

Family

ID=80069729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210003662.9A Active CN114021543B (en) 2022-01-05 2022-01-05 Document comparison analysis method and system based on table structure analysis

Country Status (1)

Country Link
CN (1) CN114021543B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637845B (en) * 2022-03-11 2023-04-14 上海弘玑信息技术有限公司 Model testing method, device, equipment and storage medium
CN116052193B (en) * 2023-04-03 2023-06-30 杭州实在智能科技有限公司 RPA interface dynamic form picking and matching method and system
TWI839304B (en) * 2023-09-15 2024-04-11 中國信託商業銀行股份有限公司 File comparison method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543614A (en) * 2018-11-22 2019-03-29 厦门商集网络科技有限责任公司 A kind of this difference of full text comparison method and equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10311076B1 (en) * 2016-10-26 2019-06-04 Open Invention Network, Llc Automated file acquisition, identification, extraction and transformation
CN112115111A (en) * 2019-06-20 2020-12-22 上海怀若智能科技有限公司 OCR-based document version management method and system
CN111738224B (en) * 2020-07-28 2020-12-08 浙江明度智控科技有限公司 Intelligent analysis method, system and storage medium for medicine document content
CN113468864A (en) * 2021-06-09 2021-10-01 广西电网有限责任公司 Method and device for quickly comparing long documents and storage medium
CN113361257B (en) * 2021-06-29 2022-10-11 深圳壹账通智能科技有限公司 PDF document analysis method, system, electronic device and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543614A (en) * 2018-11-22 2019-03-29 厦门商集网络科技有限责任公司 A kind of this difference of full text comparison method and equipment

Also Published As

Publication number Publication date
CN114021543A (en) 2022-02-08

Similar Documents

Publication Publication Date Title
CN114021543B (en) Document comparison analysis method and system based on table structure analysis
CN108960223B (en) Method for automatically generating voucher based on intelligent bill identification
US9633257B2 (en) Method and system of pre-analysis and automated classification of documents
US9063953B2 (en) System and methods for creation and use of a mixed media environment
US9357098B2 (en) System and methods for use of voice mail and email in a mixed media environment
US8086039B2 (en) Fine-grained visual document fingerprinting for accurate document comparison and retrieval
WO2022057707A1 (en) Text recognition method, image recognition classification method, and document recognition processing method
US20110188759A1 (en) Method and System of Pre-Analysis and Automated Classification of Documents
US20040247206A1 (en) Image processing method and image processing system
JP2004334339A (en) Information processor, information processing method, and storage medium, and program
US20040213458A1 (en) Image processing method and system
JP4785655B2 (en) Document processing apparatus and document processing method
US11615244B2 (en) Data extraction and ordering based on document layout analysis
CN108197119A (en) The archives of paper quality digitizing solution of knowledge based collection of illustrative plates
Borovikov A survey of modern optical character recognition techniques
CN103996055A (en) Identification method based on classifiers in image document electronic material identification system
CN116580414A (en) Contract document difference detection method and device based on ICR character matrix
CN111860524A (en) Intelligent classification device and method for digital files
CN112464907A (en) Document processing system and method
JP6856916B1 (en) Information processing equipment, information processing methods and information processing programs
US20200311059A1 (en) Multi-layer word search option
CN100444194C (en) Automatic extraction device, method and program of essay title and correlation information
Tomaschek Evaluation of off-the-shelf OCR technologies
Clausner et al. Unearthing the recent past: digitising and understanding statistical information from census tables
Yacoub et al. Document digitization lifecycle for complex magazine collection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant