CN114021543B

CN114021543B - Document comparison analysis method and system based on table structure analysis

Info

Publication number: CN114021543B
Application number: CN202210003662.9A
Authority: CN
Inventors: 郑飞鹏
Original assignee: Hangzhou Real Intelligence Technology Co ltd
Current assignee: Hangzhou Real Intelligence Technology Co ltd
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-04-22
Anticipated expiration: 2042-01-05
Also published as: CN114021543A

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a document comparison analysis method and system based on table structure analysis. The method comprises the steps of S1, receiving various types of source files and uniformly converting the source files into PDF files; s2, aiming at different types of content parts in the PDF file, extracting, dividing and identifying by using different tools respectively to obtain table data and non-table data with text content, coordinate information and table structure; and S3, comparing the tabular data with the non-tabular data respectively to obtain the text difference and the tabular difference outside the table. The system comprises a file conversion module, a file identification module and a data comparison module. The invention has the advantages of focusing on the comparison of the document content and the semantic level, having the capability of comparing the structures and the semantics among the tables in the document comparison, good comparison effect, low resource occupation and accurate character recognition.

Description

Document comparison analysis method and system based on table structure analysis

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a document comparison analysis method and system based on table structure analysis.

Background

Today, information updates are iterative and rapid, and there are more and more documents exposed to and used in productive life activities, whether by businesses, governments, utilities, or individuals. In practical situations, a document often requires multiple handoffs. In the process of transferring hands, each time of transferring hands, some deviation may be added, which finally results in a larger discrepancy with the original document, and a certain influence is generated on the related business. For example, a business contract requires multiple revisions from initial contract to final contract, each revision may cause subtle changes in format due to different document editing tools used by the personnel involved in the revision, in addition to content changes. The contracting parties may need to transmit the contract documents in the form of fax, printed matter, etc., which in turn involves the interconversion from text documents to different carrier forms such as electronic images, paper documents, scanned matter, etc. The method is limited by the accuracy problem of the current technologies such as OCR and the like, and the conversion process has certain probability to cause the problems of content deletion, content change and the like.

Currently, the most common way to deal with such conversion errors is to rely on manual checking. And when the contract content is more, the efficiency of manual review of the contract examiner is lower.

Therefore, it is very important to develop an electronic computer program that can accept multiple document formats, accurately analyze document contents, accurately describe content differences in different versions and different source files, and perform efficient comparison.

However, the existing document alignment related technology has the following disadvantages:

1. the document comparison based on the image template matching technology has too strict requirements on the content structure of the document;

2. the comparison effect of the mixed document of the image-text table is poor;

3. OCR recognition occupies higher resources and has a certain probability of character recognition error.

Therefore, it is very important to design a document comparison analysis method and system based on table structure analysis, which focuses on comparison between document contents and semantic levels, has the capability of comparing structures and semantics among tables in document comparison, and has good comparison effect, low resource occupation and accurate character recognition.

For example, chinese patent application No. CN202110644806.4 describes a method, an apparatus and a storage medium for fast comparing long documents, and the comparing method includes the following steps for two long documents to be compared: s1 analyzing the two documents to form a tree-shaped document structure; s2 splitting the two documents into two groups of content chunks according to the tree document structure; s3 establishing a mapping relationship between two groups of content blocks to form a plurality of mapping pairs; s4 parallel multiple tasks, each task for word-by-word comparison of two content blocks of a mapping pair to find a difference point. Although the speed of long document comparison can be improved, the method has the disadvantages that the capability of table structure and semantic comparison in document comparison is lacked, and the difference cannot be reflected under the condition that the table structure in the actual document has difference.

Disclosure of Invention

The invention provides a document comparison analysis method and a document comparison analysis system based on table structure analysis, which are focused on comparison of document contents and semantic levels, have the capability of comparing structures between tables and semantics in document comparison, have good comparison effect, low occupied resources and accurate character recognition and aim to overcome the problems that the existing document comparison technology has strict requirements on the structure of document contents, has limitations, poor comparison effect, higher occupied resources and certain probability of character recognition errors in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

the document comparison analysis method based on table structure analysis comprises the following steps;

s1, receiving various types of source files and uniformly converting the source files into PDF files;

s2, aiming at different types of content parts in the PDF file, extracting, dividing and identifying by using different tools respectively to obtain table data and non-table data with text content, coordinate information and table structure;

and S3, comparing the tabular data with the non-tabular data respectively to obtain the text difference and the tabular difference outside the table.

Preferably, the out-of-table text difference and the table difference comprise reasons for difference generation, difference content and document positions of the differences; the reasons for the difference include addition, deletion and modification; the document position where the difference is located includes a page number and XY coordinates.

Preferably, step S2 includes a PDF file parsing process, which includes the following steps:

s21, using the PDF file analysis tool to analyze the PDF file generated in the step S1, and obtaining table data;

s22, extracting the content from the PDF file according to the reading sequence by using a PDF file analysis tool;

s23, if the extracted content is a character, judging whether the character belongs to the table data in the step S21; if not, adding the characters to the tail part of the character array; if yes, skipping the character, and going to step S24;

s24, if the extracted content is a picture, recognizing the picture by using OCR, and sequentially adding character results obtained by OCR recognition to the tail part of the character array in the step S23;

s25, the final identification result is a character array and a table array formed by table data; each table object in the table array contains the relative position relationship of the table in the character array.

Preferably, the table data alignment in step S3 includes the following steps:

s31, extracting the table structure characteristics, generating a hash by using the table structure characteristics plus the table content, and marking the table with the same hash as the 'complete matching' in the source document and the converted PDF document; in the rest incompletely matched tables, tables with similar structural features in the source document and the converted PDF document are compared in a traversing manner, and the table with the content similarity exceeding 60% is marked as 'partial matching'; the other tables which still fail to be successfully matched are marked as deleted/added;

the tables with similar structural characteristics refer to that the structural characteristic matrix of the table in any document is completely contained by the structural matrix of the table of another document or the same part is overlapped by more than 80 percent;

s32, aiming at the tables of 'complete matching' and 'partial matching', allocating a placeholder mark composed of special characters for each group, and ensuring that in a comparison task, the placeholder marks of any two groups of 'complete matching' or 'partial matching' tables do not contain any same characters, and all characters in the placeholder marks composed of special characters are all uncommon characters;

s33, forming correspondence between unmatched feature blocks in the 'partial matching' table and the original table cells, and marking the corresponding original table cells as addition or deletion of the table; reading the matched characteristic blocks in the 'partially matched' table line by line and forming correspondence with the original table cells, comparing the addition and deletion of the cells in the first line of the original table cell area through an editing distance algorithm after the removing and changing operation, obtaining the column addition and deletion of the corresponding original table cell area, and marking;

s34, in the columns of the feature blocks which are successfully matched in the residual partial matching table, comparing the first column of the corresponding original table cell by the editing distance algorithm after the removing and changing operation to obtain the line addition and deletion of the corresponding original table cell area, and newly marking; and for the rest cells, splicing the text contents of all the cells according to the reading sequence of the front row and the rear row, and obtaining the change of the cell contents by using a general edit distance algorithm.

Preferably, the non-tabular data alignment of step S3 includes the steps of:

s35, splicing the non-table data parts of the source document and the converted PDF document into a character string, inserting the character string into the placeholder mark of the table obtained in the step S32 according to the table-character relative index position, comparing the two character strings of the two documents by using an editing distance algorithm to obtain difference points of the outer texts in the two documents, and feeding back the difference caused by the relative position change of the table and the outer characters from the difference points;

and S36, finally obtaining the final document difference as the out-of-form text difference and the form difference.

Preferably, the tabular difference comprises 3 particle sizes:

difference in whole table: adding/deleting tables; table block difference: table block add/delete; cell content difference: and adding/deleting/changing the cell content.

Preferably, the generation rule of the table structure feature matrix in step S31 is as follows:

s311, determining the size of the two-dimensional array according to the minimum granularity of the cells in the table;

s312, filling the two-dimensional array elements, wherein the rule is as follows:

the head unit of the transverse merging unit is H, and the rest is _ H;

the head unit of the longitudinal merging unit is V, and the rest is _ V;

the no merging cell is N;

the head unit of the horizontal and vertical bidirectional merging cells is D, and the rest is _ D;

s313, compressing the two-dimensional array formed in the step S312; the compression rules are as follows:

the column compression direction is from right to left, and the row compression direction is from bottom to top;

when all letters of a certain column/row satisfy the following rule, the corresponding column/row is compressed:

the column/row letters are completely the same as the left/upper side of the column/row letters, or are combined cells and main cells;

deleting the compressed row/line from the array, and repeatedly executing the step S313 until the two-dimensional array cannot be compressed;

s314, through step S313, a table structure feature matrix is finally obtained.

The invention also provides a document comparison analysis system based on table structure analysis, which comprises the following steps:

the file conversion module is used for receiving various types of source files and uniformly converting the source files into PDF files;

the file identification module is used for extracting, dividing and identifying different tools aiming at different types of content parts in the PDF file respectively to obtain table data and non-table data with text content, coordinate information and a table structure;

and the data comparison module is used for respectively comparing the table data with the non-table data to finally obtain the text difference and the table difference outside the table.

Preferably, the file identification module includes:

the table data identification module is used for analyzing the PDF file and obtaining table data;

and the non-table data identification module is used for analyzing the PDF file and obtaining non-table data.

Preferably, the data alignment module comprises;

the table data comparison module is used for obtaining the difference of the table data parts of the source document and the converted PDF document;

and the non-table data comparison module is used for obtaining the difference of the non-table data part of the source document and the converted PDF document.

Compared with the prior art, the invention has the beneficial effects that: (1) the invention introduces a character, table and picture classification and identification mechanism, so that pictures and tables can be brought into a comparison range, and the application range of the document comparison device is strengthened; (2) the comparison dimensionality of the table is extended to the comparison of three dimensionalities of the whole table, the table block and the table cell content from the simple cell text content comparison, so that the method can effectively sense the addition and deletion of the whole table, the addition and deletion of the table columns and the table rows and the addition and deletion of the matched cell content in most real service scenes; (3) the invention can more intuitively embody the structural change of the user based on the original form.

Drawings

FIG. 1 is a flow chart of a PDF file parsing process in a document comparison analysis method based on table structure parsing according to the present invention;

FIG. 2 is a flowchart of table data comparison in a document comparison analysis method based on table structure parsing according to the present invention;

FIG. 3 is a flow chart of non-tabular data alignment in the document alignment analysis method based on table structure parsing according to the present invention;

FIG. 4 is a schematic diagram of a two-dimensional array according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the compressed table structure feature matrix of FIG. 4;

FIG. 6 is a diagram illustrating a text content of a document A according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a text content of the document B according to the embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

Example 1:

Different types of files are uniformly converted into PDFs, and because the PDF format files can keep the stability of the document format, no matter the files are across system platforms or printed, the phenomenon of structural disorder is avoided. Meanwhile, whether pictures or frequently-used WORD (Microsoft office software WORD processing software) documents can be converted into PDF, and the file types are unified, so that subsequent unified processing is facilitated. And PDF can be displayed at the front end of the Web page, and the document difference points can be displayed on the Web page in a wire frame graph form by combining the comparison result output by the invention.

Further, as shown in fig. 1, step S2 includes a PDF file parsing process, which includes the following steps:

Further, as shown in fig. 2, the table data comparison in step S3 includes the following steps:

Further, as shown in fig. 3, the non-tabular data alignment in step S3 includes the following steps:

Further, the table difference contains 3 granularities:

Further, the generating rule of the table structure feature matrix in step S31 is as follows:

s311, determining the size of the two-dimensional array according to the minimum granularity of the cells in the table, taking the following table 1 as an example;

TABLE 1 Table of sales and revenue for two seasons products of a company

The two-dimensional array size corresponding to the table is 8 rows and 5 columns;

the head unit of the transverse merging unit is H, and the rest is _ H;

the head unit of the longitudinal merging unit is V, and the rest is _ V;

the no merging cell is N;

from this, a two-dimensional array as shown in FIG. 4 can be obtained;

s314, through step S313, a table structure feature matrix is finally obtained, and the result taking table 1 as an example is shown in fig. 5.

Further, the file identification module comprises:

Further, the data comparison module comprises;

Based on the technical scheme of the invention, in the concrete implementation and operation process, the operation flow of the invention is described by the document A and the document B shown in FIG. 6 and FIG. 7:

the difference points between the document A and the document B comprise:

1. the entry in document A contains the table heading "Purchase Entries List" in one row, and this heading in document B exists outside the table.

2. The item table in document a is between the two postings 2 and 3, and the item table in document B is after the postings 3.

3. The material table in document a contains the "order" column but no "quotation" row, and the material table in document B does not contain the "order" but no "quotation" row.

The specific operation flow is as follows:

1. document content is identified and extracted from document a and document B.

Wherein, the extracted text content of the document A is as follows: the main title and text characters except the table, the table content is: item tables (with title) and materials tables.

The text content extracted from the document B is as follows: main title, project table title, text characters except table, table content is: item tables (without title) and materials tables.

2. And generating a table characteristic matrix according to the document A and the document B tables.

The document A project list feature matrix is as follows:

the material table feature matrix is:

；

the document B project table feature matrix is as follows:

(ii) a The material table feature matrix is:

。

3. since the table structure of each table in documents a and B is not completely the same as the text in the table, the Hash generated by table structure + table text is not necessarily equal, and thus there is no table that achieves a "complete match" relationship. Traversing the tables of the two documents and comparing the similarity of the content and the structural features, the method is easy to obtain: the item table of document A is associated with the item table of document B as a "partial match"; the material table of document A is associated with the material table of document B as a "partial match".

4. Special placeholders PH _1 (entry table) and PH _2 (material table) are generated for the two sets of association tables, respectively. (because the special placeholder is longer in length and there are invisible characters, so reference is made here to PH _1 and PH _ 2)

5. Comparing the project list structure characteristic matrixes in the document A and the document B, wherein the project list structure characteristic matrix of the document A is one more than the project list structure characteristic matrix of the document B

And (5) structure. This structure is mapped back to the document A project table, which is known to correspond to the title line. The line change is marked as new.

6. The processing of the overlapped part of the structural characteristics takes a material table as an example: because the structural feature matrices of the material tables in document A and document B are completely overlapped, the material tables with 4 columns and 3 rows (document A) and 3 columns and 4 rows (document B) can be obtained by mapping the structural feature matrices back to the original tables.

7. Comparing the first row of the two table structures obtained in the step 6, the document A compares the material table in the document B with more 'sequence' columns, that is, the columns 2, 3 and 4 of the material table in the document A respectively form corresponding relations with the

columns

1, 2 and 3 of the material table in the document B. The column is marked as new.

8. And (3) excluding the first row of the material table, and comparing the document A with the first column of the association column of the data row of the material table in the document B, namely comparing the 2 nd column, the 2 nd to 3 rd rows of the material table of the document A with the 1 st column and the 2 nd to 4 th rows of the material table of the document B. The available document a has one row of "quotation" deleted compared to the document B materials table. And rows 2 and 3 of the document A materials table are associated with rows 2 and 4 of the document B materials table, respectively. Line 3 of the document B materials table is marked as deleted.

9. And obtaining the row-column association relationship of the material tables in the documents A and B to obtain the cell association relationship of the material tables in the documents A and B. And sequentially comparing the difference points of the associated cell contents by using an edit distance algorithm.

10. And inserting the special placeholder obtained in the step 4 into a corresponding position in a list formed by the respective epiword contents of the two documents according to the sequence of the table and the epiword in the documents. Respectively splicing the exterior character lists of the document A and the document B into character strings, and comparing by using an edit distance algorithm to obtain difference points: compared with the document B, PH _1 is removed at the end of the bidding description 2, but the text of "Purchase item List" and PH _1 are added at the end of the description 3.

11. From the placeholder-to-table relationship mapping, it can be seen that document B moves the item table from the end of the caption 2 to the end of the caption 3.

The invention introduces a character, table and picture classification and identification mechanism, so that pictures and tables can be brought into a comparison range, and the application range of the document comparison device is strengthened; the comparison dimensionality of the table is extended to the comparison of three dimensionalities of the whole table, the table block and the table cell content from the simple cell text content comparison, so that the method can effectively sense the addition and deletion of the whole table, the addition and deletion of the table columns and the table rows and the addition and deletion of the matched cell content in most real service scenes; the invention can more intuitively embody the structural change of the user based on the original form.

The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims

1. The document comparison analysis method based on table structure analysis is characterized by comprising the following steps;

s3, comparing the tabular data with the non-tabular data respectively to finally obtain the text difference and the tabular difference outside the table;

the table data comparison in step S3 includes the following steps:

2. The document comparison analysis method based on form structure analysis according to claim 1, wherein the out-of-form text differences and the form differences both include reasons for differences, content of differences, and document locations where the differences are located; the reasons for the difference include addition, deletion and modification; the document position where the difference is located includes a page number and XY coordinates.

3. The method according to claim 1, wherein step S2 comprises a PDF file parsing process, which comprises the following steps:

4. The method for document alignment analysis based on table structure analysis of claim 1, wherein the non-table data alignment in step S3 comprises the following steps:

5. The document alignment analysis method based on table structure analysis according to claim 1 or 4, wherein the table difference comprises 3 granularities:

6. The document alignment analysis method based on table structure analysis according to claim 1, wherein the generation rule of the table structure feature matrix in step S31 is as follows:

the head unit of the transverse merging unit is H, and the rest is _ H;

the head unit of the longitudinal merging unit is V, and the rest is _ V;

the no merging cell is N;

s314, through step S313, a table structure feature matrix is finally obtained.

7. The system for comparing and analyzing documents based on table structure analysis according to any one of claims 1 to 6, wherein the system for comparing and analyzing documents based on table structure analysis comprises:

8. The system of claim 7, wherein the file identification module comprises:

9. The system of claim 7, wherein the data alignment module comprises;