CN113536751B

CN113536751B - Processing method and device of form data, electronic equipment and storage medium

Info

Publication number: CN113536751B
Application number: CN202110738835.7A
Authority: CN
Inventors: 李晨辉; 胡腾; 陈永锋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-09-22
Anticipated expiration: 2041-06-30
Also published as: CN113536751A

Abstract

The application discloses a method and a device for processing form data, electronic equipment and a storage medium, relates to the field of computers, and particularly relates to the field of artificial intelligence such as natural language processing and knowledge graph. The specific implementation scheme is as follows: identifying a target form to be processed to determine a target style parameter and a first content list corresponding to the target form; extracting candidate reference tables from the plurality of reference tables based on the target style parameters; determining the content similarity and the position similarity between the target table and the candidate reference table according to the first content list and the second content list corresponding to the candidate reference table; under the condition that the content similarity and the position similarity meet preset conditions, determining the candidate reference table as a target reference table; and determining the association relation between the first contents in the target table according to the association relation between the second contents in the target reference table. The method can realize the structuring of complex tables.

Description

Processing method and device of form data, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of artificial intelligence such as natural language processing and knowledge graph, and in particular, to a method and apparatus for processing table data, an electronic device, and a storage medium.

Background

Forms are widely used in people's lives. The forms of the table are also rich and various, and besides the simple forms of the individual forms, most of the tables are semi-structured tables, and cannot be directly understood and utilized by a computer system.

Therefore, how to convert a complex semi-structured table into structured information, so that a computer system can be directly utilized is a problem to be solved.

Disclosure of Invention

The application provides a method and a device for processing form data, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a method for processing table data, including:

identifying a target table to be processed to determine a target style parameter corresponding to the target table and a first content list, wherein the first content list comprises a plurality of first contents and position parameters corresponding to each first content;

extracting candidate reference tables from a plurality of reference tables based on the target style parameters;

determining content similarity and position similarity between the target table and the candidate reference table according to the first content list and a second content list corresponding to the candidate reference table;

Under the condition that the content similarity and the position similarity meet preset conditions, determining the candidate reference table as a target reference table;

and determining the association relation between the first contents in the target table according to the association relation between the second contents in the target reference table.

According to another aspect of the present application, there is provided a processing apparatus of form data, including: the identifying module is used for identifying a target form to be processed so as to determine target style parameters corresponding to the target form and a first content list, wherein the first content list comprises a plurality of first contents and position parameters corresponding to each first content;

the extraction module is used for extracting candidate reference tables from a plurality of reference tables based on the target style parameters;

the first determining module is used for determining content similarity and position similarity between the target table and the candidate reference table according to the first content list and a second content list corresponding to the candidate reference table;

the second determining module is used for determining the candidate reference table as a target reference table under the condition that the content similarity and the position similarity meet preset conditions;

And the third determining module is used for determining the association relation between the first contents in the target table according to the association relation between the second contents in the target reference table.

According to another aspect of the present application, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the above embodiments.

According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method according to the above-described embodiments.

According to another aspect of the application, a computer program product is provided, comprising a computer program which, when being executed by a processor, implements a method according to the above-described embodiments.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

fig. 1 is a flow chart of a method for processing table data according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating another method for processing table data according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another method for processing table data according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating another method for processing table data according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating another method for processing table data according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a reference table recognition process according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a process for processing a table to be processed by using a reference table according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a table data processing device according to an embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing a method of processing form data according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following describes a method, an apparatus, an electronic device, and a storage medium for processing table data according to an embodiment of the present application with reference to the accompanying drawings.

Artificial intelligence is the discipline of studying certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person using a computer, both in the technical field of hardware and in the technical field of software. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a deep learning technology, a big data processing technology, a knowledge graph technology and the like.

NLP (Natural Language Processing ) is an important direction in the fields of computer science and artificial intelligence, and the content of NLP research includes, but is not limited to, the following branch fields: text classification, information extraction, automatic abstracting, intelligent question and answer, topic recommendation, machine translation, topic word recognition, knowledge base construction, deep text representation, named entity recognition, text generation, text analysis (lexical, syntactic, grammatical, etc.), speech recognition and synthesis, and the like.

Fig. 1 is a flow chart of a table data processing method according to an embodiment of the present application.

The processing method of the table data in the embodiment of the application can be executed by the processing device of the table data provided in the embodiment of the application, and the device can be configured in the electronic equipment to determine the association relationship among the contents in the target table according to the association relationship among the contents in the similar reference table so as to realize the structuring of the complex table.

As shown in fig. 1, the processing method of the table data includes:

step 101, identifying a target table to be processed to determine a target style parameter and a first content list corresponding to the target table.

In the application, the row-by-row or column-by-column identification is carried out on the target table to be processed, the row number and column number contained in the target table, the content of each cell and the like can be determined, and the style parameters and the first content list corresponding to the target table can be determined according to the identified content.

The style parameter may be a type of a table (such as a key value pair formula, a relational expression, etc.), or may be a row number and a column number of a target table, where the first content list includes a plurality of first contents and a position parameter corresponding to each first content. The first content is the content of the cells in the target table, and the position parameter is used for indicating the position of the first content in the target table.

For example, a table is identified, and the table is determined to be a 2-row and 2-column table, which can be regarded as a 2×2 cell distribution matrix, the values of 4 elements in the matrix can be sequentially 0,1,2, and 3 (can be from left to right, from column to left, etc.), and the content list of the table includes 0: a name; 1: sex; 2: a birth place; 3: the contact means. Wherein 0: the name indicates that the content in the table position corresponding to the element with the value of 0 is "name", that is, the position parameter corresponding to the content "name" in the table is 0.

Step 102, extracting candidate reference tables from the plurality of reference tables based on the target style parameters.

Since the forms of the tables are different, it is explained that the similarity of the two tables is relatively low, so in the application, the table with the same form parameters as the target table can be extracted from the multiple reference tables as candidate reference tables according to the target form parameters of the target table. Thus, candidate reference tables with the same style can be screened from a plurality of reference tables based on the target style parameters, so that the calculation efficiency can be improved.

Step 103, determining the content similarity and the position similarity between the target table and the candidate reference table according to the first content list and the second content list corresponding to the candidate reference table.

In practical applications, the forms are the same, but the contents of the forms may also differ. According to the method, the device and the system, the content similarity and the position similarity between each first content in the first content list and each second content in the second content list can be determined according to the second content list corresponding to the first content list and the candidate reference table, the sum of the content similarity between each first content and each second content is used as the content similarity between the target table and the candidate reference table, and the sum of the position similarity between each first content and each second content is used as the position similarity between the target table and the candidate reference table.

The content similarity between the first content and the second content may be determined by calculating a euclidean distance between the two vectors according to vectors corresponding to the first content and the second content respectively; the position similarity between the first content and the second content may be determined according to a difference between a position parameter corresponding to the first content and a position parameter corresponding to the second content.

For example, the position parameter of a first content in the target table is 0, the position parameter of a second content in the candidate reference table is 2, it can be seen that the position difference between the two contents is 2, the reciprocal of the difference can be used as the position similarity, and if the position parameters of the two contents are the same, the position similarity can be regarded as 1.

And 104, determining the candidate reference table as a target reference table under the condition that the content similarity and the position similarity meet the preset conditions.

In the application, the candidate reference table can be further screened according to the content similarity and the position similarity between the target table and the candidate reference table.

When the candidate reference table is realized, the candidate reference table can be determined to be the target reference table under the condition that the content similarity meets the corresponding preset condition and the position similarity also meets the corresponding preset condition. The content similarity and the position similarity have corresponding preset conditions respectively.

Or calculating the total similarity between the target table and the candidate reference table according to the content similarity and the position similarity between the target table and the candidate reference table and the corresponding weight, and determining the candidate reference table as the target reference table under the condition that the total similarity is larger than a preset threshold value.

Step 105, determining the association relationship between the first contents in the target table according to the association relationship between the second contents in the target reference table.

After the target reference table is determined, the association relationship between the first contents in the target table can be determined according to the association relationship between the second contents in the target reference table. The association relationship between the contents may be a relationship between the attributes and the values, and may be a relationship between the attributes.

In the application, the association relation between the first content and other first contents in the target table can be determined according to the association relation between the second content with the highest similarity with the first content in the target table in the target reference table and other second contents.

For example, the content "name" in the target table has the highest similarity with the content "name" in the target reference table, and the value corresponding to the content "name" in the target reference table is below the "name", so that the value corresponding to the content "name" in the target table can be acquired downward in the content "name" in the target table.

In the embodiment of the application, the target style parameter and the first content list corresponding to the target table are determined by identifying the target table to be processed, the candidate reference table is extracted from the reference table according to the target style parameter, the content similarity and the position similarity between the target table and the candidate reference table are determined according to the first content list and the second content list corresponding to the candidate reference table, the candidate reference table is determined as the target reference table under the condition that the content similarity and the position similarity meet the preset condition, and the association relationship between the first contents in the target table is determined according to the association relationship between the second contents in the target reference table. Therefore, the association relation among the contents in the target table is determined according to the association relation among the contents in the reference table with high similarity, so that the structuring of the complex table is realized.

In one embodiment of the present application, the style parameter may be a cell distribution matrix, and the method shown in fig. 2 may also be used when determining the target style parameter and the first content list corresponding to the target table. Fig. 2 is a flowchart illustrating another method for processing table data according to an embodiment of the present application.

As shown in fig. 2, identifying a target table to be processed to determine a target style parameter and a first content list corresponding to the target table includes:

in step 201, the target table is identified to determine the distribution state of each first cell in the target table and the first content in each first cell.

In the application, the target table can be identified row by row or column by column, and when the cells are identified, the positions of the cells (such as the number of the cells in the number of the rows) are recorded, and the content of the cells is extracted, so that the distribution state of each first cell in the target table and the first content in each first cell can be determined.

The distribution state of the first cell can be understood as what row and what column the first cell is in the target table. It can be seen that the distribution state of the first cell can be used to characterize the location of the first cell in the target table.

Step 202, determining a target distribution matrix corresponding to the target table according to the distribution state of each first cell and a preset distribution matrix element value mode.

According to the application, the number of rows and columns of the target distribution matrix corresponding to the target table can be determined according to the distribution state of each first cell, and the value of each element in the target distribution matrix can be determined according to the preset distribution matrix element value mode. Thus, the target distribution matrix corresponding to the target table can be determined.

The element value mode of the distribution matrix refers to the value mode of each element in the distribution matrix. For example, the values may be 0,1,2,3,4, etc. from row to left, or 0,1,2,3,4, etc. from column to top.

For example, the target table is identified as a distribution matrix of 2*3, where the values of the elements in the distribution matrix are 0 for the 1 st row and 1 st column, 1 st row and 2 nd column, 2 for the 1 st row and 2 nd column, 3 for the 2 nd row and 1 st column, 4 for the 2 nd row and 2 nd column, and 5 for the 2 nd row and 3 rd column.

The method of the values of the elements in the distribution matrix and the values of the elements may be set as needed, and the present application is not limited thereto.

Step 203, a first content list is generated according to the value of each element in the target distribution matrix and the first content in each first cell.

In the application, as the value of each element in the target distribution matrix can be used for representing the position of the first cell corresponding to each element, the position of each first content in the target table can be determined according to the value of each element in the target distribution matrix and the first content of each first cell, a plurality of first contents and the position of each first content in the target table can be generated, and a first content list can be generated.

The first content list comprises a plurality of first contents and position parameters corresponding to each first content. The location parameter may be an element value indicating a location of the first cell where the first content is located.

For example, the target cell matrix corresponding to the target table is 2*3, where the element value of the 1 st row and the 1 st column is 0, the content of the position in the target table is "name", the element value of the 1 st row and the 2 nd column is 1, the content of the position in the target table is "gender", the position parameter corresponding to the content "name" is 0, and the position parameter corresponding to the content "gender" is 1.

In the embodiment of the application, if the style parameter is a cell distribution matrix, when identifying a target table to be processed to determine a target style parameter and a first content list corresponding to the target table, the distribution state of each first cell in the target table and the first content in each first cell can be determined by identifying the target table, the target distribution matrix corresponding to the target table is determined according to the distribution state of each first cell and a preset value mode of a distribution matrix element, and the first content list is generated according to the value of each element in the target distribution matrix and the first content in each first cell. Therefore, when the table is identified, the distribution matrix and the content list corresponding to the table can be determined according to the determined distribution state of each cell and the content of each cell, so that the pattern of the table and the position of each content are determined by utilizing the distribution matrix and the content list together, the complexity of the matrix element value is reduced, and the space is saved.

Further, when extracting candidate reference tables from the plurality of reference tables based on the target pattern, candidate reference tables having the same number of rows as the target distribution matrix and the same number of columns as the target distribution matrix as the number of columns of the reference distribution matrix may be extracted. That is, a reference table having the same number of rows and the same number of columns of the corresponding distribution matrix and the target distribution matrix may be extracted from the plurality of reference tables as the candidate reference table. Therefore, the reference table is screened according to the number of rows and columns of the target distribution matrix, and the table processing efficiency is improved.

In practical applications, the types of the contents in the table are generally various, and in order to improve the table processing efficiency, in one embodiment of the present application, when determining the content similarity and the position similarity between the target table and the candidate reference table, the content similarity and the position similarity between each first content and the target second content in the candidate reference table may be determined. Fig. 3 is a flowchart illustrating another method for processing table data according to an embodiment of the present application.

As shown in fig. 3, determining the content similarity and the position similarity between the target table and the candidate reference table includes:

in step 301, according to the type to which each second content belongs, a target second content to be matched is determined.

In the application, the second content list corresponding to the candidate reference table can comprise the type of each second content, and the second content with the type of the target type can be determined as the target second content to be matched according to the type of each second content in the candidate reference table. Wherein the target second content may be one or more.

The type of the content in the table may include an attribute (key), a value (value), and the like, or may be other types. For example, the second content of the type key to which the second content belongs may be regarded as the target second content to be matched.

Step 302, each first content in the first content list is matched with the target second content, so as to obtain a first similarity between each first content and the target second content.

After determining the target second content to be matched, each first content in the first content list can be matched with the target second content, and the first similarity between each first content and the target second content is calculated. And when the target second content is a plurality of, calculating the first similarity between each first content and each second content.

The method for calculating the first similarity is described in the above embodiments, and will not be described herein.

Step 303, determining a position offset parameter between any first content and the target second content according to the position parameter corresponding to any first content and the position parameter corresponding to the target second content when the first similarity between any first content and the target second content is greater than the threshold.

To further increase the efficiency of the table processing, the first similarity between each first content and the target second content may be compared to a threshold. When the first similarity between any first content and the target second content is greater than the threshold value, the similarity between the first content and the target second content is higher, and the position offset parameter between any first content and the target second content can be determined according to the position parameter corresponding to any first content and the position parameter corresponding to the target second content.

That is, for the first content when the similarity with the target second content is greater than the threshold value, the positional shift parameter between the first content and the target second content is determined. Therefore, the calculated amount is reduced, and the processing efficiency is improved.

Step 304, determining the content similarity and the position similarity between the target table and the candidate reference table according to the first similarity and the position offset parameter corresponding to each first content in the first content list.

According to the method and the device, the content similarity and the position similarity between the target table and the candidate reference table can be determined according to the first similarity and the position offset parameter corresponding to the first content of which the first similarity between the target second content is larger than the threshold value.

When calculating the content similarity between the target table and the candidate reference table, the first similarity greater than the threshold value may be summed and used as the content similarity between the two tables. Similarly, when calculating the positional similarity between the target table and the candidate reference table, the positional deviation parameters corresponding to the first content having the first similarity between the target second content greater than the threshold value may be summed up as the positional similarity between the two tables.

After the content similarity and the position similarity between the target table and the candidate reference table are calculated, the target reference table can be determined based on the content similarity and the position similarity between the target table and the candidate reference table, and the association relationship between the first contents in the target table is determined by using the target reference table.

In the embodiment of the present application, the second content list may include a type to which each second content belongs, when determining the content similarity and the position similarity between the target table and the candidate reference table according to the second content list corresponding to the first content list and the candidate reference table, the second target content to be matched may be determined according to the type to which each second content belongs, and the first similarity between each first content and the second target content may be obtained, and if the first similarity between any first content and the second target content is greater than the threshold value, the position offset parameter between any first content and the second target content may be calculated, and the content similarity and the position similarity between the target table and the candidate reference table may be determined according to the first similarity and the position offset parameter corresponding to each first content in the first content list. Therefore, the target second content to be matched is determined according to the type of the target second content, and each first content is matched with the target second content, so that the calculated amount is reduced, and the processing efficiency is improved.

Fig. 4 is a flowchart illustrating another method for processing table data according to an embodiment of the present application.

As shown in fig. 4, the processing method of the table data includes:

step 401, identifying a target table to be processed, so as to determine a target style parameter and a first content list corresponding to the target table.

Step 402, extracting candidate reference tables from the plurality of reference tables based on the target style parameters.

Step 403, determining the content similarity and the position similarity between the target table and the candidate reference table according to the first content list and the second content list corresponding to the candidate reference table.

Step 404, determining the candidate reference table as the target reference table when the content similarity and the position similarity satisfy the preset condition.

In the present application, steps 401 to 404 are similar to steps 101 to 104 described above, and thus are not repeated here.

And step 405, determining target second content according to the type of each second content.

In the present application, the second content list corresponding to the reference table may include a type and an association direction to which each second content belongs. Wherein the association direction may be used to represent a positional relationship between each second content and other second content.

For example, the content "name" belongs to a type of key, and the association direction is downward, that is, the value of "name" is below "name", and then it can be considered that the association direction is upward for the value of "name".

Since the types to which the second contents belong are various, in order to improve the calculation efficiency, the second content of which the type is the target type may be determined as the target second content according to the type to which each of the second contents belongs. For example, since the key affects the style of the table, the second content of the type key in the target reference table can be determined as the target second content.

In step 406, a target first content having a similarity to the target second content greater than a threshold is determined.

After determining the target second content in the target reference table, the similarity between each first content and the target second content can be calculated, and the first content with the similarity larger than the threshold value with the target second content is determined as the target first content.

Step 407, acquiring the target value associated with the target first content from the association direction of the target table based on the association direction corresponding to the target second content.

After the target first content and the target second content are determined, the association direction corresponding to the target second content can be used as the association direction corresponding to the target first content, and the target value associated with the target first content can be obtained from the association direction of the target table.

For example, the second content of the target is a key, the corresponding association direction is downward, and the first content of the target with the similarity between the mobile phone number and the mobile phone number being greater than the threshold value is a contact way, so that the corresponding value of the contact way can be obtained downward from the contact way position in the target table. It is to be appreciated that a "contact address" may correspond to a plurality of values.

In the application, the second content with the type of key can be used as the target second content, and the value corresponding to each key in the target table can be determined according to the association direction of the target second content, so that all (key, value) pairs in the target table can be obtained. In the acquire (key, value) pair, the target table may be stored in the form of (key, value).

In the embodiment of the application, the second content list may include a type and an association direction to which each second content belongs, and after determining the target reference table, when determining the association relationship between each first content in the target table according to the association relationship between each second content in the target reference table, the second content may be determined according to the type to which each second content belongs, and the first content with similarity greater than a threshold value with the second content is determined, and the target value associated with the first content is acquired from the association direction of the target table according to the association direction corresponding to the second content. Therefore, the target second content is determined according to the type of each second content, and the value associated with the first content, of which the similarity between the target second content is greater than the threshold value, is obtained according to the association direction of the target second target content, so that the calculated amount is reduced, and the table processing efficiency is improved.

In practical applications, there may be an association relationship between each attribute (key) in the table, for example, the association relationship between "name" and "place of birth" is "birth".

Further, after the target first content with the similarity with the target second content being greater than the threshold value is determined, the target first content is taken as a starting point, the first content in the other direction perpendicular to the association direction corresponding to the target second content is obtained from the target table, and the association relationship between the target first content and the first content in the other direction is determined according to the semantics respectively corresponding to the target first content and the first content in the other direction.

For example, the first content of the target is "name", the association direction corresponding to the second content of the target with the similarity larger than the threshold value is downward, the first content of the target in the horizontal direction, such as "graduation institution" in the left direction, can be obtained from the table from the "name", and the association relationship between the "name" and the "graduation institution" can be determined as "graduation in" according to the semantics of the respective correspondence between the "name" and the "graduation institution".

When determining the association relationship, an entity with the highest similarity to the target first content and an entity with the highest similarity to the first content in the other direction can be determined from the knowledge picture according to the semantics corresponding to the target first content and the semantics corresponding to the first content in the other direction, and the association relationship between the target first content and the first content in the other direction is determined according to the relationship between the two entities.

In the application, according to the association direction corresponding to the second content with the attribute (key) of the type, the association relationship between the first content with the attribute of the type can be determined from the target table, namely, the association relationship between the attributes can be determined.

In the embodiment of the application, after determining the target first content with the similarity with the target second content being greater than the threshold value, the first content in the other direction perpendicular to the association direction can be obtained from the target table by taking the target first content as a starting point, and the association relationship between the target first content and the first content in the other direction can be determined according to the semantics respectively corresponding to the target first content and the first content in the other direction. Therefore, the association relation between the first content of the target and the first content in the direction perpendicular to the association direction can be determined by utilizing the association direction corresponding to the second content of the target, so that the intellectualization of the table processing is improved.

In one embodiment of the present application, before extracting candidate reference tables from a plurality of reference tables based on the target style parameters, the method shown in fig. 5 may be used to obtain the style parameters corresponding to each reference table, the second content list, and the association relationship between the respective second contents. Fig. 5 is a flowchart illustrating another method for processing table data according to an embodiment of the present application.

As shown in fig. 5, before extracting the candidate reference table from the plurality of reference tables based on the target style parameter, the method further includes:

step 501, a labeling data set is obtained, wherein the labeling data set includes a plurality of reference tables and a type and an association direction of each second content in each reference table.

In the present application, a labeling data set may be acquired, which may include a plurality of reference tables, and for each reference table, the type and the associated direction to which each second content belongs have been labeled.

The type may include a key, a value, or the like, or other types, and the association direction of the second content of the type is a key may be understood as the direction of the value.

Step 502, identify each reference table to determine the style parameter and the second content list corresponding to each reference table.

In the present application, the method for identifying each reference table is similar to the method for identifying the target table, and therefore will not be described herein.

Step 503, determining the association relationship between the second contents in the second content list according to the type and the association direction of each second content in each reference table.

In the application, the target second content with the type being the target type can be determined according to the type of each second content in each reference table, for example, the second content with the type being the key is used as the target second content.

Then, the second content in the association direction in the reference table may be determined as a value associated with the target second content as the start point.

In addition, the second content in the other direction perpendicular to the association direction can be acquired from the reference table by taking the target second content as the starting point, and the association relationship between the target second content and the second content in the other direction can be determined according to the semantics of the second content in the target second content and the second content in the other direction. Thus, the association relationship between the attributes in the reference table can be determined.

In the embodiment of the application, before candidate reference tables are extracted from a plurality of reference tables based on the target style parameters, each reference table in the labeling data set can be identified to determine the style data of each reference table, the second content list and the association relation among the second contents, so that other tables can be structured by using the reference tables, complex tables can be processed, and the speed is high.

The following further describes a method for processing table data according to an embodiment of the present application with reference to fig. 6 and 7. Fig. 6 is a schematic diagram of a reference table recognition process according to an embodiment of the present application. Fig. 7 is a schematic diagram of a process for processing a table to be processed by using a reference table according to an embodiment of the present application.

In fig. 6, a labeling data set is obtained, wherein each cell in each reference table is labeled with its type (value, other) and the associated direction corresponding to each key in the labeling data set.

Resolving keys with upward or downward association directions for each reference table, and marking the keys as xkeys; resolving a key with a left or right association direction, and marking the key as a ykey; analyzing a cell marked as other and marking as buildinjkey; the other cells in the reference table are labeled value. And analyzing the reference table to obtain a cell structure of the reference table, namely a cell distribution matrix. And then encrypting and storing the reference table after the analysis is completed.

In fig. 7, a table to be processed is obtained, and the table to be processed is identified, i.e. parsed, to obtain a cell distribution matrix corresponding to the table to be processed.

For each reference table in fig. 6, matching the cell distribution matrix corresponding to the table to be processed with the cell distribution matrix corresponding to each reference table, calculating a similarity score p_m, matching the content of each cell in the table to be processed with the key in the reference table, calculating a similarity score p_ij, and calculating a position deviation score p_f of the content of each cell relative to the key in the reference table.

Then, a matching score p=α×p_m+ (1- α) p_ij×p_f between each reference table and the template to be processed is calculated. Where α represents a weight, which can be determined as desired.

After the matching score corresponding to each reference table is obtained, the reference table with the highest matching score can be selected as the reference table of the current table to be processed, namely the target reference table, and each key and the associated direction in the table to be processed are determined according to the key and the corresponding associated direction in the target reference table. Then, from the position of each key, the associated value is acquired along the associated direction, and from the position of each key, other keys are acquired along the parallel direction (namely, the perpendicular direction of the associated direction). The table to be processed is then stored in a (key, value) structure.

In order to achieve the above embodiment, the embodiment of the present application further provides a processing device for table data. Fig. 8 is a schematic structural diagram of a table data processing device according to an embodiment of the present application.

As shown in fig. 8, the processing apparatus 800 for table data includes:

the identifying module 810 is configured to identify a target table to be processed, so as to determine a target style parameter and a first content list corresponding to the target table, where the first content list includes a plurality of first contents and a position parameter corresponding to each of the first contents;

An extraction module 820 for extracting candidate reference tables from a plurality of reference tables based on the target style parameters;

a first determining module 830, configured to determine, according to the first content list and a second content list corresponding to the candidate reference table, content similarity and position similarity between the target table and the candidate reference table;

a second determining module 840, configured to determine the candidate reference table as a target reference table if the content similarity and the position similarity satisfy a preset condition;

and a third determining module 850, configured to determine an association relationship between each first content in the target table according to the association relationship between each second content in the target reference table.

In one possible implementation manner of the embodiment of the present application, the style parameter is a cell distribution matrix, and the identifying module 810 is configured to:

identifying the target table to determine the distribution state of each first cell in the target table and the first content in each first cell;

determining a target distribution matrix corresponding to the target table according to the distribution state of each first cell and a preset distribution matrix element value mode;

And generating the first content list according to the value of each element in the target distribution matrix and the first content in each first cell.

In a possible implementation manner of the embodiment of the present application, the extracting module 820 is configured to:

candidate reference tables are extracted from the plurality of reference tables, wherein the number of rows of the corresponding reference distribution matrix is the same as that of the target distribution matrix, and the number of columns of the reference distribution matrix is the same as that of the target distribution matrix.

In a possible implementation manner of the embodiment of the present application, the second content list includes a type to which each second content belongs, and the first determining module 830 is configured to:

determining target second contents to be matched according to the type of each second content;

matching each first content in the first content list with the target second content to obtain a first similarity between each first content and the target second content;

determining a position offset parameter between any first content and the target second content according to the position parameter corresponding to any first content and the position parameter corresponding to the target second content under the condition that the first similarity between any first content and the target second content is larger than a threshold value;

And determining the content similarity and the position similarity between the target table and the candidate reference table according to the first similarity and the position offset parameter corresponding to each first content in the first content list.

In one possible implementation manner of the embodiment of the present application, the second content list includes a type and an association direction to which each second content belongs, and the third determining module 850 includes:

a first determining unit, configured to determine a target second content according to a type to which each second content belongs;

a second determining unit configured to determine a target first content having a similarity with the target second content greater than a threshold value;

and the first acquisition unit is used for acquiring a target value associated with the target first content from the association direction of the target table based on the association direction corresponding to the target second content.

In a possible implementation manner of the embodiment of the present application, the third determining module 850 further includes:

a second acquisition unit configured to acquire, from the target table, first content in another direction perpendicular to the association direction, with the target first content as a start point;

and the third determining unit is used for determining the association relation between the target first content and the first content in the other direction according to the semantics corresponding to the target first content and the first content in the other direction respectively.

In one possible implementation manner of the embodiment of the present application, the apparatus may further include:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a labeling data set, wherein the labeling data set comprises a plurality of reference tables and the type and the association direction of each second content in each reference table;

the identifying module 810 is further configured to identify each of the reference tables to determine a style parameter and a second content list corresponding to each of the reference tables;

the third determining module 850 is further configured to determine an association relationship between the second contents in the second content list according to the type and the association direction to which each second content in each reference table belongs.

It should be noted that, the explanation of the foregoing embodiment of the method for processing table data is also applicable to the processing device for processing table data in this embodiment, and thus will not be repeated here.

In the embodiment of the application, the target style parameter and the first content list corresponding to the target table are determined by identifying the target table to be processed, the candidate reference table is extracted from the reference table according to the target style parameter, the content similarity and the position similarity between the target table and the candidate reference table are determined according to the first content list and the second content list corresponding to the candidate reference table, the candidate reference table is determined as the target reference table under the condition that the content similarity and the position similarity meet the preset condition, and the association relationship between the first contents in the target table is determined according to the association relationship between the second contents in the target reference table. Therefore, the association relationship among the contents in the target table is determined according to the association relationship among the contents in the reference table with high similarity, so that the structuring of the complex table is realized.

According to embodiments of the present application, the present application also provides an electronic device, a readable storage medium and a computer program product.

FIG. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 902 or a computer program loaded from a storage unit 908 into a RAM (Random Access Memory ) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An I/O (Input/Output) interface 905 is also connected to bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a CPU (Central Processing Unit ), GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, DSP (Digital Signal Processor ), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, the processing method of the table data. For example, in some embodiments, the method of processing tabular data may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described processing method of table data may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the processing method of the table data in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service (Virtual Private Server, virtual special servers) are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

According to an embodiment of the present application, there is also provided a computer program product, which when executed by an instruction processor in the computer program product, performs the method for processing table data according to the above embodiment of the present application.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, so long as the desired result of the technical solution of the present disclosure is achieved, and the present disclosure is not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method of processing form data, comprising:

identifying a target table to be processed to determine a target style parameter and a first content list corresponding to the target table, wherein the style parameter is a cell distribution matrix, identifying the target table to determine the distribution state of each first cell in the target table and the first content in each first cell, and determining a target distribution matrix corresponding to the target table according to the distribution state of each first cell and a preset distribution matrix element value mode; generating a first content list according to the value of each element in the target distribution matrix and the first content in each first cell, wherein the first content list comprises a plurality of first contents and position parameters corresponding to each first content;

Extracting candidate reference tables from a plurality of reference tables based on the target style parameters, wherein candidate reference tables, of which the number of rows of the corresponding reference distribution matrix is the same as that of the target distribution matrix and the number of columns of the reference distribution matrix is the same as that of the target distribution matrix, are extracted from the plurality of reference tables;

determining target second contents to be matched according to the type of each second content in a second content list corresponding to the candidate reference table; matching each first content in the first content list with the target second content to obtain a first similarity between each first content and the target second content; determining a position offset parameter between any first content and the target second content according to the position parameter corresponding to any first content and the position parameter corresponding to the target second content under the condition that the first similarity between any first content and the target second content is larger than a threshold value; determining content similarity and position similarity between the target table and the candidate reference table according to first similarity and position offset parameters corresponding to each first content in the first content list;

2. The method of claim 1, wherein the second content list includes a type and an association direction to which each second content belongs, and the determining, according to the association relationship between the second contents in the target reference table, the association relationship between the first contents in the target table includes:

determining target second content according to the type of each second content;

determining target first content with similarity with the target second content being greater than a threshold value;

and acquiring a target value associated with the target first content from the association direction of the target table based on the association direction corresponding to the target second content.

3. The method of claim 2, wherein after the determining the target first content having a similarity to the target second content greater than a threshold value, further comprising:

acquiring first content in the other direction perpendicular to the association direction from the target table by taking the target first content as a starting point;

And determining the association relation between the target first content and the first content in the other direction according to the semantics corresponding to the target first content and the first content in the other direction respectively.

4. The method of claim 1, wherein prior to said extracting candidate reference tables from a plurality of reference tables based on said target style parameter, further comprising:

acquiring a labeling data set, wherein the labeling data set comprises a plurality of reference tables, and the type and the association direction of each second content in each reference table;

identifying each reference table to determine a style parameter and a second content list corresponding to each reference table;

and determining the association relation among the second contents in the second content list according to the type and the association direction of each second content in each reference table.

5. A processing apparatus of form data, comprising:

the identifying module is used for identifying a target table to be processed so as to determine target style parameters and a first content list corresponding to the target table, wherein the style parameters are cell distribution matrixes, and identifying the target table so as to determine the distribution state of each first cell in the target table and the first content in each first cell; determining a target distribution matrix corresponding to the target table according to the distribution state of each first cell and a preset distribution matrix element value mode; generating a first content list according to the value of each element in the target distribution matrix and the first content in each first cell, wherein the first content list comprises a plurality of first contents and position parameters corresponding to each first content;

The extraction module is used for extracting candidate reference tables from a plurality of reference tables based on the target style parameters, wherein candidate reference tables, of which the number of rows of the corresponding reference distribution matrix is the same as that of the target distribution matrix and the number of columns of the reference distribution matrix is the same as that of the target distribution matrix, are extracted from the plurality of reference tables;

the first determining module is used for determining target second contents to be matched according to the type of each second content in the second content list corresponding to the candidate reference table; matching each first content in the first content list with the target second content to obtain a first similarity between each first content and the target second content; determining a position offset parameter between any first content and the target second content according to the position parameter corresponding to any first content and the position parameter corresponding to the target second content under the condition that the first similarity between any first content and the target second content is larger than a threshold value; determining content similarity and position similarity between the target table and the candidate reference table according to first similarity and position offset parameters corresponding to each first content in the first content list;

6. The apparatus of claim 5, wherein the second content list includes a type and an associated direction to which each second content belongs, and the third determining module includes:

7. The apparatus of claim 6, the third determination module further comprising:

8. The apparatus of claim 5, further comprising:

the identification module is further used for identifying each reference table to determine a style parameter and a second content list corresponding to each reference table;

the third determining module is further configured to determine an association relationship between the second contents in the second content list according to the type and the association direction of each second content in each reference table.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.