CN113536751A

CN113536751A - Processing method and device of table data, electronic equipment and storage medium

Info

Publication number: CN113536751A
Application number: CN202110738835.7A
Authority: CN
Inventors: 李晨辉; 胡腾; 陈永锋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-22
Anticipated expiration: 2041-06-30
Also published as: CN113536751B

Abstract

The application discloses a processing method and device of table data, electronic equipment and a storage medium, and relates to the field of computers, in particular to the fields of artificial intelligence such as natural language processing and knowledge charts. The specific implementation scheme is as follows: identifying a target form to be processed to determine a target style parameter and a first content list corresponding to the target form; extracting candidate reference tables from the plurality of reference tables based on the target style parameter; determining the content similarity and the position similarity between the target table and the candidate reference table according to the first content list and a second content list corresponding to the candidate reference table; under the condition that the content similarity and the position similarity meet preset conditions, determining a candidate reference table as a target reference table; and determining the association relation among the first contents in the target table according to the association relation among the second contents in the target reference table. The method can realize the structurization of complex tables.

Description

Processing method and device of table data, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of artificial intelligence such as natural language processing and knowledge profiles, and more particularly, to a method and an apparatus for processing table data, an electronic device, and a storage medium.

Background

Forms are widely used in people's lives. Forms are also various, and except for simple forms of individual forms, most forms are semi-structured forms, and cannot be directly understood and applied by a computer system.

Therefore, how to convert a complex semi-structured table into structured information so that a computer system can be directly utilized is a problem to be solved urgently.

Disclosure of Invention

The application provides a processing method and device of table data, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a method for processing table data, including:

identifying a target table to be processed to determine a target style parameter and a first content list corresponding to the target table, wherein the first content list comprises a plurality of first contents and a position parameter corresponding to each first content;

extracting a candidate reference table from a plurality of reference tables based on the target style parameter;

determining content similarity and position similarity between the target table and the candidate reference table according to the first content list and a second content list corresponding to the candidate reference table;

determining the candidate reference table as a target reference table under the condition that the content similarity and the position similarity meet preset conditions;

and determining the association relation among the first contents in the target table according to the association relation among the second contents in the target reference table.

According to another aspect of the present application, there is provided a table data processing apparatus including: the device comprises an identification module, a processing module and a processing module, wherein the identification module is used for identifying a target form to be processed so as to determine a target style parameter and a first content list corresponding to the target form, and the first content list comprises a plurality of first contents and a position parameter corresponding to each first content;

an extraction module for extracting candidate reference tables from a plurality of reference tables based on the target style parameter;

a first determining module, configured to determine content similarity and position similarity between the target table and the candidate reference table according to the first content list and a second content list corresponding to the candidate reference table;

the second determining module is used for determining the candidate reference table as a target reference table under the condition that the content similarity and the position similarity meet preset conditions;

and the third determining module is used for determining the incidence relation among the first contents in the target table according to the incidence relation among the second contents in the target reference table.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the above embodiments.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the above-described embodiments.

According to another aspect of the present application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method according to the above embodiments.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a method for processing table data according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of another table data processing method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another table data processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another table data processing method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another table data processing method according to an embodiment of the present application;

fig. 6 is a schematic diagram of an identification process of a reference table according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating a process of processing a table to be processed by using a reference table according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a table data processing apparatus according to an embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing a method for processing table data according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A processing method, an apparatus, an electronic device, and a storage medium of table data according to an embodiment of the present application are described below with reference to the drawings.

Artificial intelligence is the subject of research on the use of computers to simulate certain mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of humans, both in the hardware and software domain. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, deep learning, a big data processing technology, a knowledge map technology and the like.

NLP (Natural Language Processing) is an important direction in the fields of computer science and artificial intelligence, and the content of NLP research includes but is not limited to the following branch fields: text classification, information extraction, automatic summarization, intelligent question answering, topic recommendation, machine translation, subject word recognition, knowledge base construction, deep text representation, named entity recognition, text generation, text analysis (lexical, syntactic, grammatical, etc.), speech recognition and synthesis, and the like.

Fig. 1 is a flowchart illustrating a method for processing table data according to an embodiment of the present application.

The table data processing method in the embodiment of the present application can be executed by the table data processing device provided in the embodiment of the present application, and the device can be configured in an electronic device, and determine the association relationship between the contents in the target table according to the association relationship between the contents in the similar reference table, so as to implement the structuring of the complex table.

As shown in fig. 1, the method for processing table data includes:

step 101, identifying a target table to be processed to determine a target style parameter and a first content list corresponding to the target table.

In the application, the target table to be processed is identified row by row or column by column, the number of rows and columns contained in the target table, the content of each cell and the like can be determined, and the style parameter and the first content list corresponding to the target table can be determined according to the identified content.

The style parameter may be a type of the table (e.g., a key-value pair formula, a relational formula, etc.), or may be a number of rows and a number of columns of the target table, and the first content list includes a plurality of first contents and a location parameter corresponding to each first content. The first content is the content of a cell in the target table, and the position parameter is used for indicating the position of the first content in the target table.

For example, a table is identified, the table is determined to be a table with 2 rows and 2 columns, and may be regarded as a 2 × 2 cell distribution matrix, values of 4 elements in the matrix may be 0,1,2, and 3 in sequence (may be row by row from left to back, and may be column by column from left to right, and the like), and a content list of the table includes 0: a name; 1: sex; 2: a radix rehmanniae; 3: and (4) contact information. Wherein, 0: the name indicates that the content in the table position corresponding to the element with the value of 0 is the "name", that is, the position parameter corresponding to the content "name" in the table is 0.

Step 102, extracting candidate reference tables from the plurality of reference tables based on the target style parameter.

Because the forms are different, the similarity of the two forms is relatively low, so in the application, the form with the same form parameter as the target form can be extracted from the multiple reference forms as the candidate reference form according to the target form parameter of the target form. Therefore, candidate reference tables with the same style can be screened from the plurality of reference tables based on the target style parameter, and the calculation efficiency can be improved.

Step 103, determining the content similarity and the position similarity between the target table and the candidate reference table according to the first content list and the second content list corresponding to the candidate reference table.

In practical applications, the style of the table is the same, but the contents of the table may also differ. In the application, each first content in the first content list and the content similarity and the position similarity between each first content and each second content in the second content list can be determined according to the first content list and the second content list corresponding to the candidate reference table, the sum of the content similarities between each first content and each second content is used as the content similarity between the target table and the candidate reference table, and the sum of the position similarities between each first content and each second content is used as the position similarity between the target table and the candidate reference table.

The content similarity between the first content and the second content can be determined by calculating the Euclidean distance between two vectors according to the vectors respectively corresponding to the first content and the second content; the position similarity between the first content and the second content may be determined according to a difference between a position parameter corresponding to the first content and a position parameter corresponding to the second content.

For example, the position parameter of a certain first content in the target table is 0, the position parameter of a certain second content in the candidate reference table is 2, it can be seen that the position difference between the two contents is 2, the reciprocal of the difference can be taken as the position similarity, and if the position parameters of the two contents are the same, the position similarity can be considered as 1.

And 104, determining the candidate reference table as the target reference table under the condition that the content similarity and the position similarity meet preset conditions.

In the application, the candidate reference table can be further screened according to the content similarity and the position similarity between the target table and the candidate reference table.

When the method is implemented, the candidate reference table can be determined as the target reference table under the condition that the content similarity meets the corresponding preset condition and the position similarity also meets the corresponding preset condition. The content similarity and the position similarity respectively have corresponding preset conditions.

Or, calculating the total similarity between the target table and the candidate reference table according to the content similarity, the position similarity and the corresponding weight between the target table and the candidate reference table, and determining the candidate reference table as the target reference table under the condition that the total similarity is greater than a preset threshold value.

And 105, determining the association relation among the first contents in the target table according to the association relation among the second contents in the target reference table.

After the target reference table is determined, the association relationship between each first content in the target table may be determined according to the association relationship between each second content in the target reference table. The association relationship between the contents may be a relationship between an attribute and a value, and may be a relationship between attributes.

In the application, the association relationship between the first content and other first contents in the target table may be determined according to the association relationship between the second content with the highest similarity to the first content in the target table in the target reference table and other second contents.

For example, the content "name" in the target table has the highest similarity with the content "name" in the target reference table, and the value corresponding to the content "name" in the target reference table is below the "name", so that the value corresponding to the content "name" in the target table can be obtained downward from the content "name" in the target table.

In the embodiment of the application, a target style parameter and a first content list corresponding to a target table are determined by identifying the target table to be processed, a candidate reference table is extracted from the reference table according to the target style parameter, content similarity and position similarity between the target table and the candidate reference table are determined according to the first content list and a second content list corresponding to the candidate reference table, the candidate reference table is determined as the target reference table under the condition that the content similarity and the position similarity meet preset conditions, and an association relation between each first content in the target table is determined according to an association relation between each second content in the target reference table. Therefore, the association relation among the contents in the target table is determined according to the association relation among the contents in the reference table with high similarity, and the structure of the complex table is realized.

In an embodiment of the present application, the style parameter may be a cell distribution matrix, and when determining the target style parameter and the first content list corresponding to the target table, the method shown in fig. 2 may also be adopted. Fig. 2 is a schematic flowchart of another table data processing method according to an embodiment of the present application.

As shown in fig. 2, identifying the target table to be processed to determine the target style parameter and the first content list corresponding to the target table includes:

step 201, identifying the target table to determine the distribution state of each first cell in the target table and the first content in each first cell.

In the application, the target table may be identified row by row or column by column, and when a cell is identified, the position of the cell (for example, the several cells in the several rows) is recorded, and the content of the cell is extracted, so that the distribution state of each first cell in the target table and the first content in each first cell may be determined.

The distribution state of the first cell may be understood as the row and column of the first cell in the target table. It can be seen that the distribution state of the first cell can be used to characterize the location of the first cell in the target table.

Step 202, determining a target distribution matrix corresponding to the target table according to the distribution state of each first cell and a preset distribution matrix element dereferencing mode.

In the application, the number of rows and the number of columns of the target distribution matrix corresponding to the target table can be determined according to the distribution state of each first cell, and the value of each element in the target distribution matrix can be determined according to a preset distribution matrix element value mode. Thus, the target distribution matrix corresponding to the target table can be determined.

The distribution matrix element value mode refers to a value mode of each element in the distribution matrix. For example, values of 0,1,2,3, 4 and the like may be sequentially taken from left to right row by row, or values of 0,1,2,3, 4 and the like may be sequentially taken from top to bottom column by column.

For example, the target table is identified as a 2 × 3 distribution matrix, in which the values of the elements in the 1 st row and the 1 st column are 0, the values of the elements in the 1 st row and the 2 nd column are 1, the values of the elements in the 1 st row and the 2 nd column are 2, the values of the elements in the 2 nd row and the 1 st column are 3, the values of the elements in the 2 nd row and the 2 nd column are 4, and the values of the elements in the 2 nd row and the 3 rd column are 5.

It should be noted that, the value mode of each element and the value size of each element in the distribution matrix may be set as needed, which is not limited in the present application.

Step 203, a first content list is generated according to the value of each element in the target distribution matrix and the first content in each first cell.

In this application, since the value of each element in the target distribution matrix may be used to represent the position of the first cell corresponding to each element, the position of each first content in the target table may be determined according to the value of each element in the target distribution matrix and the first content of each first cell, and the first content list may be generated by using a plurality of first contents and the position of each first content in the target table.

The first content list comprises a plurality of first contents and a position parameter corresponding to each first content. The location parameter may be an element value indicating a location of the first cell where the first content is located.

For example, the target cell matrix corresponding to the target table is a 2 × 3 matrix, where the value of the element in the 1 st row and the 1 st column in the 1 st row is 0, the content at the position in the target table is "name", the value of the element in the 2 nd row and the 1 st column in the 1 st row is 1, the content at the position in the target table is "gender", then the position parameter corresponding to the content "name" is 0, and the position parameter corresponding to the content "gender" is 1.

In the embodiment of the application, if the style parameter is a cell distribution matrix, when the target table to be processed is identified to determine the target style parameter and the first content list corresponding to the target table, the target table may be identified to determine the distribution state of each first cell and the first content in each first cell in the target table, the target distribution matrix corresponding to the target table is determined according to the distribution state of each first cell and a preset distribution matrix element dereferencing mode, and the first content list is generated according to the dereferencing of each element in the target distribution matrix and the first content in each first cell. Therefore, when the table is identified, the distribution matrix and the content list corresponding to the table can be determined according to the determined distribution state of each cell and the content of each cell, so that the style of the table and the position of each content are determined by using the distribution matrix and the content list together, the complexity of matrix element dereferencing is reduced, and the space is saved.

Further, in extracting the candidate reference table from the plurality of reference tables based on the target style, the candidate reference table corresponding to the reference distribution matrix having the same number of rows as that of the target distribution matrix and having the same number of columns as that of the target distribution matrix may be extracted. That is, the reference table having the same number of rows and the same number of columns of the corresponding distribution matrix and the target distribution matrix may be extracted from the plurality of reference tables as the candidate reference table. Therefore, the reference table is screened according to the row number and the column number of the target distribution matrix, and the table processing efficiency is improved.

In practical applications, the types to which the contents in the table generally belong are various, and in order to improve the table processing efficiency, in an embodiment of the present application, when determining the content similarity and the position similarity between the target table and the candidate reference table, the content similarity and the position similarity between each first content and the target second content in the candidate reference table may be determined according to the content similarity and the position similarity between each first content and the target second content in the candidate reference table. Fig. 3 is a flowchart illustrating another table data processing method according to an embodiment of the present application.

As shown in fig. 3, determining the content similarity and the position similarity between the target table and the candidate reference table includes:

step 301, determining target second content to be matched according to the type of each second content.

In the application, the second content list corresponding to the candidate reference table may include a type to which each second content belongs, and the second content whose type is the target type may be determined according to the type to which each second content belongs in the candidate reference table, and may be used as the target second content to be matched. The target second content may be one or more.

The type to which the content in the table belongs may include an attribute (key), a value (value), and the like, or may be other types. For example, the second content of which the type is key to which the second content belongs may be used as the target second content to be matched.

Step 302, each first content in the first content list is matched with a target second content to obtain a first similarity between each first content and the target second content.

After the target second content to be matched is determined, each first content in the first content list may be matched with the target second content, and a first similarity between each first content and the target second content is calculated. When the target second content is multiple, calculating a first similarity between each first content and each second content.

The method for calculating the first similarity is as described in the above embodiments, and will not be described herein again.

Step 303, determining a position offset parameter between any first content and the target second content according to the position parameter corresponding to any first content and the position parameter corresponding to the target second content when the first similarity between any first content and the target second content is greater than the threshold.

To further increase the efficiency of the table processing, a first similarity between each first content and the target second content may be compared to a threshold. When the first similarity between any first content and the target second content is greater than the threshold, it is described that the similarity between the first content and the target second content is relatively high, and the position offset parameter between any first content and the target second content may be determined according to the position parameter corresponding to any first content and the position parameter corresponding to the target second content.

That is, for the first content when the similarity with the target second content is greater than the threshold, the position offset parameter between the first content and the target second content is determined. Therefore, the calculation amount is reduced, and the processing efficiency is improved.

Step 304, determining the content similarity and the position similarity between the target table and the candidate reference table according to the first similarity and the position offset parameter corresponding to each first content in the first content list.

In the present application, the content similarity and the position similarity between the target table and the candidate reference table may be determined according to the first similarity and the position offset parameter corresponding to the first content whose first similarity between the target second content is greater than the threshold.

In calculating the content similarity between the target table and the candidate reference table, the first similarities greater than the threshold may be summed up as the content similarity between the two tables. Likewise, in calculating the positional similarity between the target table and the candidate reference table, the positional shift parameters corresponding to the first content having the first similarity between the target second content larger than the threshold may be summed as the positional similarity between the two tables.

After the content similarity and the position similarity between the target table and the candidate reference table are calculated, the target reference table can be determined based on the content similarity and the position similarity between the target table and the candidate reference table, and the association relationship between the first contents in the target table is determined by using the target reference table.

In this embodiment, the second content list may include a type to which each second content belongs, when determining the content similarity and the position similarity between the target table and the candidate reference table according to the first content list and the second content list corresponding to the candidate reference table, the target second target content to be matched may be determined according to the type to which each second content belongs, and the first similarity between each first content and the target second content is obtained, and when the first similarity between any first content and the target second content is greater than a threshold value, the position offset parameter between any first content and the target second content is calculated, and the content similarity and the position similarity between the target table and the candidate reference table are determined according to the first similarity and the position offset parameter corresponding to each first content in the first content list. Therefore, the target second content to be matched is determined according to the type of the target second content, and each first content is matched with the target second content, so that the calculation amount is reduced, and the processing efficiency is improved.

Fig. 4 is a flowchart illustrating another table data processing method according to an embodiment of the present application.

As shown in fig. 4, the method for processing table data includes:

step 401, identifying a target table to be processed to determine a target style parameter and a first content list corresponding to the target table.

Step 402, extracting candidate reference tables from the plurality of reference tables based on the target style parameter.

Step 403, determining the content similarity and the position similarity between the target table and the candidate reference table according to the first content list and the second content list corresponding to the candidate reference table.

In step 404, under the condition that the content similarity and the position similarity meet the preset conditions, the candidate reference table is determined as the target reference table.

In the present application, steps 401 to 404 are similar to steps 101 to 104, and thus are not described herein again.

Step 405, determining the target second content according to the type of each second content.

In this application, the second content list corresponding to the reference table may include a type and an associated direction to which each second content belongs. Wherein the association direction may be used to represent a positional relationship between each second content and other second contents.

For example, if the type to which the content "name" belongs is key and the association direction is downward, that is, the value of "name" is below "name", then the association direction may be considered to be upward for the value of "name".

Since the types to which the second contents belong are various, in order to improve the calculation efficiency, the second contents of which the types are the target types may be determined as the target second contents according to the type to which each of the second contents belongs. For example, since the key affects the style of the table, the second content belonging to the type of the key in the target reference table may be determined as the target second content.

Step 406, determining the target first content with the similarity degree with the target second content larger than the threshold value.

After the target second contents in the target reference table are determined, the similarity between each first content and the target second contents can be calculated, and the first content with the similarity between the first content and the target second contents larger than the threshold value is determined as the target first content.

Step 407, acquiring a target value associated with the target first content from the association direction of the target table based on the association direction corresponding to the target second content.

After the target first content and the target second content are determined, the association direction corresponding to the target second content may be used as the association direction corresponding to the target first content, and the target value associated with the target first content may be obtained from the association direction of the target table.

For example, if the type of the target second content is "mobile phone number" is key, the corresponding association direction is downward, and the target first content with the similarity to the "mobile phone number" greater than the threshold value is "contact address", a value corresponding to the "contact address" may be obtained downward from the "contact address" position in the target table. It is understood that "contact details" may correspond to a plurality of values.

In the application, the second content of which the type is key can be used as the target second content, and the value corresponding to each key in the target table can be determined according to the association direction of the target second content, so that all (key, value) pairs in the target table can be obtained. In acquiring the (key, value) pair, the target table may be stored in the form of (key, value).

In this embodiment of the application, the second content list may include a type and an association direction to which each second content belongs, and after the target reference table is determined, when the association relationship between each first content in the target table is determined according to the association relationship between each second content in the target reference table, the target second content may be determined according to the type to which each second content belongs, the target first content having a similarity with the target second content greater than a threshold value is determined, and a target value associated with the target first content is obtained from the association direction of the target table according to the association direction corresponding to the target second content. Therefore, the target second content is determined according to the type of each second content, and the value associated with the first content with the similarity between the target second content and the threshold value is obtained according to the association direction of the target second target content, so that the calculation amount is reduced, and the table processing efficiency is improved.

In practical applications, there may be an association relationship between the attributes (keys) in the table, for example, the association relationship between "name" and "place of birth" is "place of birth".

Further, after the target first content with the similarity to the target second content larger than the threshold is determined, the target first content may be used as a starting point, the first content in the other direction perpendicular to the association direction corresponding to the target second content is acquired from the target table, and the association relationship between the target first content and the first content in the other direction is determined according to the semantics corresponding to the target first content and the first content in the other direction respectively.

For example, the target first content is "name", the association direction corresponding to the target second content "name" whose similarity is greater than the threshold is downward, the first content "graduation colleges" in the horizontal direction, for example, the left direction, may be obtained from the table by taking "name" as the starting point, and the association relationship between "name" and "graduation colleges" may be determined as "graduation.

When determining the association relationship, according to the semantic corresponding to the target first content and the semantic corresponding to the first content in the other direction, an entity with the highest similarity to the target first content and an entity with the highest similarity to the first content in the other direction are determined from the knowledge picture, and according to the relationship between the two entities, the association relationship between the target first content and the first content in the other direction is determined.

In the application, the association relationship between the first contents with the attribute as the belonging type can be determined from the target table according to the association direction corresponding to the second contents with the attribute as the belonging type (key), that is, the association relationship between the attributes can be determined.

In this embodiment of the application, after the target first content with the similarity to the target second content being greater than the threshold is determined, the first content in the other direction perpendicular to the association direction may be acquired from the target table with the target first content as a starting point, and the association relationship between the target first content and the first content in the other direction is determined according to the semantics corresponding to the target first content and the first content in the other direction, respectively. Therefore, the association relation between the target first content and the first content in the direction perpendicular to the association direction can be determined by utilizing the association direction corresponding to the target second content, and the intellectualization of form processing is improved.

In an embodiment of the present application, before extracting candidate reference tables from the multiple reference tables based on the target style parameter, the method shown in fig. 5 may be adopted to obtain the association relationship among the style parameter, the second content list, and each second content corresponding to each reference table. Fig. 5 is a flowchart illustrating another table data processing method according to an embodiment of the present application.

As shown in fig. 5, before extracting the candidate reference table from the plurality of reference tables based on the target style parameter, the method further includes:

step 501, obtaining a labeled data set, where the labeled data set includes a plurality of reference tables and a type and an associated direction of each second content in each reference table.

In the present application, an annotation data set may be obtained, where the annotation data set may include a plurality of reference tables, and each reference table is already annotated with a type and an associated direction to which each second content belongs.

The type may include key, value, or the like, or other types, and the associated direction of the second content whose type is key may be understood as the direction of the value.

Step 502, identifying each reference table to determine a style parameter and a second content list corresponding to each reference table.

In the present application, the identification method for each reference table is similar to the identification method for the target table, and therefore, the detailed description thereof is omitted here.

Step 503, determining the association relationship between the second contents in the second content list according to the type and the association direction to which each second content in each reference table belongs.

In this application, the target second content with the type being the target type may be determined according to the type to which each second content in each reference table belongs, for example, the second content with the type being key is used as the target second content.

Then, the second content in the association direction in the reference table may be determined as the value associated with the target second content, starting with the target second content.

In addition, the target second content may be used as a starting point, the second content in the other direction perpendicular to the association direction is acquired from the reference table, and the association relationship between the target second content and the second content in the other direction is determined according to the semantics corresponding to the target second content and the second content in the other direction. Thus, the association relationship between the attributes in the reference table can be determined.

In the embodiment of the application, before the candidate reference table is extracted from the plurality of reference tables based on the target style parameter, the association relationship between the style data, the second content list and each second content of each reference table can be determined by identifying each reference table in the labeled data set, so that other tables can be structured by using the reference tables, complex tables can be processed, and the speed is high.

The following describes a method for processing table data according to an embodiment of the present application with reference to fig. 6 and 7. Fig. 6 is a schematic diagram of an identification process of a reference table according to an embodiment of the present application. Fig. 7 is a schematic diagram of a process of processing a table to be processed by using a reference table according to an embodiment of the present application.

In fig. 6, an annotation data set is obtained, where each cell in each reference table in the annotation data set is labeled with its type (key, value, other) and the associated direction corresponding to each key.

Analyzing each reference table to obtain a key with the correlation direction being upward or downward, and marking the key as xkey; analyzing a key with the correlation direction of left or right, and marking the key as ykey; analyzing the cells marked as other, and marking the cells as build _ key; the other cells in the reference table are marked as value. Then, the reference table is analyzed to obtain a cell distribution matrix which is a cell structure of the reference table. And then, encrypting and storing the analyzed reference table.

In fig. 7, the table to be processed is obtained, and the table to be processed is identified, that is, analyzed, so as to obtain the cell distribution matrix corresponding to the table to be processed.

For each reference table in fig. 6, the cell distribution matrix corresponding to the table to be processed is matched with the cell distribution matrix corresponding to each reference table, a similarity score p _ m is calculated, the content of each cell in the table to be processed is matched with the key in the reference table, the similarity score p _ ij is calculated, and meanwhile, the position offset score p _ f of the content of each cell relative to the key in the reference table is calculated.

Then, a matching score p ═ α × p _ m + (1- α) × p _ ij × (p _ f) between each reference table and the template to be processed is calculated. Where α represents a weight, which can be determined as needed.

After the matching score corresponding to each reference table is obtained, the reference table with the highest matching score can be selected as the reference table of the current table to be processed, namely the target reference table, and each key and the association direction thereof in the table to be processed are determined according to the key and the corresponding association direction in the target reference table. Thereafter, the associated value is acquired in its associated direction from the position of each key, and the other keys are acquired in their parallel directions (i.e., the directions perpendicular to the associated direction) from the position of each key. After that, the table to be processed is stored in a (key) structure.

In order to implement the foregoing embodiments, an apparatus for processing table data is further provided in the embodiments of the present application. Fig. 8 is a schematic structural diagram of a table data processing apparatus according to an embodiment of the present application.

As shown in fig. 8, the apparatus 800 for processing table data includes:

an identifying module 810, configured to identify a target table to be processed to determine a target style parameter and a first content list corresponding to the target table, where the first content list includes a plurality of first contents and a location parameter corresponding to each of the first contents;

an extraction module 820, configured to extract a candidate reference table from a plurality of reference tables based on the target style parameter;

a first determining module 830, configured to determine content similarity and position similarity between the target table and the candidate reference table according to the first content list and a second content list corresponding to the candidate reference table;

a second determining module 840, configured to determine the candidate reference table as a target reference table when the content similarity and the location similarity satisfy a preset condition;

a third determining module 850, configured to determine, according to the association relationship between the second contents in the target reference table, the association relationship between the first contents in the target table.

In a possible implementation manner of the embodiment of the present application, the pattern parameter is a cell distribution matrix, and the identifying module 810 is configured to:

identifying the target table to determine the distribution state of each first cell in the target table and the first content in each first cell;

determining a target distribution matrix corresponding to the target table according to the distribution state of each first cell and a preset distribution matrix element dereferencing mode;

and generating the first content list according to the value of each element in the target distribution matrix and the first content in each first cell.

In a possible implementation manner of the embodiment of the present application, based on the target style parameter, the extraction module 820 is configured to:

and extracting candidate reference tables which correspond to the reference distribution matrix and have the same row number as the row number of the target distribution matrix and the same column number as the target distribution matrix from the plurality of reference tables.

In a possible implementation manner of the embodiment of the present application, the second content list includes a type to which each second content belongs, and the first determining module 830 is configured to:

determining target second content to be matched according to the type of each second content;

matching each first content in the first content list with the target second content to obtain a first similarity between each first content and the target second content;

under the condition that the first similarity between any first content and the target second content is larger than a threshold value, determining a position offset parameter between any first content and the target second content according to a position parameter corresponding to any first content and a position parameter corresponding to the target second content;

and determining the content similarity and the position similarity between the target table and the candidate reference table according to the first similarity and the position offset parameter corresponding to each first content in the first content list.

In a possible implementation manner of this embodiment of the application, the second content list includes a type and an associated direction to which each second content belongs, and the third determining module 850 includes:

a first determining unit, configured to determine a target second content according to a type to which each of the second contents belongs;

a second determining unit, configured to determine a target first content whose similarity with the target second content is greater than a threshold;

a first obtaining unit, configured to obtain, based on an association direction corresponding to the target second content, a target value associated with the target first content from the association direction of the target table.

In a possible implementation manner of this embodiment of the present application, the third determining module 850 further includes:

a second obtaining unit configured to obtain, from the target table, a first content in another direction perpendicular to the association direction, with the target first content as a starting point;

and a third determining unit, configured to determine, according to semantics corresponding to the target first content and the first content in the other direction, an association relationship between the target first content and the first content in the other direction.

In a possible implementation manner of the embodiment of the present application, the apparatus may further include:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a labeled data set, and the labeled data set comprises a plurality of reference tables and the type and the associated direction of each second content in each reference table;

the identifying module 810 is further configured to identify each reference table to determine a style parameter and a second content list corresponding to each reference table;

the third determining module 850 is further configured to determine an association relationship between the second contents in the second content list according to the type and the association direction to which each second content in each reference table belongs.

It should be noted that the explanation of the foregoing embodiment of the method for processing table data is also applicable to the apparatus for processing table data of this embodiment, and therefore, the description thereof is omitted here.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 902 or a computer program loaded from a storage unit 908 into a RAM (Random Access Memory) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An I/O (Input/Output) interface 905 is also connected to the bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 901 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 901 performs the respective methods and processes described above, such as the processing method of table data. For example, in some embodiments, the method of processing tabular data may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the processing method of table data described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the processing method of the table data by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in a conventional physical host and a VPS (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to an embodiment of the present application, there is also provided a computer program product, which when executed by an instruction processor in the computer program product, performs the processing method of table data proposed by the above-mentioned embodiment of the present application.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A processing method of table data comprises the following steps:

2. The method of claim 1, wherein the style parameter is a cell distribution matrix, and the identifying the target table to be processed to determine the target style parameter and the first content list corresponding to the target table comprises:

3. The method of claim 2, wherein said extracting a candidate reference table from a plurality of reference tables based on the target style parameter comprises:

4. The method of claim 1, wherein the second content list includes a type to which each second content belongs, and the determining the content similarity and the location similarity between the target table and the candidate reference table according to the first content list and the second content list corresponding to the candidate reference table comprises:

5. The method according to any one of claims 1 to 4, wherein the second content list includes a type and an association direction to which each second content belongs, and the determining the association between each first content in the target table according to the association between each second content in the target reference table includes:

determining target second content according to the type of each second content;

determining target first content with similarity degree greater than threshold value with the target second content;

and acquiring a target value associated with the target first content from the association direction of the target table based on the association direction corresponding to the target second content.

6. The method of claim 5, wherein after the determining the target first content having a similarity to the target second content greater than a threshold, further comprising:

taking the target first content as a starting point, and acquiring first content in another direction perpendicular to the correlation direction from the target table;

and determining the association relation between the target first content and the first content in the other direction according to the semantics corresponding to the target first content and the first content in the other direction respectively.

7. The method of any of claims 1-4, wherein prior to said extracting candidate reference tables from a plurality of reference tables based on the target style parameter, further comprising:

acquiring a labeling data set, wherein the labeling data set comprises a plurality of reference tables and the type and the associated direction of each second content in each reference table;

identifying each reference table to determine a style parameter and a second content list corresponding to each reference table;

and determining the association relation among the second contents in the second content list according to the type and the association direction of each second content in each reference table.

8. A processing apparatus of table data, comprising:

the device comprises an identification module, a processing module and a processing module, wherein the identification module is used for identifying a target form to be processed so as to determine a target style parameter and a first content list corresponding to the target form, and the first content list comprises a plurality of first contents and a position parameter corresponding to each first content;

9. The apparatus of claim 8, wherein the pattern parameter is a cell distribution matrix, and the identifying module is to:

10. The apparatus of claim 9, wherein the extraction module, based on the target style parameter, is to:

11. The apparatus of claim 8, wherein the second content list includes a type to which each second content belongs, and the first determining module is configured to:

12. The apparatus according to any one of claims 8-11, wherein the second content list includes a type and an associated direction to which each second content belongs, and the third determining module includes:

13. The apparatus of claim 12, the third determination module, further comprising:

14. The apparatus of any of claims 8-11, further comprising:

the identification module is further configured to identify each reference table to determine a style parameter and a second content list corresponding to each reference table;

the third determining module is further configured to determine an association relationship between the second contents in the second content list according to the type and the association direction of each second content in each reference table.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.