CN110362620B

CN110362620B - Table data structuring method based on machine learning

Info

Publication number: CN110362620B
Application number: CN201910623601.0A
Authority: CN
Inventors: 廖闻剑; 李曙光; 宋万军; 姜广栋; 杨万刚; 尹若成
Original assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Current assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date: 2019-07-11
Filing date: 2019-07-11
Publication date: 2021-04-06
Anticipated expiration: 2039-07-11
Also published as: CN110362620A

Abstract

The invention relates to a table data structuring method based on machine learning, which is used for counting the number of objects in each cell in a large number of sample electronic tables to form a dictionary table, obtaining the score of each cell in the electronic table to be processed by combining the occurrence frequency of the objects in each cell in the electronic table to be processed and the number of the objects in the dictionary table, and taking the score of each cell as a minimum unit, realizing the acquisition of a header row or a header column in the electronic table to be processed by comparing rows and columns, thereby obtaining each header item, further extracting and structuring data items based on each header item, solving the defects that the prior art only recognizes a transverse header and cannot recognize a plurality of headers by depending on rules, and accurately and efficiently realizing the data structuring processing of the electronic table.

Description

Table data structuring method based on machine learning

Technical Field

The invention relates to a table data structuring method based on machine learning, and belongs to the technical field of table data structuring.

Background

The spreadsheet is the most commonly used computer software tool, and in the prior art, for a Sheet (spreadsheet) with unknown content, the data items of each cell can only be read after a file is opened, and the steps are as follows:

(1) opening an Excel file by using an interface;

(2) reading the Sheet in the Excel file by using an interface;

(3) the interface is used to read the cells in Sheet.

In the execution process of the method, the meaning of each data item is unknown, so that the data cannot be structured. Because the meaning of the data item is described by the header of the table, the data cannot be understood without knowing the header of the table. Therefore, in order to complete the structuring of the table data, some jobs use an assumption that the header of the table exists in the head row of the table, and based on this assumption, the header can be extracted and then the data can be extracted, so as to complete the structuring of the table data, and the execution steps are as follows:

(1) opening an Excel file by using an interface;

(2) reading the Sheet in the Excel file by using an interface;

(3) reading a first row of cells in the Sheet by using an interface to serve as a header;

(4) and reading the data corresponding to each header according to the columns to complete data structuring.

This assumption has obvious defects, the extracted header is only a horizontal header, and the header must be in the head row, and there are cases of misjudgment in the cases of a table with a vertical header, a header in a non-head row of the table, and a plurality of rows of headers in one table. Therefore, some work optimizes the operation based on prior knowledge, and solves the problem that the header is not in the first line, and the steps are as follows:

(1) opening an Excel file by using an interface;

(2) reading the Sheet in the Excel file by using an interface;

(3) sequentially reading the data of each row and each column in the Sheet by using an interface until the data with knowledge is met (through rule matching, such as a mobile phone number, an identity card, a bank card and the like), sequentially searching a first row which does not accord with the rule from the row and the column, and using the row as a header;

This method also has a problem that erroneous judgment occurs even when there are a plurality of vertical headers and one header, and the header cannot be recognized for a table without recognition data.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a table data structuring method based on machine learning, which can accurately identify the table head items in the electronic table and efficiently complete the structuring of the data items in the electronic table based on each table head item.

The invention adopts the following technical scheme for solving the technical problems: the invention designs a table data structuring method based on machine learning, which is used for structuring data items in an electronic table to be processed and is characterized by comprising the following steps:

a, counting the number of objects in a preset number sample spreadsheet and in each cell, respectively obtaining each object and the number corresponding to the object, constructing a dictionary table, and entering step B;

b, counting the times count of the objects in the cells appearing in the spreadsheet to be processed respectively aiming at each cell in the spreadsheet to be processed, and then entering the step C;

step C, respectively aiming at each cell in the electronic form to be processed, obtaining the number c of the objects in the cell corresponding to the dictionary table, wherein if the dictionary table does not have the objects in the cell in the electronic form to be processed, the number of the objects in the cell in the electronic form to be processed corresponding to the dictionary table is 0, and then entering the step D;

and D, aiming at each cell in the spreadsheet to be processed respectively, according to the following formula:

obtaining a score corresponding to the cell, and then entering the step E;

step E, respectively aiming at each row in the spreadsheet to be processed, obtaining the sum of scores score corresponding to each cell in the row as the score corresponding to the row;

meanwhile, respectively aiming at each column in the electronic table to be processed, obtaining the sum of scores score corresponding to each cell in the column as the score corresponding to the column;

respectively corresponding scores of each row and each column in the electronic form to be processed are obtained, and then the step F is carried out;

f, clustering all rows in the electronic form to be processed according to the scores respectively corresponding to all rows in the electronic form to be processed, respectively obtaining the average value of the scores corresponding to all rows in all row clusters and all rows, taking the average value as the score respectively corresponding to all row clusters, and selecting the row cluster with the highest score as the row cluster to be selected;

meanwhile, according to the scores respectively corresponding to all columns in the electronic table to be processed, performing column clustering on all columns in the electronic table to be processed, respectively obtaining the average value of the scores corresponding to all columns in all column clusters and all columns, taking the average value as the score respectively corresponding to all column clusters, and selecting the column cluster with the highest score as the cluster of the columns to be selected;

then entering step G;

g, aiming at each row in the row cluster to be selected, selecting the row with the highest score, and obtaining the average score of each non-empty cell in the row according to the score of the row to be selected as the row cell average score;

meanwhile, aiming at each column in the cluster of the columns to be selected, selecting the column with the highest score, and obtaining the average score of each non-empty cell in the column according to the score of the column to be used as the column cell average score;

then entering step H;

step H, if the average score of the row cells is greater than the average score of the column cells, each row in the cluster of the row to be selected is each header row in the electronic table to be processed, each header item is obtained, and the step J is carried out;

if the average score of the row cells is smaller than the average score of the column cells, each column in the cluster of the columns to be selected is each header column in the electronic table to be processed, each header item is obtained, and the step J is carried out;

and J, reading each data item in the electronic form to be processed according to each header item in the electronic form to be processed, and structuring form data.

As a preferred technical scheme of the invention: in the step A, after the dictionary table is constructed and obtained, the following steps I to II are adopted, the dictionary table is updated, and then the step B is carried out;

step I, acquiring maximum quantity values of the quantity corresponding to each object in the dictionary table, and entering step II;

step II, respectively executing the following steps II-1 to II-2 aiming at each object in the dictionary table, updating the number corresponding to the object, and further updating the dictionary table;

II-1, judging whether the object belongs to a preset header item set, if so, setting the number corresponding to the object as the maximum number value, otherwise, entering a step II-2;

and II-2, judging whether the object belongs to a preset data item set, if so, setting the quantity corresponding to the object to be 0, otherwise, not modifying the quantity corresponding to the object.

As a preferred technical scheme of the invention: in the step F, according to the scores respectively corresponding to the rows in the electronic form to be processed, clustering is carried out on the rows in the electronic form to be processed according to the following steps Fa-1 to Fa-3;

step Fa-1, acquiring the minimum row score and the maximum row score in the scores respectively corresponding to each row in the spreadsheet to be processed, and entering the step Fa-2;

step Fa-2, aiming at the span from the minimum row score to the maximum row score, dividing according to the preset row score grades to obtain all row score intervals, and then entering the step Fa-3;

step Fa-3, dividing each row in the electronic form to be processed into each row score interval according to the corresponding score of each row in the electronic form to be processed, wherein each row score interval having the electronic form row to be processed is a row cluster;

meanwhile, according to the scores respectively corresponding to all columns in the electronic table to be processed, carrying out column clustering on all columns in the electronic table to be processed according to the following steps Fb-1 to Fb-3;

step Fb-1, acquiring the minimum column score and the maximum column score in the scores respectively corresponding to each column in the electronic table to be processed, and entering the step Fb-2;

step Fb-2, aiming at the span from the minimum column score to the maximum column score, performing rank division according to preset column score grades to obtain each column score interval, and then entering the step Fb-3;

and step Fb-3, dividing each column in the electronic table to be processed into each column score interval according to the corresponding score of each column in the electronic table to be processed, wherein each column score interval of the electronic table to be processed is owned, namely each column cluster is obtained.

Compared with the prior art, the table data structuring method based on machine learning has the following technical effects:

the invention designs a table data structuring method based on machine learning, which is used for counting the number of objects in each cell in a large number of sample electronic tables to form a dictionary table, obtaining the score of each cell in the electronic table to be processed by combining the occurrence frequency of the objects in each cell in the electronic table to be processed and the number of the objects in the dictionary table corresponding to the objects, taking the score of each cell as a minimum unit, and realizing the acquisition of a header row or a header column in the electronic table to be processed by comparing rows and columns, thereby obtaining each header item, further extracting and structuring data items based on each header item, solving the defects that the prior art only recognizes a transverse header and cannot recognize a plurality of headers by depending on rules, and accurately and efficiently realizing the data structuring processing of the electronic table.

Drawings

FIG. 1 is a schematic diagram of the present invention for designing a table data structuring method based on machine learning.

Detailed Description

The following description will explain embodiments of the present invention in further detail with reference to the accompanying drawings.

The invention designs a table data structuring method based on machine learning, which is used for carrying out structuring processing on data items in an electronic table to be processed and executing the following steps A to J in specific practical application.

And step A, counting the number of the objects in each cell in a preset number sample spreadsheet, respectively obtaining each object and the number corresponding to the object, constructing a dictionary table, updating the dictionary table by adopting the following steps I to II, and entering the step B.

And step I, acquiring the maximum quantity value of the quantity corresponding to each object in the dictionary table, and then entering the step II.

And step II, respectively executing the following steps II-1 to II-2 aiming at each object in the dictionary table, updating the number corresponding to the object, and further updating the dictionary table.

And B, counting the times count of the objects in the cells in the spreadsheet to be processed respectively aiming at each cell in the spreadsheet to be processed, and then entering the step C.

And C, respectively aiming at each cell in the electronic form to be processed, obtaining the number c of the objects in the cell corresponding to the dictionary table, wherein if the dictionary table does not have the objects in the cell in the electronic form to be processed, the number of the objects in the cell in the electronic form to be processed corresponding to the dictionary table is 0, and then entering the step D.

and E, obtaining the score corresponding to the cell, and then entering the step E.

and F, respectively obtaining the scores corresponding to each row and each column in the electronic table to be processed.

And F, according to the scores respectively corresponding to all the rows in the electronic form to be processed, clustering all the rows in the electronic form to be processed according to the following steps Fa-1 to Fa-3, respectively obtaining the average value of the scores corresponding to all the row clusters and all the rows, taking the average value as the score respectively corresponding to all the row clusters, and selecting the row cluster with the highest score as the cluster of the row to be selected.

and Fa-3, dividing each line in the electronic form to be processed into line score intervals according to the corresponding score of each line in the electronic form to be processed, wherein each line score interval having the electronic form line to be processed is the line cluster.

Meanwhile, according to the scores respectively corresponding to all columns in the electronic form to be processed, performing column clustering on all columns in the electronic form to be processed according to the following steps Fb-1 to Fb-3, respectively obtaining the average value of the scores corresponding to all columns in all column clusters and all columns, taking the average value as the score respectively corresponding to all column clusters, and selecting the column cluster with the highest score as the cluster of the columns to be selected.

And G, after the clusters of the rows to be selected and the clusters of the columns to be selected are obtained.

then step H is entered.

The table data structuring method based on machine learning is designed by the technical scheme, the quantity statistics is carried out on the objects in each cell in a large number of sample electronic tables to form a dictionary table, the score of each cell in the electronic table to be processed is obtained by combining the occurrence frequency of the objects in each cell in the electronic table to be processed and the quantity of the objects in the dictionary table corresponding to the objects, the score of each cell is taken as the minimum unit, the acquisition of a header row or a header column in the electronic table to be processed is realized by comparing the row and the column, each header item is obtained, and then the extraction and the structuring of the data items are carried out based on each header item, so that the defects that the data structuring of the electronic table is accurately and efficiently realized by depending on rules, only horizontal headers are recognized and a plurality of headers cannot be recognized in the prior art are overcome.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A table data structuring method based on machine learning is used for carrying out structuring processing on data items in an electronic table to be processed, and is characterized by comprising the following steps:

obtaining a score corresponding to the cell, and then entering the step E;

then entering step G;

then entering step H;

2. The table data structuring method based on machine learning according to claim 1, characterized in that: in the step A, after the dictionary table is constructed and obtained, the following steps I to II are adopted, the dictionary table is updated, and then the step B is carried out;

3. The table data structuring method based on machine learning according to claim 1, characterized in that: in the step F, according to the scores respectively corresponding to the rows in the electronic form to be processed, clustering is carried out on the rows in the electronic form to be processed according to the following steps Fa-1 to Fa-3;