CN110362620B - Table data structuring method based on machine learning - Google Patents

Table data structuring method based on machine learning Download PDF

Info

Publication number
CN110362620B
CN110362620B CN201910623601.0A CN201910623601A CN110362620B CN 110362620 B CN110362620 B CN 110362620B CN 201910623601 A CN201910623601 A CN 201910623601A CN 110362620 B CN110362620 B CN 110362620B
Authority
CN
China
Prior art keywords
score
row
processed
column
cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910623601.0A
Other languages
Chinese (zh)
Other versions
CN110362620A (en
Inventor
廖闻剑
李曙光
宋万军
姜广栋
杨万刚
尹若成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fiberhome Telecommunication Technologies Co ltd
Original Assignee
Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fiberhome Telecommunication Technologies Co ltd filed Critical Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority to CN201910623601.0A priority Critical patent/CN110362620B/en
Publication of CN110362620A publication Critical patent/CN110362620A/en
Application granted granted Critical
Publication of CN110362620B publication Critical patent/CN110362620B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a table data structuring method based on machine learning, which is used for counting the number of objects in each cell in a large number of sample electronic tables to form a dictionary table, obtaining the score of each cell in the electronic table to be processed by combining the occurrence frequency of the objects in each cell in the electronic table to be processed and the number of the objects in the dictionary table, and taking the score of each cell as a minimum unit, realizing the acquisition of a header row or a header column in the electronic table to be processed by comparing rows and columns, thereby obtaining each header item, further extracting and structuring data items based on each header item, solving the defects that the prior art only recognizes a transverse header and cannot recognize a plurality of headers by depending on rules, and accurately and efficiently realizing the data structuring processing of the electronic table.

Description

Table data structuring method based on machine learning
Technical Field
The invention relates to a table data structuring method based on machine learning, and belongs to the technical field of table data structuring.
Background
The spreadsheet is the most commonly used computer software tool, and in the prior art, for a Sheet (spreadsheet) with unknown content, the data items of each cell can only be read after a file is opened, and the steps are as follows:
(1) opening an Excel file by using an interface;
(2) reading the Sheet in the Excel file by using an interface;
(3) the interface is used to read the cells in Sheet.
In the execution process of the method, the meaning of each data item is unknown, so that the data cannot be structured. Because the meaning of the data item is described by the header of the table, the data cannot be understood without knowing the header of the table. Therefore, in order to complete the structuring of the table data, some jobs use an assumption that the header of the table exists in the head row of the table, and based on this assumption, the header can be extracted and then the data can be extracted, so as to complete the structuring of the table data, and the execution steps are as follows:
(1) opening an Excel file by using an interface;
(2) reading the Sheet in the Excel file by using an interface;
(3) reading a first row of cells in the Sheet by using an interface to serve as a header;
(4) and reading the data corresponding to each header according to the columns to complete data structuring.
This assumption has obvious defects, the extracted header is only a horizontal header, and the header must be in the head row, and there are cases of misjudgment in the cases of a table with a vertical header, a header in a non-head row of the table, and a plurality of rows of headers in one table. Therefore, some work optimizes the operation based on prior knowledge, and solves the problem that the header is not in the first line, and the steps are as follows:
(1) opening an Excel file by using an interface;
(2) reading the Sheet in the Excel file by using an interface;
(3) sequentially reading the data of each row and each column in the Sheet by using an interface until the data with knowledge is met (through rule matching, such as a mobile phone number, an identity card, a bank card and the like), sequentially searching a first row which does not accord with the rule from the row and the column, and using the row as a header;
(4) and reading the data corresponding to each header according to the columns to complete data structuring.
This method also has a problem that erroneous judgment occurs even when there are a plurality of vertical headers and one header, and the header cannot be recognized for a table without recognition data.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a table data structuring method based on machine learning, which can accurately identify the table head items in the electronic table and efficiently complete the structuring of the data items in the electronic table based on each table head item.
The invention adopts the following technical scheme for solving the technical problems: the invention designs a table data structuring method based on machine learning, which is used for structuring data items in an electronic table to be processed and is characterized by comprising the following steps:
a, counting the number of objects in a preset number sample spreadsheet and in each cell, respectively obtaining each object and the number corresponding to the object, constructing a dictionary table, and entering step B;
b, counting the times count of the objects in the cells appearing in the spreadsheet to be processed respectively aiming at each cell in the spreadsheet to be processed, and then entering the step C;
step C, respectively aiming at each cell in the electronic form to be processed, obtaining the number c of the objects in the cell corresponding to the dictionary table, wherein if the dictionary table does not have the objects in the cell in the electronic form to be processed, the number of the objects in the cell in the electronic form to be processed corresponding to the dictionary table is 0, and then entering the step D;
and D, aiming at each cell in the spreadsheet to be processed respectively, according to the following formula:
Figure BDA0002126321900000021
obtaining a score corresponding to the cell, and then entering the step E;
step E, respectively aiming at each row in the spreadsheet to be processed, obtaining the sum of scores score corresponding to each cell in the row as the score corresponding to the row;
meanwhile, respectively aiming at each column in the electronic table to be processed, obtaining the sum of scores score corresponding to each cell in the column as the score corresponding to the column;
respectively corresponding scores of each row and each column in the electronic form to be processed are obtained, and then the step F is carried out;
f, clustering all rows in the electronic form to be processed according to the scores respectively corresponding to all rows in the electronic form to be processed, respectively obtaining the average value of the scores corresponding to all rows in all row clusters and all rows, taking the average value as the score respectively corresponding to all row clusters, and selecting the row cluster with the highest score as the row cluster to be selected;
meanwhile, according to the scores respectively corresponding to all columns in the electronic table to be processed, performing column clustering on all columns in the electronic table to be processed, respectively obtaining the average value of the scores corresponding to all columns in all column clusters and all columns, taking the average value as the score respectively corresponding to all column clusters, and selecting the column cluster with the highest score as the cluster of the columns to be selected;
then entering step G;
g, aiming at each row in the row cluster to be selected, selecting the row with the highest score, and obtaining the average score of each non-empty cell in the row according to the score of the row to be selected as the row cell average score;
meanwhile, aiming at each column in the cluster of the columns to be selected, selecting the column with the highest score, and obtaining the average score of each non-empty cell in the column according to the score of the column to be used as the column cell average score;
then entering step H;
step H, if the average score of the row cells is greater than the average score of the column cells, each row in the cluster of the row to be selected is each header row in the electronic table to be processed, each header item is obtained, and the step J is carried out;
if the average score of the row cells is smaller than the average score of the column cells, each column in the cluster of the columns to be selected is each header column in the electronic table to be processed, each header item is obtained, and the step J is carried out;
and J, reading each data item in the electronic form to be processed according to each header item in the electronic form to be processed, and structuring form data.
As a preferred technical scheme of the invention: in the step A, after the dictionary table is constructed and obtained, the following steps I to II are adopted, the dictionary table is updated, and then the step B is carried out;
step I, acquiring maximum quantity values of the quantity corresponding to each object in the dictionary table, and entering step II;
step II, respectively executing the following steps II-1 to II-2 aiming at each object in the dictionary table, updating the number corresponding to the object, and further updating the dictionary table;
II-1, judging whether the object belongs to a preset header item set, if so, setting the number corresponding to the object as the maximum number value, otherwise, entering a step II-2;
and II-2, judging whether the object belongs to a preset data item set, if so, setting the quantity corresponding to the object to be 0, otherwise, not modifying the quantity corresponding to the object.
As a preferred technical scheme of the invention: in the step F, according to the scores respectively corresponding to the rows in the electronic form to be processed, clustering is carried out on the rows in the electronic form to be processed according to the following steps Fa-1 to Fa-3;
step Fa-1, acquiring the minimum row score and the maximum row score in the scores respectively corresponding to each row in the spreadsheet to be processed, and entering the step Fa-2;
step Fa-2, aiming at the span from the minimum row score to the maximum row score, dividing according to the preset row score grades to obtain all row score intervals, and then entering the step Fa-3;
step Fa-3, dividing each row in the electronic form to be processed into each row score interval according to the corresponding score of each row in the electronic form to be processed, wherein each row score interval having the electronic form row to be processed is a row cluster;
meanwhile, according to the scores respectively corresponding to all columns in the electronic table to be processed, carrying out column clustering on all columns in the electronic table to be processed according to the following steps Fb-1 to Fb-3;
step Fb-1, acquiring the minimum column score and the maximum column score in the scores respectively corresponding to each column in the electronic table to be processed, and entering the step Fb-2;
step Fb-2, aiming at the span from the minimum column score to the maximum column score, performing rank division according to preset column score grades to obtain each column score interval, and then entering the step Fb-3;
and step Fb-3, dividing each column in the electronic table to be processed into each column score interval according to the corresponding score of each column in the electronic table to be processed, wherein each column score interval of the electronic table to be processed is owned, namely each column cluster is obtained.
Compared with the prior art, the table data structuring method based on machine learning has the following technical effects:
the invention designs a table data structuring method based on machine learning, which is used for counting the number of objects in each cell in a large number of sample electronic tables to form a dictionary table, obtaining the score of each cell in the electronic table to be processed by combining the occurrence frequency of the objects in each cell in the electronic table to be processed and the number of the objects in the dictionary table corresponding to the objects, taking the score of each cell as a minimum unit, and realizing the acquisition of a header row or a header column in the electronic table to be processed by comparing rows and columns, thereby obtaining each header item, further extracting and structuring data items based on each header item, solving the defects that the prior art only recognizes a transverse header and cannot recognize a plurality of headers by depending on rules, and accurately and efficiently realizing the data structuring processing of the electronic table.
Drawings
FIG. 1 is a schematic diagram of the present invention for designing a table data structuring method based on machine learning.
Detailed Description
The following description will explain embodiments of the present invention in further detail with reference to the accompanying drawings.
The invention designs a table data structuring method based on machine learning, which is used for carrying out structuring processing on data items in an electronic table to be processed and executing the following steps A to J in specific practical application.
And step A, counting the number of the objects in each cell in a preset number sample spreadsheet, respectively obtaining each object and the number corresponding to the object, constructing a dictionary table, updating the dictionary table by adopting the following steps I to II, and entering the step B.
And step I, acquiring the maximum quantity value of the quantity corresponding to each object in the dictionary table, and then entering the step II.
And step II, respectively executing the following steps II-1 to II-2 aiming at each object in the dictionary table, updating the number corresponding to the object, and further updating the dictionary table.
II-1, judging whether the object belongs to a preset header item set, if so, setting the number corresponding to the object as the maximum number value, otherwise, entering a step II-2;
and II-2, judging whether the object belongs to a preset data item set, if so, setting the quantity corresponding to the object to be 0, otherwise, not modifying the quantity corresponding to the object.
And B, counting the times count of the objects in the cells in the spreadsheet to be processed respectively aiming at each cell in the spreadsheet to be processed, and then entering the step C.
And C, respectively aiming at each cell in the electronic form to be processed, obtaining the number c of the objects in the cell corresponding to the dictionary table, wherein if the dictionary table does not have the objects in the cell in the electronic form to be processed, the number of the objects in the cell in the electronic form to be processed corresponding to the dictionary table is 0, and then entering the step D.
And D, aiming at each cell in the spreadsheet to be processed respectively, according to the following formula:
Figure BDA0002126321900000051
and E, obtaining the score corresponding to the cell, and then entering the step E.
Step E, respectively aiming at each row in the spreadsheet to be processed, obtaining the sum of scores score corresponding to each cell in the row as the score corresponding to the row;
meanwhile, respectively aiming at each column in the electronic table to be processed, obtaining the sum of scores score corresponding to each cell in the column as the score corresponding to the column;
and F, respectively obtaining the scores corresponding to each row and each column in the electronic table to be processed.
And F, according to the scores respectively corresponding to all the rows in the electronic form to be processed, clustering all the rows in the electronic form to be processed according to the following steps Fa-1 to Fa-3, respectively obtaining the average value of the scores corresponding to all the row clusters and all the rows, taking the average value as the score respectively corresponding to all the row clusters, and selecting the row cluster with the highest score as the cluster of the row to be selected.
Step Fa-1, acquiring the minimum row score and the maximum row score in the scores respectively corresponding to each row in the spreadsheet to be processed, and entering the step Fa-2;
step Fa-2, aiming at the span from the minimum row score to the maximum row score, dividing according to the preset row score grades to obtain all row score intervals, and then entering the step Fa-3;
and Fa-3, dividing each line in the electronic form to be processed into line score intervals according to the corresponding score of each line in the electronic form to be processed, wherein each line score interval having the electronic form line to be processed is the line cluster.
Meanwhile, according to the scores respectively corresponding to all columns in the electronic form to be processed, performing column clustering on all columns in the electronic form to be processed according to the following steps Fb-1 to Fb-3, respectively obtaining the average value of the scores corresponding to all columns in all column clusters and all columns, taking the average value as the score respectively corresponding to all column clusters, and selecting the column cluster with the highest score as the cluster of the columns to be selected.
Step Fb-1, acquiring the minimum column score and the maximum column score in the scores respectively corresponding to each column in the electronic table to be processed, and entering the step Fb-2;
step Fb-2, aiming at the span from the minimum column score to the maximum column score, performing rank division according to preset column score grades to obtain each column score interval, and then entering the step Fb-3;
and step Fb-3, dividing each column in the electronic table to be processed into each column score interval according to the corresponding score of each column in the electronic table to be processed, wherein each column score interval of the electronic table to be processed is owned, namely each column cluster is obtained.
And G, after the clusters of the rows to be selected and the clusters of the columns to be selected are obtained.
G, aiming at each row in the row cluster to be selected, selecting the row with the highest score, and obtaining the average score of each non-empty cell in the row according to the score of the row to be selected as the row cell average score;
meanwhile, aiming at each column in the cluster of the columns to be selected, selecting the column with the highest score, and obtaining the average score of each non-empty cell in the column according to the score of the column to be used as the column cell average score;
then step H is entered.
Step H, if the average score of the row cells is greater than the average score of the column cells, each row in the cluster of the row to be selected is each header row in the electronic table to be processed, each header item is obtained, and the step J is carried out;
if the average score of the row cells is smaller than the average score of the column cells, each column in the cluster of the columns to be selected is each header column in the electronic table to be processed, each header item is obtained, and the step J is carried out;
and J, reading each data item in the electronic form to be processed according to each header item in the electronic form to be processed, and structuring form data.
The table data structuring method based on machine learning is designed by the technical scheme, the quantity statistics is carried out on the objects in each cell in a large number of sample electronic tables to form a dictionary table, the score of each cell in the electronic table to be processed is obtained by combining the occurrence frequency of the objects in each cell in the electronic table to be processed and the quantity of the objects in the dictionary table corresponding to the objects, the score of each cell is taken as the minimum unit, the acquisition of a header row or a header column in the electronic table to be processed is realized by comparing the row and the column, each header item is obtained, and then the extraction and the structuring of the data items are carried out based on each header item, so that the defects that the data structuring of the electronic table is accurately and efficiently realized by depending on rules, only horizontal headers are recognized and a plurality of headers cannot be recognized in the prior art are overcome.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (3)

1. A table data structuring method based on machine learning is used for carrying out structuring processing on data items in an electronic table to be processed, and is characterized by comprising the following steps:
a, counting the number of objects in a preset number sample spreadsheet and in each cell, respectively obtaining each object and the number corresponding to the object, constructing a dictionary table, and entering step B;
b, counting the times count of the objects in the cells appearing in the spreadsheet to be processed respectively aiming at each cell in the spreadsheet to be processed, and then entering the step C;
step C, respectively aiming at each cell in the electronic form to be processed, obtaining the number c of the objects in the cell corresponding to the dictionary table, wherein if the dictionary table does not have the objects in the cell in the electronic form to be processed, the number of the objects in the cell in the electronic form to be processed corresponding to the dictionary table is 0, and then entering the step D;
and D, aiming at each cell in the spreadsheet to be processed respectively, according to the following formula:
Figure FDA0002126321890000011
obtaining a score corresponding to the cell, and then entering the step E;
step E, respectively aiming at each row in the spreadsheet to be processed, obtaining the sum of scores score corresponding to each cell in the row as the score corresponding to the row;
meanwhile, respectively aiming at each column in the electronic table to be processed, obtaining the sum of scores score corresponding to each cell in the column as the score corresponding to the column;
respectively corresponding scores of each row and each column in the electronic form to be processed are obtained, and then the step F is carried out;
f, clustering all rows in the electronic form to be processed according to the scores respectively corresponding to all rows in the electronic form to be processed, respectively obtaining the average value of the scores corresponding to all rows in all row clusters and all rows, taking the average value as the score respectively corresponding to all row clusters, and selecting the row cluster with the highest score as the row cluster to be selected;
meanwhile, according to the scores respectively corresponding to all columns in the electronic table to be processed, performing column clustering on all columns in the electronic table to be processed, respectively obtaining the average value of the scores corresponding to all columns in all column clusters and all columns, taking the average value as the score respectively corresponding to all column clusters, and selecting the column cluster with the highest score as the cluster of the columns to be selected;
then entering step G;
g, aiming at each row in the row cluster to be selected, selecting the row with the highest score, and obtaining the average score of each non-empty cell in the row according to the score of the row to be selected as the row cell average score;
meanwhile, aiming at each column in the cluster of the columns to be selected, selecting the column with the highest score, and obtaining the average score of each non-empty cell in the column according to the score of the column to be used as the column cell average score;
then entering step H;
step H, if the average score of the row cells is greater than the average score of the column cells, each row in the cluster of the row to be selected is each header row in the electronic table to be processed, each header item is obtained, and the step J is carried out;
if the average score of the row cells is smaller than the average score of the column cells, each column in the cluster of the columns to be selected is each header column in the electronic table to be processed, each header item is obtained, and the step J is carried out;
and J, reading each data item in the electronic form to be processed according to each header item in the electronic form to be processed, and structuring form data.
2. The table data structuring method based on machine learning according to claim 1, characterized in that: in the step A, after the dictionary table is constructed and obtained, the following steps I to II are adopted, the dictionary table is updated, and then the step B is carried out;
step I, acquiring maximum quantity values of the quantity corresponding to each object in the dictionary table, and entering step II;
step II, respectively executing the following steps II-1 to II-2 aiming at each object in the dictionary table, updating the number corresponding to the object, and further updating the dictionary table;
II-1, judging whether the object belongs to a preset header item set, if so, setting the number corresponding to the object as the maximum number value, otherwise, entering a step II-2;
and II-2, judging whether the object belongs to a preset data item set, if so, setting the quantity corresponding to the object to be 0, otherwise, not modifying the quantity corresponding to the object.
3. The table data structuring method based on machine learning according to claim 1, characterized in that: in the step F, according to the scores respectively corresponding to the rows in the electronic form to be processed, clustering is carried out on the rows in the electronic form to be processed according to the following steps Fa-1 to Fa-3;
step Fa-1, acquiring the minimum row score and the maximum row score in the scores respectively corresponding to each row in the spreadsheet to be processed, and entering the step Fa-2;
step Fa-2, aiming at the span from the minimum row score to the maximum row score, dividing according to the preset row score grades to obtain all row score intervals, and then entering the step Fa-3;
step Fa-3, dividing each row in the electronic form to be processed into each row score interval according to the corresponding score of each row in the electronic form to be processed, wherein each row score interval having the electronic form row to be processed is a row cluster;
meanwhile, according to the scores respectively corresponding to all columns in the electronic table to be processed, carrying out column clustering on all columns in the electronic table to be processed according to the following steps Fb-1 to Fb-3;
step Fb-1, acquiring the minimum column score and the maximum column score in the scores respectively corresponding to each column in the electronic table to be processed, and entering the step Fb-2;
step Fb-2, aiming at the span from the minimum column score to the maximum column score, performing rank division according to preset column score grades to obtain each column score interval, and then entering the step Fb-3;
and step Fb-3, dividing each column in the electronic table to be processed into each column score interval according to the corresponding score of each column in the electronic table to be processed, wherein each column score interval of the electronic table to be processed is owned, namely each column cluster is obtained.
CN201910623601.0A 2019-07-11 2019-07-11 Table data structuring method based on machine learning Active CN110362620B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910623601.0A CN110362620B (en) 2019-07-11 2019-07-11 Table data structuring method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910623601.0A CN110362620B (en) 2019-07-11 2019-07-11 Table data structuring method based on machine learning

Publications (2)

Publication Number Publication Date
CN110362620A CN110362620A (en) 2019-10-22
CN110362620B true CN110362620B (en) 2021-04-06

Family

ID=68218702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910623601.0A Active CN110362620B (en) 2019-07-11 2019-07-11 Table data structuring method based on machine learning

Country Status (1)

Country Link
CN (1) CN110362620B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12038982B2 (en) 2021-10-08 2024-07-16 Beijing Baidu Netcom Science Technology Co., Ltd. Method of extracting table information, electronic device, and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523420B (en) * 2020-04-14 2023-07-07 南京烽火星空通信发展有限公司 Header classification and header column semantic recognition method based on multi-task deep neural network
CN113010503A (en) * 2021-03-01 2021-06-22 广州智筑信息技术有限公司 Engineering cost data intelligent analysis method and system based on deep learning
CN113901214B (en) * 2021-10-08 2023-11-17 北京百度网讯科技有限公司 Method and device for extracting form information, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741020A (en) * 2005-09-29 2006-03-01 北京勤哲软件技术有限责任公司 Method for storing electronic table unit lattice content with relational data base
CN102799574A (en) * 2012-06-29 2012-11-28 无锡永中软件有限公司 Data partitioning and merging method for electronic forms
CN106156239A (en) * 2015-04-27 2016-11-23 ***通信集团公司 A kind of form abstracting method and device
CN108009264A (en) * 2017-12-14 2018-05-08 北京航天测控技术有限公司 A kind of comparative approach of versions of data for Excel format files
CN109522452A (en) * 2018-11-13 2019-03-26 南京烽火星空通信发展有限公司 A kind of processing method of magnanimity semi-structured data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040107205A1 (en) * 2002-12-03 2004-06-03 Lockheed Martin Corporation Boolean rule-based system for clustering similar records

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741020A (en) * 2005-09-29 2006-03-01 北京勤哲软件技术有限责任公司 Method for storing electronic table unit lattice content with relational data base
CN102799574A (en) * 2012-06-29 2012-11-28 无锡永中软件有限公司 Data partitioning and merging method for electronic forms
CN106156239A (en) * 2015-04-27 2016-11-23 ***通信集团公司 A kind of form abstracting method and device
CN108009264A (en) * 2017-12-14 2018-05-08 北京航天测控技术有限公司 A kind of comparative approach of versions of data for Excel format files
CN109522452A (en) * 2018-11-13 2019-03-26 南京烽火星空通信发展有限公司 A kind of processing method of magnanimity semi-structured data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12038982B2 (en) 2021-10-08 2024-07-16 Beijing Baidu Netcom Science Technology Co., Ltd. Method of extracting table information, electronic device, and storage medium

Also Published As

Publication number Publication date
CN110362620A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN110362620B (en) Table data structuring method based on machine learning
US9785830B2 (en) Methods for automatic structured extraction of data in OCR documents having tabular data
CN109101477B (en) Enterprise field classification and enterprise keyword screening method
CN105260751B (en) A kind of character recognition method and its system
CN105261109A (en) Identification method of prefix letter of banknote
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN107704539A (en) The method and device of extensive text message batch structuring
CN100390815C (en) Template optimized character recognition method and system
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN112651323B (en) Chinese handwriting recognition method and system based on text line detection
CN100501764C (en) Character recognition system and method
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN111340020A (en) Formula identification method, device, equipment and storage medium
CN112016481A (en) Financial statement information detection and identification method based on OCR
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
CN102221976A (en) Method for quickly inputting words based on incomplete identification
CN112084308A (en) Method, system and storage medium for text type data recognition
CN113807158A (en) PDF content extraction method, device and equipment
Joseph et al. Feature extraction and classification techniques of MODI script character recognition
CN111340032A (en) Character recognition method based on application scene in financial field
CN109472020B (en) Feature alignment Chinese word segmentation method
CN1084502C (en) Method and device for recognition of similar writing
CN114511857A (en) OCR recognition result processing method, device, equipment and storage medium
CN113723501A (en) Maximum diversity clustering construction method of pathogenic microorganism reference knowledge base
CN113361666A (en) Handwritten character recognition method, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant