CN114970475A - excel table analysis method, system, equipment and storage medium - Google Patents

excel table analysis method, system, equipment and storage medium Download PDF

Info

Publication number
CN114970475A
CN114970475A CN202210584452.3A CN202210584452A CN114970475A CN 114970475 A CN114970475 A CN 114970475A CN 202210584452 A CN202210584452 A CN 202210584452A CN 114970475 A CN114970475 A CN 114970475A
Authority
CN
China
Prior art keywords
header
scanning
list
similarity
excel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210584452.3A
Other languages
Chinese (zh)
Inventor
彭麒菱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Merchants Bank Co Ltd
Original Assignee
China Merchants Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Merchants Bank Co Ltd filed Critical China Merchants Bank Co Ltd
Priority to CN202210584452.3A priority Critical patent/CN114970475A/en
Publication of CN114970475A publication Critical patent/CN114970475A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an excel table analysis method, an excel table analysis system, excel table analysis equipment and a storage medium, wherein the excel table analysis method comprises the following steps: the method comprises the steps of obtaining an excel table, judging the header similarity of each data line in the excel table through a pre-constructed header similarity model based on the transition state of a preset scanning state machine to obtain header information and table text information, and if the header information has a multi-layer nested header, combining the header information to obtain a header hierarchical structure of the excel table. The method and the device solve the technical problems that excel forms are low in analysis efficiency and high in development and maintenance cost.

Description

excel table analysis method, system, equipment and storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method, a system, a device, and a storage medium for analyzing an excel form.
Background
In the enterprise digitization process, how to analyze the stock excel table is a problem faced by small and medium-sized enterprises. At present, an excel table analysis method mainly comprises a customized analysis mode and an analysis mode based on regular matching, wherein the customized analysis mode needs to perform customized analysis on tables of different clients, when the table structure changes, corresponding analysis codes need to be modified, the workload is too large, the maintenance cost is higher, and the analysis mode based on the regular matching is to deduce a regular expression through fitting of the existing tables and then match other unknown tables for analysis through the regular expression. However, regular expressions corresponding to different tables may be different, which results in lower excel table parsing efficiency and higher development and maintenance costs.
Disclosure of Invention
The application mainly aims to provide an excel table analysis method, an excel table analysis system, excel table analysis equipment and a storage medium, and aims to solve the technical problems that in the prior art, the excel table analysis efficiency is low, and the development and maintenance cost is high.
In order to achieve the above object, the present application provides an excel table analysis method, where the excel table analysis method includes:
acquiring an excel table;
based on the transition state of a preset scanning state machine, performing header similarity judgment on each data line in the excel table through a pre-constructed header similarity model to obtain header information and table text information;
and if the header information has a plurality of layers of nested headers, merging the plurality of layers of nested headers to obtain a header hierarchical structure of the excel table.
The present application further provides an excel form parsing system, where the excel form parsing system is a virtual system, and the excel form parsing system includes:
the acquisition module is used for acquiring the excel table;
the judging module is used for judging the header similarity of each data line in the excel table through a pre-constructed header similarity model based on the transition state of a preset scanning state machine to obtain header information and table text information;
and the header merging module is used for merging the header information to obtain the header hierarchical structure of the excel table if the header information has a multi-layer nested header.
The present application further provides an excel form analysis device, where the excel form analysis device is an entity device, and the excel form analysis device includes: the excel table analysis method comprises a memory, a processor and an excel table analysis program stored on the memory, wherein the excel table analysis program is executed by the processor to realize the steps of the excel table analysis method.
The application also provides a storage medium, the storage medium is a computer-readable storage medium, an excel table analysis program is stored on the computer-readable storage medium, and the excel table analysis program is executed by a processor to realize the steps of the excel table analysis method.
The application provides an excel table analysis method, a system, equipment and a storage medium, compared with the technical means of customized analysis of the excel table or analysis based on the regular expression of the excel table adopted in the prior art, the excel table is firstly obtained, then based on the transition state of a preset scanning state machine, the header similarity judgment is carried out on each data line in the excel table through a pre-constructed header similarity model, header information and table text information are obtained, further, if the header information has a multi-layer nested header, the multi-layer nested headers are merged, the header hierarchical structure of the excel table is obtained, the combination of the transition state of the scanning state machine and the header similarity model is realized, whether each data line in the excel table belongs to the header is judged and obtained, and the header information and the table text information of the excel table are rapidly and accurately analyzed, and the method can accurately analyze the complex table structure including multiple layers of headers, and can judge each data line in the excel table directly through a scanning state machine and a header similarity model even if the excel table structure is changed, and does not need to perform customized development on each type of table, thereby improving the analysis efficiency of the excel table and effectively reducing the maintenance cost.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present application, the drawings required to be used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art to be able to obtain other drawings without inventive labor based on these drawings.
Fig. 1 is a schematic flow chart of a first embodiment of an excel table parsing method according to the present application;
fig. 2 is a schematic flowchart of a second embodiment of the excel table parsing method according to the present application;
fig. 3 is a schematic flowchart of a third embodiment of an excel table parsing method according to the present application;
FIG. 4 is a schematic diagram of a header structure of an excel table with a multi-layer nested header;
fig. 5 is a schematic structural diagram illustrating reading of header texts of each line and positions of cells corresponding to the header texts;
fig. 6 is a schematic structural diagram of an excel table parsing device of a hardware operating environment according to an embodiment of the present application;
fig. 7 is a functional module diagram of the excel form parsing system of the present application.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In a first embodiment of the excel table analysis method of the present application, referring to fig. 1, the excel table analysis method includes:
step S10, acquiring an excel table;
in this embodiment, it should be noted that an excel table includes at least one or more of a header area, a body area, and a remark area.
Step S20, based on the transition state of a preset scanning state machine, performing header similarity judgment on each data line in the excel table through a pre-constructed header similarity model to obtain header information and table text information;
in this embodiment, it should be noted that the transition state includes an initial filtering state, a header scanning state, a body scanning state, and a termination state, where the initial filtering state is a state corresponding to a preset scanning state machine when a data line in an excel table starts to be scanned, and the header similarity model is obtained by performing iterative training based on a header text in service history table data collected in advance.
It should be further noted that the header information includes header text of each line and a cell position corresponding to the header text, usually, headers of the excel table generally have continuity, no blank or other interference data exists between multiple layers of headers, lines of the body content of the table are connected together, no blank line exists between lines of the body content of the table, and no header line appears again.
Since the header may be a certain data line in the middle of the excel table, that is, the previous data line may be a blank line or interference data, it is necessary to filter the excel table, specifically, when the preset scan state machine is in the initial filtering state: firstly scanning an initial data row of the excel table, if the initial data row is empty, directly skipping the row of data to perform scanning judgment on the next row of data, or performing header similarity judgment on the initial data row through a pre-constructed header similarity model to obtain header similarity, if the header similarity is smaller than a preset similarity threshold, judging that the row of data is not a header but interference data, further directly skipping the row of data to perform scanning judgment on the next row of data, repeating the steps until the header similarity predicted through the header similarity model is higher than or equal to the set similarity threshold, further recording cell information of the data row corresponding to the set similarity threshold, and changing the state of the preset state machine into the header scanning state.
When the preset scanning state machine is in the header scanning state: and judging the similarity of the cell information in the next data line of the excel table through the header similarity model to obtain the corresponding header similarity, if the header similarity is still not less than a preset similarity threshold, continuously keeping the preset scanning state machine in the header scanning state, and returning to the execution step: and carrying out similarity judgment on cell information in the next data row of the excel table through the header similarity model to obtain corresponding header similarity until the header similarity of the next data row of the excel table is smaller than a preset similarity threshold, or finishing scanning all data rows of the excel table, and recording the cell information of each data row corresponding to which the header similarity is not smaller than the preset similarity threshold. Additionally, if all data rows of the excel table are scanned completely, and the header similarity predicted by the header similarity model is not smaller than the preset similarity threshold, it is indicated that only header information exists in the excel table but no table body content information exists, at this time, the preset scanning state machine jumps to a termination state and finishes scanning the excel table, and thus the header information is determined based on the cell information of the data row corresponding to each header similarity not smaller than the preset similarity threshold.
Further, if the header similarity of the next data line of the excel table is smaller than a preset similarity threshold, recording the cell information of the data line corresponding to the data line smaller than the preset similarity threshold, and changing the state of the preset state machine into the text scanning state. When the preset scanning state machine is in the text scanning state: scanning the excel table line by line from the scanning position corresponding to the data line currently smaller than the set similarity threshold, inputting the cell information in the next data line of the excel table into the header similarity model in the line by line scanning process, outputting the header similarity, if the header similarity is smaller than the preset similarity threshold, proving that the currently scanned data line is still the text line of the table, therefore, the state of the preset scanning state machine keeps the text scanning state unchanged, and recording the cell information of the data line corresponding to the header similarity smaller than the preset similarity threshold, until all data lines are scanned completely, the preset scanning state machine enters the termination state, or, if a first empty line data is encountered in the scanning process or the header similarity is not smaller than the preset four-degree threshold, indicating that the text is not existed any more subsequently, and the preset scanning state machine enters a termination state and finishes scanning the excel table, so that the text information of the table is determined based on the cell information of the data line corresponding to each header similarity smaller than a preset similarity threshold.
And step S30, if the header information has a multilayer nested header, merging the multilayer nested header to obtain the header hierarchical structure of the excel table.
In this embodiment, the multi-layer nested header is header information of merged cells existing between different data rows in an excel table or in the same data row, for example, referring to fig. 4, fig. 4 is a schematic diagram of a header structure of an excel table with a multi-layer nested header, where the header structure in the excel table has three layers, and a complex multi-layer nested header is constructed by cell merging. Specifically, if the header information has a multilayer nested header, analyzing and merging the multilayer nested header to obtain a header hierarchy of the excel table.
According to the scheme, the transition state of the scanning state machine is combined with the header similarity model, whether each data line in the excel table belongs to the header or not is judged and obtained line by line, so that the header information and the table text information of the excel table can be quickly and accurately analyzed, the complex table structure comprising multiple layers of headers can be accurately analyzed, even if the excel table structure changes, each data line in the excel table can be judged directly through the scanning state machine and the header similarity model, customized development of each type of tables is not needed, the efficiency of analyzing the excel table is improved, and the maintenance cost is effectively reduced.
Further, referring to fig. 2, based on the first embodiment in the present application, in another embodiment in the present application, the step of determining the header similarity of each data line in the excel table through a pre-constructed header similarity model based on the transition state of the preset scan state machine to obtain header information and table body information includes:
step A10, when the preset scanning state machine is in the initial filtering state, if an initial data row of the excel table is not empty, inputting cell information in the initial data row into the header similarity model, and outputting header similarity;
step A20, if the header similarity is not less than a preset similarity threshold, changing the state of the preset scanning state machine into the header scanning state;
in this embodiment, specifically, when initially scanning the excel table, the preset scanning state machine is in the initial filtering state, and then starts to scan an initial data line in the excel table, and determines whether the initial data line is empty, if the initial data line is empty, the initial data line is directly skipped, and then scans a next data line, until the scanned data line is not empty, and if the initial data line is not empty, the cell information in the data line is determined by the header similarity model, so as to obtain a header similarity, and then determines whether the header similarity is smaller than the preset similarity threshold, if so, the data line is skipped, so as to filter out interference data, such as to filter out summary and remark data outside the table, and then continue to scan the next data line in the excel table, and when the data line is not empty, and returning to the execution step: and judging the cell information in the data line through the header similarity model to obtain header similarity, recording the cell information of the data line corresponding to the header similarity not less than the preset similarity threshold until the header similarity is not less than the preset similarity threshold, and changing the state of the preset scanning state machine into the header scanning state.
Step A30, inputting the cell information in the next data row of the excel table into the header similarity model, outputting the header similarity, and recording the cell information of the data row with the header similarity not less than a preset similarity threshold value until the header similarity of the next data row of the excel table is less than the preset similarity threshold value, or scanning all the data rows of the excel table;
step A40, determining the header information based on the cell information of the data line corresponding to the header similarity not less than the preset similarity threshold;
step A50, if the header similarity is smaller than a preset similarity threshold, changing the state of the preset scanning state machine into the text scanning state;
in this embodiment, specifically, when the state of the preset scanning state machine is in the header scanning state: judging the cell information in the data line through the header similarity model to obtain header similarity, further judging whether the header similarity is greater than or equal to a preset similarity threshold, if not, recording the cell information of the data line corresponding to the header similarity which is greater than or equal to the preset similarity threshold, and if so, returning to the execution step: inputting the cell information in the next data row of the excel table into the header similarity model, outputting header similarity until all data rows in the excel table are scanned completely, or the header similarity is smaller than the preset similarity threshold, and determining the header information of the excel table based on the recorded cell information of the data row corresponding to which the header similarity is larger than or equal to the preset similarity threshold.
Additionally, when all data lines in the excel table are scanned completely and the header similarity is still larger than or equal to a preset similarity threshold, it is proved that the excel data table has a header and does not have a table text, then the preset scanning state machine jumps to an ending state and finishes scanning, when the header similarity is smaller than the preset similarity threshold, it is proved that header information scanning in the excel table is finished, then the state of the preset scanning state machine is changed into the text scanning state, and cell information of the data lines corresponding to the header similarity smaller than the preset similarity threshold is recorded, so that the table text in the excel table is scanned.
Step A60, inputting cell information in a next data row of the excel table into the header similarity model, outputting header similarity, and recording cell information of a data row corresponding to the header similarity smaller than a preset similarity threshold until all data rows of the excel table are scanned;
step A70, determining the form text information based on the cell information of the data line corresponding to the header similarity smaller than the preset similarity threshold.
In this embodiment, specifically, when the state of the preset scanning state machine is in the text scanning state: judging the cell information in the data line through the header similarity model to obtain header similarity, further judging whether the header similarity is smaller than a preset similarity threshold, if so, recording the cell information of the data line corresponding to the header similarity smaller than the preset similarity threshold, and if not, returning to the executing step, wherein the state of the preset scanning state machine is unchanged in the text scanning state: inputting the cell information in the next data line of the excel table into the header similarity model, outputting header similarity until all data lines in the excel table are scanned completely, or the header similarity is not less than the preset similarity threshold, further performing termination state on the preset scanning state, and further determining the table text information of the excel table based on the recorded cell information of the data line corresponding to all header similarities less than the preset similarity threshold.
According to the scheme, the header similarity predicted based on the state of the preset scanning state machine and the header similarity model is achieved, interference data can be filtered, and the boundaries of the table can be identified, so that header information and table body information of the excel table can be identified, customized development of each type of table is not needed, when the table structure changes, the algorithm is not needed to be modified almost, and maintenance workload is effectively reduced.
Further, referring to fig. 3, in another embodiment of the present application, based on the first embodiment of the present application, the step of merging the multiple layers of nested headers to obtain a header hierarchy of the excel table includes:
step B10, reading the cell information of each line in the multilayer nested header according to a preset cell reading rule to obtain a header reading result of each line;
in this embodiment, specifically, according to a representation manner of cell data in an Excel table, when a merged cell exists in the header information, if a header text of the merged cell exists between different data lines or in the same data line, a header text of a first cell in the merged cell is retained, data of the rest cells in the merged cell is set to be null, a splitting result is obtained, and based on the multi-layer nested header and the splitting result, the header text and a corresponding position thereof in a cell corresponding to each line are read in the multi-layer nested header, so as to determine the header reading result of each line, for example, referring to fig. 4 and 5, fig. 5 is a schematic structural diagram of reading the header text and the cell position corresponding to the header text of each line in the present application, where a symbol "represents that data of a cell is null, the header reading for each row is as follows:
first row: [ 'watch head A', 'watch head B', '', '' '' ',' ',' watch head H ',' watch head I ',' ]
A second row: [ ', ' watch head C ', ' watch head D ', ' watch head G ', ' watch head J ', ' watch head K ' ]
Third row: [ ',' watch head E ',' watch head F ',' 'and' ]
Step B20, constructing a merging result list, a judgment list and a scanning header list, wherein the judgment list initially stores the header reading results of the first row, and the scanning header list sequentially scans the header reading results of each row;
in this embodiment, the merge result list is used to store the result of merging the determination list and the scan header list each time, the merge result list is initially set to empty, the determination list initially stores the header reading result of the first row, the scan header list sequentially scans the header reading results of each row, and for example, the scan header list is [ ' header a ', ' header B ', ' ' header H ', ' header I ', ' ], and the determination list [ ' header a ', ' header B ', ' ' ' ' ' ], following the example of step B10.
Step B30, if the data of the cells at the same position exist in the judgment list of the current scanning and the header reading result in the scanning header list are null, filling the cells in the null position in the current scanning header list to obtain an updated scanning header list;
in this embodiment, specifically, cell information at the same position in the currently scanned determination list and the header reading result in the scan header list is compared one by one, if a cell at the same position is empty, based on the position corresponding to the empty cell, a non-empty cell is queried in the currently scanned header list, and data of a queried first non-empty cell is filled in the empty cell to obtain the updated scan header list, for example, following the above example, it can be known that data of third, fourth, fifth, and eight cells in the scan header list and the determination list are all empty, and therefore, for the third, fourth, and five cells, data of the first non-empty cell is searched forward as header B, and header B is filled in the third, fourth, and five cells in the scan header list, and for the eighth cell, the data of the first non-empty cell is searched forward to be the header I, and the updated scan header list [ 'header a', 'header B', 'header H', 'header I', ] is obtained.
Additionally, if the data of the cells at the same position in the header reading results of the currently scanned determination list and the scanned header list are not null, the scanned header list does not need to be filled.
Step B40, element splicing is carried out on the cell information in the merging result list and the scanning header list, the spliced header result is stored in the merging result list, and scanning of the next row of header reading result is carried out to obtain a scanning header list of the next row of scanning;
in this embodiment, specifically, the unit cell information at the same position between the scan header list and the merge result list is merged, wherein if the header texts in the unit cell information at the same position are the same, the unit cell is merged into one header text, and if the header texts in the unit cell information at the same position are different, the header texts are merged by using a separator, for example, the merge result list is [ 'header a', 'header B', 'header H', 'header I', ].
Further, the scanning of the header reading result of the next line is continued to obtain a corresponding scanned header list, and the scanned header list obtained by the scanning of the next line is, for example, [ ', ' header C ', ' header D ', ' header G ', ' header J ', ' header K ' ]
Step B50, performing element concatenation on the header reading result in the scanning header list of the next line scanning and the header reading result in the determination list at the same cell position to obtain a new determination list, and returning to the execution step based on the new determination list and the scanning header list of the next line scanning: if the data of the cells at the same position exist in the judgment list of the current scanning and the header reading results in the scanning header list are empty, filling the empty cells in the current scanning header list to obtain an updated scanning header list so as to obtain an updated merging result list until the header reading results of all data rows are scanned to obtain a final merging result list;
step B60, determining the header hierarchy based on the final merge result list.
In this embodiment, specifically, the scan header list of the next line scan and the cell information at the same position between the determination lists are spliced, wherein if the header texts in the cell information at the same position are the same, the cell is merged into one header text, and if the header texts in the cell information at the same position are different, the header texts are spliced, and it should be noted that in the splicing process, the header texts can be spliced by separators, or the header texts can be directly spliced, or alternatively, if the cell information at the same position is empty, the new determination lists [ 'header a', 'header B', 'header C', 'header D', ', ' gauge head G ', ' gauge head H ', ' gauge head I gauge head J ', ' gauge head K ' ].
Further, judging whether the header reading results in the currently scanned judging list and the scanned header list have data of cells at the same position as null, and returning to the execution step: if the data of the cells at the same position exist in the judgment list of the current scanning and the header reading result in the scanning header list is empty, the empty cells in the current scanning header list are filled, that is, if the data of the cells at the same position exist in the new judgment list and the scanning header list obtained by the next scanning are empty, the scanning header list is filled, and the updated scanning header lists are [ ', ' header C ', ' header D ', ' header G ', ' header J ', ' header K ' ].
Furthermore, the element concatenation is performed on the cell information in the updated scan header list and the merging result list to obtain the updated merging result list, for example, the merging result list includes [ ' header a ', ' header B ', ' header H ', ' header I ', ] and the scan header list includes [ ', ' header C ', ' header D ', ' header G ', ' header J ', ' header K ', ] and the updated merging result list includes [ ' header a ', ' header B # # header C ', ' header B # # header D ', ' header B # # header G ', ' header J ', ' header K ' ], and further determining the header hierarchy of the excel table based on the final merging result list, so that the downstream application can correctly restore the header hierarchy of the excel table through a # separator based on the header hierarchy.
Through the scheme, when the header information has the multilayer nested headers, the complex table structure including the multilayer headers can be accurately analyzed, and when the table structure changes, the algorithm does not need to be adjusted, so that the accuracy rate of header combination is not influenced.
Further, before the step of performing header similarity judgment on each data line in the excel table through a pre-constructed header similarity model based on the transition state of the preset scanning state machine to obtain header information and table text information, the excel table analysis method further includes:
step C10, acquiring business history table data, and extracting a header training corpus from the business history table data;
and step C10, performing iterative training on the similarity model of the header to be trained based on the header training corpus to obtain the similarity model of the header.
In this embodiment, the header similarity model includes natural Language models such as nlp (natural Language processing) and lsi (large Semantic indexing), and the business history table data is table data corresponding to different businesses, for example, a payroll table.
Specifically, historical table data corresponding to different services are collected, relevant header texts are extracted from the historical table data, the header texts are used as header training corpora, the header training corpora are input into the to-be-trained header similarity model to obtain training results, iteration training is carried out on the to-be-trained header similarity model based on the training results and real labels corresponding to the header training corpora to obtain the header similarity model, in the subsequent judging process, data rows (row) in an excel table are used as input of the header similarity model, the header similarity model returns an interval value of [0-1], the probability that a text array which is close to 1 and represents an input model is a header is larger, otherwise, the probability is smaller, and the probability that a group of texts integrally belong to the header can be judged through the header similarity model, therefore, the analysis of the header, the text or other interference data in the excel table is realized.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an excel table parsing device of a hardware operating environment according to an embodiment of the present application.
As shown in fig. 6, the excel table parsing apparatus may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002. The communication bus 1002 is used for realizing connection communication between the processor 1001 and the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.
Optionally, the excel form parsing device may further include a rectangular user interface, a network interface, a camera, an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like. The rectangular user interface may comprise a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also comprise a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WIFI interface).
Those skilled in the art will appreciate that the excel form parsing apparatus architecture shown in fig. 6 does not constitute a limitation of the excel form parsing apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 6, a memory 1005, which is a kind of computer storage medium, may include an operating system, a network communication module, and an excel table parsing program therein. The operating system is a program for managing and controlling hardware and software resources of the excel form analysis equipment and supports the operation of the excel form analysis program and other software and/or programs. The network communication module is used for realizing communication among the components in the memory 1005 and communication with other hardware and software in the excel form analysis system.
In the excel form analysis apparatus shown in fig. 6, the processor 1001 is configured to execute an excel form analysis program stored in the memory 1005, and implement the steps of the excel form analysis method described in any one of the above.
The specific implementation of the excel form analysis device of the application is basically the same as that of each embodiment of the excel form analysis method, and is not described herein again.
In addition, referring to fig. 7, fig. 7 is a functional module schematic diagram of an excel form parsing system according to the present application, and the present application further provides an excel form parsing system, where the excel form parsing system includes:
the acquisition module is used for acquiring the excel table;
the judging module is used for judging the header similarity of each data line in the excel table through a pre-constructed header similarity model based on the transition state of a preset scanning state machine to obtain header information and table text information;
and the header merging module is used for merging the header information to obtain the header hierarchical structure of the excel table if the header information has a multi-layer nested header.
Optionally, the determining module is further configured to:
when the preset scanning state machine is in the initial filtering state, if an initial data row of the excel table is not empty, inputting cell information in the initial data row into the header similarity model, and outputting header similarity;
if the header similarity is not less than a preset similarity threshold, changing the state of the preset scanning state machine into the header scanning state;
inputting cell information in a next data row of the excel table into the header similarity model, outputting header similarity, and recording cell information of a data row corresponding to which the header similarity is not less than a preset similarity threshold until the header similarity of the next data row of the excel table is less than the preset similarity threshold, or scanning all data rows of the excel table;
determining the header information based on the cell information of the data line corresponding to the header similarity not less than the preset similarity threshold;
if the header similarity is smaller than a preset similarity threshold, changing the state of the preset scanning state machine into the text scanning state;
inputting cell information in the next data row of the excel table into the header similarity model, outputting header similarity, and recording cell information of the data row corresponding to the header similarity smaller than a preset similarity threshold until all data rows of the excel table are scanned;
and determining the form text information based on the cell information of the data line corresponding to the header similarity smaller than the preset similarity threshold.
Optionally, the header merging module is further configured to:
reading the cell information of each line in the multilayer nested header according to a preset cell reading rule to obtain a header reading result of each line;
constructing a merging result list, a judgment list and a scanning header list, wherein the judgment list initially stores header reading results of a first row, and the scanning header list sequentially scans the header reading results of each row;
if the data of the cells at the same position exist in the judgment list of the current scanning and the header reading result in the scanning header list are null, filling the cells in the null position in the current scanning header list to obtain an updated scanning header list;
element splicing is carried out on the cell information in the merging result list and the scanning header list, the splicing header result is stored in the merging result list, and scanning of the next row of header reading results is carried out to obtain a scanning header list of the next row of scanning;
element splicing of the same cell position is carried out on the header reading result in the scanning header list of next line scanning and the header reading result in the judgment list to obtain a new judgment list, and based on the new judgment list and the scanning header list of next line scanning, the execution steps are returned: if the data of the cells at the same position exist in the judgment list of the current scanning and the header reading results in the scanning header list are empty, filling the empty cells in the current scanning header list to obtain an updated scanning header list so as to obtain an updated merging result list until the header reading results of all data rows are scanned to obtain a final merging result list;
determining the header hierarchy based on the final merge result list.
Optionally, the header merging module is further configured to:
if the header texts of the merging cells exist among different data lines or in the same data line, the header text of the first cell in the merging cells is reserved, and the data of the rest cells in the merging cells are set to be null, so that a splitting result is obtained;
determining a header reading result of each row based on the multi-layer nested headers and the splitting result.
Optionally, the header merging module is further configured to:
based on the empty cells, inquiring non-empty cells in the current scanning header list forwards, and filling the data of the inquired first non-empty cell into the empty cells to obtain the updated scanning header list.
Optionally, the excel table parsing system is further configured to:
acquiring business history table data, and extracting a header training corpus from the business history table data;
and performing iterative training on the similarity model of the header to be trained based on the header training corpus to obtain the similarity model of the header.
The specific implementation of the excel form analysis system of the application is basically the same as that of the above excel form analysis method, and is not described herein again.
The embodiment of the application provides a storage medium which is a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs, and the one or more programs can be further executed by one or more processors to realize the steps of the excel table parsing method.
The specific implementation manner of the computer-readable storage medium of the present application is substantially the same as that of each embodiment of the excel table parsing method, and is not described herein again.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims (10)

1. An excel table analysis method, characterized in that the excel table analysis method comprises:
acquiring an excel table;
based on the transition state of a preset scanning state machine, performing header similarity judgment on each data line in the excel table through a pre-constructed header similarity model to obtain header information and table text information;
and if the header information has a plurality of layers of nested headers, merging the plurality of layers of nested headers to obtain a header hierarchical structure of the excel table.
2. The excel form parsing method according to claim 1, wherein the transition states of the preset scan state machine include an initial filtering state, a header scanning state and a body scanning state.
3. The excel form parsing method according to claim 2, wherein the step of performing header similarity determination on each data row in the excel form through a pre-constructed header similarity model based on a transition state of a preset scan state machine to obtain header information and form body information comprises:
when the preset scanning state machine is in the initial filtering state, if an initial data row of the excel table is not empty, inputting cell information in the initial data row into the header similarity model, and outputting header similarity;
if the header similarity is not less than a preset similarity threshold, changing the state of the preset scanning state machine into the header scanning state;
inputting cell information in a next data row of the excel table into the header similarity model, outputting header similarity, and recording cell information of a data row corresponding to which the header similarity is not less than a preset similarity threshold until the header similarity of the next data row of the excel table is less than the preset similarity threshold, or scanning all data rows of the excel table;
determining the header information based on the cell information of the data line corresponding to the header similarity not less than the preset similarity threshold;
if the header similarity is smaller than a preset similarity threshold, changing the state of the preset scanning state machine into the text scanning state;
inputting cell information in a next data row of the excel table into the header similarity model, outputting header similarity, and recording cell information of a data row with header similarity smaller than a preset similarity threshold until all data rows of the excel table are scanned;
and determining the form text information based on the cell information of the data line corresponding to the header similarity smaller than the preset similarity threshold.
4. The excel table parsing method according to claim 1, wherein the step of merging the multiple layers of nested headers to obtain a header hierarchy of the excel table comprises:
reading the cell information of each line in the multilayer nested header according to a preset cell reading rule to obtain a header reading result of each line;
constructing a merging result list, a judgment list and a scanning header list, wherein the judgment list initially stores header reading results of a first row, and the scanning header list sequentially scans the header reading results of each row;
if the data of the cells at the same position exist in the judgment list of the current scanning and the header reading result in the scanning header list are null, filling the cells in the null position in the current scanning header list to obtain an updated scanning header list;
element splicing is carried out on the cell information in the merging result list and the scanning header list, the splicing header result is stored in the merging result list, and scanning of the next row of header reading results is carried out to obtain a scanning header list of the next row of scanning;
element splicing of the same cell position is carried out on the header reading result in the scanning header list of next line scanning and the header reading result in the judgment list to obtain a new judgment list, and based on the new judgment list and the scanning header list of next line scanning, the execution steps are returned: if the data of the cells at the same position exist in the judgment list of the current scanning and the header reading results in the scanning header list are empty, filling the empty cells in the current scanning header list to obtain an updated scanning header list so as to obtain an updated merging result list until the header reading results of all data rows are scanned to obtain a final merging result list;
determining the header hierarchy based on the final merge result list.
5. The excel form parsing method according to claim 4, wherein the step of reading cell information of each row in the multi-layer nested header according to a preset cell reading rule to obtain a header reading result of each row comprises:
if the header texts of the merging cells exist among different data lines or in the same data line, the header text of the first cell in the merging cells is reserved, and the data of the rest cells in the merging cells are set to be null, so that a splitting result is obtained;
determining a header reading result of each row based on the multi-layer nested headers and the splitting result.
6. The excel form parsing method according to claim 4, wherein the step of obtaining an updated scan header list by filling cells empty in a current scan header list comprises:
based on the position corresponding to the empty cell, forward querying non-empty cells in the current scanning header list, and filling the queried data of the first non-empty cell into the empty cell to obtain the updated scanning header list.
7. The excel form parsing method according to claim 1, wherein before the step of obtaining header information and form body information by performing header similarity determination on each data row in the excel form through a pre-constructed header similarity model based on the transition state of a preset scan state machine, the excel form parsing method further comprises:
acquiring business history table data and extracting a header training corpus from the business history table data;
and performing iterative training on the similarity model of the header to be trained based on the header training corpus to obtain the similarity model of the header.
8. An excel form parsing system, said excel form parsing system comprising:
the acquisition module is used for acquiring the excel table;
the judging module is used for judging the header similarity of each data line in the excel table through a pre-constructed header similarity model based on the transition state of a preset scanning state machine to obtain header information and table text information;
and the header merging module is used for merging the header information to obtain the header hierarchical structure of the excel table if the header information has a plurality of layers of nested headers.
9. An excel form parsing apparatus, characterized in that the excel form parsing apparatus comprises: a memory, a processor and an excel table parsing program stored on the memory,
the excel form parsing program is executed by the processor to implement the excel form parsing method according to any one of claims 1 to 7.
10. A storage medium which is a computer-readable storage medium, wherein the computer-readable storage medium has an excel table parsing program stored thereon, and the excel table parsing program is executed by a processor to implement the excel table parsing method according to any one of claims 1 to 7.
CN202210584452.3A 2022-05-27 2022-05-27 excel table analysis method, system, equipment and storage medium Pending CN114970475A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210584452.3A CN114970475A (en) 2022-05-27 2022-05-27 excel table analysis method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210584452.3A CN114970475A (en) 2022-05-27 2022-05-27 excel table analysis method, system, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114970475A true CN114970475A (en) 2022-08-30

Family

ID=82956372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210584452.3A Pending CN114970475A (en) 2022-05-27 2022-05-27 excel table analysis method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114970475A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115390853A (en) * 2022-09-14 2022-11-25 北京虎蜥信息技术有限公司 Structured analysis method, system, terminal and storage medium for multi-source process file
CN115563111A (en) * 2022-09-27 2023-01-03 国网江苏省电力有限公司超高压分公司 Method and system for configuring dynamic model of converter station system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115390853A (en) * 2022-09-14 2022-11-25 北京虎蜥信息技术有限公司 Structured analysis method, system, terminal and storage medium for multi-source process file
CN115563111A (en) * 2022-09-27 2023-01-03 国网江苏省电力有限公司超高压分公司 Method and system for configuring dynamic model of converter station system

Similar Documents

Publication Publication Date Title
CN114970475A (en) excel table analysis method, system, equipment and storage medium
CN108959257B (en) Natural language parsing method, device, server and storage medium
CN102148805B (en) Feature matching method and device
CN114547072A (en) Method, system, equipment and storage medium for converting natural language query into SQL
CN112001153B (en) Text processing method, device, computer equipment and storage medium
CN112328489A (en) Test case generation method and device, terminal equipment and storage medium
CN111460289A (en) News information pushing method and device
CN112307303A (en) Efficient and accurate network page duplicate removal system based on cloud computing
CN114462603A (en) Knowledge graph generation method and device for data lake
US8751503B2 (en) Computer product, operation and management support apparatus and method
CN115546809A (en) Table structure identification method based on cell constraint and application thereof
CN116226681B (en) Text similarity judging method and device, computer equipment and storage medium
EP3564833B1 (en) Method and device for identifying main picture in web page
CN117094296A (en) VB language-based form splitting method, intelligent terminal and storage medium
CN115186738B (en) Model training method, device and storage medium
CN114185938B (en) Project traceability analysis method and system based on digital finance and big data traceability
JP4466241B2 (en) Document processing method and document processing apparatus
CN110750569A (en) Data extraction method, device, equipment and storage medium
CN116226231B (en) Data segmentation method and related device
CN111125587B (en) Webpage structure optimization method, device, equipment and storage medium
CN116821325B (en) Information extraction method for unstructured report
CN112949298B (en) Word segmentation method and device, electronic equipment and readable storage medium
CN109325166B (en) Method and device for configuring analysis rules in crawler system
CN115146692A (en) Data clustering method and device, electronic equipment and readable storage medium
CN115983290A (en) Text replacement method, text replacement device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination