CN112632927A - Table fragment link restoration method and system based on semantic processing - Google Patents

Table fragment link restoration method and system based on semantic processing Download PDF

Info

Publication number
CN112632927A
CN112632927A CN202011621485.8A CN202011621485A CN112632927A CN 112632927 A CN112632927 A CN 112632927A CN 202011621485 A CN202011621485 A CN 202011621485A CN 112632927 A CN112632927 A CN 112632927A
Authority
CN
China
Prior art keywords
processing
text
context
segments
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011621485.8A
Other languages
Chinese (zh)
Inventor
金鑫
李鹏辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Alphainsight Technology Co ltd
Original Assignee
Shanghai Alphainsight Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Alphainsight Technology Co ltd filed Critical Shanghai Alphainsight Technology Co ltd
Priority to CN202011621485.8A priority Critical patent/CN112632927A/en
Publication of CN112632927A publication Critical patent/CN112632927A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a table fragment link restoration method based on semantic processing, which specifically comprises the following steps: s100, performing structured extraction on the table to obtain a table segment; s200, preprocessing the table segments extracted in the step S100; s300, learning semantic information of the context and data in the table by adopting an LSTM deep learning model to judge whether adjacent table segments are linked; s400, carrying out rule verification on the model processing result, and restoring the table segments needing to be linked. The method of the embodiment utilizes an LSTM deep learning model to perform representation learning, automatically excavates semantic information contained in table context and data in a table, achieves intelligent identification of whether a table segment in a line feed page change scene in a PDF document should be subjected to link restoration, and performs link restoration on the group of table segments.

Description

Table fragment link restoration method and system based on semantic processing
Technical Field
The invention belongs to the technical field of table text processing, and particularly relates to a table fragment link restoration method and system based on semantic processing.
Background
In recent years, deep learning techniques have been widely used in various fields such as natural language processing, graphic images, and automatic driving, and the expression effect is significantly better than that of the conventional method.
In the field of natural language processing, the deep learning technology can capture deep grammar and semantic information by encoding text characters in a high-dimensional space, thereby providing a technical basis for realizing high-level application in the field of natural language processing from the aspect of semantics.
In text information processing, there are a large number of tables of different styles. The prior art still has many problems for extracting the table information. When page or line feed occurs, it is difficult to determine whether the line is fed or not simply by dividing lines or simple rules. For the situation without table lines, it is difficult for the computer to make an accurate judgment on whether two adjacent rows output the same cell.
Disclosure of Invention
1. Technical problem to be solved by the invention
The invention aims to solve the problem that the existing table processing method is difficult to accurately judge whether adjacent cells can be combined or not.
2. Technical scheme
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the invention relates to a table fragment link restoration method based on semantic processing, which specifically comprises the following steps:
s100, performing structured extraction on the table to obtain a table segment;
s200, preprocessing the table segments extracted in the step S100;
s300, learning semantic information of the context and data in the table by adopting an LSTM deep learning model to judge whether adjacent table segments are linked;
s400, carrying out rule verification on the model processing result, and restoring the table segments needing to be linked.
Preferably, the step S100 is to extract the table segments in the table according to the table structure.
Preferably, the preprocessing in step S200 is to extract and clean the context of the table segment, extract the cell text in the table segment, convert the extracted and combined cell sequence into a text, and clean the text to remove the context of the table segment and the invalid content in the table.
Preferably, the step S300 specifically includes the following steps:
s310, obtaining context word vectors, and learning the context of each table segment by using word2vec to obtain corresponding vectors;
s320, obtaining text word vectors in the table, and obtaining corresponding vectors for texts in the table of each table segment by using word2vec learning;
s330, word vector splicing, namely splicing the word vectors of the upper and lower texts and the text word vectors in the table;
s340, model processing, namely performing bidirectional LSTM processing on the text through an LSTM deep learning model, learning semantic information of the text, and acquiring semantic features of table segments;
and S350, restoration judgment, namely judging whether each group of spliced table segments should be subjected to link restoration through a linear classifier.
Preferably, the performing of the rule check on the model processing result in step S400 specifically includes performing the check on the merged cell information, and performing the rule correction on the result of the model prediction error.
Preferably, the washing of the characters removes the context of the table segment and the invalid contents in the table, and the washing of the characters is specifically to delete the meaningless punctuation marks.
Preferably, the restoration judgment in step S350 specifically includes taking the preprocessed and converted table information (vectorization) as an input, and judging the relationship between the two tables by the linear classifier as follows:
judging whether the tables are the same table or not, and if not, not performing link restoration;
when the same table is judged, judging whether the last line of the previous table and the first line of the next table are the same line, and directly splicing if the last line and the first line of the next table are not the same line; when the same row is formed, the last row of the previous table is reserved, and the first row of the next table is merged into the last row of the previous table.
A table fragment link recovery system based on semantic processing, which is used for executing the method, and comprises
The table extraction module is used for performing structured extraction on a table to obtain table segments;
the preprocessing module is used for preprocessing the extracted table segments;
a model processing module for determining whether adjacent table segments should be linked according to the table context and semantic information of data within the table;
and the checking and recovering module is used for carrying out rule checking on the model processing result and recovering the table fragments needing to be linked.
Preferably, the model processing module comprises a context word vector obtaining unit, a table text word vector obtaining unit, a word vector splicing unit, a processing unit and a judging unit.
Preferably, the context word vector obtaining unit is configured to obtain a corresponding context word vector by using word2vec learning for the context of each table segment; the table text word vector obtaining unit is used for learning and obtaining a corresponding table text word vector for the in-table text of each table segment by using word2 vec.
Preferably, the word vector splicing unit is used for splicing the context word vectors and the text word vectors in the table; the processing unit is used for performing bidirectional LSTM processing on the text through the LSTM deep learning model, learning semantic information of the text and acquiring semantic features of the table segments; and the judging unit is used for judging whether each group of spliced table segments should be subjected to link restoration through a linear classifier.
3. Advantageous effects
Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
the invention relates to a table fragment link restoration method based on semantic processing, which specifically comprises the following steps: s100, performing structured extraction on the table to obtain a table segment; s200, preprocessing the table segments extracted in the step S100; s300, learning semantic information of the context and data in the table by adopting an LSTM deep learning model to judge whether adjacent table segments are linked; s400, carrying out rule verification on the model processing result, and restoring the table segments needing to be linked. The method of the embodiment utilizes an LSTM deep learning model to perform representation learning, automatically excavates semantic information contained in table context and data in a table, achieves intelligent identification of whether a table segment in a line feed page change scene in a PDF document should be subjected to link restoration, and performs link restoration on the group of table segments.
Drawings
FIG. 1 is a flow chart of a table fragment link recovery method based on semantic processing according to the present invention;
FIG. 2 is a schematic structural diagram of a table fragment link recovery system based on semantic processing according to the present invention.
The reference numerals in the schematic drawings illustrate:
100. a table extraction module;
200. a preprocessing module;
300. a model processing module; 310. a context word vector acquisition unit; 320. a table text word vector obtaining unit; 330. a word vector splicing unit; 340. a processing unit; 350. a judgment unit;
400. and a checking and recovering module.
Detailed Description
In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in many different forms and are not limited to the embodiments described herein, but rather are provided for the purpose of providing a more thorough disclosure of the invention.
It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present; when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present; the terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Example 1
Referring to fig. 1-2, a table fragment link restoration method based on semantic processing according to this embodiment specifically includes the following steps:
s100, performing structured extraction on the table to obtain a table segment;
s200, preprocessing the table segments extracted in the step S100;
s300, learning semantic information of the context and data in the table by adopting an LSTM deep learning model to judge whether adjacent table segments are linked;
s400, carrying out rule verification on the model processing result, and restoring the table segments needing to be linked.
Specifically, the step S100 is to extract the table segments in the table according to the table structure.
Specifically, the preprocessing in step S200 is to extract and clean the context of the table segment, extract the cell text in the table segment, convert the extracted and combined cell sequence into a text, and clean the text to remove the context of the table segment and the invalid content in the table.
Specifically, the step S300 specifically includes the following steps:
s310, obtaining context word vectors, and learning the context of each table segment by using word2vec to obtain corresponding vectors;
s320, obtaining text word vectors in the table, and obtaining corresponding vectors for texts in the table of each table segment by using word2vec learning;
s330, word vector splicing, namely splicing the word vectors of the upper and lower texts and the text word vectors in the table;
s340, model processing, namely performing bidirectional LSTM processing on the text through an LSTM deep learning model, learning semantic information of the text, and acquiring semantic features of table segments;
and S350, restoration judgment, namely judging whether each group of spliced table segments should be subjected to link restoration through a linear classifier.
Specifically, the step S400 of performing rule check on the model processing result specifically includes performing check on the merged cell information, and performing rule correction on the result of the model prediction error.
The characters are cleaned to remove the context of the table segment and invalid contents in the table, and the cleaning of the characters is specifically to delete the meaningless punctuation marks.
The restoration judgment in step S350 specifically includes taking the preprocessed and converted table information (vectorization) as input, and judging the relationship between the two tables by the linear classifier as follows:
judging whether the tables are the same table or not, and if not, not performing link restoration;
when the same table is judged, judging whether the last line of the previous table and the first line of the next table are the same line, and directly splicing if the last line and the first line of the next table are not the same line; when the same row is formed, the last row of the previous table is reserved, and the first row of the next table is merged into the last row of the previous table.
The method of the embodiment utilizes an LSTM deep learning model to perform representation learning, automatically excavates semantic information contained in table context and data in a table, achieves intelligent identification of whether a table segment in a line feed page change scene in a PDF document should be subjected to link restoration, and performs link restoration on the group of table segments.
A table fragment link recovery system based on semantic processing, which is used for executing the method, and comprises
The table extraction module 100, the table extraction module 100 is configured to perform structured extraction on a table to obtain table segments;
a preprocessing module 200, wherein the preprocessing module 200 is used for preprocessing the extracted table segments;
a model processing module 300, said model processing module 300 is used for judging whether the adjacent table segments should be linked according to the table context and the semantic information of the data in the table;
and the checking and recovering module 400, wherein the checking and recovering module 400 is used for checking the rule of the model processing result and recovering the table segments needing to be linked.
Specifically, the model processing module 300 includes a context word vector obtaining unit 310, a table text word vector obtaining unit 320, a word vector splicing unit 330, a processing unit 340, and a determining unit 350.
Specifically, the context word vector obtaining unit 310 is configured to obtain a corresponding context word vector by using word2vec learning for a context of each table segment; the table text word vector obtaining unit 320 is configured to obtain a corresponding table text word vector for the text in the table of each table segment by using word2vec learning.
Specifically, the word vector splicing unit 330 is configured to splice the context word vectors and the text word vectors in the table; the processing unit 340 is configured to perform bidirectional LSTM processing on the text through the LSTM deep learning model, learn semantic information of the text, and obtain semantic features of the table segments; the judging unit 350 is configured to judge, through a linear classifier, whether each group of table segments after being spliced should be subjected to link recovery.
The above-mentioned embodiments only express a certain implementation mode of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention; it should be noted that, for those skilled in the art, without departing from the concept of the present invention, several variations and modifications can be made, which are within the protection scope of the present invention; therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A table fragment link restoration method based on semantic processing is characterized by specifically comprising the following steps:
s100, performing structured extraction on the table to obtain a table segment;
s200, preprocessing the table segments extracted in the step S100;
s300, learning semantic information of the context and data in the table by adopting an LSTM deep learning model to judge whether adjacent table segments are linked;
s400, carrying out rule verification on the model processing result, and restoring the table segments needing to be linked.
2. The table fragment link recovery method based on semantic processing as claimed in claim 1, wherein: the step S100 is to extract table segments in the table according to the table structure.
3. The table fragment link recovery method based on semantic processing as claimed in claim 1, wherein: the preprocessing in step S200 is to extract and clean the context of the table segment, extract the cell text in the table segment, convert the extracted and combined cell sequence into a text, and clean the text to remove the context of the table segment and the invalid content in the table.
4. The method for recovering table fragment links based on semantic processing according to claim 1, wherein the step S300 specifically includes the following steps:
s310, obtaining context word vectors, and learning the context of each table segment by using word2vec to obtain corresponding vectors;
s320, obtaining text word vectors in the table, and obtaining corresponding vectors for texts in the table of each table segment by using word2vec learning;
s330, word vector splicing, namely splicing the word vectors of the upper and lower texts and the text word vectors in the table;
s340, model processing, namely performing bidirectional LSTM processing on the text through an LSTM deep learning model, learning semantic information of the text, and acquiring semantic features of table segments;
and S350, restoration judgment, namely judging whether each group of spliced table segments should be subjected to link restoration through a linear classifier.
5. The table fragment link recovery method based on semantic processing as claimed in claim 1, wherein: the step S400 of performing rule check on the model processing result specifically includes performing check on the merged cell information, and performing rule correction on the result of the model prediction error.
6. The table fragment link recovery method based on semantic processing as claimed in claim 3, wherein: the character cleaning specifically comprises the step of deleting meaningless punctuation marks in the context of the table segments and the invalid contents in the table.
7. The method for recovering table fragment link based on semantic processing as claimed in claim 4, wherein: the restoration judgment in step S350 specifically includes taking the preprocessed and converted table information (vectorization) as input, and judging the relationship between the two tables by the linear classifier as follows:
judging whether the tables are the same table or not, and if not, not performing link restoration;
when the same table is judged, judging whether the last line of the previous table and the first line of the next table are the same line, and directly splicing if the last line and the first line of the next table are not the same line; when the same row is formed, the last row of the previous table is reserved, and the first row of the next table is merged into the last row of the previous table.
8. A table fragment link restoration system based on semantic processing is characterized in that: the system is used for executing the method of any one of the preceding claims 1-7, the system comprising
The table extraction module (100), the table extraction module (100) is used for performing structured extraction on the table to obtain table fragments;
a pre-processing module (200), the pre-processing module (200) being configured to pre-process the extracted table segments;
a model processing module (300), the model processing module (300) being configured to determine whether adjacent table segments should be linked according to the table context and semantic information of data within the table;
and the checking and recovering module (400) is used for checking the rule of the model processing result and recovering the table fragments needing to be linked.
9. The system for recovering table fragment link based on semantic processing according to claim 8, wherein: the model processing module (300) comprises a context word vector obtaining unit (310), a table text word vector obtaining unit (320), a word vector splicing unit (330), a processing unit (340) and a judging unit (350).
10. The system for recovering table fragment link based on semantic processing according to claim 9, wherein: the context word vector acquisition unit (310) is used for acquiring a corresponding context word vector by using word2vec learning for the context of each table segment; the table text word vector obtaining unit (320) is used for obtaining a corresponding table text word vector for the in-table text of each table segment by using word2vec learning; the word vector splicing unit (330) is used for splicing the upper and lower word vectors and the text word vectors in the table; the processing unit (340) is used for performing bidirectional LSTM processing on the text through the LSTM deep learning model, learning semantic information of the text and acquiring semantic features of the table segments; the judging unit (350) is used for judging whether each group of table segments after splicing should be subjected to link restoration through a linear classifier.
CN202011621485.8A 2020-12-30 2020-12-30 Table fragment link restoration method and system based on semantic processing Pending CN112632927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011621485.8A CN112632927A (en) 2020-12-30 2020-12-30 Table fragment link restoration method and system based on semantic processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011621485.8A CN112632927A (en) 2020-12-30 2020-12-30 Table fragment link restoration method and system based on semantic processing

Publications (1)

Publication Number Publication Date
CN112632927A true CN112632927A (en) 2021-04-09

Family

ID=75287668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011621485.8A Pending CN112632927A (en) 2020-12-30 2020-12-30 Table fragment link restoration method and system based on semantic processing

Country Status (1)

Country Link
CN (1) CN112632927A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688693A (en) * 2021-07-29 2021-11-23 上海浦东发展银行股份有限公司 Adjacent table processing method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818075A (en) * 2017-10-16 2018-03-20 平安科技(深圳)有限公司 Form data structuring extracting method, electronic equipment and computer-readable recording medium
CN109344399A (en) * 2018-09-14 2019-02-15 重庆邂智科技有限公司 A kind of Text similarity computing method based on the two-way lstm neural network of stacking
CN109460730A (en) * 2018-11-03 2019-03-12 上海犀语科技有限公司 A kind of analysis method that table skips and device
WO2019075970A1 (en) * 2017-10-16 2019-04-25 平安科技(深圳)有限公司 Line wrap recognition method for table information, electronic device, and computer-readable storage medium
CN112100426A (en) * 2020-09-22 2020-12-18 哈尔滨工业大学(深圳) Method and system for searching general table information based on visual and text characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818075A (en) * 2017-10-16 2018-03-20 平安科技(深圳)有限公司 Form data structuring extracting method, electronic equipment and computer-readable recording medium
WO2019075970A1 (en) * 2017-10-16 2019-04-25 平安科技(深圳)有限公司 Line wrap recognition method for table information, electronic device, and computer-readable storage medium
CN109344399A (en) * 2018-09-14 2019-02-15 重庆邂智科技有限公司 A kind of Text similarity computing method based on the two-way lstm neural network of stacking
CN109460730A (en) * 2018-11-03 2019-03-12 上海犀语科技有限公司 A kind of analysis method that table skips and device
CN112100426A (en) * 2020-09-22 2020-12-18 哈尔滨工业大学(深圳) Method and system for searching general table information based on visual and text characteristics

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688693A (en) * 2021-07-29 2021-11-23 上海浦东发展银行股份有限公司 Adjacent table processing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110363252B (en) End-to-end trend scene character detection and identification method and system
CN107145479B (en) Text semantic-based chapter structure analysis method
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
CN111027562B (en) Optical character recognition method based on multiscale CNN and RNN combined with attention mechanism
CN112836052B (en) Automobile comment text viewpoint mining method, equipment and storage medium
CN109284503B (en) Translation statement ending judgment method and system
CN109299470B (en) Method and system for extracting trigger words in text bulletin
CN110991175B (en) Method, system, equipment and storage medium for generating text in multi-mode
CN106446072A (en) Webpage content processing method and apparatus
CN109948518B (en) Neural network-based PDF document content text paragraph aggregation method
CN106372053B (en) Syntactic analysis method and device
CN111639185B (en) Relation information extraction method, device, electronic equipment and readable storage medium
CN112632927A (en) Table fragment link restoration method and system based on semantic processing
CN114821613A (en) Extraction method and system of table information in PDF
CN117373591A (en) Disease identification method and device for electronic medical record, electronic equipment and storage medium
CN109460730B (en) Analysis method and device for line and page changing of table
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN113051869B (en) Method and system for realizing identification of text difference content by combining semantic recognition
CN114611529B (en) Intention recognition method and device, electronic equipment and storage medium
CN114429106B (en) Page information processing method and device, electronic equipment and storage medium
KR102569381B1 (en) System and Method for Machine Reading Comprehension to Table-centered Web Documents
CN115186683A (en) Cross-modal translation-based attribute-level multi-modal emotion classification method
CN114579796A (en) Machine reading understanding method and device
KR101126186B1 (en) Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof
CN115410207B (en) Detection method and device for vertical text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210409