CN110532346B - Method and device for extracting elements in document - Google Patents

Method and device for extracting elements in document Download PDF

Info

Publication number
CN110532346B
CN110532346B CN201910650428.3A CN201910650428A CN110532346B CN 110532346 B CN110532346 B CN 110532346B CN 201910650428 A CN201910650428 A CN 201910650428A CN 110532346 B CN110532346 B CN 110532346B
Authority
CN
China
Prior art keywords
document
template
extracted
matching
subscript
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910650428.3A
Other languages
Chinese (zh)
Other versions
CN110532346A (en
Inventor
王笑添
纪传俊
陈运文
纪达麒
高翔
罗巧梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Datagrand Information Technology Shanghai Co ltd
Original Assignee
Datagrand Information Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datagrand Information Technology Shanghai Co ltd filed Critical Datagrand Information Technology Shanghai Co ltd
Priority to CN201910650428.3A priority Critical patent/CN110532346B/en
Publication of CN110532346A publication Critical patent/CN110532346A/en
Application granted granted Critical
Publication of CN110532346B publication Critical patent/CN110532346B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for extracting elements in a document, wherein the method comprises the following steps: labeling the template document, and generating subscript information of the template document and the label; matching the template document with the document to be extracted to generate a matching pair; according to the subscript information of the labeling and matching pairs, defining front and rear boundaries in the template document and front and rear boundaries in the document to be extracted; replacing the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted; and outputting the template document and the subscript information in the label as extracted elements. The method has the beneficial effects that the history labeling data is effectively utilized, the values of the fields in the document to be extracted of the same template are accurately extracted, and the method is very suitable for extracting documents with fixed templates like contracts.

Description

Method and device for extracting elements in document
Technical Field
The invention belongs to the field of intelligent document processing, and particularly relates to a method and a device for extracting elements in a document.
Background
The manual reading and auditing of the documents such as massive contracts, legal documents and the like are time-consuming and labor-consuming, the manual processing burden can be obviously reduced by automatically extracting the documents, and business personnel can save a lot of time and put energy on more important matters.
The traditional text extraction uses machine learning methods such as CRF, deep learning and the like, and supervised learning is carried out through massive labeling data, so that certain requirements are met on the training data quantity, and a large amount of manual labeling is needed to train an effective model to be put into use; meanwhile, the extraction accuracy is usually far from 1, and the service personnel are required to check whether the extraction result is correct.
Therefore, a method for extracting is needed, and a good extraction effect can be achieved without a large amount of labeling data; meanwhile, if the accuracy rate is very close to 1, the load of checking the extraction result by service personnel can be greatly reduced. The comparison extraction method is one extraction method capable of solving the two problems.
The comparison and extraction method is a method for extracting by utilizing a historical document set to find out the difference between the historical document set and the document to be extracted through comparison, the requirement on the manually marked data amount is not high, the same template only needs to be marked with one part, and the marking workload is greatly reduced; meanwhile, if the method finds a most similar template meeting the similarity threshold, the extraction accuracy can approach 1, the manual work can basically not need to check whether the extraction result is correct, and the check load of service personnel can be greatly reduced.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method and a device for extracting elements in a document, and partial embodiments of the invention can have better extraction effect without a large amount of annotation data; meanwhile, the accuracy rate very close to 1 can greatly reduce the load of business personnel for checking the extraction result.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a method of extracting elements in a document, the method comprising: labeling the template document, and generating subscript information of the template document and the label; matching the template document with the document to be extracted to generate a matching pair; according to the subscript information of the labeling and matching pairs, defining front and rear boundaries in the template document and front and rear boundaries in the document to be extracted; replacing the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted; and outputting the template document and the subscript information in the label as extracted elements.
Preferably, the defining the front and rear boundaries in the template document and the front and rear boundaries in the document to be extracted includes: marking a starting index between the matching starting index and the matching ending index in the template, wherein the matching ending index in the template is used as the front boundary of the template document, and the matching ending index in the document to be extracted is used as the front boundary of the document to be extracted; and the matching start index in the template is between the marking start index and the marking end index, the matching start index in the template is used as the rear boundary of the template document, and the matching start index in the document to be extracted is used as the rear boundary of the document to be extracted.
Preferably, the labeling template document comprises: and replacing the marked part by the placeholder and updating the subscript information of the template document.
Preferably, the subscript information of the updated template document is based on a cumulative offset calculated by means of a cumulative superposition of the length differences of the placeholders and the corresponding replaced annotation portions.
Preferably, the labeling template document comprises: and calculating the similarity between the plurality of documents and the document to be extracted, and selecting the document most similar to the document to be extracted as the template document.
Preferably, the calculating the similarity between the plurality of documents and the document to be extracted includes: normalizing the full text of the document and the document to be extracted; dividing the document and the document to be extracted according to punctuation marks to obtain respective short sentence lists of the document and the document to be extracted; respectively de-duplicating the short sentence lists of the document and the document to be extracted, and removing short sentences which are empty character strings to obtain a short sentence set of the document and a short sentence set of the document to be extracted; calculating the difference between the short sentence set of the document and the short sentence set of the document to be extracted; and calculating the similarity according to the element number of the difference between the short sentence set of the document and the short sentence set of the document to be extracted and the element number of the short sentence set of the document to be extracted.
Preferably, the selecting, as the template document, the document most similar to the document to be extracted includes: and presetting a threshold value, and selecting the document with the highest similarity as a template document when the similarity of the document is larger than the threshold value, otherwise, feeding back the available information of the template-free document.
Preferably, the matching template document and the document to be extracted include: the sequenceMatcher in the difflib library of Python was used as the matching algorithm.
An apparatus for extracting elements from a document, the apparatus comprising:
the marking module marks the template document and generates template document and marked subscript information thereof;
the matching module is used for matching the template document with the document to be extracted to generate a matching pair;
the demarcation module is used for demarcating front and rear boundaries in the template document and front and rear boundaries in the document to be extracted according to the subscript information of the marking and matching pair;
the replacing module replaces the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted; and
and the output module outputs the template document and the subscript information thereof in the label as extracted elements.
Compared with the prior art, the invention has the beneficial effects that:
1. the method provides extraction means different from machine learning methods such as CRF, deep learning and the like and extraction based on rules, can accurately extract field contents in documents with fixed templates such as contracts and the like, can effectively improve the final extraction effect by putting multiple extraction modes on the top of each other, and has practical significance;
2. a large number of labels are not needed, and the same template only needs to be labeled for one part, so that the workload of labeling personnel is greatly reduced;
3. compared with the machine learning methods such as CRF, deep learning and the like, the model training is very rapid;
4. the similarity algorithm can be used for efficiently matching the most similar templates;
5. by finding out the template reaching the similarity threshold, the field content can be extracted accurately, the method is more credible than a machine learning method, the accuracy rate is close to 1, and the review load of service personnel can be greatly reduced.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a model training process for one field.
FIG. 2 is a schematic diagram of a process for traversing all label start and end subscripts.
FIG. 3 is a flow chart of a method for extracting elements from a document
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.
As shown in fig. 1, the present embodiment provides a method for extracting a field value by comparing the historical document annotation data with the document to be extracted to find out a difference portion. Firstly, training and comparing an extraction model according to historical document annotation data, then predicting a document to be extracted, and extracting values of fields in the document to be extracted.
1. Model definition
The model needs to know the full text of all candidate history documents for one document type, and all the labels and positions of fields in each candidate history document. Thus, we can design a model of the document type as follows:
{
the history document 1 is provided with a history list,
the history document 2 is provided with a history list,
the history document 3 is provided with a history list,
……
}
each history document model is as follows:
{
"content": utf-8 full text content (processed, free of linefeed and u '\ 4\3')
“labels”:{
The label information 1 of the field,
the label information 2 of the field,
the label information 3 of the field,
……
}
}
each label information of the fields is as follows:
(
"text" means the labeling of text content,
"start" marks the beginning subscript of text in its entirety
)
The method is simple, but when the document to be extracted contains the same text as the labeled text in the template, the sequence match used in the subsequent model prediction can be misplaced. Therefore, the method can be improved by replacing all the labeling texts in the labeling information with u '\4\3', replacing the labeling texts in the whole document with u '\4\3', and updating all the labeling initial subscripts. The new model of each history document is as follows:
{
"content": utf-8 full text content (all labeled text is replaced with u '\ 4\3')
“labels”:{
The label information 1 of the field,
the label information 2 of the field,
the label information 3 of the field,
……
}
}
each label information of the fields is as follows:
(
“text”:u’\4\3’,
"start" mark text in full text with new start subscript
)
All this information is obtained and the model can be predicted.
2. Model training
Training the model will generate a corresponding processed full text and all annotation information for each template (history document). When each template is processed, the full-text content of the template is firstly obtained, a copy is copied, and then all the labeling data are traversed. And when each iteration is performed, replacing the current labeling text with u '\4\3', updating the starting subscript of the labeling text according to the starting subscript offset, updating the accumulated starting subscript offset, and replacing the corresponding labeling text in the full-text copy with u '\4\3'. After processing each template, a model of the document type is generated for all templates, and the model training is completed by persistence to a disk.
A schematic diagram of a model training process for one field is shown in fig. 1, and a schematic diagram of a process for traversing all label start and end subscripts is shown in fig. 2.
3. Model prediction
The general flow of model prediction is that, for a document to be extracted to be predicted, firstly finding out the most similar template, if the similarity is less than a threshold value, then indicating that the template matching is failed, and not extracting data; if the similarity reaches a threshold value, then carrying out full-text matching on the document to be extracted and the most similar template, taking up each matching pair, and extracting all corresponding values of the fields by using a comparison extraction algorithm.
1. Most similar template matching
To improve performance, improving user experience in real scenes, algorithms that find the most similar templates cannot take too much time to spread out. A simple and efficient method is to use the aggregate differences of phrases to calculate the similarity, which is very suitable for documents with regular templates like contracts. The specific method comprises the following steps:
the method comprises the steps that all templates are traversed, the similarity between each template and a document to be extracted is calculated, the template with the highest similarity is found out finally, whether the similarity reaches a similarity threshold value specified by the user or not is judged, if the similarity does not reach the similarity threshold value, the matching failure of the most similar template is indicated, and the document is not extracted and is returned directly; if so, the next process continues.
The step of calculating the similarity between a template and a document to be extracted is as follows:
1) And normalizing the whole text of the template and the document to be extracted (removing white space, numbers, english points, percentiles, thousand marks, dashes, case and case amounts, units and the like), and segmenting according to punctuation marks to obtain respective short sentence lists of the template and the document to be extracted. Punctuation marks include commas, semicolons, colon, parentheses, square brackets, signature, stop marks, slashes, periods, and the like in chinese and english.
2) And respectively collecting and de-duplicating the short sentence lists of the template and the document to be extracted, and removing the elements of the empty character strings to obtain the short sentence sets of the template and the document to be extracted, which are respectively called set_template and set_new.
3) Calculating the difference between the document to be extracted and the template set
diff_set_new_minus_template=set_new-set_template
4) Calculating the similarity between the document to be extracted and the template
similarity=1–float(len(diff_set_new_minus_template))/len(set_new)
The function of similarity calculation can be added with buffer decorators, such as from cachetools import cached, LRUCache is introduced and then used
Decorated (cache=lruciche (maxsize=10000)) to avoid repeated computation of similarity to the same template for different fields of the same document to be extracted. Therefore, we have designed a key "content_width_label" in the template whose value is the text of the whole text of the template without labeling, and the same template is identical in different fields.
2. Generating full text matching information
The full text matching information of the template and the document to be extracted is required to be obtained for the next step of comparison and extraction algorithm. The sequence match is provided in the difflib library of Python, which can be directly used to obtain the full-text matching information of the template and the document to be extracted, including the two initial subscripts and the matching length of each matching pair. This step is the most time consuming step in comparing the extraction model predictions.
3. Comparison and extraction algorithm
For a document to be extracted and the most similar template thereof, we make a comparison extraction by labeling one label. The basic idea of comparing and extracting each label is to firstly determine the front boundary and the rear boundary of a template in a field interval to be extracted and a document to be extracted respectively, replace part of the front and rear boundaries of the template with the content of the front and rear boundaries of the document to be extracted, and splice the content with three parts of texts behind the front boundary and the rear boundary in the field interval to obtain the field value to be extracted. Several of these boundary conditions must be handled correctly.
When the boundary is found, the label data may be incorrect or cannot be found, and then the label returns an empty extraction result. If the end subscript of the label is smaller than the front boundary, the complete matching of the label part is indicated, namely, the value of the field in the document to be extracted is the same as the label in the template, and the direct return of the label text is the field value and the initial subscript of the field value in the document to be extracted.
If a front boundary is found, a rear boundary is found. If the marked data is wrong or cannot be found, the back boundary returns an empty extraction result.
If the front boundary and the rear boundary which are matched in the labeling interval are found, calculating the offset of the front boundary and the labeling start subscript in the template, and superposing the offset with the front boundary in the document to be extracted to obtain the field start subscript in the document to be extracted; and splicing the text before the front boundary in the labeling interval in the template, the text between the front boundary and the rear boundary in the document to be extracted and the text after the rear boundary in the labeling interval in the template to obtain the extraction text corresponding to the label of the field in the document to be extracted, namely the field value. The field start subscript and the field value in the document to be extracted are all information to be extracted for a certain label of the document to be extracted. A schematic diagram of the flow of the alignment extraction algorithm is shown in fig. 3.
3.1 front boundary positioning
If the label beginning is the full-text beginning of the template and no previous text exists, we consider that the front field value is to be extracted in the document to be extracted, and directly set the front boundary of the template and the document to be extracted as the subscript 0. Otherwise, the front boundary of the template and the document to be extracted is acquired by means of the front and rear boundary positioning auxiliary functions.
The incoming sequence match pair is positive order because of the positive order lookup.
The incoming matching condition function is:
lambda:label_start,label_end,template_match_start,template_match_end,
new_match_start,new_match_end:\
(template_match_start<label_start and label_start<=template_match_end)
the incoming return function is:
lambda:label_start,label_end,template_match_start,template_match_end,
new_match_start,new_match_end:\
(template_match_end,new_match_end)
wherein, label_start represents the label starting index in the template, label_end represents the label ending index in the template, template_match_start represents the match starting index of a match pair in the template, template_match_end represents the match ending index of a match pair in the template, new_match_start represents the match starting index of a match pair in the document to be extracted, new_match_end represents the match ending index of a match pair in the document to be extracted.
3.2 rear boundary positioning
If the end of the label is the full text end of the template and no post is found, we consider that the last field value is to be extracted in the document to be extracted, and the template and the rear boundary of the document to be extracted are both set as the full text length of the document. Otherwise, the template and the back boundary of the document to be extracted are acquired by means of the front and back boundary positioning auxiliary functions.
The incoming sequence match pair is in reverse order because of the reverse order lookup.
The incoming matching condition function is:
lambda:label_start,label_end,template_match_start,template_match_end,
new_match_start,new_match_end:\
(label_start<=template_match_start and template_match_start<=label_end)
the incoming return function is:
lambda:label_start,label_end,template_match_start,template_match_end,
new_match_start,new_match_end:\
(template_match_start,new_match_start)
the definitions of the label_start, label_end, template_match_ start, template _match_end, new_match_start and new_match_end are consistent with the definition in the previous boundary positioning.
3.3 front and rear boundary positioning auxiliary function
The front and rear boundary positioning auxiliary functions return the result processed by the designated functions when the matching condition is satisfied for the first time, and the following parameters are required to be transmitted:
1) organized_mapping_blocks: ordered sequence matches the block sequence, either positive or reverse.
2) func_find: and (4) calling back the function, considering the matching condition when the function is met, and returning a result by the front and rear boundary positioning auxiliary function.
3) func_get: and (4) calling back the function, and returning a result returned by the function by the front and rear boundary positioning auxiliary function.
4) label_start: the template is marked with the start subscript.
5) label_end: and marking the end subscript in the template.
The front and back boundary positioning auxiliary functions can traverse the matching pair sequence ordered_matching_blocks in sequence, and the loop skips the next iteration when encountering the situation because the matching pair list returned by the sequence matcher contains a matching pair with the matching length of 0.
In each iteration, the matching start index and the matching text length in the template, the document to be extracted and the matching end index in the template and the document to be extracted are obtained, and the matching end index in the template and the matching end index in the document to be extracted are calculated. And then transferring the 6 parameters of the label starting index label_start in the template, the label ending index label_end in the template, the label starting index template_match_start in the template, the label ending index template_match_end in the template, the label starting index new_match_start in the document to be extracted and the label ending index new_match_end in the document to be extracted into a callback func_find, and if the func_find returns to true, transferring the 6 parameters into a callback function func_get, and then returning the return result of the func_get function as the return result of the front and back boundary positioning auxiliary function.
If the sequence of matched pairs organized_matching_blocks is traversed in sequence, any matched pair satisfying callback func_find is still not found, then the front/back boundary to be found is not found, and the null is returned.
By using the comparison extraction method, the history labeling data can be effectively utilized to accurately extract the values of the fields in the document to be extracted of the same template, and the method is very suitable for extracting documents with fixed templates like contracts. The comparison extraction method is faster in training and slower in prediction, and the performance bottleneck of prediction is mainly that the sequence matcher generates full-text matching information of the most similar template (history document) and the document to be extracted. For the same document type, different fields, or different document types, we can utilize clustered parallel computing to accelerate predictions.
One embodiment is provided below in connection with the specific text:
1. model training a template example
1) Input:
template full text content: hole, named dune, word Zhongni, ancestor Song Guo, confucius school creator. Zhang Liang, word ovary, letter, is grown in korea. ".
Template annotation starting subscript list: [[0,2),[23,25) ].
2) And (3) treatment:
initializing labels to an empty list.
Initialization placeholder_version_line_content=0.
Initializing last_end_in_line_content=0.
The accumulated offset len_delta=0 is initialized.
i) First round of iteration: the label start subscript labelstart is 0, and the label end subscript labend is 2:
the start and stop subscripts [0,2 ] correspond to "holes".
New start subscript=0+len_delta=0.
Updating the accumulated offset len _ delta = (len ("\u 0004\u 0003") -len ("hole")) to obtain updated len _ delta = 0,
the label is added with a new label text and a new label initial subscript [ "\u0004\u0003', 0].
Update placeholder_version_line_content+ =
(line_content [ last_end_in_line_content: label_start ] + "\u0004\u0003") to obtain updated placeholder_version_line_content is "\u0004\u0003").
Update last_end_in_line_content=label_end=2.
ii) a second iteration: annotating start and stop subscripts [23, 25):
the start and stop subscripts [23,25 ] correspond to "Zhang Liang".
New start subscript=23+len_delta=23.
Updating the accumulated offset len _ delta = (len ("\u 0004\u 0003") -len ("Zhang Liang")) to obtain updated len _ delta = 0,
the label is added with a new label text and a new label start subscript [ "\u0004\u0003', 23].
Update placeholder_version_line_content+ =
(line_content [ last_end_in_line_content: label_start ] + "\u0004\u0003") to obtain updated placeholder_version_line_content of "\u0004\u0003, name hump, word sec, ancestor Song Guo, julian school sponsor. U 0004/u 0003".
Update last_end_in_line_content=label_end=25.
iii) Post-treatment
Make up for the end of
Placement holder_version_line_content+ = line_content [ last_end_in_line_content: ] gets "\u0004\u0003, name hump, word sec.
U 0004/u 0003, word ovary, the letter, is produced in korea. ".
content_with_label is placeholder_version_line_content.
content_without_label is the template full text.
3) And (3) outputting:
Figure BDA0002135006400000151
2. similarity calculation example of template and document to be extracted
The similarity threshold is assumed to be 0.6.
1) Input:
and (3) full text of the template: the present contract is a purchase and sale contract. Party a: and (5) a hole. And B, formula: and (5) a bang Zi. Party A purchases party B to pay one kilo-two of RMB, and the one kilo-two of RMB is collected for Wu Jiaoliu minutes. The contract is validated on the date of the two parties' endorsement. ".
The full text of the document to be extracted: the present contract is a purchase and sale contract. Party a: and (3) semen (Xuezi). And B, formula: han Feizi. Party a purchases party b to pay the rmb Liuliu and pick up Liu Yuan whole. The contract is validated on the date of the two parties' endorsement. ".
2) And (3) treatment:
i) Normalization: removing white space, number, english dot, percentage, thousand, dash, case and case amount and unit
And (3) a template: the present contract is a purchase and sale contract. Party a: and (5) a hole. And B, formula: and (5) a bang Zi. Party A purchases party B to pay the RMB. The contract is validated on the date of the two parties' endorsement. ".
Document to be extracted: the present contract is a purchase and sale contract. Party a: and (3) semen (Xuezi). And B, formula: han Feizi. Party A purchases party B to pay the RMB. The contract is validated on the date of the two parties' endorsement. ".
ii) dividing according to punctuation marks, wherein the punctuation marks comprise commas, semicolons, colon, parentheses, square brackets, signature numbers, pause numbers, slashes and periods of Chinese and English:
and (3) a template: the [ "present contract is a purchase and sale contract. "party A", "hole", "party B", "Monascus", and "party A purchases party B to pay RMB. ", the present contract is validated on the date of the two parties signing. "].
Document to be extracted: the [ "present contract is a purchase and sale contract. "party a", "party b", "Han Feizi", "party a purchases party b service to pay for the renminbi. ", the present contract is validated on the date of the two parties signing. "].
iii) Calculating phrase similarity percentages using the difference of the sets
Template collection and duplication removal, and blank string removal: set_template= [ "this contract is a purchase and sale contract. "party A", "hole", "party B", "Monascus", and "party A purchases party B to pay RMB. ", the present contract is validated on the date of the two parties signing. "].
Collecting and de-duplicating the document to be extracted, and removing the empty character strings: set_new= [ "this contract is a purchase and sale contract. "party a", "party b", "Han Feizi", "party a purchases party b service to pay for the renminbi. ", the present contract is validated on the date of the two parties signing. "].
Subtracting the template short sentence set from the short sentence set of the document to be extracted
diff_set_new_minus_template=set_new-set_template=
[ "Xuanzi" ] "Han Feizi" ].
Similarity degree
=1-float(len(diff_set_new_minus_template))/len(set_new)=1-2.0/7≈0.714。
3) And (3) outputting:
similarity 0.714 (if this is the highest similarity template, and threshold 0.6 is reached, it is the most similar template that meets the similarity threshold requirement).
3. The most similar template is compared with the document to be extracted, and the extraction is taken as an example:
assume that the field we want to draw is the first party.
1) Input:
the most similar templates reaching the similarity threshold are full text (with labels): the present contract is a purchase and sale contract. Party a: u 0004/u 0003. And B, formula: and (5) a bang Zi. Party A purchases party B to pay one kilo-two of RMB, and the one kilo-two of RMB is collected for Wu Jiaoliu minutes. The contract is validated on the date of the two parties' endorsement. ".
The full text of the document to be extracted: the present contract is a purchase and sale contract. Party a: and (3) semen (Xuezi). And B, formula: han Feizi. Party a purchases party b to pay the rmb Liuliu and pick up Liu Yuan whole. The contract is validated on the date of the two parties' endorsement. ".
2) And (3) treatment:
i) The sequenceMatcher generates matching pair information:
the template is the purchase trade contract. Party a: the "corresponding document to be extracted" present contract is a purchase and sale contract. Party a: ";
and (5) a template. And B, formula: "corresponds to the document to be extracted". And B, formula: ";
the template is "child". Party A purchases party B to pay the sub of the RMB corresponding to the document to be extracted. Party A purchases party B to pay RMB;
the template 'land' corresponds to the document 'land' to be extracted;
and (5) a template. The contract is validated on the date of the two parties' endorsement. "corresponds to the document to be extracted". The contract is validated on the date of the two parties' endorsement. ".
ii) determining the front boundary:
searching a first meeting template from left to right, wherein the marking starting subscript is a front boundary in a section (template matching starting subscript and template matching ending subscript), the marking starting subscript is 14, the first matching pair is in the template, the starting subscript of the "present contract is a purchase and sale contract, the ending subscript is 14, the 14 is in the section (0, 14), the front boundary of the template is found to be 14, and the front boundary of the corresponding document to be extracted is just 14.
iii) Determining a rear boundary:
similarly, a method for determining the front boundary is that a first template matching start index in a meeting template is found from right to left, the first template matching start index is in an interval [ marking start index, marking end index ] which is a rear boundary, the marking start index is 14, the marking end index is 16, and a second matching pair is in the template. And B, formula: the initial subscript of "is 16, 16 in the interval [14,16], the trailing edge of the found template is 16, and the trailing edge corresponding to the document to be extracted is just 16.
iv) get field value and start subscript
The label corresponds to the extracted field value
Text-labeled interval [0, template front boundary-label start subscript) +document to be extracted interval [ document front boundary to be extracted, document back boundary to be extracted) +template-labeled interval [ template back boundary: label end subscript ]
=“\u0004\u0003”[0,14-14]
The + "present contract is a purchase and sale contract. Party a: and (3) semen (Xuezi). And B, formula: han Feizi. Party a purchases party b to pay the rmb Liuliu and pick up Liu Yuan whole. The contract is validated on the date of the two parties' endorsement. "[14,16]
The + "present contract is a purchase and sale contract. Party a: u 0004/u 0003. And B, formula: and (5) a bang Zi. Party A purchases party B to pay one kilo-two of RMB, and the one kilo-two of RMB is collected for Wu Jiaoliu minutes. The contract is validated on the date of the two parties' endorsement. "[16:16]
= "semen Juglandis".
The label corresponds to the extraction field start index=document front boundary to be extracted- (template front boundary-label start index) =14- (14-14) =14.
3) And (3) outputting:
the field of the document to be extracted corresponds to the extraction result: [ ("Xuanzi", 14) ].
While the foregoing embodiments have been described in detail and with reference to the present invention, it will be apparent to one skilled in the art that modifications and improvements can be made based on the disclosure without departing from the spirit and scope of the invention.

Claims (6)

1. A method of extracting elements from a document, the method comprising:
labeling a template document, generating a template document and subscript information of the template document, wherein the labeling the template document comprises: replacing the labeling part by a placeholder, updating the subscript information of the template document, and updating the subscript information of the template document based on an accumulated offset which is calculated by accumulating and superposing the length difference of the placeholder and the corresponding replaced labeling part;
matching the template document with the document to be extracted to generate a matching pair;
according to the subscript information of the labeling and matching pair, defining the front and rear boundaries in the template document and the front and rear boundaries in the document to be extracted, including: marking a starting index between the matching starting index and the matching ending index in the template, wherein the matching ending index in the template is used as the front boundary of the template document, and the matching ending index in the document to be extracted is used as the front boundary of the document to be extracted; the matching initial subscript in the template is between the marking initial subscript and the marking ending subscript, the matching initial subscript in the template is used as the rear boundary of the template document, and the matching initial subscript in the document to be extracted is used as the rear boundary of the document to be extracted;
replacing the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted;
and outputting the template document and the subscript information in the label as extracted elements.
2. The method of extracting elements from a document of claim 1, wherein the annotation template document comprises:
and calculating the similarity between the plurality of documents and the document to be extracted, and selecting the document most similar to the document to be extracted as the template document.
3. The method of extracting elements from documents of claim 2, wherein said calculating similarities between a plurality of documents and the document to be extracted comprises:
normalizing the full text of the document and the document to be extracted;
dividing the document and the document to be extracted according to punctuation marks to obtain respective short sentence lists of the document and the document to be extracted;
respectively de-duplicating the short sentence lists of the document and the document to be extracted, and removing short sentences which are empty character strings to obtain a short sentence set of the document and a short sentence set of the document to be extracted;
calculating the difference between the short sentence set of the document and the short sentence set of the document to be extracted;
and calculating the similarity according to the element number of the difference between the short sentence set of the document and the short sentence set of the document to be extracted and the element number of the short sentence set of the document to be extracted.
4. A method of extracting elements from a document according to claim 3, wherein selecting a document most similar to the document to be extracted as a template document comprises:
and presetting a threshold value, and selecting the document with the highest similarity as a template document when the similarity of the document is larger than the threshold value, otherwise, feeding back the available information of the template-free document.
5. The method of extracting elements from a document according to claim 1, wherein matching the template document with the document to be extracted comprises: the sequenceMatcher in the difflib library of Python was used as the matching algorithm.
6. An apparatus for extracting elements from a document, the apparatus comprising:
the marking module marks the template document, generates the template document and marked subscript information thereof, the marking module replaces marked parts by the placeholders, updates subscript information of the template document, and updates subscript information of the template document based on an accumulated offset which is calculated by accumulating and superposing the length differences of the placeholders and the corresponding replaced marked parts;
the matching module is used for matching the template document with the document to be extracted to generate a matching pair; the marking module is used for marking front and rear boundaries in the template document and front and rear boundaries in the document to be extracted according to the subscript information of the marking and matching pair, wherein the marking starting subscript is arranged between the matching starting subscript in the template and the matching ending subscript in the template, the matching ending subscript in the template is used as the front boundary of the template document, and the matching ending subscript in the document to be extracted is used as the front boundary of the document to be extracted; the matching initial subscript in the template is between the marking initial subscript and the marking ending subscript, the matching initial subscript in the template is used as the rear boundary of the template document, and the matching initial subscript in the document to be extracted is used as the rear boundary of the document to be extracted;
the replacing module replaces the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted; and
and the output module outputs the template document and the subscript information thereof in the label as extracted elements.
CN201910650428.3A 2019-07-18 2019-07-18 Method and device for extracting elements in document Active CN110532346B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910650428.3A CN110532346B (en) 2019-07-18 2019-07-18 Method and device for extracting elements in document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910650428.3A CN110532346B (en) 2019-07-18 2019-07-18 Method and device for extracting elements in document

Publications (2)

Publication Number Publication Date
CN110532346A CN110532346A (en) 2019-12-03
CN110532346B true CN110532346B (en) 2023-04-28

Family

ID=68660614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910650428.3A Active CN110532346B (en) 2019-07-18 2019-07-18 Method and device for extracting elements in document

Country Status (1)

Country Link
CN (1) CN110532346B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8620079B1 (en) * 2011-05-10 2013-12-31 First American Data Tree Llc System and method for extracting information from documents
WO2018142266A1 (en) * 2017-01-31 2018-08-09 Mocsy Inc. Information extraction from documents
CN108763368A (en) * 2018-05-17 2018-11-06 爱因互动科技发展(北京)有限公司 The method for extracting new knowledge point
CN109685056A (en) * 2019-01-04 2019-04-26 达而观信息科技(上海)有限公司 Obtain the method and device of document information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8620079B1 (en) * 2011-05-10 2013-12-31 First American Data Tree Llc System and method for extracting information from documents
WO2018142266A1 (en) * 2017-01-31 2018-08-09 Mocsy Inc. Information extraction from documents
CN108763368A (en) * 2018-05-17 2018-11-06 爱因互动科技发展(北京)有限公司 The method for extracting new knowledge point
CN109685056A (en) * 2019-01-04 2019-04-26 达而观信息科技(上海)有限公司 Obtain the method and device of document information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于云平台的防汛文档智能生成模型构建;姜鹏等;《水利信息化》;20130625(第03期);全文 *

Also Published As

Publication number Publication date
CN110532346A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
CN108959252B (en) Semi-supervised Chinese named entity recognition method based on deep learning
WO2021212682A1 (en) Knowledge extraction method, apparatus, electronic device, and storage medium
CN110598203B (en) Method and device for extracting entity information of military design document combined with dictionary
CN111222305B (en) Information structuring method and device
CN111309915A (en) Method, system, device and storage medium for training natural language of joint learning
CN110795938B (en) Text sequence word segmentation method, device and storage medium
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
US20100257440A1 (en) High precision web extraction using site knowledge
WO2020259280A1 (en) Log management method and apparatus, network device and readable storage medium
CN109460725B (en) Receipt consumption details content mergence and extracting method, equipment and storage medium
CN113901825B (en) Entity relationship joint extraction method and system based on active deep learning
CN112417823B (en) Chinese text word order adjustment and word completion method and system
WO2020205861A1 (en) Hierarchical machine learning architecture including master engine supported by distributed light-weight real-time edge engines
CN111625621A (en) Document retrieval method and device, electronic equipment and storage medium
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN107797986B (en) LSTM-CNN-based mixed corpus word segmentation method
CN110705217B (en) Wrongly written or mispronounced word detection method and device, computer storage medium and electronic equipment
CN108519978A (en) A kind of Chinese document segmenting method based on Active Learning
CN115935914A (en) Admission record missing text supplementing method
CN112084776B (en) Method, device, server and computer storage medium for detecting similar articles
CN113344125A (en) Long text matching identification method and device, electronic equipment and storage medium
CN110532346B (en) Method and device for extracting elements in document
CN111681731A (en) Method for automatically marking colors of inspection report
CN113392189B (en) News text processing method based on automatic word segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant