CN110532346B

CN110532346B - Method and device for extracting elements in document

Info

Publication number: CN110532346B
Application number: CN201910650428.3A
Authority: CN
Inventors: 王笑添; 纪传俊; 陈运文; 纪达麒; 高翔; 罗巧梅
Original assignee: Datagrand Information Technology Shanghai Co ltd
Current assignee: Datagrand Information Technology Shanghai Co ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2023-04-28
Anticipated expiration: 2039-07-18
Also published as: CN110532346A

Abstract

The invention discloses a method and a device for extracting elements in a document, wherein the method comprises the following steps: labeling the template document, and generating subscript information of the template document and the label; matching the template document with the document to be extracted to generate a matching pair; according to the subscript information of the labeling and matching pairs, defining front and rear boundaries in the template document and front and rear boundaries in the document to be extracted; replacing the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted; and outputting the template document and the subscript information in the label as extracted elements. The method has the beneficial effects that the history labeling data is effectively utilized, the values of the fields in the document to be extracted of the same template are accurately extracted, and the method is very suitable for extracting documents with fixed templates like contracts.

Description

Method and device for extracting elements in document

Technical Field

The invention belongs to the field of intelligent document processing, and particularly relates to a method and a device for extracting elements in a document.

Background

The manual reading and auditing of the documents such as massive contracts, legal documents and the like are time-consuming and labor-consuming, the manual processing burden can be obviously reduced by automatically extracting the documents, and business personnel can save a lot of time and put energy on more important matters.

The traditional text extraction uses machine learning methods such as CRF, deep learning and the like, and supervised learning is carried out through massive labeling data, so that certain requirements are met on the training data quantity, and a large amount of manual labeling is needed to train an effective model to be put into use; meanwhile, the extraction accuracy is usually far from 1, and the service personnel are required to check whether the extraction result is correct.

Therefore, a method for extracting is needed, and a good extraction effect can be achieved without a large amount of labeling data; meanwhile, if the accuracy rate is very close to 1, the load of checking the extraction result by service personnel can be greatly reduced. The comparison extraction method is one extraction method capable of solving the two problems.

The comparison and extraction method is a method for extracting by utilizing a historical document set to find out the difference between the historical document set and the document to be extracted through comparison, the requirement on the manually marked data amount is not high, the same template only needs to be marked with one part, and the marking workload is greatly reduced; meanwhile, if the method finds a most similar template meeting the similarity threshold, the extraction accuracy can approach 1, the manual work can basically not need to check whether the extraction result is correct, and the check load of service personnel can be greatly reduced.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method and a device for extracting elements in a document, and partial embodiments of the invention can have better extraction effect without a large amount of annotation data; meanwhile, the accuracy rate very close to 1 can greatly reduce the load of business personnel for checking the extraction result.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a method of extracting elements in a document, the method comprising: labeling the template document, and generating subscript information of the template document and the label; matching the template document with the document to be extracted to generate a matching pair; according to the subscript information of the labeling and matching pairs, defining front and rear boundaries in the template document and front and rear boundaries in the document to be extracted; replacing the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted; and outputting the template document and the subscript information in the label as extracted elements.

Preferably, the defining the front and rear boundaries in the template document and the front and rear boundaries in the document to be extracted includes: marking a starting index between the matching starting index and the matching ending index in the template, wherein the matching ending index in the template is used as the front boundary of the template document, and the matching ending index in the document to be extracted is used as the front boundary of the document to be extracted; and the matching start index in the template is between the marking start index and the marking end index, the matching start index in the template is used as the rear boundary of the template document, and the matching start index in the document to be extracted is used as the rear boundary of the document to be extracted.

Preferably, the labeling template document comprises: and replacing the marked part by the placeholder and updating the subscript information of the template document.

Preferably, the subscript information of the updated template document is based on a cumulative offset calculated by means of a cumulative superposition of the length differences of the placeholders and the corresponding replaced annotation portions.

Preferably, the labeling template document comprises: and calculating the similarity between the plurality of documents and the document to be extracted, and selecting the document most similar to the document to be extracted as the template document.

Preferably, the calculating the similarity between the plurality of documents and the document to be extracted includes: normalizing the full text of the document and the document to be extracted; dividing the document and the document to be extracted according to punctuation marks to obtain respective short sentence lists of the document and the document to be extracted; respectively de-duplicating the short sentence lists of the document and the document to be extracted, and removing short sentences which are empty character strings to obtain a short sentence set of the document and a short sentence set of the document to be extracted; calculating the difference between the short sentence set of the document and the short sentence set of the document to be extracted; and calculating the similarity according to the element number of the difference between the short sentence set of the document and the short sentence set of the document to be extracted and the element number of the short sentence set of the document to be extracted.

Preferably, the selecting, as the template document, the document most similar to the document to be extracted includes: and presetting a threshold value, and selecting the document with the highest similarity as a template document when the similarity of the document is larger than the threshold value, otherwise, feeding back the available information of the template-free document.

Preferably, the matching template document and the document to be extracted include: the sequenceMatcher in the difflib library of Python was used as the matching algorithm.

An apparatus for extracting elements from a document, the apparatus comprising:

the marking module marks the template document and generates template document and marked subscript information thereof;

the matching module is used for matching the template document with the document to be extracted to generate a matching pair;

the demarcation module is used for demarcating front and rear boundaries in the template document and front and rear boundaries in the document to be extracted according to the subscript information of the marking and matching pair;

the replacing module replaces the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted; and

and the output module outputs the template document and the subscript information thereof in the label as extracted elements.

Compared with the prior art, the invention has the beneficial effects that:

1. the method provides extraction means different from machine learning methods such as CRF, deep learning and the like and extraction based on rules, can accurately extract field contents in documents with fixed templates such as contracts and the like, can effectively improve the final extraction effect by putting multiple extraction modes on the top of each other, and has practical significance;

2. a large number of labels are not needed, and the same template only needs to be labeled for one part, so that the workload of labeling personnel is greatly reduced;

3. compared with the machine learning methods such as CRF, deep learning and the like, the model training is very rapid;

4. the similarity algorithm can be used for efficiently matching the most similar templates;

5. by finding out the template reaching the similarity threshold, the field content can be extracted accurately, the method is more credible than a machine learning method, the accuracy rate is close to 1, and the review load of service personnel can be greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a model training process for one field.

FIG. 2 is a schematic diagram of a process for traversing all label start and end subscripts.

FIG. 3 is a flow chart of a method for extracting elements from a document

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.

As shown in fig. 1, the present embodiment provides a method for extracting a field value by comparing the historical document annotation data with the document to be extracted to find out a difference portion. Firstly, training and comparing an extraction model according to historical document annotation data, then predicting a document to be extracted, and extracting values of fields in the document to be extracted.

1. Model definition

The model needs to know the full text of all candidate history documents for one document type, and all the labels and positions of fields in each candidate history document. Thus, we can design a model of the document type as follows:

{

the history document 1 is provided with a history list,

the history document 2 is provided with a history list,

the history document 3 is provided with a history list,

……

}

each history document model is as follows:

{

"content": utf-8 full text content (processed, free of linefeed and u '\ 4\3')

“labels”:{

The label information 1 of the field,

the label information 2 of the field,

the label information 3 of the field,

……

}

each label information of the fields is as follows:

(

"text" means the labeling of text content,

"start" marks the beginning subscript of text in its entirety

)

The method is simple, but when the document to be extracted contains the same text as the labeled text in the template, the sequence match used in the subsequent model prediction can be misplaced. Therefore, the method can be improved by replacing all the labeling texts in the labeling information with u '\4\3', replacing the labeling texts in the whole document with u '\4\3', and updating all the labeling initial subscripts. The new model of each history document is as follows:

{

"content": utf-8 full text content (all labeled text is replaced with u '\ 4\3')

“labels”:{

The label information 1 of the field,

the label information 2 of the field,

the label information 3 of the field,

……

}

each label information of the fields is as follows:

(

“text”:u’\4\3’,

"start" mark text in full text with new start subscript

)

All this information is obtained and the model can be predicted.

2. Model training

Training the model will generate a corresponding processed full text and all annotation information for each template (history document). When each template is processed, the full-text content of the template is firstly obtained, a copy is copied, and then all the labeling data are traversed. And when each iteration is performed, replacing the current labeling text with u '\4\3', updating the starting subscript of the labeling text according to the starting subscript offset, updating the accumulated starting subscript offset, and replacing the corresponding labeling text in the full-text copy with u '\4\3'. After processing each template, a model of the document type is generated for all templates, and the model training is completed by persistence to a disk.

A schematic diagram of a model training process for one field is shown in fig. 1, and a schematic diagram of a process for traversing all label start and end subscripts is shown in fig. 2.

3. Model prediction

The general flow of model prediction is that, for a document to be extracted to be predicted, firstly finding out the most similar template, if the similarity is less than a threshold value, then indicating that the template matching is failed, and not extracting data; if the similarity reaches a threshold value, then carrying out full-text matching on the document to be extracted and the most similar template, taking up each matching pair, and extracting all corresponding values of the fields by using a comparison extraction algorithm.

1. Most similar template matching

To improve performance, improving user experience in real scenes, algorithms that find the most similar templates cannot take too much time to spread out. A simple and efficient method is to use the aggregate differences of phrases to calculate the similarity, which is very suitable for documents with regular templates like contracts. The specific method comprises the following steps:

the method comprises the steps that all templates are traversed, the similarity between each template and a document to be extracted is calculated, the template with the highest similarity is found out finally, whether the similarity reaches a similarity threshold value specified by the user or not is judged, if the similarity does not reach the similarity threshold value, the matching failure of the most similar template is indicated, and the document is not extracted and is returned directly; if so, the next process continues.

The step of calculating the similarity between a template and a document to be extracted is as follows:

1) And normalizing the whole text of the template and the document to be extracted (removing white space, numbers, english points, percentiles, thousand marks, dashes, case and case amounts, units and the like), and segmenting according to punctuation marks to obtain respective short sentence lists of the template and the document to be extracted. Punctuation marks include commas, semicolons, colon, parentheses, square brackets, signature, stop marks, slashes, periods, and the like in chinese and english.

2) And respectively collecting and de-duplicating the short sentence lists of the template and the document to be extracted, and removing the elements of the empty character strings to obtain the short sentence sets of the template and the document to be extracted, which are respectively called set_template and set_new.

3) Calculating the difference between the document to be extracted and the template set

diff_set_new_minus_template＝set_new-set_template

4) Calculating the similarity between the document to be extracted and the template

similarity＝1–float(len(diff_set_new_minus_template))/len(set_new)

The function of similarity calculation can be added with buffer decorators, such as from cachetools import cached, LRUCache is introduced and then used

Decorated (cache=lruciche (maxsize=10000)) to avoid repeated computation of similarity to the same template for different fields of the same document to be extracted. Therefore, we have designed a key "content_width_label" in the template whose value is the text of the whole text of the template without labeling, and the same template is identical in different fields.

2. Generating full text matching information

The full text matching information of the template and the document to be extracted is required to be obtained for the next step of comparison and extraction algorithm. The sequence match is provided in the difflib library of Python, which can be directly used to obtain the full-text matching information of the template and the document to be extracted, including the two initial subscripts and the matching length of each matching pair. This step is the most time consuming step in comparing the extraction model predictions.

3. Comparison and extraction algorithm

For a document to be extracted and the most similar template thereof, we make a comparison extraction by labeling one label. The basic idea of comparing and extracting each label is to firstly determine the front boundary and the rear boundary of a template in a field interval to be extracted and a document to be extracted respectively, replace part of the front and rear boundaries of the template with the content of the front and rear boundaries of the document to be extracted, and splice the content with three parts of texts behind the front boundary and the rear boundary in the field interval to obtain the field value to be extracted. Several of these boundary conditions must be handled correctly.

When the boundary is found, the label data may be incorrect or cannot be found, and then the label returns an empty extraction result. If the end subscript of the label is smaller than the front boundary, the complete matching of the label part is indicated, namely, the value of the field in the document to be extracted is the same as the label in the template, and the direct return of the label text is the field value and the initial subscript of the field value in the document to be extracted.

If a front boundary is found, a rear boundary is found. If the marked data is wrong or cannot be found, the back boundary returns an empty extraction result.

If the front boundary and the rear boundary which are matched in the labeling interval are found, calculating the offset of the front boundary and the labeling start subscript in the template, and superposing the offset with the front boundary in the document to be extracted to obtain the field start subscript in the document to be extracted; and splicing the text before the front boundary in the labeling interval in the template, the text between the front boundary and the rear boundary in the document to be extracted and the text after the rear boundary in the labeling interval in the template to obtain the extraction text corresponding to the label of the field in the document to be extracted, namely the field value. The field start subscript and the field value in the document to be extracted are all information to be extracted for a certain label of the document to be extracted. A schematic diagram of the flow of the alignment extraction algorithm is shown in fig. 3.

3.1 front boundary positioning

If the label beginning is the full-text beginning of the template and no previous text exists, we consider that the front field value is to be extracted in the document to be extracted, and directly set the front boundary of the template and the document to be extracted as the subscript 0. Otherwise, the front boundary of the template and the document to be extracted is acquired by means of the front and rear boundary positioning auxiliary functions.

The incoming sequence match pair is positive order because of the positive order lookup.

The incoming matching condition function is:

lambda:label_start,label_end,template_match_start,template_match_end,

new_match_start,new_match_end:\

(template_match_start<label_start and label_start<＝template_match_end)

the incoming return function is:

lambda:label_start,label_end,template_match_start,template_match_end,

new_match_start,new_match_end:\

(template_match_end,new_match_end)

wherein, label_start represents the label starting index in the template, label_end represents the label ending index in the template, template_match_start represents the match starting index of a match pair in the template, template_match_end represents the match ending index of a match pair in the template, new_match_start represents the match starting index of a match pair in the document to be extracted, new_match_end represents the match ending index of a match pair in the document to be extracted.

3.2 rear boundary positioning

If the end of the label is the full text end of the template and no post is found, we consider that the last field value is to be extracted in the document to be extracted, and the template and the rear boundary of the document to be extracted are both set as the full text length of the document. Otherwise, the template and the back boundary of the document to be extracted are acquired by means of the front and back boundary positioning auxiliary functions.

The incoming sequence match pair is in reverse order because of the reverse order lookup.

The incoming matching condition function is:

lambda:label_start,label_end,template_match_start,template_match_end,

new_match_start,new_match_end:\

(label_start<＝template_match_start and template_match_start<＝label_end)

the incoming return function is:

lambda:label_start,label_end,template_match_start,template_match_end,

new_match_start,new_match_end:\

(template_match_start,new_match_start)

the definitions of the label_start, label_end, template_match_ start, template _match_end, new_match_start and new_match_end are consistent with the definition in the previous boundary positioning.

3.3 front and rear boundary positioning auxiliary function

The front and rear boundary positioning auxiliary functions return the result processed by the designated functions when the matching condition is satisfied for the first time, and the following parameters are required to be transmitted:

1) organized_mapping_blocks: ordered sequence matches the block sequence, either positive or reverse.

2) func_find: and (4) calling back the function, considering the matching condition when the function is met, and returning a result by the front and rear boundary positioning auxiliary function.

3) func_get: and (4) calling back the function, and returning a result returned by the function by the front and rear boundary positioning auxiliary function.

4) label_start: the template is marked with the start subscript.

5) label_end: and marking the end subscript in the template.

The front and back boundary positioning auxiliary functions can traverse the matching pair sequence ordered_matching_blocks in sequence, and the loop skips the next iteration when encountering the situation because the matching pair list returned by the sequence matcher contains a matching pair with the matching length of 0.

In each iteration, the matching start index and the matching text length in the template, the document to be extracted and the matching end index in the template and the document to be extracted are obtained, and the matching end index in the template and the matching end index in the document to be extracted are calculated. And then transferring the 6 parameters of the label starting index label_start in the template, the label ending index label_end in the template, the label starting index template_match_start in the template, the label ending index template_match_end in the template, the label starting index new_match_start in the document to be extracted and the label ending index new_match_end in the document to be extracted into a callback func_find, and if the func_find returns to true, transferring the 6 parameters into a callback function func_get, and then returning the return result of the func_get function as the return result of the front and back boundary positioning auxiliary function.

If the sequence of matched pairs organized_matching_blocks is traversed in sequence, any matched pair satisfying callback func_find is still not found, then the front/back boundary to be found is not found, and the null is returned.

By using the comparison extraction method, the history labeling data can be effectively utilized to accurately extract the values of the fields in the document to be extracted of the same template, and the method is very suitable for extracting documents with fixed templates like contracts. The comparison extraction method is faster in training and slower in prediction, and the performance bottleneck of prediction is mainly that the sequence matcher generates full-text matching information of the most similar template (history document) and the document to be extracted. For the same document type, different fields, or different document types, we can utilize clustered parallel computing to accelerate predictions.

One embodiment is provided below in connection with the specific text:

1. model training a template example

1) Input:

template full text content: hole, named dune, word Zhongni, ancestor Song Guo, confucius school creator. Zhang Liang, word ovary, letter, is grown in korea. ".

Template annotation starting subscript list: [[0,2),[23,25) ].

2) And (3) treatment:

initializing labels to an empty list.

Initialization placeholder_version_line_content=0.

Initializing last_end_in_line_content=0.

The accumulated offset len_delta=0 is initialized.

i) First round of iteration: the label start subscript labelstart is 0, and the label end subscript labend is 2:

the start and stop subscripts [0,2 ] correspond to "holes".

New start subscript=0+len_delta=0.

Updating the accumulated offset len _ delta = (len ("\u 0004\u 0003") -len ("hole")) to obtain updated len _ delta = 0,

the label is added with a new label text and a new label initial subscript [ "\u0004\u0003', 0].

Update placeholder_version_line_content+ =

(line_content [ last_end_in_line_content: label_start ] + "\u0004\u0003") to obtain updated placeholder_version_line_content is "\u0004\u0003").

Update last_end_in_line_content=label_end=2.

ii) a second iteration: annotating start and stop subscripts [23, 25):

the start and stop subscripts [23,25 ] correspond to "Zhang Liang".

New start subscript=23+len_delta=23.

Updating the accumulated offset len _ delta = (len ("\u 0004\u 0003") -len ("Zhang Liang")) to obtain updated len _ delta = 0,

the label is added with a new label text and a new label start subscript [ "\u0004\u0003', 23].

Update placeholder_version_line_content+ =

(line_content [ last_end_in_line_content: label_start ] + "\u0004\u0003") to obtain updated placeholder_version_line_content of "\u0004\u0003, name hump, word sec, ancestor Song Guo, julian school sponsor. U 0004/u 0003".

Update last_end_in_line_content=label_end=25.

iii) Post-treatment

Make up for the end of

Placement holder_version_line_content+ = line_content [ last_end_in_line_content: ] gets "\u0004\u0003, name hump, word sec.

U 0004/u 0003, word ovary, the letter, is produced in korea. ".

content_with_label is placeholder_version_line_content.

content_without_label is the template full text.

3) And (3) outputting:

2. similarity calculation example of template and document to be extracted

The similarity threshold is assumed to be 0.6.

1) Input:

and (3) full text of the template: the present contract is a purchase and sale contract. Party a: and (5) a hole. And B, formula: and (5) a bang Zi. Party A purchases party B to pay one kilo-two of RMB, and the one kilo-two of RMB is collected for Wu Jiaoliu minutes. The contract is validated on the date of the two parties' endorsement. ".

The full text of the document to be extracted: the present contract is a purchase and sale contract. Party a: and (3) semen (Xuezi). And B, formula: han Feizi. Party a purchases party b to pay the rmb Liuliu and pick up Liu Yuan whole. The contract is validated on the date of the two parties' endorsement. ".

2) And (3) treatment:

i) Normalization: removing white space, number, english dot, percentage, thousand, dash, case and case amount and unit

And (3) a template: the present contract is a purchase and sale contract. Party a: and (5) a hole. And B, formula: and (5) a bang Zi. Party A purchases party B to pay the RMB. The contract is validated on the date of the two parties' endorsement. ".

Document to be extracted: the present contract is a purchase and sale contract. Party a: and (3) semen (Xuezi). And B, formula: han Feizi. Party A purchases party B to pay the RMB. The contract is validated on the date of the two parties' endorsement. ".

ii) dividing according to punctuation marks, wherein the punctuation marks comprise commas, semicolons, colon, parentheses, square brackets, signature numbers, pause numbers, slashes and periods of Chinese and English:

and (3) a template: the [ "present contract is a purchase and sale contract. "party A", "hole", "party B", "Monascus", and "party A purchases party B to pay RMB. ", the present contract is validated on the date of the two parties signing. "].

Document to be extracted: the [ "present contract is a purchase and sale contract. "party a", "party b", "Han Feizi", "party a purchases party b service to pay for the renminbi. ", the present contract is validated on the date of the two parties signing. "].

iii) Calculating phrase similarity percentages using the difference of the sets

Template collection and duplication removal, and blank string removal: set_template= [ "this contract is a purchase and sale contract. "party A", "hole", "party B", "Monascus", and "party A purchases party B to pay RMB. ", the present contract is validated on the date of the two parties signing. "].

Collecting and de-duplicating the document to be extracted, and removing the empty character strings: set_new= [ "this contract is a purchase and sale contract. "party a", "party b", "Han Feizi", "party a purchases party b service to pay for the renminbi. ", the present contract is validated on the date of the two parties signing. "].

Subtracting the template short sentence set from the short sentence set of the document to be extracted

diff_set_new_minus_template＝set_new-set_template＝

[ "Xuanzi" ] "Han Feizi" ].

Similarity degree

＝1-float(len(diff_set_new_minus_template))/len(set_new)＝1-2.0/7≈0.714。

3) And (3) outputting:

similarity 0.714 (if this is the highest similarity template, and threshold 0.6 is reached, it is the most similar template that meets the similarity threshold requirement).

3. The most similar template is compared with the document to be extracted, and the extraction is taken as an example:

assume that the field we want to draw is the first party.

1) Input:

the most similar templates reaching the similarity threshold are full text (with labels): the present contract is a purchase and sale contract. Party a: u 0004/u 0003. And B, formula: and (5) a bang Zi. Party A purchases party B to pay one kilo-two of RMB, and the one kilo-two of RMB is collected for Wu Jiaoliu minutes. The contract is validated on the date of the two parties' endorsement. ".

2) And (3) treatment:

i) The sequenceMatcher generates matching pair information:

the template is the purchase trade contract. Party a: the "corresponding document to be extracted" present contract is a purchase and sale contract. Party a: ";

and (5) a template. And B, formula: "corresponds to the document to be extracted". And B, formula: ";

the template is "child". Party A purchases party B to pay the sub of the RMB corresponding to the document to be extracted. Party A purchases party B to pay RMB;

the template 'land' corresponds to the document 'land' to be extracted;

and (5) a template. The contract is validated on the date of the two parties' endorsement. "corresponds to the document to be extracted". The contract is validated on the date of the two parties' endorsement. ".

ii) determining the front boundary:

searching a first meeting template from left to right, wherein the marking starting subscript is a front boundary in a section (template matching starting subscript and template matching ending subscript), the marking starting subscript is 14, the first matching pair is in the template, the starting subscript of the "present contract is a purchase and sale contract, the ending subscript is 14, the 14 is in the section (0, 14), the front boundary of the template is found to be 14, and the front boundary of the corresponding document to be extracted is just 14.

iii) Determining a rear boundary:

similarly, a method for determining the front boundary is that a first template matching start index in a meeting template is found from right to left, the first template matching start index is in an interval [ marking start index, marking end index ] which is a rear boundary, the marking start index is 14, the marking end index is 16, and a second matching pair is in the template. And B, formula: the initial subscript of "is 16, 16 in the interval [14,16], the trailing edge of the found template is 16, and the trailing edge corresponding to the document to be extracted is just 16.

iv) get field value and start subscript

The label corresponds to the extracted field value

Text-labeled interval [0, template front boundary-label start subscript) +document to be extracted interval [ document front boundary to be extracted, document back boundary to be extracted) +template-labeled interval [ template back boundary: label end subscript ]

＝“\u0004\u0003”[0,14-14]

The + "present contract is a purchase and sale contract. Party a: and (3) semen (Xuezi). And B, formula: han Feizi. Party a purchases party b to pay the rmb Liuliu and pick up Liu Yuan whole. The contract is validated on the date of the two parties' endorsement. "[14,16]

The + "present contract is a purchase and sale contract. Party a: u 0004/u 0003. And B, formula: and (5) a bang Zi. Party A purchases party B to pay one kilo-two of RMB, and the one kilo-two of RMB is collected for Wu Jiaoliu minutes. The contract is validated on the date of the two parties' endorsement. "[16:16]

= "semen Juglandis".

The label corresponds to the extraction field start index=document front boundary to be extracted- (template front boundary-label start index) =14- (14-14) =14.

3) And (3) outputting:

the field of the document to be extracted corresponds to the extraction result: [ ("Xuanzi", 14) ].

While the foregoing embodiments have been described in detail and with reference to the present invention, it will be apparent to one skilled in the art that modifications and improvements can be made based on the disclosure without departing from the spirit and scope of the invention.

Claims

1. A method of extracting elements from a document, the method comprising:

labeling a template document, generating a template document and subscript information of the template document, wherein the labeling the template document comprises: replacing the labeling part by a placeholder, updating the subscript information of the template document, and updating the subscript information of the template document based on an accumulated offset which is calculated by accumulating and superposing the length difference of the placeholder and the corresponding replaced labeling part;

matching the template document with the document to be extracted to generate a matching pair;

according to the subscript information of the labeling and matching pair, defining the front and rear boundaries in the template document and the front and rear boundaries in the document to be extracted, including: marking a starting index between the matching starting index and the matching ending index in the template, wherein the matching ending index in the template is used as the front boundary of the template document, and the matching ending index in the document to be extracted is used as the front boundary of the document to be extracted; the matching initial subscript in the template is between the marking initial subscript and the marking ending subscript, the matching initial subscript in the template is used as the rear boundary of the template document, and the matching initial subscript in the document to be extracted is used as the rear boundary of the document to be extracted;

replacing the content in the front and rear boundaries in the template document with the content in the front and rear boundaries in the document to be extracted;

and outputting the template document and the subscript information in the label as extracted elements.

2. The method of extracting elements from a document of claim 1, wherein the annotation template document comprises:

and calculating the similarity between the plurality of documents and the document to be extracted, and selecting the document most similar to the document to be extracted as the template document.

3. The method of extracting elements from documents of claim 2, wherein said calculating similarities between a plurality of documents and the document to be extracted comprises:

normalizing the full text of the document and the document to be extracted;

dividing the document and the document to be extracted according to punctuation marks to obtain respective short sentence lists of the document and the document to be extracted;

respectively de-duplicating the short sentence lists of the document and the document to be extracted, and removing short sentences which are empty character strings to obtain a short sentence set of the document and a short sentence set of the document to be extracted;

calculating the difference between the short sentence set of the document and the short sentence set of the document to be extracted;

and calculating the similarity according to the element number of the difference between the short sentence set of the document and the short sentence set of the document to be extracted and the element number of the short sentence set of the document to be extracted.

4. A method of extracting elements from a document according to claim 3, wherein selecting a document most similar to the document to be extracted as a template document comprises:

and presetting a threshold value, and selecting the document with the highest similarity as a template document when the similarity of the document is larger than the threshold value, otherwise, feeding back the available information of the template-free document.

5. The method of extracting elements from a document according to claim 1, wherein matching the template document with the document to be extracted comprises: the sequenceMatcher in the difflib library of Python was used as the matching algorithm.

6. An apparatus for extracting elements from a document, the apparatus comprising:

the marking module marks the template document, generates the template document and marked subscript information thereof, the marking module replaces marked parts by the placeholders, updates subscript information of the template document, and updates subscript information of the template document based on an accumulated offset which is calculated by accumulating and superposing the length differences of the placeholders and the corresponding replaced marked parts;

the matching module is used for matching the template document with the document to be extracted to generate a matching pair; the marking module is used for marking front and rear boundaries in the template document and front and rear boundaries in the document to be extracted according to the subscript information of the marking and matching pair, wherein the marking starting subscript is arranged between the matching starting subscript in the template and the matching ending subscript in the template, the matching ending subscript in the template is used as the front boundary of the template document, and the matching ending subscript in the document to be extracted is used as the front boundary of the document to be extracted; the matching initial subscript in the template is between the marking initial subscript and the marking ending subscript, the matching initial subscript in the template is used as the rear boundary of the template document, and the matching initial subscript in the document to be extracted is used as the rear boundary of the document to be extracted;