CN113762160A

CN113762160A - Date extraction method and device, computer equipment and storage medium

Info

Publication number: CN113762160A
Application number: CN202111049925.1A
Authority: CN
Inventors: 程佳宇; 陈永红; 张军涛; 王国鹏
Original assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Current assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-12-07
Also published as: WO2023035332A1

Abstract

The invention discloses a date extraction method, a date extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a file image containing a date to be extracted, and preprocessing the file image; performing OCR recognition on the file image, and acquiring a target text segment containing the date to be extracted by combining the associated information of the date to be extracted; labeling the target text segment by using an NER technology, and outputting to obtain a date text segment; carrying out classification prediction on the date text segment through a classification model, and modifying and post-processing the date text segment based on a classification prediction result; and acquiring a target element of the date to be extracted according to the correction and post-processing results, and extracting the date according to the target element. The method and the device locate the text segment where the date to be extracted is located by combining the associated information to be extracted and lifted, and identify and label the file image or the text segment through OCR recognition and NER technology, so that the extraction precision and the extraction efficiency of the date can be improved.

Description

Date extraction method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a date extraction method and apparatus, a computer device, and a storage medium.

Background

During review of various contracts, the material to be handled by hand tends to have two distinct characteristics: (1) the contract types and the covered elements are variable according to different industries, including but not limited to real estate, medical treatment, manufacturing industry, purchasing industry and the like, so that the threshold of manually checking related materials is improved, and the difficulty of checking work is increased; (2) the approximate elements are too many and contain interference information such as handwriting types, other doped seals, watermarks and the like, so that the difficulty in accurately extracting the elements is increased. Regarding the way of extracting various dates in the contract, the general classification is two types:

the first method is to comb the positioning rules of keywords or key sentences based on business logic, and then match the date formats meeting the requirements in combination with regular patterns and the like to serve as final candidate dates. And simultaneously selecting the final target element value for a plurality of candidate dates by combining the related business rules.

The second application is that date element extraction is performed in combination with deep learning, that is, a target value corresponding to a date is obtained through deep learning model prediction.

The first existing method mentioned above has a drawback that although the accuracy of the extraction date can be guaranteed to some extent, the method is hardly robust, i.e. the extraction effect cannot be expected by changing a contract style or changing a context expression of a date.

In the second conventional method mentioned above, because there are many date-type elements in the contract, such as start date, completion date, contract date, validity period, etc., and there are more than one date-type elements frequently, this results in that it is difficult for the model to identify the true target element, and thus the extraction accuracy is poor.

Disclosure of Invention

The embodiment of the invention provides a date extraction method, a date extraction device, computer equipment and a storage medium, and aims to improve the extraction precision and extraction efficiency of dates.

In a first aspect, an embodiment of the present invention provides a date extraction method, including:

acquiring a file image containing a date to be extracted, and preprocessing the file image;

performing OCR recognition on the preprocessed file image, and acquiring a target text segment containing the date to be extracted by combining the associated information of the date to be extracted;

labeling the target text segment by using an NER technology, and outputting to obtain a date text segment;

carrying out classification prediction on the date text segment through a classification model, and modifying and post-processing the date text segment based on a classification prediction result;

and acquiring a target element of the date to be extracted according to the correction and post-processing results, and extracting the date according to the target element.

In a second aspect, an embodiment of the present invention provides a date extracting apparatus, including:

the device comprises a preprocessing unit, a storage unit and a display unit, wherein the preprocessing unit is used for acquiring a file image containing a date to be extracted and preprocessing the file image;

the first acquisition unit is used for carrying out OCR recognition on the preprocessed file image and acquiring a target text segment containing the date to be extracted by combining the associated information of the date to be extracted;

the label labeling unit is used for labeling the target text segment by using NER technology and outputting the target text segment to obtain a date text segment;

the post-processing unit is used for carrying out classification prediction on the date text section through a classification model and modifying and post-processing the date text section based on a classification prediction result;

and the date extraction unit is used for acquiring the target element of the date to be extracted according to the correction and post-processing result and extracting the date according to the target element.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the date extraction method according to the first aspect when executing the computer program.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the date extraction method according to the first aspect is implemented.

The embodiment of the invention provides a date extraction method, a date extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a file image containing a date to be extracted, and preprocessing the file image; performing OCR recognition on the preprocessed file image, and acquiring a target text segment containing the date to be extracted by combining the associated information of the date to be extracted; labeling the target text segment by using an NER technology, and outputting to obtain a date text segment; carrying out classification prediction on the date text segment through a classification model, and modifying and post-processing the date text segment based on a classification prediction result; and acquiring a target element of the date to be extracted according to the correction and post-processing results, and extracting the date according to the target element. The embodiment of the invention positions the text segment where the date to be extracted is located by combining the associated information to be extracted and lifted, and identifies and labels the file image or the text segment by OCR recognition and NER technology, so that the target element of the date to be extracted can be accurately acquired, and the date can be extracted with high accuracy and efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a date extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic view of a sub-flow of a date extraction method according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a date extraction apparatus according to an embodiment of the present invention;

fig. 4 is a sub-schematic block diagram of a date extraction apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of a date extraction method according to an embodiment of the present invention, which specifically includes: steps S101 to S105.

S101, acquiring a file image containing a date to be extracted, and preprocessing the file image;

s102, performing OCR recognition on the preprocessed file image, and acquiring a target text segment containing the date to be extracted by combining the associated information of the date to be extracted;

s103, labeling the target text segment by using an NER technology, and outputting to obtain a date text segment;

s104, carrying out classification prediction on the date text segment through a classification model, and modifying and post-processing the date text segment based on a classification prediction result;

and S105, acquiring the target element of the date to be extracted according to the correction and post-processing result, and extracting the date according to the target element.

In this embodiment, by preprocessing the document image including the date to be extracted, the interference factors such as noise in the document image can be removed, then the document data in the document image can be smoothly identified by the OCR recognition technology, and the target text segment including the date to be extracted is extracted from the identified document data in combination with the associated information of the date to be extracted. And then labeling the target text segment by using an NER (named entity recognition) technology so as to further obtain a date text segment, on the basis, performing classification prediction on the date text segment by using a classification model, and performing operations such as correction, post-processing and the like on a classification prediction result so as to obtain a target element corresponding to the date to be extracted. And extracting the corresponding date to be extracted according to the target element.

In the embodiment, the text segment where the date to be extracted is located is positioned by combining the associated information to be extracted and lifted, so that the interference of other date elements in the file in extracting the date element value to be extracted is reduced, and the accuracy can be improved. And identifying and labeling the file image or the text segment by OCR recognition and NER technology, so that the target elements of the date to be extracted can be accurately acquired, and the date can be extracted, so that the extraction precision and the extraction efficiency of the date can be improved. The date to be extracted in this embodiment may be a contract date in a contract file, or other dates such as an opening completion date and an effective date, which may be determined specifically according to an actual scene.

In one embodiment, the step S101 includes:

carrying out direction correction processing on the file image;

detecting a seal or a watermark in the file image by adopting a Yolov5 technology;

and removing the detected seal and the watermark through the generative countermeasure network.

In this embodiment, in the document image preprocessing stage, the direction of the document image is corrected based on the image preprocessing technology, the seal (mainly including a square seal, a round seal, a joint seal, a stamp tax seal, and other seals) and the watermark in the document image are detected, and the GAN (generative countermeasure network) is removed after the seal or the watermark is detected, so as to reduce the interference of the relevant noise in the document image on the accuracy rate of extracting the elements. In addition, the detected stamp can also be used as a characteristic for identifying the signature page. In a specific application scene, approximately 500 real contract stamp-carrying samples are selected, a script is written to generate a 10W + stamp picture, and the removal of the stamp is realized based on the GAN, wherein the correction rate reaches 34% (correction rate: the accuracy of the recognition after the removal of the stamp-the accuracy of the recognition before the removal of the stamp).

In one embodiment, the step S102 includes:

performing character recognition on the file image through a print OCR technology;

positioning the associated information of the date to be extracted based on the character recognition result, and taking the positioning result as the target text segment; the associated information is page information corresponding to the date to be extracted or keyword information associated with the date to be extracted.

In this embodiment, since the precision and the recognition degree of the document image are greatly improved after the document image is preprocessed, the document image can be subjected to character recognition by the print OCR technology. In the character recognition process, a target text segment corresponding to the date to be extracted is obtained according to the associated information (such as page information or keywords) of the date to be extracted. Here, since there may be other date of interference in the document image besides the date to be extracted, for example, the date to be extracted is the contract date, the date of interference with the contract date may be a completion date, an effective date, or the like. Meanwhile, the date to be extracted is usually in a fixed position, for example, the signing date usually appears in three types of pages, namely a front cover page, a front page and a signing page. Therefore, the present embodiment combines the page information of the date to be extracted as the auxiliary information for positioning, so that the positioning accuracy of the target text segment can be mentioned.

In one embodiment, the step S103 includes:

extracting text features of the target text segment by utilizing a Bert pre-training model;

extracting target characteristics required by entity identification from the text characteristics through a Bi-LSTM network;

and decoding the target characteristics by adopting a conditional random field to obtain a corresponding labeling sequence, and outputting the labeling sequence as the date text segment.

In this embodiment, the target text segment is labeled based on the NER technique to obtain the date text segment. Specifically, text features in the target text segment are extracted by using a Bert pre-training model, feature vectors are constructed according to the text features, the feature vectors are extracted through a Bi-LSTM network (bidirectional long-and-short memory cyclic neural network), and then decoding operation is performed on the basis of a Conditional Random Field (CRF), so that the date text segment is obtained.

Of course, before prediction labeling, the NER technique may be trained and optimized by using a training sample set to improve labeling efficiency and accuracy. For example, when the signing date is labeled, 3000 real contract samples are selected, the extracted text is enhanced to about 30W, and the training process of the whole NER technology is completed based on the Bert pre-training model + Bi-LSTM network + CRF, wherein labels corresponding to the signing date are B _ sign (the starting character of the signing date) and I _ sign (other characters except the starting character of the signing date). And then, obtaining a label corresponding to each text token through a trained NER technology, and extracting text sections with predicted labels of B _ sign and I _ sign as candidate signing date values to return.

In one embodiment, as shown in fig. 2, the step S104 includes: steps S201 to S204.

S201, acquiring a corresponding text box in the date text section;

s202, performing two-classification processing on each text box by adopting a support vector machine to judge whether the text box is a handwritten image;

s203, if the text box is judged to be the handwritten image, recognizing the handwritten image through a handwritten OCR technology, and correcting and post-processing a recognition result;

and S204, if the text box is judged not to be the handwritten image, continuing to correct and post-process the date text segment.

In this embodiment, after the date text segment is recognized by print OCR, the included text box is obtained, and whether the image area corresponding to the text box is a handwritten image is marked by a Support Vector Machine (SVM). If the image is a handwritten image, text data in the handwritten image is recognized by handwriting OCR. It can be understood that the recognition target of the OCR is different from that of the OCR, i.e. the recognition target of the OCR is the handwriting data, and the recognition target of the OCR is the print data.

In an embodiment, the step S104 further includes:

carrying out date format verification, text error correction and unified format processing on the date text section;

and auditing the date text based on the scene of the date to be extracted.

In this embodiment, the post-processing and the business rule stage, where the post-processing includes date format check, text error correction (mainly form-word), uniform format, and the like, and the business rule is configured based on the customized requirement, for example, the latest date is taken as the final target value by a plurality of contract dates. The customized requirement is mainly based on the audit requirement, and different audit scenarios have different audit requirements for the contract, for example: the target value of the final contract date field is the latest date of the plurality of contract dates, and the target value of the final contract date field is the date of the cover page of the plurality of contract dates.

In an embodiment, the performing date format check, text error correction and unified format processing on the date text field includes:

and calculating an error correction score probability value of the date text by using an N-gram model, and correcting the date text based on the error correction score probability value.

In this embodiment, the N-Gram is an algorithm based on a statistical language model, and the basic idea is to perform a sliding window operation of size N on the content in the date text according to bytes, so as to form a byte fragment sequence with length N. Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of a date text, wherein each gram in the list is a feature vector dimension. The appearance of the Nth word is only related to the previous N-1 words, but not to any other words, and the probability of the whole sentence is the product of the appearance probabilities of all words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus.

Fig. 3 is a schematic block diagram of a date extracting apparatus 300 according to an embodiment of the present invention, where the apparatus 300 includes:

the preprocessing unit 301 is configured to acquire a file image including a date to be extracted, and preprocess the file image;

a first obtaining unit 302, configured to perform OCR recognition on the preprocessed file image, and obtain a target text segment including a date to be extracted in combination with associated information of the date to be extracted;

a label labeling unit 303, configured to label the target text segment by using an NER technique, and output the target text segment to obtain a date text segment;

a post-processing unit 304, configured to perform classification prediction on the date text segment through a classification model, and perform modification and post-processing on the date text segment based on a classification prediction result;

a date extracting unit 305, configured to acquire a target element of a date to be extracted according to the correction and post-processing result, and extract a date according to the target element.

In one embodiment, the preprocessing unit 301 includes:

the correcting unit is used for carrying out direction correction processing on the file image;

the detection unit is used for detecting the seal or the watermark in the file image by adopting a Yolov5 technology;

and the removing unit is used for removing the detected seal and the watermark through the generative countermeasure network.

In an embodiment, the first obtaining unit 302 includes:

the character recognition unit is used for carrying out character recognition on the file image through a print OCR technology;

the positioning unit is used for positioning the associated information of the date to be extracted based on the character recognition result and taking the positioning result as the target text segment; the associated information is page information corresponding to the date to be extracted or keyword information associated with the date to be extracted.

In one embodiment, the label labeling unit 303 includes:

the first extraction unit is used for extracting text features of the target text segment by utilizing a Bert pre-training model;

the second extraction unit is used for extracting target characteristics required by entity identification from the text characteristics through a Bi-LSTM network;

and the decoding output unit is used for decoding the target characteristics by adopting a conditional random field to obtain a corresponding labeling sequence and outputting the labeling sequence as the date text segment.

In one embodiment, as shown in fig. 4, the post-processing unit 304 includes:

a second obtaining unit 401, configured to obtain a corresponding text box in the date text section;

a determining unit 402, configured to perform two-class processing on each text box by using a support vector machine to determine whether the text box is a handwritten image;

a handwriting recognition unit 403, configured to, if it is determined that the text box is a handwriting image, recognize the handwriting image by using a handwriting OCR technology, and correct and post-process a recognition result;

and a correcting unit 404, configured to continue to correct and post-process the date text segment if it is determined that the text box is not a handwritten image.

In one embodiment, the post-processing unit 304 further comprises:

the verification processing unit is used for performing date format verification, text error correction and unified format processing on the date text section;

and the auditing unit is used for auditing the date text based on the scene of the date to be extracted.

In one embodiment, the verification processing unit includes:

and the probability value calculating unit is used for calculating the error correction score probability value of the date text by using an N-gram model and correcting the date text based on the error correction score probability value.

Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed, the steps provided by the above embodiments can be implemented. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiment of the present invention further provides a computer device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided in the above embodiments when calling the computer program in the memory. Of course, the computer device may also include various network interfaces, power supplies, and the like.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A date extraction method, comprising:

2. The date extraction method according to claim 1, wherein the acquiring of the document image containing the date to be extracted, and the preprocessing of the document image, comprise:

carrying out direction correction processing on the file image;

3. The date extraction method according to claim 1, wherein the OCR recognition of the preprocessed file image and the obtaining of the target text segment containing the date to be extracted in combination with the associated information of the date to be extracted, comprises:

4. The date extraction method according to claim 1, wherein the labeling the target text segment by using the NER technique and outputting the obtained date text segment includes:

5. The date extraction method according to claim 1, wherein the classifying and predicting the date text segment by the classification model and performing modification and post-processing on the date text segment based on the classification and prediction result comprises:

acquiring a corresponding text box in the date text section;

performing two-classification processing on each text box by adopting a support vector machine to judge whether the text box is a handwritten image;

if the text box is judged to be the handwritten image, recognizing the handwritten image through a handwritten OCR technology, and correcting and post-processing a recognition result;

and if the text box is judged not to be the handwritten image, continuing to correct and post-process the date text segment.

6. The date extraction method according to claim 1, wherein the classifying and predicting the date text segment by the classification model and performing the correction and post-processing on the date text segment based on the classification and prediction result further comprises:

and auditing the date text based on the scene of the date to be extracted.

7. The date extraction method of claim 6, wherein the date format check, text error correction and unified format processing of the date text segment comprises:

8. A date extraction device characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the date extraction method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, implements the date extraction method according to any one of claims 1 to 7.