CN111046864A - Method and system for automatically extracting five elements of contract scanning piece - Google Patents

Method and system for automatically extracting five elements of contract scanning piece Download PDF

Info

Publication number
CN111046864A
CN111046864A CN201911286082.XA CN201911286082A CN111046864A CN 111046864 A CN111046864 A CN 111046864A CN 201911286082 A CN201911286082 A CN 201911286082A CN 111046864 A CN111046864 A CN 111046864A
Authority
CN
China
Prior art keywords
contract
elements
module
scanning piece
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911286082.XA
Other languages
Chinese (zh)
Inventor
王洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yingjia Yunchuang Technology Shenzhen Co Ltd
Original Assignee
Yingjia Yunchuang Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yingjia Yunchuang Technology Shenzhen Co Ltd filed Critical Yingjia Yunchuang Technology Shenzhen Co Ltd
Priority to CN201911286082.XA priority Critical patent/CN111046864A/en
Publication of CN111046864A publication Critical patent/CN111046864A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

The invention discloses a method and a system for automatically extracting five elements of a contract scanning piece, wherein the method comprises the following steps: step one, previewing a contract scanning piece on line; step two, finding the positions of the five elements in the contract scanning piece and independently capturing the picture of the content of the five elements; thirdly, the clipboard acquires the pictures and then pushes the pictures to perform OCR recognition; writing the recognition result of the OCR into the clipboard; and step five, pasting the extraction result, and checking the extraction correctness. The method has the advantages that the examination and verification of contract elements are finished in the extraction process, the extraction and examination are carried out simultaneously, and the extraction method is more spectral and more reliable aiming at the accuracy of the extraction result; only intercepting the key part for OCR recognition, and avoiding waiting for unnecessary content of OCR recognition when the contract page number is excessive; the method realizes that pictures are intercepted and copied and characters are pasted out by monitoring a clipboard of a computer to assist in extracting contract elements.

Description

Method and system for automatically extracting five elements of contract scanning piece
Technical Field
The invention relates to the technical field of OCR, in particular to a method and a system for automatically extracting five elements of a contract scanning piece.
Background
OCR (Optical Character Recognition) is a technology of converting characters in a paper document into an image file of a black-and-white dot matrix optically for print characters, and converting the characters in the image into a text format through Recognition software for further editing and processing by Character processing software.
An OCR recognition system is to convert the image to keep the graph in the image, and if there is a table, the data in the table and the words in the image are changed into computer words uniformly, so as to reduce the storage amount of the image data, reuse and analyze the recognized words, and certainly save the manpower and time for keyboard input.
In the prior art, in a contract auditing service, an OCR technology is used to convert an image file of a contract scanned piece into a text file, and the extraction of five elements of a contract is completed by combining the position of the five elements in the contract content, namely, by using the position coordinates of the five elements and the characteristics of keywords (such as a contracting party, a contractor, a first party, a second party and the like), so as to complete the clearance of the key elements.
The problems existing in the above are:
1. the dependency on the content and format of the contract template is strong, and the extracted result has larger error under different types of contracts;
2. positions of five elements in a contract template need to be defined in advance, and a plurality of rules need to be defined in advance under the scene of multi-service multi-type contracts;
3. during extraction, an extraction template needs to be selected, and when the service types and the contract types are more, the corresponding template is selected, so that much labor cost is consumed;
4. because the OCR recognition result has an error with the actual contract original text, the recognition of the keyword and the recognition of the contract element content behind the keyword both have risks of not conforming to the actual condition, and the correctness of the extraction result still needs to be checked manually after extraction;
5. extraction is directed to text content, but the algorithm of the OCR process is relatively consumptive, and when the number of contract pages is as many as several hundred, the extraction waiting time is very long.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method and a system for automatically extracting five elements of a contract scanning piece.
The technical scheme of the invention is as follows:
a method for automatically extracting five elements of a contract scanning piece comprises the following steps:
step one, previewing a contract scanning piece on line;
step two, finding the positions of the five elements in the contract scanning piece and independently capturing the picture of the content of the five elements;
thirdly, the clipboard acquires the pictures and then pushes the pictures to perform OCR recognition;
writing the recognition result of the OCR into the clipboard;
and step five, pasting the extraction result, and checking the extraction correctness.
In step one, the contract scanning piece is a picture class file.
In step two, five elements in the contract scanning piece include:
(1) both parties should have the qualification and ability to enforce legal action;
(2) the meaning that the parties reach on a voluntary basis means agreement;
(3) the standard and content of the contract must be legal;
(4) parties of the contract must have compensation for each other;
(5) the contract must conform to the form prescribed by law.
In the step, the OCR recognition is to recognize the text content in the intercepted picture.
An automatic extraction system for five elements of a contract scanned piece comprises a preview module, a screenshot module, a clipboard monitoring module and an OCR module;
the preview module is used for quickly previewing the contract scanning piece and finding the positions of five elements in the contract scanning piece;
the screenshot module is connected with the preview module and is used for independently intercepting the contents of the five elements in the contract scanning piece in a screenshot mode;
the clipboard monitoring module is connected with the screenshot module and transmits the received screenshot pictures of the five elements of the contract to the OCR module;
the OCR module is connected with the clipboard monitoring module, recognizes the character content in the intercepted picture through OCR recognition, returns the recognized character content to the clipboard monitoring module, and finally verifies the extraction correctness through pasting the extraction result.
The contract scanning piece is a picture file, and the preview module supports online opening of the picture file.
Wherein, five elements in the contract scanning piece include:
(1) both parties should have the qualification and ability to enforce legal action;
(2) the meaning that the parties reach on a voluntary basis means agreement;
(3) the standard and content of the contract must be legal;
(4) parties of the contract must have compensation for each other;
(5) the contract must conform to the form prescribed by law.
Compared with the prior art, the invention has the beneficial effects that:
1. the examination and verification of contract elements are finished in the extraction process, the extraction and examination are carried out simultaneously, and the extraction result is more spectral and more secure in mind aiming at the accuracy of the extraction result;
2. only intercepting the key part for OCR recognition, and avoiding waiting for unnecessary content of OCR recognition when the contract page number is excessive;
3. by monitoring the clipboard of the computer, the picture is intercepted and copied, and the characters are pasted out to assist in extracting the contract elements.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a method for automatically extracting five elements of a contract scanned piece according to an embodiment of the present invention;
fig. 2 is a schematic block diagram of an automatic extraction system for five elements of a contract scanning piece according to a second embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Example one
Referring to fig. 1, the present embodiment provides an automatic extraction method for five elements of a contract scanned piece, including the following steps:
step one, previewing a contract scanning piece on line. The contract scan is a photo class file.
And step two, finding the positions of the five elements in the contract scanning piece and independently capturing the picture of the content of the five elements. Five elements in the contract scanning piece comprise:
(1) both parties should have the qualification and ability to enforce legal action;
(2) the meaning that the parties reach on a voluntary basis means agreement;
(3) the standard and content of the contract must be legal;
(4) parties of the contract must have compensation for each other;
(5) the contract must conform to the form prescribed by law.
And step three, pushing the pictures by the clipboard to perform OCR recognition after the pictures are acquired by the clipboard. And the OCR identification is to identify the character content in the intercepted picture.
And fourthly, writing the recognition result of the OCR (namely the text content in the screenshot picture) into the clipboard.
And step five, pasting the extraction result, and checking the extraction correctness. The result obtained by pasting is the text content in the screenshot picture, and the correctness of the checking and extracting result becomes simple and clear and is not easy to make mistakes.
By the scheme, the examination and verification of contract elements are finished in the extraction process, the extraction and the examination and verification are carried out simultaneously, and the extraction method is more spectral and more reliable aiming at the accuracy of the extraction result; only intercepting the key part for OCR recognition, and avoiding waiting for unnecessary content of OCR recognition when the contract page number is excessive; the method realizes that pictures are intercepted and copied and characters are pasted out by monitoring a clipboard of a computer to assist in extracting contract elements.
Example two
Referring to fig. 2, the embodiment provides an automatic extraction system for five elements of a contract scanned part, which includes a preview module 1, a screenshot module 2, a clipboard monitoring module 3, and an OCR module 4; the preview module 1 is used for quickly previewing the contract scanning piece and finding the positions of five elements in the contract scanning piece; the screenshot module 2 is connected with the preview module 1, and independently intercepts the contents of the five elements in the contract scanning piece by using a screenshot mode; the clipboard monitoring module 3 is connected with the screenshot module 2 and transmits the received screenshot pictures of the five elements of the contract to the OCR module 4; the OCR module 4 is connected with the clipboard monitoring module 3, recognizes the character content in the intercepted picture through OCR recognition, returns the recognized character content to the clipboard monitoring module 3, verifies and extracts the correctness through pasting the extraction result, the result obtained through pasting is the character content in the screenshot picture, and the correctness of the verification and extraction result becomes simple and clear and is not easy to make mistakes.
The contract scanning piece is a picture file, and the preview module 1 supports online opening of the picture file.
Wherein, five elements in the contract scanning piece include:
(1) both parties should have the qualification and ability to enforce legal action;
(2) the meaning that the parties reach on a voluntary basis means agreement;
(3) the standard and content of the contract must be legal;
(4) parties of the contract must have compensation for each other;
(5) the contract must conform to the form prescribed by law.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A method for automatically extracting five elements of a contract scanned part is characterized by comprising the following steps:
step one, previewing a contract scanning piece on line;
step two, finding the positions of the five elements in the contract scanning piece and independently capturing the picture of the content of the five elements;
thirdly, the clipboard acquires the pictures and then pushes the pictures to perform OCR recognition;
writing the recognition result of the OCR into the clipboard;
and step five, pasting the extraction result, and checking the extraction correctness.
2. The method for automatically extracting five elements from a contract scanned object according to claim 1, wherein in step one, the contract scanned object is a photo class document.
3. The method for automatically extracting five elements of the contract scanning piece according to claim 1, wherein in step two, the five elements of the contract scanning piece comprise:
(1) both parties should have the qualification and ability to enforce legal action;
(2) the meaning that the parties reach on a voluntary basis means agreement;
(3) the standard and content of the contract must be legal;
(4) parties of the contract must have compensation for each other;
(5) the contract must conform to the form prescribed by law.
4. The method for automatically extracting five elements from a contract scanning piece as claimed in claim 1, wherein in the step, the OCR is to recognize the text in the intercepted picture.
5. An automatic extraction system for five elements of a contract scanned part is characterized by comprising a preview module, a screenshot module, a clipboard monitoring module and an OCR module;
the preview module is used for quickly previewing the contract scanning piece and finding the positions of five elements in the contract scanning piece;
the screenshot module is connected with the preview module and is used for independently intercepting the contents of the five elements in the contract scanning piece in a screenshot mode;
the clipboard monitoring module is connected with the screenshot module and transmits the received screenshot pictures of the five elements of the contract to the OCR module;
the OCR module is connected with the clipboard monitoring module, recognizes the character content in the intercepted picture through OCR recognition, returns the recognized character content to the clipboard monitoring module, and finally verifies the extraction correctness through pasting the extraction result.
6. The system according to claim 5, wherein the contract scanned item is a photo class document, and the preview module supports online opening of the photo class document.
7. The system for automatically extracting five elements from a contract scanned product according to claim 5, wherein the five elements in the contract scanned product comprise:
(1) both parties should have the qualification and ability to enforce legal action;
(2) the meaning that the parties reach on a voluntary basis means agreement;
(3) the standard and content of the contract must be legal;
(4) parties of the contract must have compensation for each other;
(5) the contract must conform to the form prescribed by law.
CN201911286082.XA 2019-12-13 2019-12-13 Method and system for automatically extracting five elements of contract scanning piece Pending CN111046864A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911286082.XA CN111046864A (en) 2019-12-13 2019-12-13 Method and system for automatically extracting five elements of contract scanning piece

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911286082.XA CN111046864A (en) 2019-12-13 2019-12-13 Method and system for automatically extracting five elements of contract scanning piece

Publications (1)

Publication Number Publication Date
CN111046864A true CN111046864A (en) 2020-04-21

Family

ID=70236336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911286082.XA Pending CN111046864A (en) 2019-12-13 2019-12-13 Method and system for automatically extracting five elements of contract scanning piece

Country Status (1)

Country Link
CN (1) CN111046864A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021232593A1 (en) * 2020-05-22 2021-11-25 平安国际智慧城市科技股份有限公司 Product protocol character recognition-based method and apparatus for recognizing malicious terms, and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919014A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 OCR recognition methods and its electronic equipment
CN110222692A (en) * 2019-05-21 2019-09-10 深圳壹账通智能科技有限公司 A kind of contract method of calibration and relevant device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919014A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 OCR recognition methods and its electronic equipment
CN110222692A (en) * 2019-05-21 2019-09-10 深圳壹账通智能科技有限公司 A kind of contract method of calibration and relevant device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021232593A1 (en) * 2020-05-22 2021-11-25 平安国际智慧城市科技股份有限公司 Product protocol character recognition-based method and apparatus for recognizing malicious terms, and device

Similar Documents

Publication Publication Date Title
US8520889B2 (en) Automated generation of form definitions from hard-copy forms
US20050289182A1 (en) Document management system with enhanced intelligent document recognition capabilities
US9384389B1 (en) Detecting errors in recognized text
CN112052749A (en) Archive filing method and device, electronic equipment and computer readable storage medium
AU2015203150A1 (en) System and method for data extraction and searching
CN109598228B (en) Method and system for electronically recording and archiving paper files
CN108304815B (en) Data acquisition method, device, server and storage medium
CN105718554A (en) Document collaboration conversion method and system
US8953228B1 (en) Automatic assignment of note attributes using partial image recognition results
Kaur Text recognition applications for mobile devices
CN113850060A (en) Civil aviation document data identification and entry method and system
CN115116068A (en) Archive intelligent filing system based on OCR
KR100673198B1 (en) Image inputing system
CN111046864A (en) Method and system for automatically extracting five elements of contract scanning piece
WO2024012209A1 (en) Image recognition-based service processing method and apparatus, and storage medium
CN116343210B (en) File digitization management method and device
CN110059184B (en) Operation error collection and analysis method and system
CN116758550A (en) Text recognition method and device for form image, electronic equipment and storage medium
CN112348024A (en) Image-text identification method and system based on deep learning optimization network
CN112149673A (en) Multifunctional test rack based on optical recognition technology
CN112100630A (en) Identification method for confidential document
Panchal et al. Design and implementation of android application to extract text from images by using tesseract for English and Hindi
CN115471855A (en) Contract checking system and method based on 5G and optical character recognition
US11093191B2 (en) Information processing apparatus and non-transitory computer readable medium capable of extracting a job flow without use of attribute information included in job entries
CN117010842A (en) Substation two-ticket archiving method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination