CN112149679B

CN112149679B - Method and device for extracting document elements based on OCR character recognition

Info

Publication number: CN112149679B
Application number: CN202011015420.9A
Authority: CN
Inventors: 张朝壹; 李志芳; 侯文君; 邓倩楠; 李旭明; 陈毅彬
Original assignee: Beijing Leadal Technology Development Co ltd; Beijing Zhonghong Lida Xinchuang Technology Co ltd
Current assignee: Beijing Leadal Technology Development Co ltd; Beijing Zhonghong Lida Xinchuang Technology Co ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2022-09-23
Anticipated expiration: 2040-09-24
Also published as: CN112149679A

Abstract

The invention relates to a method and a device for extracting document elements based on OCR character recognition, belongs to the technical field of intelligent file processing, and solves the problems that the existing method wastes manpower and time and has low efficiency. The method comprises the following steps: scanning a paper official document file containing the official document element information to obtain an electronic official document file; dynamically generating an algorithm selection box based on the electronic official document file, and acquiring a corresponding algorithm based on the algorithm selection box; acquiring document element information in the electronic document file based on an algorithm; and storing the acquired document element information to the corresponding field position of the document element form in the service information processing system. The method is simple and easy to implement, realizes the rapid extraction of the document elements, saves manpower and cost, and improves efficiency.

Description

Method and device for extracting document elements based on OCR character recognition

Technical Field

The invention relates to the technical field of intelligent file processing, in particular to a method and a device for extracting document elements based on OCR character recognition.

Background

At present, the traditional receiving and dispatching documents are mainly characterized in that documents are scanned into electronic documents through a scanner and then uploaded to a related information system for document handling, document elements in the documents are often required to be extracted in the document handling process, and document element identification in the current office business is mainly carried out in a manual mode, wherein the document elements in the documents are manually identified, and document content elements are manually input into the business information processing system.

The method has the advantages that the workload is large, mistakes are easy to occur, the repetitive work is high, and the consumed labor and time cost is huge if the service volume of the processed files is increased. The accuracy of the document content, especially document element information, is required to be high, so that half-point information entry errors cannot occur.

Disclosure of Invention

In view of the foregoing analysis, embodiments of the present invention provide a method and an apparatus for extracting document elements based on OCR character recognition, so as to solve the problems of manpower and time waste and low efficiency of the existing method.

In one aspect, an embodiment of the present invention provides a method for extracting document elements based on OCR character recognition, including the following steps:

scanning a paper official document containing the official document element information to obtain an electronic official document;

dynamically generating an algorithm selection box based on the electronic official document file, and acquiring a corresponding algorithm based on the algorithm selection box; acquiring document element information in the electronic document file based on the algorithm;

and storing the acquired official document element information to a corresponding field position of an official document element form in a service information processing system.

Further, the algorithm comprises a coordinate region positioning method and a text rule positioning method; the acquiring of the document elements in the electronic document file based on the algorithm comprises:

dynamically generating a coordinate area positioning template rule selection frame based on the obtained coordinate area positioning method, obtaining a corresponding coordinate area positioning template rule based on the coordinate area positioning template rule selection frame, and obtaining document elements in the electronic document file according to the coordinate area positioning template rule; or,

dynamically generating a text template rule selection box based on the obtained text rule positioning method, obtaining a corresponding text template rule based on the text template rule selection box, and obtaining document elements in the electronic document file according to the text template rule.

Further, the coordinate area positioning template rule is obtained by:

scanning the paper official document template to obtain a plurality of corresponding electronic official document templates; the paper official document templates are paper official documents of various different categories;

selecting a rectangular area containing document elements in each electronic document template, and extracting coordinate range values, page numbers and font information of all document elements in the rectangular area by adopting an OCR (optical character recognition) technology;

and obtaining various coordinate area positioning template rules based on the coordinate range values, page numbers and font information of all document elements in each electronic document template, and storing the coordinate area positioning template rules in a database.

Further, the text template rule is obtained by:

obtaining an official document element extraction rule based on each paper official document template;

and generating a text template rule based on the extraction rule of the official document elements, and storing the text template rule in a database.

Further, the official document elements comprise a main sending, a title, a security level, a subject word, a transcription, a signing and issuing, a undertaking unit, a contact person and a contact telephone; wherein, the extraction rule of the official document elements comprises:

based on the first one in the electronic document file containing ": the paragraph at the end acquires the main sending element;

acquiring a 'title' element based on a paragraph before a paragraph where a 'main sending' element is located in an electronic document file;

based on that the electronic document contains the subject term: the paragraph of "obtains the element of" subject word ";

the electronic document file contains' copy: and in. The paragraph at the end acquires the copy element;

acquiring an 'issuing' element based on a paragraph containing 'issuing' in the electronic document file;

based on the last paragraph of the electronic official document, the elements of "undertaking organization", "contact person" and "contact phone" are obtained.

On the other hand, the embodiment of the invention provides a device for extracting document elements based on OCR character recognition, which comprises:

the scanner is used for scanning paper official document files containing official document element information to obtain electronic official document files;

the document element extraction module is used for dynamically generating an algorithm selection box according to the electronic document file and acquiring a corresponding algorithm based on the algorithm selection box; acquiring document element information in the electronic document file based on the algorithm;

and the official document element storage module is used for storing the acquired official document element information to the corresponding field position of the official document element form in the service information processing system.

dynamically generating a text template rule selection box based on the obtained text rule positioning method, obtaining a corresponding text template rule based on the text template rule selection box, and obtaining the official document elements in the electronic official document file according to the text template rule.

Further, the official document element extraction module obtains the coordinate area positioning template rule by:

Further, the official document element extraction module obtains the text template rule by the following method:

the first one in the electronic document-based file contains ": the paragraph at the end acquires the main sending element;

based on that the electronic document contains the subject term: "to obtain the element of" subject word ";

the electronic document file contains' copy: "and so". The paragraph at the end acquires the copy element;

acquiring an 'issuing' element based on a paragraph containing 'issuing' in the electronic official document;

Compared with the prior art, the invention can realize at least one of the following beneficial effects:

1. a method for extracting document elements based on OCR character recognition is characterized in that a scanner is used for electronizing a paper document to obtain an electronic document, a coordinate area positioning method or a text rule positioning method is further used for extracting document elements in the document and storing the document elements in a service information processing system.

2. By adopting any one of the two algorithms of the coordinate region positioning method and the text rule positioning method, the automation of extracting the document elements in the document file by the system is realized, the method is simple and easy to implement, the problems that manpower and time are wasted and the extraction efficiency is low when the document elements are artificially extracted are solved, the manpower and the time are saved, and the efficiency of extracting the document elements is improved.

3. The coordinate area positioning template rule or the text template rule is obtained through the paper official document template, convenience is provided for later-stage extraction of official document elements in official document files, and later-stage faster extraction of official document elements in official document files is facilitated.

In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a flow diagram of a method for extracting document elements based on OCR text recognition in one embodiment;

FIG. 2 is a diagram showing the construction of an apparatus for extracting document elements based on OCR character recognition in another embodiment;

reference numerals:

100-scanner, 200-document element extraction module and 300-document element storage module.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

The existing method for artificially extracting the document elements has the disadvantages of large workload, high error tendency, high repetitive work, huge manpower and time cost if the service volume of processed documents is increased, and poor accuracy and low efficiency of the method for artificially extracting the document elements. The method and the device for extracting the document elements based on OCR character recognition solve the problems of poor accuracy and low efficiency of a method for manually extracting the document elements, realize the automation of document element extraction, improve the accuracy and the extraction efficiency of the extracted document elements and have high practical value.

A specific embodiment of the present invention discloses a method for extracting document elements based on OCR character recognition, as shown in fig. 1. Includes the following steps S1-S3.

And step S1, scanning the paper official document containing the official document element information to obtain the electronic official document. Generally, the document for extracting the document elements is a paper document, and the paper document needs to be scanned by a scanner to obtain an electronic document, so as to extract the document elements.

S2, dynamically generating an algorithm selection box based on the electronic official document file, and acquiring a corresponding algorithm based on the algorithm selection box; and acquiring the document element information in the electronic document file based on the algorithm. Preferably, the algorithm comprises a coordinate area location method and a text rule location method. Specifically, after the electronic document file is obtained by scanning through a scanner, an algorithm selection box is dynamically generated by the system according to a document extraction starting instruction of a user, the user can select any one algorithm of a coordinate area positioning method or a text rule positioning method through the algorithm selection box so as to extract document elements in the electronic document file, and the coordinate area positioning method or the text rule positioning method is specifically selected and determined according to the actual situation of the user, but the coordinate area positioning method or the text rule positioning method and the text rule positioning method can both realize the rapid extraction of the document elements.

Acquiring document elements in the electronic document file based on the algorithm, wherein the method comprises the following steps:

Specifically, after an algorithm selection frame is dynamically generated based on the electronic document file, a user can select a coordinate area positioning method, then a coordinate area positioning template rule selection frame is dynamically generated, a corresponding coordinate area positioning template rule is obtained based on the coordinate area positioning template rule selection frame, and the system automatically obtains document elements in the electronic document file according to the coordinate area positioning template rule. And pre-generating coordinate area positioning template rules, wherein the coordinate area positioning template rules comprise a plurality of rules corresponding to different types of documents, and when the coordinate area positioning template rules are specifically implemented, a user selects from the pre-generated coordinate area positioning template rules according to the type of the document of which document elements are to be extracted. Meanwhile, after the algorithm selection box is dynamically generated based on the electronic official document file, a user can also select a text rule positioning method, then the text template rule selection box is dynamically generated, a corresponding text template rule is obtained based on the text template rule selection box, and the system automatically obtains the official document elements in the electronic official document file according to the text template rule.

By adopting any one of the two algorithms of the coordinate region positioning method and the text rule positioning method, the automation of extracting the document elements in the document file by the system is realized, the method is simple and easy to implement, the problems that manual document element extraction wastes manpower and time and the extraction efficiency is low are solved, the manpower and the time are saved, and the efficiency of extracting the document elements is improved.

Preferably, the coordinate region positioning template rule is obtained in advance by:

scanning the paper official document template to obtain a plurality of corresponding electronic official document templates; wherein the paper official document templates are paper official documents of various different categories;

and obtaining various coordinate area positioning template rules based on the coordinate range values, page numbers and font information of all document elements in each electronic document template, and storing the coordinate area positioning template rules into a database.

Because the coordinates of the official document elements on each type of official document are relatively fixed, the coordinate area positioning template rule can be generated through the coordinates, the page numbers and the font information of the official document elements so as to extract the official document elements in the official document. Specifically, firstly, scanning each type of paper official document template by a scanner to obtain a plurality of corresponding electronic official document templates and displaying the electronic official document templates on an interface, then selecting a rectangular area containing official document elements from each type of electronic official document by a mouse, directly extracting coordinate range values, page numbers and font information of all official document elements in the rectangular area by adopting an OCR (optical character recognition) technology, forming a coordinate area positioning template rule by the coordinate range values, the page numbers and the font information of the official document elements corresponding to each type of official document, and finally obtaining the plurality of coordinate area positioning template rules and storing the coordinate area positioning template rules in a database. After the user selects the corresponding coordinate area positioning template rule, the system directly positions the specific position in the official document according to the coordinate information of the official document element and automatically acquires the official document element at the corresponding position.

The coordinate area positioning template rule is obtained through the paper official document template, convenience is provided for later-stage extraction of official document elements in official document files, and later-stage faster extraction of official document elements in official document files is facilitated.

Preferably, the text template rule is obtained by:

Specifically, the text template rule is an important factor for extracting the document elements by adopting a text rule positioning method, so that before the document elements are extracted by adopting the text rule positioning method, the document element extraction rule needs to be summarized according to each type of paper document template to generate the text template rule and store the text template rule in a database, and the document elements can be automatically extracted at a later stage conveniently.

Preferably, the official document elements comprise a main sending, a title, a security level, a subject word, a copy sending, a signing, a undertaking unit, a contact and a contact telephone; wherein, the extraction rule of the document elements comprises:

Specifically, each paper official document template comprises official document requirements such as main sending, title, security level, subject word, copy sending, signing, undertaking unit, contact person, contact phone and the like. Wherein, for each official document, the "sending" element is that the first one of the official documents contains ": "paragraph text at the end, extract the paragraph": the 'main sending' element can be obtained by the previous text;

the 'title' element is a text of a paragraph before the paragraph where the 'main sending' element is located, and all texts of the paragraph are extracted to obtain the 'title' element;

the security level element is a first paragraph text, and the security level element can be obtained by extracting the whole text of the first paragraph;

the element of the subject term is that the document text contains the subject term: "to extract the subject term of the paragraph: the subsequent text can obtain the element of the subject term;

the ' copy ' element is that the document text contains ' copy: "and in combination with". "ending paragraph text, extract the paragraph at": "after and at". The 'copying' element can be obtained from the previous text;

the 'issue' element is a paragraph text which contains 'issue' in the official document text and is ended by 'issue', and the 'issue' element can be obtained by extracting the text before the 'issue' of the paragraph;

the elements of the 'undertaking unit', 'contact person' and 'contact telephone' are the last paragraph text in the official document text, and the 'undertaking unit' is extracted respectively: "," contacts: "," telephone: the ' later text can obtain ' undertaking units ' element, ' contact person ' element and ' contact telephone ' element.

The coordinate text template rule is obtained through the paper official document template, convenience is provided for later-stage extraction of official document elements in the official document files, and later-stage faster extraction of the official document elements in the official document files is facilitated.

And step S3, storing the obtained official document element information to the corresponding field position of the official document element form in the service information processing system. Specifically, after the document elements in the document file are extracted according to the coordinate area positioning method or the text rule positioning method in step S2, the extracted document elements may be saved to the corresponding field positions of the document element form in the service information processing system for later operation and review by the operator.

Compared with the prior art, the method for extracting the document elements based on OCR character recognition provided by the embodiment electronizes the paper document through the scanner to obtain the electronic document, and further adopts the coordinate area positioning method or the text rule positioning method to extract the document elements in the document, and stores the document elements in the service information processing system.

Another embodiment of the present invention discloses a device for extracting document elements based on OCR character recognition, as shown in fig. 2. The method comprises the following steps: the scanner 100 is used for scanning paper official documents containing official document element information to obtain electronic official document files; the document element extraction module 200 is started according to a user instruction, and is used for dynamically generating an algorithm selection box according to the electronic document file after being started, and acquiring a corresponding algorithm based on the algorithm selection box; acquiring document element information in the electronic document file based on an algorithm; the document element storing module 300 is configured to store the obtained document element information to a corresponding field position of a document element form in the service information processing system.

The device for extracting the official document elements based on OCR character recognition adopts the scanner to electronize the paper official document to obtain the electronic official document, further adopts a coordinate region positioning method or a text rule positioning method to extract the official document elements in the official document, and stores the official document elements in the business information processing system.

Preferably, the algorithm comprises a coordinate region positioning method and a text rule positioning method; the specific process of obtaining the document elements in the electronic document based on the algorithm is referred to in the embodiment of the method, and is not repeated here.

Those skilled in the art will appreciate that all or part of the processes for implementing the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, for instructing the relevant hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A method for extracting official document elements based on OCR character recognition is characterized by comprising the following steps:

based on the electronic official document file, extracting a starting instruction according to the official document of the user, dynamically generating an algorithm selection box, and acquiring a corresponding algorithm based on the algorithm selection box; the algorithm comprises a coordinate area positioning method and a text rule positioning method; acquiring the document element information in the electronic document file based on the algorithm, wherein the method comprises the following steps:

dynamically generating a coordinate area positioning template rule selection frame based on the obtained coordinate area positioning method, obtaining a corresponding coordinate area positioning template rule based on the coordinate area positioning template rule selection frame, positioning to a specific position in a document file according to coordinate information of document elements in the coordinate area positioning template rule, and automatically obtaining document elements at a corresponding position in an electronic document file; or,

dynamically generating a text template rule selection box based on the obtained text rule positioning method, obtaining a corresponding text template rule based on the text template rule selection box, and obtaining an official document element in the electronic official document file according to the text template rule;

obtaining the coordinate region positioning template rule by:

obtaining various coordinate area positioning template rules based on the coordinate range values, page numbers and font information of all document elements in each electronic document template, and storing the coordinate area positioning template rules to a database;

obtaining the text template rule by:

generating a text template rule based on the extraction rule of the official document elements, and storing the text template rule to a database;

the official document elements comprise a main sending unit, a title, a security level, a subject word, a copying unit, a signing unit, a undertaking unit, a contact person and a contact telephone; wherein, the extraction rule of the official document elements comprises:

acquiring a 'title' element based on a paragraph before a paragraph in which a 'main delivery' element in an electronic document is located;

the electronic document file contains' copying: "and so". The paragraph at the end acquires the copy element;

acquiring elements of a undertaking unit, a contact person and a contact telephone based on the last paragraph of the electronic official document file;

2. An apparatus for extracting document elements based on OCR character recognition, comprising:

the scanner is used for scanning the paper official document containing the official document essential information to obtain an electronic official document;

the official document element extraction module is used for dynamically generating an algorithm selection box according to an official document extraction starting instruction of a user based on the electronic official document file, and acquiring a corresponding algorithm based on the algorithm selection box; the algorithm comprises a coordinate area positioning method and a text rule positioning method; acquiring the document element information in the electronic document file based on the algorithm, wherein the method comprises the following steps:

dynamically generating a coordinate area positioning template rule selection frame based on the obtained coordinate area positioning method, obtaining a corresponding coordinate area positioning template rule based on the coordinate area positioning template rule selection frame, positioning to a specific position in the official document file according to coordinate information of the official document elements in the coordinate area positioning template rule, and automatically obtaining the official document elements at the corresponding position in the electronic official document file; or,

the official document element extraction module obtains the coordinate area positioning template rule through the following method:

the official document element extraction module obtains the text template rule by the following method: