CN112149679B - Method and device for extracting document elements based on OCR character recognition - Google Patents

Method and device for extracting document elements based on OCR character recognition Download PDF

Info

Publication number
CN112149679B
CN112149679B CN202011015420.9A CN202011015420A CN112149679B CN 112149679 B CN112149679 B CN 112149679B CN 202011015420 A CN202011015420 A CN 202011015420A CN 112149679 B CN112149679 B CN 112149679B
Authority
CN
China
Prior art keywords
document
official document
electronic
rule
official
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011015420.9A
Other languages
Chinese (zh)
Other versions
CN112149679A (en
Inventor
张朝壹
李志芳
侯文君
邓倩楠
李旭明
陈毅彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Leadal Technology Development Co ltd
Beijing Zhonghong Lida Xinchuang Technology Co ltd
Original Assignee
Beijing Leadal Technology Development Co ltd
Beijing Zhonghong Lida Xinchuang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Leadal Technology Development Co ltd, Beijing Zhonghong Lida Xinchuang Technology Co ltd filed Critical Beijing Leadal Technology Development Co ltd
Priority to CN202011015420.9A priority Critical patent/CN112149679B/en
Publication of CN112149679A publication Critical patent/CN112149679A/en
Application granted granted Critical
Publication of CN112149679B publication Critical patent/CN112149679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/28Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
    • G06V30/287Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of Kanji, Hiragana or Katakana characters

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Character Input (AREA)

Abstract

The invention relates to a method and a device for extracting document elements based on OCR character recognition, belongs to the technical field of intelligent file processing, and solves the problems that the existing method wastes manpower and time and has low efficiency. The method comprises the following steps: scanning a paper official document file containing the official document element information to obtain an electronic official document file; dynamically generating an algorithm selection box based on the electronic official document file, and acquiring a corresponding algorithm based on the algorithm selection box; acquiring document element information in the electronic document file based on an algorithm; and storing the acquired document element information to the corresponding field position of the document element form in the service information processing system. The method is simple and easy to implement, realizes the rapid extraction of the document elements, saves manpower and cost, and improves efficiency.

Description

Method and device for extracting document elements based on OCR character recognition
Technical Field
The invention relates to the technical field of intelligent file processing, in particular to a method and a device for extracting document elements based on OCR character recognition.
Background
At present, the traditional receiving and dispatching documents are mainly characterized in that documents are scanned into electronic documents through a scanner and then uploaded to a related information system for document handling, document elements in the documents are often required to be extracted in the document handling process, and document element identification in the current office business is mainly carried out in a manual mode, wherein the document elements in the documents are manually identified, and document content elements are manually input into the business information processing system.
The method has the advantages that the workload is large, mistakes are easy to occur, the repetitive work is high, and the consumed labor and time cost is huge if the service volume of the processed files is increased. The accuracy of the document content, especially document element information, is required to be high, so that half-point information entry errors cannot occur.
Disclosure of Invention
In view of the foregoing analysis, embodiments of the present invention provide a method and an apparatus for extracting document elements based on OCR character recognition, so as to solve the problems of manpower and time waste and low efficiency of the existing method.
In one aspect, an embodiment of the present invention provides a method for extracting document elements based on OCR character recognition, including the following steps:
scanning a paper official document containing the official document element information to obtain an electronic official document;
dynamically generating an algorithm selection box based on the electronic official document file, and acquiring a corresponding algorithm based on the algorithm selection box; acquiring document element information in the electronic document file based on the algorithm;
and storing the acquired official document element information to a corresponding field position of an official document element form in a service information processing system.
Further, the algorithm comprises a coordinate region positioning method and a text rule positioning method; the acquiring of the document elements in the electronic document file based on the algorithm comprises:
dynamically generating a coordinate area positioning template rule selection frame based on the obtained coordinate area positioning method, obtaining a corresponding coordinate area positioning template rule based on the coordinate area positioning template rule selection frame, and obtaining document elements in the electronic document file according to the coordinate area positioning template rule; or,
dynamically generating a text template rule selection box based on the obtained text rule positioning method, obtaining a corresponding text template rule based on the text template rule selection box, and obtaining document elements in the electronic document file according to the text template rule.
Further, the coordinate area positioning template rule is obtained by:
scanning the paper official document template to obtain a plurality of corresponding electronic official document templates; the paper official document templates are paper official documents of various different categories;
selecting a rectangular area containing document elements in each electronic document template, and extracting coordinate range values, page numbers and font information of all document elements in the rectangular area by adopting an OCR (optical character recognition) technology;
and obtaining various coordinate area positioning template rules based on the coordinate range values, page numbers and font information of all document elements in each electronic document template, and storing the coordinate area positioning template rules in a database.
Further, the text template rule is obtained by:
obtaining an official document element extraction rule based on each paper official document template;
and generating a text template rule based on the extraction rule of the official document elements, and storing the text template rule in a database.
Further, the official document elements comprise a main sending, a title, a security level, a subject word, a transcription, a signing and issuing, a undertaking unit, a contact person and a contact telephone; wherein, the extraction rule of the official document elements comprises:
based on the first one in the electronic document file containing ": the paragraph at the end acquires the main sending element;
acquiring a 'title' element based on a paragraph before a paragraph where a 'main sending' element is located in an electronic document file;
based on that the electronic document contains the subject term: the paragraph of "obtains the element of" subject word ";
the electronic document file contains' copy: and in. The paragraph at the end acquires the copy element;
acquiring an 'issuing' element based on a paragraph containing 'issuing' in the electronic document file;
based on the last paragraph of the electronic official document, the elements of "undertaking organization", "contact person" and "contact phone" are obtained.
On the other hand, the embodiment of the invention provides a device for extracting document elements based on OCR character recognition, which comprises:
the scanner is used for scanning paper official document files containing official document element information to obtain electronic official document files;
the document element extraction module is used for dynamically generating an algorithm selection box according to the electronic document file and acquiring a corresponding algorithm based on the algorithm selection box; acquiring document element information in the electronic document file based on the algorithm;
and the official document element storage module is used for storing the acquired official document element information to the corresponding field position of the official document element form in the service information processing system.
Further, the algorithm comprises a coordinate region positioning method and a text rule positioning method; the acquiring of the document elements in the electronic document file based on the algorithm comprises:
dynamically generating a coordinate area positioning template rule selection frame based on the obtained coordinate area positioning method, obtaining a corresponding coordinate area positioning template rule based on the coordinate area positioning template rule selection frame, and obtaining document elements in the electronic document file according to the coordinate area positioning template rule; or,
dynamically generating a text template rule selection box based on the obtained text rule positioning method, obtaining a corresponding text template rule based on the text template rule selection box, and obtaining the official document elements in the electronic official document file according to the text template rule.
Further, the official document element extraction module obtains the coordinate area positioning template rule by:
scanning the paper official document template to obtain a plurality of corresponding electronic official document templates; the paper official document templates are paper official documents of various different categories;
selecting a rectangular area containing document elements in each electronic document template, and extracting coordinate range values, page numbers and font information of all document elements in the rectangular area by adopting an OCR (optical character recognition) technology;
and obtaining various coordinate area positioning template rules based on the coordinate range values, page numbers and font information of all document elements in each electronic document template, and storing the coordinate area positioning template rules in a database.
Further, the official document element extraction module obtains the text template rule by the following method:
obtaining an official document element extraction rule based on each paper official document template;
and generating a text template rule based on the extraction rule of the official document elements, and storing the text template rule in a database.
Further, the official document elements comprise a main sending, a title, a security level, a subject word, a transcription, a signing and issuing, a undertaking unit, a contact person and a contact telephone; wherein, the extraction rule of the official document elements comprises:
the first one in the electronic document-based file contains ": the paragraph at the end acquires the main sending element;
acquiring a 'title' element based on a paragraph before a paragraph where a 'main sending' element is located in an electronic document file;
based on that the electronic document contains the subject term: "to obtain the element of" subject word ";
the electronic document file contains' copy: "and so". The paragraph at the end acquires the copy element;
acquiring an 'issuing' element based on a paragraph containing 'issuing' in the electronic official document;
based on the last paragraph of the electronic official document, the elements of "undertaking organization", "contact person" and "contact phone" are obtained.
Compared with the prior art, the invention can realize at least one of the following beneficial effects:
1. a method for extracting document elements based on OCR character recognition is characterized in that a scanner is used for electronizing a paper document to obtain an electronic document, a coordinate area positioning method or a text rule positioning method is further used for extracting document elements in the document and storing the document elements in a service information processing system.
2. By adopting any one of the two algorithms of the coordinate region positioning method and the text rule positioning method, the automation of extracting the document elements in the document file by the system is realized, the method is simple and easy to implement, the problems that manpower and time are wasted and the extraction efficiency is low when the document elements are artificially extracted are solved, the manpower and the time are saved, and the efficiency of extracting the document elements is improved.
3. The coordinate area positioning template rule or the text template rule is obtained through the paper official document template, convenience is provided for later-stage extraction of official document elements in official document files, and later-stage faster extraction of official document elements in official document files is facilitated.
In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a flow diagram of a method for extracting document elements based on OCR text recognition in one embodiment;
FIG. 2 is a diagram showing the construction of an apparatus for extracting document elements based on OCR character recognition in another embodiment;
reference numerals:
100-scanner, 200-document element extraction module and 300-document element storage module.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
The existing method for artificially extracting the document elements has the disadvantages of large workload, high error tendency, high repetitive work, huge manpower and time cost if the service volume of processed documents is increased, and poor accuracy and low efficiency of the method for artificially extracting the document elements. The method and the device for extracting the document elements based on OCR character recognition solve the problems of poor accuracy and low efficiency of a method for manually extracting the document elements, realize the automation of document element extraction, improve the accuracy and the extraction efficiency of the extracted document elements and have high practical value.
A specific embodiment of the present invention discloses a method for extracting document elements based on OCR character recognition, as shown in fig. 1. Includes the following steps S1-S3.
And step S1, scanning the paper official document containing the official document element information to obtain the electronic official document. Generally, the document for extracting the document elements is a paper document, and the paper document needs to be scanned by a scanner to obtain an electronic document, so as to extract the document elements.
S2, dynamically generating an algorithm selection box based on the electronic official document file, and acquiring a corresponding algorithm based on the algorithm selection box; and acquiring the document element information in the electronic document file based on the algorithm. Preferably, the algorithm comprises a coordinate area location method and a text rule location method. Specifically, after the electronic document file is obtained by scanning through a scanner, an algorithm selection box is dynamically generated by the system according to a document extraction starting instruction of a user, the user can select any one algorithm of a coordinate area positioning method or a text rule positioning method through the algorithm selection box so as to extract document elements in the electronic document file, and the coordinate area positioning method or the text rule positioning method is specifically selected and determined according to the actual situation of the user, but the coordinate area positioning method or the text rule positioning method and the text rule positioning method can both realize the rapid extraction of the document elements.
Acquiring document elements in the electronic document file based on the algorithm, wherein the method comprises the following steps:
dynamically generating a coordinate area positioning template rule selection frame based on the obtained coordinate area positioning method, obtaining a corresponding coordinate area positioning template rule based on the coordinate area positioning template rule selection frame, and obtaining document elements in the electronic document file according to the coordinate area positioning template rule; or,
dynamically generating a text template rule selection box based on the obtained text rule positioning method, obtaining a corresponding text template rule based on the text template rule selection box, and obtaining the official document elements in the electronic official document file according to the text template rule.
Specifically, after an algorithm selection frame is dynamically generated based on the electronic document file, a user can select a coordinate area positioning method, then a coordinate area positioning template rule selection frame is dynamically generated, a corresponding coordinate area positioning template rule is obtained based on the coordinate area positioning template rule selection frame, and the system automatically obtains document elements in the electronic document file according to the coordinate area positioning template rule. And pre-generating coordinate area positioning template rules, wherein the coordinate area positioning template rules comprise a plurality of rules corresponding to different types of documents, and when the coordinate area positioning template rules are specifically implemented, a user selects from the pre-generated coordinate area positioning template rules according to the type of the document of which document elements are to be extracted. Meanwhile, after the algorithm selection box is dynamically generated based on the electronic official document file, a user can also select a text rule positioning method, then the text template rule selection box is dynamically generated, a corresponding text template rule is obtained based on the text template rule selection box, and the system automatically obtains the official document elements in the electronic official document file according to the text template rule.
By adopting any one of the two algorithms of the coordinate region positioning method and the text rule positioning method, the automation of extracting the document elements in the document file by the system is realized, the method is simple and easy to implement, the problems that manual document element extraction wastes manpower and time and the extraction efficiency is low are solved, the manpower and the time are saved, and the efficiency of extracting the document elements is improved.
Preferably, the coordinate region positioning template rule is obtained in advance by:
scanning the paper official document template to obtain a plurality of corresponding electronic official document templates; wherein the paper official document templates are paper official documents of various different categories;
selecting a rectangular area containing document elements in each electronic document template, and extracting coordinate range values, page numbers and font information of all document elements in the rectangular area by adopting an OCR (optical character recognition) technology;
and obtaining various coordinate area positioning template rules based on the coordinate range values, page numbers and font information of all document elements in each electronic document template, and storing the coordinate area positioning template rules into a database.
Because the coordinates of the official document elements on each type of official document are relatively fixed, the coordinate area positioning template rule can be generated through the coordinates, the page numbers and the font information of the official document elements so as to extract the official document elements in the official document. Specifically, firstly, scanning each type of paper official document template by a scanner to obtain a plurality of corresponding electronic official document templates and displaying the electronic official document templates on an interface, then selecting a rectangular area containing official document elements from each type of electronic official document by a mouse, directly extracting coordinate range values, page numbers and font information of all official document elements in the rectangular area by adopting an OCR (optical character recognition) technology, forming a coordinate area positioning template rule by the coordinate range values, the page numbers and the font information of the official document elements corresponding to each type of official document, and finally obtaining the plurality of coordinate area positioning template rules and storing the coordinate area positioning template rules in a database. After the user selects the corresponding coordinate area positioning template rule, the system directly positions the specific position in the official document according to the coordinate information of the official document element and automatically acquires the official document element at the corresponding position.
The coordinate area positioning template rule is obtained through the paper official document template, convenience is provided for later-stage extraction of official document elements in official document files, and later-stage faster extraction of official document elements in official document files is facilitated.
Preferably, the text template rule is obtained by:
obtaining an official document element extraction rule based on each paper official document template;
and generating a text template rule based on the extraction rule of the official document elements, and storing the text template rule in a database.
Specifically, the text template rule is an important factor for extracting the document elements by adopting a text rule positioning method, so that before the document elements are extracted by adopting the text rule positioning method, the document element extraction rule needs to be summarized according to each type of paper document template to generate the text template rule and store the text template rule in a database, and the document elements can be automatically extracted at a later stage conveniently.
Preferably, the official document elements comprise a main sending, a title, a security level, a subject word, a copy sending, a signing, a undertaking unit, a contact and a contact telephone; wherein, the extraction rule of the document elements comprises:
based on the first one in the electronic document file containing ": the paragraph at the end acquires the main sending element;
acquiring a 'title' element based on a paragraph before a paragraph where a 'main sending' element is located in an electronic document file;
based on that the electronic document contains the subject term: "to obtain the element of" subject word ";
the electronic document file contains' copy: "and so". The paragraph at the end acquires the copy element;
acquiring an 'issuing' element based on a paragraph containing 'issuing' in the electronic document file;
based on the last paragraph of the electronic official document, the elements of "undertaking organization", "contact person" and "contact phone" are obtained.
Specifically, each paper official document template comprises official document requirements such as main sending, title, security level, subject word, copy sending, signing, undertaking unit, contact person, contact phone and the like. Wherein, for each official document, the "sending" element is that the first one of the official documents contains ": "paragraph text at the end, extract the paragraph": the 'main sending' element can be obtained by the previous text;
the 'title' element is a text of a paragraph before the paragraph where the 'main sending' element is located, and all texts of the paragraph are extracted to obtain the 'title' element;
the security level element is a first paragraph text, and the security level element can be obtained by extracting the whole text of the first paragraph;
the element of the subject term is that the document text contains the subject term: "to extract the subject term of the paragraph: the subsequent text can obtain the element of the subject term;
the ' copy ' element is that the document text contains ' copy: "and in combination with". "ending paragraph text, extract the paragraph at": "after and at". The 'copying' element can be obtained from the previous text;
the 'issue' element is a paragraph text which contains 'issue' in the official document text and is ended by 'issue', and the 'issue' element can be obtained by extracting the text before the 'issue' of the paragraph;
the elements of the 'undertaking unit', 'contact person' and 'contact telephone' are the last paragraph text in the official document text, and the 'undertaking unit' is extracted respectively: "," contacts: "," telephone: the ' later text can obtain ' undertaking units ' element, ' contact person ' element and ' contact telephone ' element.
The coordinate text template rule is obtained through the paper official document template, convenience is provided for later-stage extraction of official document elements in the official document files, and later-stage faster extraction of the official document elements in the official document files is facilitated.
And step S3, storing the obtained official document element information to the corresponding field position of the official document element form in the service information processing system. Specifically, after the document elements in the document file are extracted according to the coordinate area positioning method or the text rule positioning method in step S2, the extracted document elements may be saved to the corresponding field positions of the document element form in the service information processing system for later operation and review by the operator.
Compared with the prior art, the method for extracting the document elements based on OCR character recognition provided by the embodiment electronizes the paper document through the scanner to obtain the electronic document, and further adopts the coordinate area positioning method or the text rule positioning method to extract the document elements in the document, and stores the document elements in the service information processing system.
Another embodiment of the present invention discloses a device for extracting document elements based on OCR character recognition, as shown in fig. 2. The method comprises the following steps: the scanner 100 is used for scanning paper official documents containing official document element information to obtain electronic official document files; the document element extraction module 200 is started according to a user instruction, and is used for dynamically generating an algorithm selection box according to the electronic document file after being started, and acquiring a corresponding algorithm based on the algorithm selection box; acquiring document element information in the electronic document file based on an algorithm; the document element storing module 300 is configured to store the obtained document element information to a corresponding field position of a document element form in the service information processing system.
The device for extracting the official document elements based on OCR character recognition adopts the scanner to electronize the paper official document to obtain the electronic official document, further adopts a coordinate region positioning method or a text rule positioning method to extract the official document elements in the official document, and stores the official document elements in the business information processing system.
Preferably, the algorithm comprises a coordinate region positioning method and a text rule positioning method; the specific process of obtaining the document elements in the electronic document based on the algorithm is referred to in the embodiment of the method, and is not repeated here.
Those skilled in the art will appreciate that all or part of the processes for implementing the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, for instructing the relevant hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (2)

1. A method for extracting official document elements based on OCR character recognition is characterized by comprising the following steps:
scanning a paper official document containing the official document element information to obtain an electronic official document;
based on the electronic official document file, extracting a starting instruction according to the official document of the user, dynamically generating an algorithm selection box, and acquiring a corresponding algorithm based on the algorithm selection box; the algorithm comprises a coordinate area positioning method and a text rule positioning method; acquiring the document element information in the electronic document file based on the algorithm, wherein the method comprises the following steps:
dynamically generating a coordinate area positioning template rule selection frame based on the obtained coordinate area positioning method, obtaining a corresponding coordinate area positioning template rule based on the coordinate area positioning template rule selection frame, positioning to a specific position in a document file according to coordinate information of document elements in the coordinate area positioning template rule, and automatically obtaining document elements at a corresponding position in an electronic document file; or,
dynamically generating a text template rule selection box based on the obtained text rule positioning method, obtaining a corresponding text template rule based on the text template rule selection box, and obtaining an official document element in the electronic official document file according to the text template rule;
obtaining the coordinate region positioning template rule by:
scanning the paper official document template to obtain a plurality of corresponding electronic official document templates; the paper official document templates are paper official documents of various different categories;
selecting a rectangular area containing document elements in each electronic document template, and extracting coordinate range values, page numbers and font information of all document elements in the rectangular area by adopting an OCR (optical character recognition) technology;
obtaining various coordinate area positioning template rules based on the coordinate range values, page numbers and font information of all document elements in each electronic document template, and storing the coordinate area positioning template rules to a database;
obtaining the text template rule by:
obtaining an official document element extraction rule based on each paper official document template;
generating a text template rule based on the extraction rule of the official document elements, and storing the text template rule to a database;
the official document elements comprise a main sending unit, a title, a security level, a subject word, a copying unit, a signing unit, a undertaking unit, a contact person and a contact telephone; wherein, the extraction rule of the official document elements comprises:
based on the first one in the electronic document file containing ": the paragraph at the end acquires the main sending element;
acquiring a 'title' element based on a paragraph before a paragraph in which a 'main delivery' element in an electronic document is located;
based on that the electronic document contains the subject term: "to obtain the element of" subject word ";
the electronic document file contains' copying: "and so". The paragraph at the end acquires the copy element;
acquiring an 'issuing' element based on a paragraph containing 'issuing' in the electronic document file;
acquiring elements of a undertaking unit, a contact person and a contact telephone based on the last paragraph of the electronic official document file;
and storing the acquired official document element information to a corresponding field position of an official document element form in a service information processing system.
2. An apparatus for extracting document elements based on OCR character recognition, comprising:
the scanner is used for scanning the paper official document containing the official document essential information to obtain an electronic official document;
the official document element extraction module is used for dynamically generating an algorithm selection box according to an official document extraction starting instruction of a user based on the electronic official document file, and acquiring a corresponding algorithm based on the algorithm selection box; the algorithm comprises a coordinate area positioning method and a text rule positioning method; acquiring the document element information in the electronic document file based on the algorithm, wherein the method comprises the following steps:
dynamically generating a coordinate area positioning template rule selection frame based on the obtained coordinate area positioning method, obtaining a corresponding coordinate area positioning template rule based on the coordinate area positioning template rule selection frame, positioning to a specific position in the official document file according to coordinate information of the official document elements in the coordinate area positioning template rule, and automatically obtaining the official document elements at the corresponding position in the electronic official document file; or,
dynamically generating a text template rule selection box based on the obtained text rule positioning method, obtaining a corresponding text template rule based on the text template rule selection box, and obtaining an official document element in the electronic official document file according to the text template rule;
the official document element extraction module obtains the coordinate area positioning template rule through the following method:
scanning the paper official document template to obtain a plurality of corresponding electronic official document templates; the paper official document templates are paper official documents of various different categories;
selecting a rectangular area containing document elements in each electronic document template, and extracting coordinate range values, page numbers and font information of all document elements in the rectangular area by adopting an OCR (optical character recognition) technology;
obtaining various coordinate area positioning template rules based on the coordinate range values, page numbers and font information of all document elements in each electronic document template, and storing the coordinate area positioning template rules to a database;
the official document element extraction module obtains the text template rule by the following method:
obtaining an official document element extraction rule based on each paper official document template;
generating a text template rule based on the extraction rule of the official document elements, and storing the text template rule to a database;
the official document elements comprise a main sending unit, a title, a security level, a subject word, a copying unit, a signing unit, a undertaking unit, a contact person and a contact telephone; wherein, the extraction rule of the official document elements comprises:
based on the first one in the electronic document file containing ": the paragraph at the end acquires the main sending element;
acquiring a 'title' element based on a paragraph before a paragraph where a 'main sending' element is located in an electronic document file;
based on that the electronic document contains the subject term: "to obtain the element of" subject word ";
the electronic document file contains' copy: "and so". The paragraph at the end acquires the copy element;
acquiring an 'issuing' element based on a paragraph containing 'issuing' in the electronic document file;
acquiring elements of a undertaking unit, a contact person and a contact telephone based on the last paragraph of the electronic official document file;
and the official document element storage module is used for storing the acquired official document element information to the corresponding field position of the official document element form in the service information processing system.
CN202011015420.9A 2020-09-24 2020-09-24 Method and device for extracting document elements based on OCR character recognition Active CN112149679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011015420.9A CN112149679B (en) 2020-09-24 2020-09-24 Method and device for extracting document elements based on OCR character recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011015420.9A CN112149679B (en) 2020-09-24 2020-09-24 Method and device for extracting document elements based on OCR character recognition

Publications (2)

Publication Number Publication Date
CN112149679A CN112149679A (en) 2020-12-29
CN112149679B true CN112149679B (en) 2022-09-23

Family

ID=73896625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011015420.9A Active CN112149679B (en) 2020-09-24 2020-09-24 Method and device for extracting document elements based on OCR character recognition

Country Status (1)

Country Link
CN (1) CN112149679B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7328797B2 (en) * 2019-06-05 2023-08-17 株式会社日立製作所 Terminal device, character recognition system and character recognition method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325011A (en) * 2020-01-21 2020-06-23 西安工程大学 Method for making and automatically matching administrative documents divided by main elements

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081615A (en) * 2009-11-28 2011-06-01 东莞市万维网络科技信息有限公司 Archives arranging and digital processing system based on information resource planning of archives
CN107862303B (en) * 2017-11-30 2019-04-26 平安科技(深圳)有限公司 Information identifying method, electronic device and the readable storage medium storing program for executing of form class diagram picture
CN111428725A (en) * 2020-04-13 2020-07-17 北京令才科技有限公司 Data structuring processing method and device and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325011A (en) * 2020-01-21 2020-06-23 西安工程大学 Method for making and automatically matching administrative documents divided by main elements

Also Published As

Publication number Publication date
CN112149679A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
US6243501B1 (en) Adaptive recognition of documents using layout attributes
JP5090369B2 (en) Automated processing using remotely stored templates (method for processing forms, apparatus for processing forms)
JP4829920B2 (en) Form automatic embedding method and apparatus, graphical user interface apparatus
US20160055376A1 (en) Method and system for identification and extraction of data from structured documents
US11151367B2 (en) Image processing apparatus and image processing program
JPH0683879A (en) Method and device for labelling document for preservation, handling and introduction
CN101561725B (en) Method and system of fast handwriting input
US11749008B2 (en) Image processing apparatus and image processing program
CN112560411A (en) Intelligent personnel information input method and system
CN115828874A (en) Industry table digital processing method based on image recognition technology
CN112149679B (en) Method and device for extracting document elements based on OCR character recognition
JP4983464B2 (en) Form image processing apparatus and form image processing program
CN111177387A (en) User list information processing method, electronic device and computer readable storage medium
JP2000322417A (en) Device and method for filing image and storage medium
CN115146583A (en) Self-host structured extraction and association method and device for terms and storage medium
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium
CN113378526A (en) PDF paragraph processing method, device, storage medium and equipment
US10606928B2 (en) Assistive technology for the impaired
CN112632934B (en) Method for restoring table picture into editable WORD file table based on proportion calculation
Gribomont OCR with Google Vision API and Tesseract
CN113283226B (en) Editing method, device, equipment and medium of online document template
JP7501255B2 (en) Document search system, document search method and program
CN115640952B (en) Method and system for importing and uploading data
TW388016B (en) Method and apparatus for character recognition interface
JP2001312691A (en) Method/device for processing picture and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220322

Address after: 100048 703-1, 6th floor, building 8, yard 50, Xisanhuan North Road, Haidian District, Beijing

Applicant after: Beijing Zhonghong Lida Xinchuang Technology Co.,Ltd.

Applicant after: BEIJING LEADAL TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 100048 703-1, 6th floor, building 8, yard 50, Xisanhuan North Road, Haidian District, Beijing

Applicant before: Beijing Zhonghong Lida Xinchuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100048 703-1, 6th floor, building 8, yard 50, Xisanhuan North Road, Haidian District, Beijing

Patentee after: Beijing Zhonghong Lida Xinchuang Technology Co.,Ltd.

Patentee after: BEIJING LEADAL TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 100048 703-1, 6th floor, building 8, yard 50, Xisanhuan North Road, Haidian District, Beijing

Patentee before: Beijing Zhonghong Lida Xinchuang Technology Co.,Ltd.

Patentee before: BEIJING LEADAL TECHNOLOGY DEVELOPMENT Co.,Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room 701, Building 8, Yard 50, North West Third Ring Road, Haidian District, Beijing 100048

Patentee after: BEIJING LEADAL TECHNOLOGY DEVELOPMENT Co.,Ltd.

Patentee after: Beijing Zhonghong Lida Xinchuang Technology Co.,Ltd.

Address before: 100048 703-1, 6th floor, building 8, yard 50, Xisanhuan North Road, Haidian District, Beijing

Patentee before: Beijing Zhonghong Lida Xinchuang Technology Co.,Ltd.

Patentee before: BEIJING LEADAL TECHNOLOGY DEVELOPMENT Co.,Ltd.