CN110751143A - Electronic invoice information extraction method and electronic equipment - Google Patents

Electronic invoice information extraction method and electronic equipment Download PDF

Info

Publication number
CN110751143A
CN110751143A CN201910915157.XA CN201910915157A CN110751143A CN 110751143 A CN110751143 A CN 110751143A CN 201910915157 A CN201910915157 A CN 201910915157A CN 110751143 A CN110751143 A CN 110751143A
Authority
CN
China
Prior art keywords
electronic invoice
value
added tax
picture
tax electronic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910915157.XA
Other languages
Chinese (zh)
Inventor
王志鹏
朱西华
张胜娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Wanwei Information Technology Co Ltd
Original Assignee
China Telecom Wanwei Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Wanwei Information Technology Co Ltd filed Critical China Telecom Wanwei Information Technology Co Ltd
Priority to CN201910915157.XA priority Critical patent/CN110751143A/en
Publication of CN110751143A publication Critical patent/CN110751143A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Input (AREA)

Abstract

The embodiment of the invention discloses an electronic invoice information extraction method and electronic equipment, which are used for realizing text recognition of a value-added tax invoice designated area and further improving the recognition efficiency of bill information. The method provided by the embodiment of the invention comprises the following steps: selecting a value-added tax electronic invoice of a target type and selecting a standard picture as a reference picture, and making a custom identification template; importing a value-added tax electronic invoice file; preprocessing and geometrically correcting the target value-added tax electronic invoice file, and scaling and cutting the target value-added tax electronic invoice file into a uniform standard size to obtain a processed value-added tax electronic invoice file; aligning the processed value-added tax electronic invoice file with a reference picture; cutting the region to be identified in the aligned picture corresponding to the template picture into the region of interest ROI to obtain the ROI intercepted region; identifying the ROI intercepted area to obtain an identification result; and then proofreading and structuring processing are carried out, and output and storage are carried out in a specified format.

Description

Electronic invoice information extraction method and electronic equipment
Technical Field
The invention relates to the technical field of image recognition, in particular to an electronic invoice information extraction method and electronic equipment.
Background
In daily work, a company financial staff has to process a lot of invoices everyday, including electronic invoices, as well as paper value-added tax special invoices, general invoices, transportation invoices and the like. The invoices are manually input, so that the time is consumed, and errors are easy to occur; the dual storage of paper invoice original paper and scanning electronic version has invisibly increased a lot of work load for financial staff. Therefore, the problem of automated processing of bills is still urgently to be solved.
The conventional Optical Character Recognition (Optical Character Recognition) technical process flow comprises the following five parts: inputting an image, preprocessing, analyzing a layout, cutting a row and a column, recognizing characters, and performing post-processing recognition and correction. Each part in the process has a great influence on the recognition result, and if one link is improperly processed, the processing result of the next link is directly influenced, so that the final recognition result is not ideal indirectly.
At present, the bill recognition technology mainly converts an unstructured bill image into structured data by means of a traditional OCR technology, so as to extract bill information. The realization method mainly comprises the following steps: and positioning the information line of each item of information on the bill, positioning the information frame position to be identified from all the information frames, and identifying all the character information in the information frames by utilizing character segmentation and OCR (optical character recognition) technology. The method is suitable for bills with orderly surfaces and less information content, and is suitable for the conditions that the time consumed by preprocessing and recognition operation is longer, positioning errors are easily caused and further recognition fails when the surfaces of the bills are complicated and the contents of the bills are more.
In addition, although electronic invoices have been widely popularized in recent years, download formats are basically Portable Document Format (PDF) formats. However, financial reimbursement of enterprises only requires paper versions and picture electronic versions, and most of bill identification technologies are based on picture formats, so for electronic invoices, financial staff also needs to manually intercept the information of the faces of the electronic invoices and convert the information into the picture formats. This operation is not only troublesome, but also prone to omission.
Disclosure of Invention
The embodiment of the invention provides an electronic invoice information extraction method and electronic equipment, which are used for setting a custom identification template, realizing text identification of a designated area of a value-added tax invoice through a PDF format electronic invoice-to-picture format technology and an end-to-end OCR technology based on deep learning, further improving the identification efficiency of bill information and improving the office efficiency of financial staff.
In view of this, a first aspect of the present invention provides an electronic invoice information extraction method, which may include:
selecting a value-added tax electronic invoice of a target type, selecting a standard picture of the value-added tax electronic invoice of the target type as a reference picture, and making a custom identification template;
importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format;
performing structural analysis on the PDF value-added tax electronic invoice file, and converting the PDF value-added tax electronic invoice file into a second value-added tax electronic invoice file in a picture format;
preprocessing and geometrically correcting a target value-added tax electronic invoice file, and processing the target value-added tax electronic invoice file into a uniform standard size through scaling and cutting to obtain a processed value-added tax electronic invoice file, wherein the target value-added tax electronic invoice file comprises a first value-added tax electronic invoice file in a picture format and a second value-added tax electronic invoice file in the picture format;
aligning the processed value-added tax electronic invoice file with the reference picture;
cutting the region to be identified in the aligned picture corresponding to the template picture into the region of interest ROI to obtain the ROI intercepted region;
performing character detection and character recognition operation on the ROI intercepted area by utilizing an end-to-end character recognition algorithm based on deep learning to obtain a recognition result;
and checking the identification result, performing structuring processing, and outputting and storing in a specified format.
Optionally, in some embodiments of the present invention, the value-added tax electronic invoice file is imported, and the single or multiple file uploading is supported by the value-added tax electronic invoice file; and after the identification is finished, respectively storing the identification results of the imported value-added tax electronic invoice files according to the file attributes.
Optionally, in some embodiments of the present invention, the selecting the target type of the value-added tax electronic invoice and the selecting the standard picture of the target type of the value-added tax electronic invoice as a reference picture, and making a custom identification template includes:
selecting a standard value-added tax electronic invoice picture, wherein the standard value-added tax electronic invoice picture is complete, clear, correct and pollution-free and is used as a reference picture for manufacturing a custom identification template;
selecting 4 corners on an external frame of a table on a picture of the standard value-added tax electronic invoice as reference points for picture alignment transformation, and storing coordinate points of the 4 corners;
and selecting an area to be identified in the electronic invoice according to the requirement, storing the coordinate position of the area to be identified and the content represented by the area as the label information of the structured result, and obtaining the custom identification template.
Optionally, in some embodiments of the present invention, the preprocessing and the geometric correction processing on the target value-added tax electronic invoice file include:
carrying out gray processing on the target value-added tax electronic invoice file, and converting the target value-added tax electronic invoice file into a single-channel gray image;
performing smooth noise reduction processing on the single-channel gray level image by adopting Gaussian filtering processing to obtain a smooth noise reduction image;
extracting 4 outer frame straight lines in the smooth noise reduction image through Hough transformation, and further calculating to obtain coordinate positions of 4 corner points of the outer frame; comparing the coordinate positions of the 4 angular points with the 4 angular points in the custom identification template, and performing geometric correction through multi-level perspective transformation to obtain a corrected picture;
and comparing the reference picture, cutting off redundant boundary regions and ROI regions in the corrected picture, keeping the size of the corrected picture consistent with that of the self-defined identification template, and obtaining a picture of the region to be identified.
Optionally, in some embodiments of the present invention, the deep learning-based end-to-end character recognition OCR technology relies on two deep learning models, which are respectively: a word detection model CTPN and a word recognition model CRNN.
Optionally, in some embodiments of the present invention, the text detection model is: collecting pictures with various scenes and containing different characters, manually calibrating character areas, and dividing the character areas into a training set and a testing set according to a ratio of 9: 1; performing model training through a CTPN algorithm in deep learning; the detection model can detect the character area in the picture of the area to be identified, detect the character area in a line mode, and position and visually display the character area in a rectangular frame.
Optionally, in some embodiments of the present invention, the text recognition model is: collecting a Chinese language and English language database, generating a character recognition sample set containing fixed word number length, and performing model training through a CRNN algorithm in deep learning; the recognition model can recognize the character information in the picture of the area to be recognized, does not need to perform text line segmentation and character segmentation, and outputs the character information in a character string format.
A second aspect of the present invention provides an electronic device, which may include:
the acquisition module is used for selecting the value-added tax electronic invoice of the target type, selecting a standard picture of the value-added tax electronic invoice of the target type as a reference picture, and making a self-defined identification template; importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format;
the processing module is used for carrying out structural analysis on the PDF value-added tax electronic invoice files and converting the PDF value-added tax electronic invoice files into second value-added tax electronic invoice files in a picture format; preprocessing and geometrically correcting a target value-added tax electronic invoice file, and processing the target value-added tax electronic invoice file into a uniform standard size through scaling and cutting to obtain a processed value-added tax electronic invoice file, wherein the target value-added tax electronic invoice file comprises a first value-added tax electronic invoice file in a picture format and a second value-added tax electronic invoice file in the picture format;
aligning the processed value-added tax electronic invoice file with the reference picture; cutting the region to be identified in the aligned picture corresponding to the template picture into the region of interest ROI to obtain the ROI intercepted region; performing character detection and character recognition operation on the ROI intercepted area by utilizing an end-to-end character recognition algorithm based on deep learning to obtain a recognition result; and checking the identification result, performing structuring processing, and outputting and storing in a specified format.
Optionally, in some embodiments of the present invention, the value-added tax electronic invoice file is imported, and the single or multiple file uploading is supported by the value-added tax electronic invoice file; and after the identification is finished, respectively storing the identification results of the imported value-added tax electronic invoice files according to the file attributes.
Alternatively, in some embodiments of the present invention,
the acquisition module is specifically used for selecting a standard value-added tax electronic invoice picture which is complete, clear, correct and pollution-free and is used as a reference picture for manufacturing a custom identification template; selecting 4 corners on an external frame of a table on a picture of the standard value-added tax electronic invoice as reference points for picture alignment transformation, and storing coordinate points of the 4 corners; and selecting an area to be identified in the electronic invoice according to the requirement, storing the coordinate position of the area to be identified and the content represented by the area as the label information of the structured result, and obtaining the custom identification template.
Alternatively, in some embodiments of the present invention,
the processing module is specifically used for carrying out gray processing on the target value-added tax electronic invoice file and converting the target value-added tax electronic invoice file into a single-channel gray image; performing smooth noise reduction processing on the single-channel gray level image by adopting Gaussian filtering processing to obtain a smooth noise reduction image; extracting 4 outer frame straight lines in the smooth noise reduction image through Hough transformation, and further calculating to obtain coordinate positions of 4 corner points of the outer frame; comparing the coordinate positions of the 4 angular points with the 4 angular points in the custom identification template, and performing geometric correction through multi-level perspective transformation to obtain a corrected picture; and comparing the reference picture, cutting off redundant boundary regions and ROI regions in the corrected picture, keeping the size of the corrected picture consistent with that of the self-defined identification template, and obtaining a picture of the region to be identified.
Optionally, in some embodiments of the present invention, the deep learning-based end-to-end character recognition OCR technology relies on two deep learning models, which are respectively: a word detection model CTPN and a word recognition model CRNN.
Alternatively, in some embodiments of the present invention,
the character detection model is as follows: collecting pictures with various scenes and containing different characters, manually calibrating character areas, and dividing the character areas into a training set and a testing set according to a ratio of 9: 1; performing model training through a CTPN algorithm in deep learning; the detection model can detect the character area in the picture of the area to be identified, detect the character area in a line mode, and position and visually display the character area in a rectangular frame.
Alternatively, in some embodiments of the present invention,
the character recognition model is as follows: collecting a Chinese language and English language database, generating a character recognition sample set containing fixed word number length, and performing model training through a CRNN algorithm in deep learning; the recognition model can recognize the character information in the picture of the area to be recognized, does not need to perform text line segmentation and character segmentation, and outputs the character information in a character string format.
A third aspect of the present invention provides an electronic device, which may include:
a transceiver, a processor, and a memory, wherein the transceiver, the processor, and the memory are connected by a bus;
the memory is used for storing operation instructions;
the transceiver is used for selecting the value-added tax electronic invoice of the target type, selecting the standard picture of the value-added tax electronic invoice of the target type as a reference picture, and making a self-defined identification template; importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format;
the processor is configured to invoke the operation instruction, and execute the step of the method for extracting electronic invoice information as described in any optional implementation manner of the first aspect and the first aspect in the embodiment of the present invention.
A fourth aspect of the present invention provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for extracting electronic invoice information as described in the first aspect and any optional implementation manner of the first aspect in the embodiment of the present invention.
According to the technical scheme, the embodiment of the invention has the following advantages:
in the embodiment of the invention, a value-added tax electronic invoice of a target type is selected, a standard picture of the value-added tax electronic invoice of the target type is selected as a reference picture, and a user-defined identification template is manufactured; importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format; performing structural analysis on the PDF value-added tax electronic invoice file, and converting the PDF value-added tax electronic invoice file into a second value-added tax electronic invoice file in a picture format; preprocessing and geometrically correcting a target value-added tax electronic invoice file, and processing the target value-added tax electronic invoice file into a uniform standard size through scaling and cutting to obtain a processed value-added tax electronic invoice file, wherein the target value-added tax electronic invoice file comprises a first value-added tax electronic invoice file in a picture format and a second value-added tax electronic invoice file in the picture format; aligning the processed value-added tax electronic invoice file with the reference picture; cutting the region to be identified in the aligned picture corresponding to the template picture into the region of interest ROI to obtain the ROI intercepted region; performing character detection and character recognition operation on the ROI intercepted area by utilizing an end-to-end character recognition algorithm based on deep learning to obtain a recognition result; and checking the identification result, performing structuring processing, and outputting and storing in a specified format. Through the technology of converting the PDF format electronic invoice into the image format and the end-to-end OCR technology based on deep learning, the text recognition of the designated area of the value-added tax invoice is realized, the recognition efficiency of bill information is further improved, and the office efficiency of financial staff is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following briefly introduces the embodiments and the drawings used in the description of the prior art, and obviously, the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained according to the drawings.
Fig. 1 is a schematic diagram of an embodiment of an extraction method of electronic invoice information in an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for extracting electronic invoice information according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a value-added tax electronic invoice customization template in an embodiment of the invention;
FIG. 4 is a diagram of an embodiment of an electronic device in an embodiment of the invention;
fig. 5 is a schematic diagram of another embodiment of the electronic device in the embodiment of the present invention.
Detailed Description
The embodiment of the invention provides an electronic invoice information extraction method and electronic equipment, which are used for setting a custom identification template, realizing text identification of a designated area of a value-added tax invoice through a PDF format electronic invoice-to-picture format technology and an end-to-end OCR technology based on deep learning, further improving the identification efficiency of bill information and improving the office efficiency of financial staff.
In order to make the technical solutions of the present invention better understood by those skilled in the art, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. The embodiments based on the present invention should fall into the protection scope of the present invention.
As shown in fig. 1, which is a schematic diagram of an embodiment of an extraction method of electronic invoice information in an embodiment of the present invention, the extraction method may include:
101. and selecting the value-added tax electronic invoice of the target type, selecting a standard picture of the value-added tax electronic invoice of the target type as a reference picture, and making a custom identification template.
The selecting the value-added tax electronic invoice of the target type and the selecting the standard picture of the value-added tax electronic invoice of the target type as the reference picture, and making the custom identification template may include: selecting a standard value-added tax electronic invoice picture, wherein the standard value-added tax electronic invoice picture is complete, clear, correct and pollution-free and is used as a reference picture for manufacturing a custom identification template; selecting 4 corners on an external frame of a table on a picture of the standard value-added tax electronic invoice as reference points for picture alignment transformation, and storing coordinate points of the 4 corners; and selecting an area to be identified in the electronic invoice according to the requirement, storing the coordinate position of the area to be identified and the content represented by the area as the label information of the structured result, and obtaining the custom identification template.
Illustratively, custom templates are formulated. Selecting a complete, clear, correct and pollution-free standard picture of the value-added tax invoice as a reference picture for making a template, manually calibrating an alignment reference point of the reference picture, namely 4 angular points of an external frame of the form, and storing the coordinate position of the reference picture; manually selecting a character area to be identified, marking the attribute content represented by the area, namely label information, storing the coordinate position of the rectangular area, and finally naming and storing the custom identification template.
Fig. 2 is a design flow chart of the method for extracting electronic invoice information according to the embodiment of the present invention. Fig. 3 is a schematic diagram of a value-added tax electronic invoice customization template in the embodiment of the present invention.
102. And importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format.
It can be understood that the value added tax electronic invoice file is imported, and the value added tax electronic invoice file supports single and multiple file uploading; and after the identification is finished, respectively storing the identification results of the imported value-added tax electronic invoice files according to the file attributes.
Illustratively, the import type of the value-added tax electronic invoice file supports single import and batch import respectively, and the import format supports a picture format (JPG and PNG) and a PDF format (text version and scan version) respectively, wherein the name of each file should be fixed to a uniform format. For example, the system automatically classifies and summarizes the items according to the names, respectively, "belonged person _ belonged department _ belonged item _ reimbursement category _ index value".
103. And carrying out structural analysis on the PDF value-added tax electronic invoice files, and converting the PDF value-added tax electronic invoice files into the value-added tax electronic invoice files in the picture format.
Illustratively, structure analysis is carried out on the uploaded PDF format electronic invoices, each page is independently converted into a picture format, namely, one picture only contains the content of one invoice; the PDF file is named, and only one sub-index value is added behind the index value, so that the project sum statistics and summarization at the later stage are facilitated.
104. Preprocessing and geometrically correcting a target value-added tax electronic invoice file, zooming and cutting the target value-added tax electronic invoice file into a uniform standard size to obtain a processed value-added tax electronic invoice file, and aligning the processed value-added tax electronic invoice file with the reference picture; and performing interesting ROI region clipping operation on the region to be identified in the aligned picture corresponding to the template picture to obtain an ROI clipped region.
The target value-added tax electronic invoice file comprises a first value-added tax electronic invoice file in the picture format and a second value-added tax electronic invoice file in the picture format.
The preprocessing and geometric correction processing of the target value-added tax electronic invoice file may include:
carrying out gray processing on the target value-added tax electronic invoice file, and converting the target value-added tax electronic invoice file into a single-channel gray image; performing smooth noise reduction processing on the single-channel gray level image by adopting Gaussian filtering processing to obtain a smooth noise reduction image; extracting 4 outer frame straight lines in the smooth noise reduction image through Hough transformation, and further calculating to obtain coordinate positions of 4 corner points of the outer frame; comparing the coordinate positions of the 4 angular points with the 4 angular points in the custom identification template, and performing geometric correction through multi-level perspective transformation to obtain a corrected picture; and comparing the reference picture, cutting off redundant boundary areas in the corrected picture, and keeping the size of the corrected picture consistent with that of the custom identification template.
Exemplarily, the image to be recognized is subjected to preprocessing and geometric correction processing, where the preprocessing is as follows: carrying out graying processing on the invoice picture to convert the invoice picture into a single-channel image; adopting Gaussian filtering processing to carry out smooth noise reduction processing on the gray level image; extracting 4 outer frame straight lines in the invoice image through Hough transformation, and further calculating to obtain coordinate positions of 4 corner points of the outer frame; comparing the 4 corner points with the 4 corner points in the template, and performing geometric correction through multi-stage perspective transformation; and comparing the reference picture, cutting off redundant boundary regions and ROI regions in the corrected picture, keeping the size of the corrected picture consistent with that of the self-defined identification template, and obtaining a picture of the region to be identified.
105. And carrying out character detection and character recognition operation on the ROI intercepted area by utilizing an end-to-end character recognition algorithm based on deep learning to obtain a recognition result.
The end-to-end character recognition OCR technology based on deep learning relies on two deep learning models, which are respectively: a text detection model and a text recognition model.
The character detection model is as follows: collecting pictures with various scenes and containing different characters, manually calibrating character areas, and dividing the character areas into a training set and a testing set according to a ratio of 9: 1; performing model training through a CTPN algorithm in deep learning; the detection model can detect the character area in the picture of the area to be identified, detect the character area in a line mode, and position and visually display the character area in a rectangular frame.
The character recognition model is as follows: collecting a Chinese language and English language database, generating a character recognition sample set containing fixed word number length, and performing model training through a CRNN algorithm in deep learning; the recognition model can recognize the character information in the picture of the area to be recognized, does not need to perform text line segmentation and character segmentation, and outputs the character information in a character string format.
Exemplarily, according to a text detection-CTPN model, taking a picture after preprocessing and geometric correction as an input, performing text detection to obtain the positioning information of a text line, which may include:
collecting related bill pictures and natural pictures containing characters in various scenes; calibrating the character area in the picture by using a labeling tool LabelImg software; constructing a CTPN model, dividing a calibrated sample set into a training set and a verification set according to the proportion of 9:1, training a network model, and if convergence is achieved, storing the model; if not, stopping training, adjusting parameters, and retraining until the algorithm converges.
According to the text recognition-CRNN model, the pre-processed and geometrically corrected picture is used as input for text detection to obtain the positioning information of the text line, which may include: taking the detected screenshot of the text line region as the input of a text recognition model, and performing text recognition to obtain a text information character string; extracting a bill background picture for the bill, adding related fonts contained in the bill, and generating a training sample with a fixed word number in a manual synthesis mode; the training sample is closer to the reality data by adding noise; constructing a CRNN model, dividing a calibrated sample set into a training set and a verification set according to the proportion of 9:1, training a network model, and if convergence is achieved, storing the model; if not, stopping training, adjusting parameters, and retraining until the algorithm converges.
106. And checking the identification result, performing structuring processing, and outputting and storing in a specified format.
Illustratively, structured text information is constructed according to the obtained text information and the label information of the area, and the structured text information is output and saved in a specified format.
It can be understood that the structuring process is to compare the labels of the rectangular areas in the template, define the text information identified in the corresponding rectangular area in the picture to be identified as the information of the label area, and constitute the structured data. For example: and converting the character string result recognized by the CRNN characters into a format specified by a user and outputting the format, such as an amount part in an invoice, if capital and lowercase amounts are recognized, outputting the capital amount to an capital column and outputting the lowercase amount to a lowercase column respectively.
In the embodiment of the invention, a value-added tax electronic invoice of a target type is selected, a standard picture of the value-added tax electronic invoice of the target type is selected as a reference picture, and a user-defined identification template is manufactured; importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format; performing structural analysis on the PDF value-added tax electronic invoice file, and converting the PDF value-added tax electronic invoice file into a second value-added tax electronic invoice file in a picture format; preprocessing and geometrically correcting a target value-added tax electronic invoice file, and processing the target value-added tax electronic invoice file into a uniform standard size through scaling and cutting to obtain a processed value-added tax electronic invoice file, wherein the target value-added tax electronic invoice file comprises a first value-added tax electronic invoice file in a picture format and a second value-added tax electronic invoice file in the picture format; aligning the processed value-added tax electronic invoice file with the reference picture; cutting the region to be identified in the aligned picture corresponding to the template picture into the region of interest ROI to obtain the ROI intercepted region; performing character detection and character recognition operation on the ROI intercepted area by utilizing an end-to-end character recognition algorithm based on deep learning to obtain a recognition result; and checking the identification result, performing structuring processing, and outputting and storing in a specified format. The invention can set the identification area template according to the self requirement, directly import the electronic invoice in PDF format and picture format, output the structured result of the custom format based on the end-to-end text identification model of deep learning, greatly improve the office efficiency of financial staff, shorten the reimbursement period, ensure the identification accuracy and play a certain role in promoting the realization of automatic office.
It can be understood that the structure analysis of the PDF format electronic invoice file is added, so that the tedious work of manually capturing images for the electronic invoice is omitted; the function of self-defining template is added, so that the picture can be quickly and accurately positioned to the best recognition effect, and the region of interest can be extracted according to the self requirement; the end-to-end recognition of the text content is carried out by using the deep learning algorithm, so that the complex and complicated design and low-efficiency processing flow of the traditional OCR method are avoided, and the recognition accuracy is greatly improved.
As shown in fig. 4, a schematic diagram of an embodiment of an electronic device provided in an embodiment of the present invention may include:
the acquiring module 401 is configured to select a value-added tax electronic invoice of a target type, select a standard picture of the value-added tax electronic invoice of the target type as a reference picture, and make a custom identification template; importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format;
the processing module 402 is used for performing structure analysis on the value-added tax electronic invoice files of the PDF class and converting the value-added tax electronic invoice files into second value-added tax electronic invoice files in a picture format; preprocessing and geometrically correcting a target value-added tax electronic invoice file, and processing the target value-added tax electronic invoice file into a uniform standard size through scaling and cutting to obtain a processed value-added tax electronic invoice file, wherein the target value-added tax electronic invoice file comprises a first value-added tax electronic invoice file in a picture format and a second value-added tax electronic invoice file in the picture format; aligning the processed value-added tax electronic invoice file with the reference picture; cutting the region to be identified in the aligned picture corresponding to the template picture into the region of interest ROI to obtain the ROI intercepted region; performing character detection and character recognition operation on the ROI intercepted area by utilizing an end-to-end character recognition algorithm based on deep learning to obtain a recognition result; and checking the identification result, performing structuring processing, and outputting and storing in a specified format.
Optionally, in some embodiments of the present invention, the value-added tax electronic invoice file is imported, and the single or multiple file uploading is supported by the value-added tax electronic invoice file; and after the identification is finished, respectively storing the identification results of the imported value-added tax electronic invoice files according to the file attributes. Alternatively, in some embodiments of the present invention,
the obtaining module 401 is specifically configured to select a standard value-added tax electronic invoice picture, where the standard value-added tax electronic invoice picture is complete, clear, correct, and pollution-free and is used as a reference picture for making a custom identification template; selecting 4 corners on an external frame of a table on a picture of the standard value-added tax electronic invoice as reference points for picture alignment transformation, and storing coordinate points of the 4 corners; and selecting an area to be identified in the electronic invoice according to the requirement, storing the coordinate position of the area to be identified and the content represented by the area as the label information of the structured result, and obtaining the custom identification template.
Alternatively, in some embodiments of the present invention,
the processing module 402 is specifically configured to perform graying processing on the target value-added tax electronic invoice file to convert the target value-added tax electronic invoice file into a single-channel grayscale image; performing smooth noise reduction processing on the single-channel gray level image by adopting Gaussian filtering processing to obtain a smooth noise reduction image; extracting 4 outer frame straight lines in the smooth noise reduction image through Hough transformation, and further calculating to obtain coordinate positions of 4 corner points of the outer frame; comparing the coordinate positions of the 4 angular points with the 4 angular points in the custom identification template, and performing geometric correction through multi-level perspective transformation to obtain a corrected picture; and comparing the reference picture, cutting off redundant boundary regions and ROI regions in the corrected picture, keeping the size of the corrected picture consistent with that of the self-defined identification template, and obtaining a picture of the region to be identified.
Optionally, in some embodiments of the present invention, the deep learning-based end-to-end character recognition OCR technology relies on two deep learning models, which are respectively: a text detection model and a text recognition model.
Alternatively, in some embodiments of the present invention,
the character detection model is as follows: collecting pictures with various scenes and containing different characters, manually calibrating character areas, and dividing the character areas into a training set and a testing set according to a ratio of 9: 1; performing model training through a CTPN algorithm in deep learning; the detection model can detect the character area in the picture of the area to be identified, detect the character area in a line mode, and position and visually display the character area in a rectangular frame.
Alternatively, in some embodiments of the present invention,
the character recognition model is as follows: collecting a Chinese language and English language database, generating a character recognition sample set containing fixed word number length, and performing model training through a CRNN algorithm in deep learning; the recognition model can recognize the character information in the picture of the area to be recognized, does not need to perform text line segmentation and character segmentation, and outputs the character information in a character string format.
As shown in fig. 5, which is a schematic diagram of an embodiment of an electronic device in an embodiment of the present invention, the electronic device may include:
the system comprises a transceiver 501, a processor 502 and a memory 503, wherein the transceiver 501, the processor 502 and the memory 503 are connected through a bus;
a memory 503 for storing operation instructions;
the transceiver 501 is used for selecting a value-added tax electronic invoice of a target type, selecting a standard picture of the value-added tax electronic invoice of the target type as a reference picture, and making a self-defined identification template; importing value-added tax electronic invoice files, wherein formats supported by the value-added tax electronic invoice files comprise PDF (Portable document Format) and picture types, and each value-added electronic invoice file is named as a uniform format;
the processor 502 is configured to invoke the operation instruction to execute the steps of the method for extracting electronic invoice information shown in fig. 1 according to the embodiment of the present invention.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for extracting electronic invoice information is characterized by comprising the following steps:
selecting a value-added tax electronic invoice of a target type, selecting a standard picture of the value-added tax electronic invoice of the target type as a reference picture, and making a custom identification template;
importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format;
performing structural analysis on the PDF value-added tax electronic invoice file, and converting the PDF value-added tax electronic invoice file into a second value-added tax electronic invoice file in a picture format;
preprocessing and geometrically correcting a target value-added tax electronic invoice file, and processing the target value-added tax electronic invoice file into a uniform standard size through scaling and cutting to obtain a processed value-added tax electronic invoice file, wherein the target value-added tax electronic invoice file comprises a first value-added tax electronic invoice file in a picture format and a second value-added tax electronic invoice file in the picture format;
aligning the processed value-added tax electronic invoice file with the reference picture;
cutting the region to be identified in the aligned picture corresponding to the template picture into the region of interest ROI to obtain the ROI intercepted region;
performing character detection and character recognition operation on the ROI intercepted area by utilizing an end-to-end character recognition algorithm based on deep learning to obtain a recognition result;
and checking the identification result, performing structuring processing, and outputting and storing in a specified format.
2. The method according to claim 1, wherein the imported value-added tax electronic invoice file supports single and multiple file uploads; and after the identification is finished, respectively storing the identification results of the imported value-added tax electronic invoice files according to the file attributes.
3. The method as claimed in claim 1, wherein the selecting the value-added tax electronic invoice of the target type and the selecting the standard picture of the value-added tax electronic invoice of the target type as the reference picture, and the making of the custom identification template comprises:
selecting a standard value-added tax electronic invoice picture, wherein the standard value-added tax electronic invoice picture is complete, clear, correct and pollution-free and is used as a reference picture for manufacturing a custom identification template;
selecting 4 corners on an external frame of a table on a picture of the standard value-added tax electronic invoice as reference points for picture alignment transformation, and storing coordinate points of the 4 corners;
and selecting an area to be identified in the electronic invoice according to the requirement, storing the coordinate position of the area to be identified and the content represented by the area as the label information of the structured result, and obtaining the custom identification template.
4. The method according to any one of claims 1-3, wherein the pre-processing and geometry correcting the target value-added tax electronic invoice file comprises:
carrying out gray processing on the target value-added tax electronic invoice file, and converting the target value-added tax electronic invoice file into a single-channel gray image;
performing smooth noise reduction processing on the single-channel gray level image by adopting Gaussian filtering processing to obtain a smooth noise reduction image;
extracting 4 outer frame straight lines in the smooth noise reduction image through Hough transformation, and further calculating to obtain coordinate positions of 4 corner points of the outer frame; comparing the coordinate positions of the 4 angular points with the 4 angular points in the custom identification template, and performing geometric correction through multi-level perspective transformation to obtain a corrected picture;
and comparing the reference picture, cutting off redundant boundary regions and ROI regions in the corrected picture, keeping the size of the corrected picture consistent with that of the self-defined identification template, and obtaining a picture of the region to be identified.
5. A method according to any of claims 1-3, characterized in that said deep learning based end-to-end character recognition OCR technique relies on two deep learning models, respectively: a text detection model and a text recognition model.
6. The method of claim 4,
the character detection model is as follows: collecting pictures with various scenes and containing different characters, manually calibrating character areas, and dividing the character areas into a training set and a testing set according to a ratio of 9: 1; performing model training through a CTPN algorithm in deep learning; the detection model can detect the character area in the picture of the area to be identified, detect the character area in a line mode, and position and visually display the character area in a rectangular frame.
7. The method of claim 4,
the character recognition model is as follows: collecting a Chinese language and English language database, generating a character recognition sample set containing fixed word number length, and performing model training through a CRNN algorithm in deep learning; the recognition model can recognize the character information in the picture of the area to be recognized, does not need to perform text line segmentation and character segmentation, and outputs the character information in a character string format.
8. An electronic device, comprising:
the acquisition module is used for selecting the value-added tax electronic invoice of the target type, selecting a standard picture of the value-added tax electronic invoice of the target type as a reference picture, and making a self-defined identification template; importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format;
the processing module is used for carrying out structural analysis on the PDF value-added tax electronic invoice files and converting the PDF value-added tax electronic invoice files into second value-added tax electronic invoice files in a picture format; preprocessing and geometrically correcting a target value-added tax electronic invoice file, and processing the target value-added tax electronic invoice file into a uniform standard size through scaling and cutting to obtain a processed value-added tax electronic invoice file, wherein the target value-added tax electronic invoice file comprises a first value-added tax electronic invoice file in a picture format and a second value-added tax electronic invoice file in the picture format; aligning the processed value-added tax electronic invoice file with the reference picture; cutting the region to be identified in the aligned picture corresponding to the template picture into the region of interest ROI to obtain the ROI intercepted region; performing character detection and character recognition operation on the ROI intercepted area by utilizing an end-to-end character recognition algorithm based on deep learning to obtain a recognition result; and checking the identification result, performing structuring processing, and outputting and storing in a specified format.
9. An electronic device, comprising:
a transceiver, a processor, and a memory, wherein the transceiver, the processor, and the memory are connected by a bus;
the memory is used for storing operation instructions;
the transceiver is used for selecting the value-added tax electronic invoice of the target type, selecting the standard picture of the value-added tax electronic invoice of the target type as a reference picture, and making a self-defined identification template; importing a value-added tax electronic invoice file, wherein the value-added tax electronic invoice file comprises a value-added tax electronic invoice file in a PDF format and a first value-added tax electronic invoice file in a picture format, and each value-added tax electronic invoice file is named as a uniform format;
the processor is used for calling the operation instruction to execute the steps of the electronic invoice information extraction method according to any one of claims 1-7.
10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of extraction of electronic invoice information, according to any one of claims 1 to 7.
CN201910915157.XA 2019-09-26 2019-09-26 Electronic invoice information extraction method and electronic equipment Pending CN110751143A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910915157.XA CN110751143A (en) 2019-09-26 2019-09-26 Electronic invoice information extraction method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910915157.XA CN110751143A (en) 2019-09-26 2019-09-26 Electronic invoice information extraction method and electronic equipment

Publications (1)

Publication Number Publication Date
CN110751143A true CN110751143A (en) 2020-02-04

Family

ID=69277151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910915157.XA Pending CN110751143A (en) 2019-09-26 2019-09-26 Electronic invoice information extraction method and electronic equipment

Country Status (1)

Country Link
CN (1) CN110751143A (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348441A (en) * 2019-07-10 2019-10-18 深圳市华云中盛科技有限公司 VAT invoice recognition methods, device, computer equipment and storage medium
CN111401266A (en) * 2020-03-19 2020-07-10 杭州易现先进科技有限公司 Method, device, computer device and readable storage medium for positioning corner points of drawing book
CN111401312A (en) * 2020-04-10 2020-07-10 深圳新致软件有限公司 PDF drawing character recognition method, system and equipment
CN111401007A (en) * 2020-03-03 2020-07-10 厦门亿禄信息科技有限公司 Method for converting unstructured data into structured data
CN111444792A (en) * 2020-03-13 2020-07-24 安诚迈科(北京)信息技术有限公司 Bill recognition method, electronic device, storage medium and device
CN111539412A (en) * 2020-04-21 2020-08-14 上海云从企业发展有限公司 Image analysis method, system, device and medium based on OCR
CN111652232A (en) * 2020-05-29 2020-09-11 泰康保险集团股份有限公司 Bill identification method and device, electronic equipment and computer readable storage medium
CN111753717A (en) * 2020-06-23 2020-10-09 北京百度网讯科技有限公司 Method, apparatus, device and medium for extracting structured information of text
CN111783645A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Character recognition method and device, electronic equipment and computer readable storage medium
CN111931771A (en) * 2020-09-16 2020-11-13 深圳壹账通智能科技有限公司 Bill content identification method, device, medium and electronic equipment
CN112016547A (en) * 2020-08-20 2020-12-01 上海天壤智能科技有限公司 Image character recognition method, system and medium based on deep learning
CN112036123A (en) * 2020-08-31 2020-12-04 北京奇虎鸿腾科技有限公司 PDF (Portable document Format) generation method, device and equipment based on webpage and storage medium
CN112132016A (en) * 2020-09-22 2020-12-25 平安科技(深圳)有限公司 Bill information extraction method and device and electronic equipment
CN112257095A (en) * 2020-11-23 2021-01-22 中电万维信息技术有限责任公司 Method for selecting alliance chain consensus node
CN112308036A (en) * 2020-11-25 2021-02-02 杭州睿胜软件有限公司 Bill identification method and device and readable storage medium
CN112348022A (en) * 2020-10-28 2021-02-09 富邦华一银行有限公司 Free-form document identification method based on deep learning
CN112507973A (en) * 2020-12-29 2021-03-16 中国电子科技集团公司第二十八研究所 Text and picture recognition system based on OCR technology
CN112580618A (en) * 2020-10-30 2021-03-30 中电万维信息技术有限责任公司 Electronic license verification method based on OCR
CN112633116A (en) * 2020-12-17 2021-04-09 西安理工大学 Method for intelligently analyzing PDF (Portable document Format) image-text
CN112651289A (en) * 2020-10-19 2021-04-13 广东工业大学 Intelligent identification and verification system and method for value-added tax common invoice
CN112949455A (en) * 2021-02-26 2021-06-11 武汉天喻信息产业股份有限公司 Value-added tax invoice identification system and method
CN113343663A (en) * 2021-06-29 2021-09-03 广州智选网络科技有限公司 Bill structuring method and device
CN113420657A (en) * 2021-06-23 2021-09-21 平安科技(深圳)有限公司 Intelligent verification method and device, computer equipment and storage medium
CN113762244A (en) * 2020-06-05 2021-12-07 北京市天元网络技术股份有限公司 Document information extraction method and device
CN113947380A (en) * 2021-10-21 2022-01-18 广东电网有限责任公司 Automatic summarizing method and device for ETC invoices
CN113963147A (en) * 2021-09-26 2022-01-21 西安交通大学 Key information extraction method and system based on semantic segmentation
CN114494678A (en) * 2021-12-02 2022-05-13 国家计算机网络与信息安全管理中心 Character recognition method and electronic equipment
CN116824604A (en) * 2023-08-30 2023-09-29 江苏苏宁银行股份有限公司 Financial data management method and system based on image processing

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130230246A1 (en) * 2012-03-01 2013-09-05 Ricoh Company, Ltd. Expense Report System With Receipt Image Processing
CN107358232A (en) * 2017-06-28 2017-11-17 中山大学新华学院 Invoice recognition methods and identification and management system based on plug-in unit
CN107766809A (en) * 2017-10-09 2018-03-06 平安科技(深圳)有限公司 Electronic installation, billing information recognition methods and computer-readable recording medium
CN109214382A (en) * 2018-07-16 2019-01-15 顺丰科技有限公司 A kind of billing information recognizer, equipment and storage medium based on CRNN
CN109344838A (en) * 2018-11-02 2019-02-15 长江大学 The automatic method for quickly identifying of invoice information, system and device
CN109635627A (en) * 2018-10-23 2019-04-16 中国平安财产保险股份有限公司 Pictorial information extracting method, device, computer equipment and storage medium
CN109657665A (en) * 2018-10-31 2019-04-19 广东工业大学 A kind of invoice batch automatic recognition system based on deep learning
CN109800747A (en) * 2018-12-14 2019-05-24 平安科技(深圳)有限公司 Medical invoice recognition methods, user equipment, storage medium and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130230246A1 (en) * 2012-03-01 2013-09-05 Ricoh Company, Ltd. Expense Report System With Receipt Image Processing
CN107358232A (en) * 2017-06-28 2017-11-17 中山大学新华学院 Invoice recognition methods and identification and management system based on plug-in unit
CN107766809A (en) * 2017-10-09 2018-03-06 平安科技(深圳)有限公司 Electronic installation, billing information recognition methods and computer-readable recording medium
CN109214382A (en) * 2018-07-16 2019-01-15 顺丰科技有限公司 A kind of billing information recognizer, equipment and storage medium based on CRNN
CN109635627A (en) * 2018-10-23 2019-04-16 中国平安财产保险股份有限公司 Pictorial information extracting method, device, computer equipment and storage medium
CN109657665A (en) * 2018-10-31 2019-04-19 广东工业大学 A kind of invoice batch automatic recognition system based on deep learning
CN109344838A (en) * 2018-11-02 2019-02-15 长江大学 The automatic method for quickly identifying of invoice information, system and device
CN109800747A (en) * 2018-12-14 2019-05-24 平安科技(深圳)有限公司 Medical invoice recognition methods, user equipment, storage medium and device

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348441A (en) * 2019-07-10 2019-10-18 深圳市华云中盛科技有限公司 VAT invoice recognition methods, device, computer equipment and storage medium
CN110348441B (en) * 2019-07-10 2021-08-17 深圳市华云中盛科技股份有限公司 Value-added tax invoice identification method and device, computer equipment and storage medium
CN111401007A (en) * 2020-03-03 2020-07-10 厦门亿禄信息科技有限公司 Method for converting unstructured data into structured data
CN111444792B (en) * 2020-03-13 2023-05-09 安诚迈科(北京)信息技术有限公司 Bill identification method, electronic equipment, storage medium and device
CN111444792A (en) * 2020-03-13 2020-07-24 安诚迈科(北京)信息技术有限公司 Bill recognition method, electronic device, storage medium and device
CN111401266B (en) * 2020-03-19 2023-11-03 杭州易现先进科技有限公司 Method, equipment, computer equipment and readable storage medium for positioning picture corner points
CN111401266A (en) * 2020-03-19 2020-07-10 杭州易现先进科技有限公司 Method, device, computer device and readable storage medium for positioning corner points of drawing book
CN111401312B (en) * 2020-04-10 2024-04-26 深圳新致软件有限公司 PDF drawing text recognition method, system and equipment
CN111401312A (en) * 2020-04-10 2020-07-10 深圳新致软件有限公司 PDF drawing character recognition method, system and equipment
CN111539412A (en) * 2020-04-21 2020-08-14 上海云从企业发展有限公司 Image analysis method, system, device and medium based on OCR
CN111539412B (en) * 2020-04-21 2021-02-26 上海云从企业发展有限公司 Image analysis method, system, device and medium based on OCR
CN111652232B (en) * 2020-05-29 2023-08-22 泰康保险集团股份有限公司 Bill identification method and device, electronic equipment and computer readable storage medium
CN111652232A (en) * 2020-05-29 2020-09-11 泰康保险集团股份有限公司 Bill identification method and device, electronic equipment and computer readable storage medium
CN113762244A (en) * 2020-06-05 2021-12-07 北京市天元网络技术股份有限公司 Document information extraction method and device
CN111753717B (en) * 2020-06-23 2023-07-28 北京百度网讯科技有限公司 Method, device, equipment and medium for extracting structured information of text
CN111753717A (en) * 2020-06-23 2020-10-09 北京百度网讯科技有限公司 Method, apparatus, device and medium for extracting structured information of text
CN111783645A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Character recognition method and device, electronic equipment and computer readable storage medium
CN112016547A (en) * 2020-08-20 2020-12-01 上海天壤智能科技有限公司 Image character recognition method, system and medium based on deep learning
CN112036123A (en) * 2020-08-31 2020-12-04 北京奇虎鸿腾科技有限公司 PDF (Portable document Format) generation method, device and equipment based on webpage and storage medium
CN112036123B (en) * 2020-08-31 2024-05-10 三六零数字安全科技集团有限公司 PDF generation method, device, equipment and storage medium based on webpage
CN111931771A (en) * 2020-09-16 2020-11-13 深圳壹账通智能科技有限公司 Bill content identification method, device, medium and electronic equipment
WO2022057470A1 (en) * 2020-09-16 2022-03-24 深圳壹账通智能科技有限公司 Bill content recognition method and apparatus, and computer device, and medium
CN112132016A (en) * 2020-09-22 2020-12-25 平安科技(深圳)有限公司 Bill information extraction method and device and electronic equipment
CN112132016B (en) * 2020-09-22 2023-09-15 平安科技(深圳)有限公司 Bill information extraction method and device and electronic equipment
CN112651289B (en) * 2020-10-19 2023-10-13 广东工业大学 Value-added tax common invoice intelligent recognition and verification system and method thereof
CN112651289A (en) * 2020-10-19 2021-04-13 广东工业大学 Intelligent identification and verification system and method for value-added tax common invoice
CN112348022A (en) * 2020-10-28 2021-02-09 富邦华一银行有限公司 Free-form document identification method based on deep learning
CN112348022B (en) * 2020-10-28 2024-05-07 富邦华一银行有限公司 Free-form document identification method based on deep learning
CN112580618A (en) * 2020-10-30 2021-03-30 中电万维信息技术有限责任公司 Electronic license verification method based on OCR
CN112257095A (en) * 2020-11-23 2021-01-22 中电万维信息技术有限责任公司 Method for selecting alliance chain consensus node
CN112257095B (en) * 2020-11-23 2022-03-22 中电万维信息技术有限责任公司 Method for selecting alliance chain consensus node
WO2022111549A1 (en) * 2020-11-25 2022-06-02 杭州睿胜软件有限公司 Document recognition method and apparatus, and readable storage medium
CN112308036A (en) * 2020-11-25 2021-02-02 杭州睿胜软件有限公司 Bill identification method and device and readable storage medium
CN112633116B (en) * 2020-12-17 2024-02-02 西安理工大学 Method for intelligently analyzing PDF graphics context
CN112633116A (en) * 2020-12-17 2021-04-09 西安理工大学 Method for intelligently analyzing PDF (Portable document Format) image-text
CN112507973B (en) * 2020-12-29 2022-09-06 中国电子科技集团公司第二十八研究所 Text and picture recognition system based on OCR technology
CN112507973A (en) * 2020-12-29 2021-03-16 中国电子科技集团公司第二十八研究所 Text and picture recognition system based on OCR technology
CN112949455B (en) * 2021-02-26 2024-04-05 武汉天喻信息产业股份有限公司 Value-added tax invoice recognition system and method
CN112949455A (en) * 2021-02-26 2021-06-11 武汉天喻信息产业股份有限公司 Value-added tax invoice identification system and method
CN113420657A (en) * 2021-06-23 2021-09-21 平安科技(深圳)有限公司 Intelligent verification method and device, computer equipment and storage medium
CN113343663A (en) * 2021-06-29 2021-09-03 广州智选网络科技有限公司 Bill structuring method and device
CN113963147A (en) * 2021-09-26 2022-01-21 西安交通大学 Key information extraction method and system based on semantic segmentation
CN113963147B (en) * 2021-09-26 2023-09-15 西安交通大学 Key information extraction method and system based on semantic segmentation
CN113947380A (en) * 2021-10-21 2022-01-18 广东电网有限责任公司 Automatic summarizing method and device for ETC invoices
CN114494678A (en) * 2021-12-02 2022-05-13 国家计算机网络与信息安全管理中心 Character recognition method and electronic equipment
CN116824604B (en) * 2023-08-30 2023-11-21 江苏苏宁银行股份有限公司 Financial data management method and system based on image processing
CN116824604A (en) * 2023-08-30 2023-09-29 江苏苏宁银行股份有限公司 Financial data management method and system based on image processing

Similar Documents

Publication Publication Date Title
CN110751143A (en) Electronic invoice information extraction method and electronic equipment
US10489682B1 (en) Optical character recognition employing deep learning with machine generated training data
US10846553B2 (en) Recognizing typewritten and handwritten characters using end-to-end deep learning
US10915788B2 (en) Optical character recognition using end-to-end deep learning
US9552516B2 (en) Document information extraction using geometric models
JP6528147B2 (en) Accounting data entry support system, method and program
US11232300B2 (en) System and method for automatic detection and verification of optical character recognition data
US10489645B2 (en) System and method for automatic detection and verification of optical character recognition data
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
US20170052985A1 (en) Normalizing values in data tables
US8792730B2 (en) Classification and standardization of field images associated with a field in a form
US20150169972A1 (en) Character data generation based on transformed imaged data to identify nutrition-related data or other types of data
Isheawy et al. Optical character recognition (ocr) system
US9047533B2 (en) Parsing tables by probabilistic modeling of perceptual cues
Akinbade et al. An adaptive thresholding algorithm-based optical character recognition system for information extraction in complex images
RU2597163C2 (en) Comparing documents using reliable source
CN110263000B (en) Paper document electronization and filing method
CN111241329A (en) Image retrieval-based ancient character interpretation method and device
CN115311666A (en) Image-text recognition method and device, computer equipment and storage medium
US11281901B2 (en) Document extraction system and method
US11256760B1 (en) Region adjacent subgraph isomorphism for layout clustering in document images
JPH11219394A (en) Automatic various financial chart input device
KR102561878B1 (en) Ai blue ocr reading system and method based on machine learning
CN115546817A (en) Document parsing method and device
Nair et al. A Smarter Way to Collect and Store Data: AI and OCR Solutions for Industry 4.0 Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200204

RJ01 Rejection of invention patent application after publication