CN115497114A - Structured information extraction method for cigarette logistics receipt bill - Google Patents

Structured information extraction method for cigarette logistics receipt bill Download PDF

Info

Publication number
CN115497114A
CN115497114A CN202211442689.4A CN202211442689A CN115497114A CN 115497114 A CN115497114 A CN 115497114A CN 202211442689 A CN202211442689 A CN 202211442689A CN 115497114 A CN115497114 A CN 115497114A
Authority
CN
China
Prior art keywords
picture
value
template
text
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211442689.4A
Other languages
Chinese (zh)
Other versions
CN115497114B (en
Inventor
曾华
徐伟
刘永海
朱小晓
胡晓峰
李涛
周幸
曾鹏程
李�瑞
廖健
王静雅
付雯
龙涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Aimo Technology Co ltd
China National Tobacco Corp Sichuan Branch
Original Assignee
Shenzhen Aimo Technology Co ltd
China National Tobacco Corp Sichuan Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Aimo Technology Co ltd, China National Tobacco Corp Sichuan Branch filed Critical Shenzhen Aimo Technology Co ltd
Priority to CN202211442689.4A priority Critical patent/CN115497114B/en
Publication of CN115497114A publication Critical patent/CN115497114A/en
Application granted granted Critical
Publication of CN115497114B publication Critical patent/CN115497114B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/12Detection or correction of errors, e.g. by rescanning the pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a structured information extraction method of cigarette logistics receipt bills, which comprises a pre-labeling step and an identification step, wherein the pre-labeling step comprises the following steps: setting template picture standards of the bill, selecting the standard template picture, and marking key and value on the template picture, wherein the key is a fixed and unchangeable keyword in the bill, and the value is variable content in the bill; an identification step: determining a picture to be recognized, matching keys associated with the picture to be recognized and keys associated with the template picture, setting a text frame except the keys in the picture to be recognized as a value candidate frame, aligning the picture to be recognized and the template picture according to the corresponding relation between the keys, correcting the value candidate frame in a staggered mode, and extracting structural information according to the content in the template picture value text frame. The method is based on the extraction of the structured information of template alignment and dislocation correction, only one template picture is required to be marked for each bill, the method is flexible and high in applicability, the problem of printing dislocation can be solved, and the identification accuracy is high.

Description

Structured information extraction method for cigarette logistics receipt bill
Technical Field
The invention relates to the field of logistics, in particular to a structured information extraction method for cigarette logistics receipt bills.
Background
In a tobacco logistics scene, a receiver has to confirm the information of a delivery party, and the information of a bill needs to be checked and examined with the information recorded by a system. Manual review is time consuming and error prone, and an alternative is to automatically extract the structured information (date, number, etc.) on the ticket by image recognition algorithm and compare it with the structured information recorded by the system.
At present, two main methods are mainly used for extracting structured information of bills, one method is to perform post-processing on a ocr (optical character recognition) result by using relevant rules such as regular expression matching and the like, the method is flexible but low in accuracy rate, especially cannot solve the situation of printing dislocation, the other method is to detect the position of each field by using a deep learning method and then perform ocr content recognition on the field, the method is high in accuracy rate, but a large amount of data needs to be collected for labeling and training for each bill, and the method is not flexible enough and low in applicability.
Disclosure of Invention
Based on the above problem, the invention provides a structured information extraction method for cigarette logistics receipt bills, which is based on structured information extraction of template alignment and dislocation correction, only one template picture needs to be marked for each bill, and the method is flexible and high in applicability, can solve the problem of printing dislocation, and is high in identification accuracy.
The technical scheme of the invention is as follows:
a structured information extraction method for cigarette logistics receipt bills is characterized by comprising the following steps:
pre-labeling: setting template picture standards of the bill, selecting the standard template picture, and marking key and value on the template picture, wherein the key is a fixed and unchangeable keyword in the bill, and the value is variable content in the bill;
an identification step: determining a picture to be recognized, matching keys associated with the picture to be recognized and keys associated with the template picture, setting a text frame except the keys in the picture to be recognized as a value candidate frame, aligning the picture to be recognized and the template picture according to the corresponding relation between the keys, correcting the value candidate frame in a staggered mode, and extracting structural information according to the content in the template picture value text frame.
The idea of the technical scheme is as follows:
a bill is composed of two parts: the method comprises the steps of fixing key words (such as names) and variable content values (such as Zhang III), wherein each bill conforms to a specific typesetting style, the key contents are unchanged, the positions of the key contents are completely aligned, the value contents are variable, even the length and the line number of the value contents are variable, but the positions basically fluctuate only near a preset position.
Based on the characteristic of the bill, the scheme of the invention selects a standard template picture for each bill to label the key and the value, matches the key in the picture to be recognized with the template key to associate, and aligns the template through perspective transformation, so that the corresponding structural information can be extracted near the position of the preset value frame, and further, in order to solve the interference caused by the dislocation of printing the value, the dislocation correction is carried out according to the alignment degree of the picture value to be recognized and the template value after the key template alignment, and the accuracy of the extraction of the structural information is greatly improved.
In the pre-labeling step, the template picture of the bill is a flat picture without inclination and dislocation in printing.
In the pre-labeling step, the step of labeling the key is as follows:
and carrying out rectangular box labeling and text content labeling on the template picture, and setting a compact rectangular box as a keyword area.
In the pre-labeling step, the step of labeling value is as follows:
and carrying out rectangular frame labeling and field name labeling on fields needing to be identified except for the labeling key in the template picture.
In the identification step, the method further comprises the following steps:
all text boxes and text content in the picture to be recognized are detected and recognized at ocr.
In the identification step, the method further comprises the following steps:
matching all the text boxes and the text contents in the obtained picture to be recognized through keywords, judging whether the text boxes belong to the key text boxes of the template picture, if so, associating the keys in the picture to be recognized with the keys of the template picture to form a group of key corresponding relations, if not, taking the text boxes and the text contents in the picture to be recognized as value candidate boxes, and if no group of key corresponding relations exists, the current picture cannot be recognized.
In the identification step, the method further comprises the following steps:
aligning the picture to be identified and the template picture according to the corresponding relation between the keys, extracting 4 vertexes of the text box, and establishing the corresponding relation of the 4 groups of vertexes according to the corresponding relation between each group of the picture to be identified and the template picture; and when the N groups of key corresponding relations exist, establishing the corresponding relation of N x4 groups of vertex coordinates, calculating a homography matrix according to the corresponding relation between the vertex coordinates, and aligning the picture to be identified and the template picture through perspective transformation.
In the identification step, the method further comprises the following steps:
translating all the value candidate frames at least once according to a preset rule, calculating the alignment degree of the value candidate frame and the template value frame of each displacement, selecting the displacement with the highest alignment degree as the final dislocation displacement, and performing dislocation correction on all the value candidate frames according to the displacement to obtain the dislocation-corrected value frame of the picture to be recognized and the content of the value candidate frame.
In the identification step, the method further comprises the following steps:
extracting structural information near the template picture value text box, finding a template picture value text box with the largest overlapping area of each value candidate box, if the overlapping degree of the two template picture value text boxes is larger than a set threshold value, associating the value candidate box with the corresponding template picture value text box, and otherwise, ignoring the value candidate box;
after the association of all the value candidate frames is completed, the value candidate frames associated with each template picture value text frame are the content corresponding to the template picture value field, the text content is connected in series to obtain the extracted structural information of the field, and if the template picture value text frame is not associated with the value candidate frames, the field cannot be identified.
The invention has the beneficial effects that:
1. the structured information extraction based on template alignment and dislocation correction only needs to mark one template picture for each bill, the method is flexible and high in applicability, meanwhile, the problem of printing dislocation can be solved, and the recognition accuracy is high;
2. selecting a standard template picture for each bill to label keys and values, matching the keys in the picture to be identified with the template keys for association, and aligning the templates through perspective transformation, so that corresponding structural information can be extracted near a preset value frame position;
3. in order to solve the interference caused by dislocation of value printing, after the key template is aligned, dislocation correction is carried out according to the alignment degree of the picture value to be recognized and the template value, and the accuracy of structured information extraction is greatly improved.
Detailed Description
The following provides a detailed description of embodiments of the invention.
Example (b):
a structured information extraction method for cigarette logistics receipt bills comprises the following steps:
pre-labeling: selecting a standard template picture, and carrying out key and value labeling on the template picture;
an identification step: ocr detection and identification are carried out on the picture to be identified, matching key text boxes in ocr are associated with template keys, the rest text boxes are used as value candidate boxes, the picture to be identified is aligned with the template picture according to the corresponding relation between the keys, dislocation correction is carried out on the value candidate boxes, and structured information is extracted near the value boxes preset by the template.
The idea of the above embodiment is as follows:
a bill is composed of two parts: the method comprises the steps of fixing key words (such as names) and variable content values (such as Zhang III), wherein each bill conforms to a specific typesetting style, the key contents are unchanged, the positions of the key contents are completely aligned, the value contents are variable, even the length and the line number of the value contents are variable, but the positions basically fluctuate only near a preset position.
Based on the characteristic of the bill, the scheme of the invention selects a standard template picture for each bill to label the key and the value, matches the key in the picture to be recognized with the template key to associate, and aligns the template through perspective transformation, so that the corresponding structural information can be extracted near the position of the preset value frame, and further, in order to solve the interference caused by the dislocation of printing the value, the dislocation correction is carried out according to the alignment degree of the picture value to be recognized and the template value after the key template alignment, and the accuracy of the extraction of the structural information is greatly improved.
In the pre-labeling step, the selected standard template picture is a smooth picture without inclination and dislocation in printing.
In the pre-labeling step, keys are fixed and unchangeable keywords in the bill, rectangular frame labeling and text content labeling are carried out on the keywords, a compact rectangular frame only comprises a keyword area, values are variable contents in the bill, all the contents are not labeled, only fields needing to be identified are labeled, and rectangular frame labeling and field name labeling are carried out on the fields.
In the identification step, all text boxes and text contents in the picture are obtained through ocr detection and identification.
In the identification step, whether all ocr text boxes belong to key text boxes is judged through keyword matching, if yes, the key text boxes are associated with corresponding template keys to form a group of key corresponding relations, if not, the key text boxes are used as value candidate boxes, and if no any group of key corresponding relations exist, the current picture cannot be identified.
In the identification step, aligning the picture to be identified with the template picture according to the corresponding relation between the keys, extracting 4 vertexes of the text frame, establishing the corresponding relation of the 4 groups of vertexes according to the corresponding relation of each group of key text frame and the template key, if N groups of key corresponding relations exist, establishing the corresponding relation of Nx 4 groups of vertex coordinates, calculating a homography matrix according to the corresponding relation of the vertex coordinates, aligning the picture to be identified with the template picture through perspective transformation, and transforming the position of the value candidate frame to the aligned picture through the same transformation matrix.
In the identification step, dislocation correction is carried out on the value candidate frame, after template alignment is carried out, if printing dislocation does not exist, the value candidate frame falls inside the template value frame, if printing dislocation exists, deviation exists in the value candidate frame, and alignment with the template value frame falls on the edge or outside the template value frame.
Translating all the value candidate frames for multiple times in a certain range, translating the value candidate frames in the range of taking the original position of the value candidate frames as the center and taking the radius of the x direction and the y direction as 50 pixels in all the directions from top to bottom, from left to right, taking 10 pixels as a unit, wherein the total number of translation times is (50/10 + 2+ 1) 2=121, calculating the alignment degree of the value candidate frames and the template value frames of each displacement, selecting the displacement with the highest alignment degree as the final dislocation displacement, and performing dislocation correction on all the value candidate frames according to the displacement.
Note that the ith template value box is
Figure 624520DEST_PATH_IMAGE001
The number of the frames is n, the jth candidate value frame is
Figure 25545DEST_PATH_IMAGE002
If the number of frames is m, intersection is intersection operation, area is area calculation operation, bin is binarization function (1 if the condition is satisfied, or 0 if not), then the calculation formula of alignment _ ratio is as follows:
Figure 813283DEST_PATH_IMAGE003
Figure 983365DEST_PATH_IMAGE004
Figure 20591DEST_PATH_IMAGE005
Figure 502388DEST_PATH_IMAGE006
Figure 306264DEST_PATH_IMAGE007
in the identification step, structured information is extracted near a value frame preset by a template, for each value candidate frame, a template value frame with the largest overlapping area with the value candidate frame is found, if the overlapping degree of the two is greater than a set threshold value, the value candidate frame is associated to the template value frame, otherwise, the value candidate frame is ignored, after all the value candidate frames are associated, all the value candidate frames associated to each template value frame are the contents corresponding to the template value field, the text contents of the value candidate frames are connected in series to form the extracted structured information of the field, if the template value frame is not associated to any value candidate frame, the field cannot be identified, the template value frame is marked as tv, the value candidate frame is marked as v, and the overlapping degree overlap-io calculation formula is as follows:
Figure 65273DEST_PATH_IMAGE008
the method is based on the extraction of the structured information of the template alignment and the dislocation correction, only one template picture is required to be marked for each bill, the method is flexible and high in applicability, the problem of printing dislocation can be solved, and the identification accuracy is high. And selecting a standard template picture for each bill to label keys and values, matching the keys in the picture to be identified with the template keys for association, and aligning the templates through perspective transformation, so that corresponding structural information can be extracted near the preset value frame position. In order to solve the interference caused by dislocation of value printing, after the key template is aligned, dislocation correction is carried out according to the alignment degree of the picture value to be recognized and the template value, and the accuracy of structured information extraction is greatly improved.
The steps in the pre-labeling stage are as follows:
1. and selecting a standard template picture.
The stencil image should be as flat as possible without skewing and without misalignment of the printing.
2. And carrying out key and value labeling on the template picture.
The key is a fixed and unchangeable keyword in the bill, and is subjected to rectangular box labeling (the compact rectangular box only comprises a keyword area) and text content labeling.
The value is variable content in the bill, all content does not need to be labeled, only the field needing to be identified is labeled, and rectangular frame labeling (a wide rectangular frame can cover all position ranges where the field content appears) and field name labeling (key and value are not in one-to-one correspondence, some values have no key at all, and therefore the field corresponding to the value is directly specified).
After combing in combination with the above examples, further details are given below.
The steps in the identification phase are as follows:
1. and (3) detecting and identifying the picture to be identified by ocr.
All text boxes and text contents in the picture are detected and identified through ocr.
2. The matching key text box from the ocr result is associated with the template key, with the remaining text boxes being value candidate boxes.
And judging whether all ocr text boxes belong to the key text box or not through keyword matching, if so, associating with the corresponding template key to form a group of key corresponding relations, and otherwise, taking the key corresponding relation as a value candidate box.
And if no any group of key corresponding relation exists, the current picture cannot be identified, otherwise, the step 3 is continuously executed.
3. And aligning the picture to be identified with the template picture according to the corresponding relation between the keys.
Extracting 4 vertexes of the text box, establishing the corresponding relation of the 4 groups of vertexes according to the corresponding relation of each group of key text box and template key, and establishing the corresponding relation of the Nx 4 groups of vertex coordinates if the corresponding relation of the N groups of keys exists. And calculating a homography matrix according to the corresponding relation between the vertex coordinates, aligning the picture to be identified with the template picture through perspective transformation, and converting the position of the value candidate frame to the aligned picture through the same transformation matrix.
The introduction and principles relating to homography matrices and perspective transformations are prior art and will not be described in detail
4. And carrying out dislocation correction on the value candidate frame.
After template alignment, if there is no print misalignment, the value candidate frame should fall as inside the template value frame as possible, whereas if there is a print misalignment, the value candidate frame is offset to some extent and not well aligned with the template value frame (falls on the edge or outside of the template value frame).
Translating all the value candidate frames for multiple times in a certain range (for example, translating the value candidate frames in the range of taking the original position of the value candidate frames as the center and the radius of the x direction and the y direction both being 50 pixels in the up, down, left and right directions, and taking 10 pixels as a unit, the number of times of translation is (50/10 + 2+ ue 1) 2=121 in total), calculating the alignment degree of the value candidate frames and the template value frames of each displacement, selecting a displacement with the highest alignment degree as a final dislocation displacement, and performing dislocation correction on all the value candidate frames according to the displacement.
Note that the ith template value box is
Figure 804559DEST_PATH_IMAGE009
The number of the frames is n, the jth candidate value frame is
Figure 399751DEST_PATH_IMAGE010
If the number of frames is m, intersection is intersection operation, area is area calculation operation, bin is binarization function (1 if the condition is satisfied, or 0 if not), then the calculation formula of alignment _ ratio is as follows:
Figure 492471DEST_PATH_IMAGE011
Figure 761779DEST_PATH_IMAGE004
Figure 531021DEST_PATH_IMAGE012
Figure 987410DEST_PATH_IMAGE013
Figure 883821DEST_PATH_IMAGE014
5. and extracting the structured information near a value frame preset by the template.
And for each value candidate box, finding a template value box with the largest coincidence area with the value candidate box, if the coincidence degree of the two is larger than a set threshold value (such as 0.6), associating the value candidate box to the template value box, and otherwise, ignoring the value candidate box. After all the value candidate frames are associated, all the value candidate frames associated with each template value frame are the content corresponding to the template value field, the text content of the value candidate frames is connected in series to be the extracted structural information of the field, and if the template value frame is not associated with any value candidate frame, the field cannot be identified.
If the template value frame is tv and the value candidate frame is v, the calculation formula of the overlap degree overlap _ ratio is as follows:
Figure 365225DEST_PATH_IMAGE015
the homography matrix is explained as follows:
respectively acquiring coordinates of a text box of a key marked by a template picture and coordinates of a text box of a key matched with a picture to be identified, and establishing a corresponding homography matrix according to the corresponding relation between the acquired coordinates; and transforming the coordinates of the value of the picture to be recognized into the aligned coordinates according to the homography matrix of the marked key of the template picture and the matched key of the picture to be recognized, so as to obtain the aligned value of the picture to be recognized.
Specifically, for the text box corresponding to each key, respectively acquiring four vertex coordinates of the text box of the key marked by the template picture and four vertex coordinates of the text box of the key matched with the picture to be recognized, and establishing a homography matrix according to a corresponding relation between the four acquired vertex coordinates;
Figure 180734DEST_PATH_IMAGE016
(1);
wherein, (x 1, y 1), (x 2, y 2), (x 3, y 3), (x 4, y 4) are four vertex coordinates of the text box corresponding to the current key identified by the picture to be identified, (x '1, y' 1), (x '2, y' 2), (x '3, y' 3), (x '4, y' 4) are four vertex coordinates of the text box corresponding to the current key labeled by the template picture, and h11, h12, h13, h21, h22, h23, h31, h32 are unknown parameters to be solved. And substituting the formula (1) into corresponding coordinate values to obtain 8 unknown parameters of the homography matrix, and inputting the solved homography matrix into an image transformation model to obtain the aligned key of the picture to be recognized. The image transformation model of the present embodiment is a perspective transformation model. For the alignment of the value of the matched picture to be recognized, as shown above, detailed description is omitted here.
In a real application scenario, each set of coordinate points (e.g., (x 1, y 1) and (x '1, y' 1) calculated in the above embodiment form a set of coordinate points, which will be referred to as a point pair hereinafter) contains noise. For example, the position of the coordinate point deviates by several pixels, and even the characteristic point is mismatched, and if only four points are used to calculate the homography matrix, a large error occurs. So to make the calculation more accurate, it is common to use much more than four points to calculate the homography matrix.
In the above embodiment, a homography matrix is calculated by using the four vertices of the text box of all keys, and the method used is RANSAC, which specifically includes the steps of:
(1) Randomly selecting 4 pairs of matching characteristic points from the initial matching point pair set S as an inner point set Si, and estimating an initial homography matrix Hi;
(2) And calculating the remaining matching point pairs in the S by using Hi. If the projection error of a certain characteristic point is smaller than a threshold value t, adding the certain characteristic point into Si;
(3) Recording the number of matching point pairs in the Si set;
(4) Repeating the steps (2) to (3) until the iteration number is more than K;
(5) And comparing which iteration calculates the point pair with the maximum number, wherein the estimation model with the maximum number of the point pairs is the homography matrix to be solved.
The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims (9)

1. A structured information extraction method for cigarette logistics receipt bills is characterized by comprising the following steps:
pre-labeling: setting template picture standards of the bill, selecting the standard template picture, and marking key and value on the template picture, wherein the key is a fixed and unchangeable keyword in the bill, and the value is variable content in the bill;
an identification step: determining a picture to be recognized, matching keys associated with the picture to be recognized and keys associated with the template picture, setting a text frame except the keys in the picture to be recognized as a value candidate frame, aligning the picture to be recognized and the template picture according to the corresponding relation between the keys, correcting the value candidate frame in a staggered mode, and extracting structural information according to the content in the template picture value text frame.
2. The method for extracting the structural information of the receipt of cigarette logistics according to claim 1, wherein in the pre-labeling step, the template picture standard of the receipt is a flat, non-inclined, printed and non-misplaced picture.
3. The method for extracting the structured information of the cigarette logistics receipt according to claim 2, wherein in the pre-labeling step, the step of labeling the key is as follows:
and carrying out rectangular box labeling and text content labeling on the template picture, and setting a compact rectangular box as a keyword area.
4. The method for extracting the structured information of the cigarette logistics receipt according to claim 3, wherein in the pre-labeling step, the step of labeling value is as follows:
and carrying out rectangular frame labeling and field name labeling on fields needing to be identified except for the labeling key in the template picture.
5. The method for extracting the structured information of the receipt of cigarette logistics according to claim 1, 2, 3 or 4, wherein the step of identifying further comprises the steps of:
all text boxes and text content in the picture to be recognized are detected and recognized at ocr.
6. The method for extracting the structured information of the cigarette logistics receipt according to claim 5, wherein the step of identifying further comprises the steps of:
matching all the text boxes and the text contents in the obtained picture to be recognized through keywords, judging whether the text boxes belong to the key text boxes of the template picture, if so, associating the keys in the picture to be recognized with the keys of the template picture to form a group of key corresponding relations, if not, taking the text boxes and the text contents in the picture to be recognized as value candidate boxes, and if no group of key corresponding relations exists, the current picture cannot be recognized.
7. The method for extracting the structured information of the cigarette logistics receipt according to claim 6, wherein the step of identifying further comprises the steps of:
aligning the picture to be identified and the template picture according to the corresponding relation between the keys, extracting 4 vertexes of the text frame, and establishing the corresponding relation of the 4 groups of vertexes according to the corresponding relation between each group of picture to be identified and the template picture;
and when the N groups of key corresponding relations exist, establishing the corresponding relation of N x4 groups of vertex coordinates, calculating a homography matrix according to the corresponding relation between the vertex coordinates, and aligning the picture to be identified and the template picture through perspective transformation.
8. The method for extracting the structured information of the receipt of cigarette logistics according to claim 7, wherein the step of identifying further comprises the steps of:
translating all the value candidate frames at least once according to a preset rule, calculating the alignment degree of the value candidate frame and the template value frame of each displacement, selecting the displacement with the highest alignment degree as the final dislocation displacement, and performing dislocation correction on all the value candidate frames according to the displacement to obtain the dislocation-corrected value frame of the picture to be recognized and the content of the value candidate frame.
9. The method for extracting the structured information of the receipt of cigarette logistics according to claim 8, wherein the step of identifying further comprises the steps of:
extracting structural information near the template picture value text box, finding a template picture value text box with the largest overlapping area of each value candidate box, if the overlapping degree of the two template picture value text boxes is larger than a set threshold value, associating the value candidate box with the corresponding template picture value text box, and otherwise, ignoring the value candidate box;
after the association of all the value candidate frames is completed, the value candidate frames associated with each template picture value text frame are the content corresponding to the template picture value field, the text content is connected in series to obtain the extracted structural information of the field, and if the template picture value text frame is not associated with the value candidate frames, the field cannot be identified.
CN202211442689.4A 2022-11-18 2022-11-18 Structured information extraction method for cigarette logistics receiving bill Active CN115497114B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211442689.4A CN115497114B (en) 2022-11-18 2022-11-18 Structured information extraction method for cigarette logistics receiving bill

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211442689.4A CN115497114B (en) 2022-11-18 2022-11-18 Structured information extraction method for cigarette logistics receiving bill

Publications (2)

Publication Number Publication Date
CN115497114A true CN115497114A (en) 2022-12-20
CN115497114B CN115497114B (en) 2024-03-12

Family

ID=85116135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211442689.4A Active CN115497114B (en) 2022-11-18 2022-11-18 Structured information extraction method for cigarette logistics receiving bill

Country Status (1)

Country Link
CN (1) CN115497114B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116382696A (en) * 2023-03-18 2023-07-04 宝钢工程技术集团有限公司 Engineering attribute dynamic analysis and submission method based on factory object position number

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539112B1 (en) * 1999-02-26 2003-03-25 Raf Technology, Inc. Methods and system for identifying a reference region on an image of a dropped-out form
US8724907B1 (en) * 2012-03-28 2014-05-13 Emc Corporation Method and system for using OCR data for grouping and classifying documents
CN111241974A (en) * 2020-01-07 2020-06-05 深圳追一科技有限公司 Bill information acquisition method and device, computer equipment and storage medium
CN111931784A (en) * 2020-09-17 2020-11-13 深圳壹账通智能科技有限公司 Bill recognition method, system, computer device and computer-readable storage medium
CN112613367A (en) * 2020-12-14 2021-04-06 盈科票据服务(深圳)有限公司 Bill information text box acquisition method, system, equipment and storage medium
CN112699867A (en) * 2020-09-27 2021-04-23 民生科技有限责任公司 Fixed format target image element information extraction method and system
CN112861782A (en) * 2021-03-07 2021-05-28 上海大学 Bill photo key information extraction system and method
CN113158895A (en) * 2021-04-20 2021-07-23 北京中科江南信息技术股份有限公司 Bill identification method and device, electronic equipment and storage medium
CN113191348A (en) * 2021-05-31 2021-07-30 山东新一代信息产业技术研究院有限公司 Template-based text structured extraction method and tool
CN113657377A (en) * 2021-07-22 2021-11-16 西南财经大学 Structured recognition method for airplane ticket printing data image
US20210383107A1 (en) * 2020-06-09 2021-12-09 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, device and storage medium for recognizing bill image
CN113903024A (en) * 2021-09-28 2022-01-07 合肥高维数据技术有限公司 Handwritten bill numerical value information identification method, system, medium and device
CN115063784A (en) * 2022-06-08 2022-09-16 杭州未名信科科技有限公司 Bill image information extraction method and device, storage medium and electronic equipment
CN115147855A (en) * 2021-03-30 2022-10-04 上海聚均科技有限公司 Method and system for carrying out batch OCR (optical character recognition) on bills
CN115240178A (en) * 2022-06-24 2022-10-25 深源恒际科技有限公司 Structured information extraction method and system for bill image

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539112B1 (en) * 1999-02-26 2003-03-25 Raf Technology, Inc. Methods and system for identifying a reference region on an image of a dropped-out form
US8724907B1 (en) * 2012-03-28 2014-05-13 Emc Corporation Method and system for using OCR data for grouping and classifying documents
CN111241974A (en) * 2020-01-07 2020-06-05 深圳追一科技有限公司 Bill information acquisition method and device, computer equipment and storage medium
US20210383107A1 (en) * 2020-06-09 2021-12-09 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, device and storage medium for recognizing bill image
CN111931784A (en) * 2020-09-17 2020-11-13 深圳壹账通智能科技有限公司 Bill recognition method, system, computer device and computer-readable storage medium
WO2022057471A1 (en) * 2020-09-17 2022-03-24 深圳壹账通智能科技有限公司 Bill identification method, system, computer device, and computer-readable storage medium
CN112699867A (en) * 2020-09-27 2021-04-23 民生科技有限责任公司 Fixed format target image element information extraction method and system
CN112613367A (en) * 2020-12-14 2021-04-06 盈科票据服务(深圳)有限公司 Bill information text box acquisition method, system, equipment and storage medium
CN112861782A (en) * 2021-03-07 2021-05-28 上海大学 Bill photo key information extraction system and method
CN115147855A (en) * 2021-03-30 2022-10-04 上海聚均科技有限公司 Method and system for carrying out batch OCR (optical character recognition) on bills
CN113158895A (en) * 2021-04-20 2021-07-23 北京中科江南信息技术股份有限公司 Bill identification method and device, electronic equipment and storage medium
CN113191348A (en) * 2021-05-31 2021-07-30 山东新一代信息产业技术研究院有限公司 Template-based text structured extraction method and tool
CN113657377A (en) * 2021-07-22 2021-11-16 西南财经大学 Structured recognition method for airplane ticket printing data image
CN113903024A (en) * 2021-09-28 2022-01-07 合肥高维数据技术有限公司 Handwritten bill numerical value information identification method, system, medium and device
CN115063784A (en) * 2022-06-08 2022-09-16 杭州未名信科科技有限公司 Bill image information extraction method and device, storage medium and electronic equipment
CN115240178A (en) * 2022-06-24 2022-10-25 深源恒际科技有限公司 Structured information extraction method and system for bill image

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116382696A (en) * 2023-03-18 2023-07-04 宝钢工程技术集团有限公司 Engineering attribute dynamic analysis and submission method based on factory object position number
CN116382696B (en) * 2023-03-18 2024-06-07 宝钢工程技术集团有限公司 Engineering attribute dynamic analysis and submission method based on factory object position number

Also Published As

Publication number Publication date
CN115497114B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
CN109658584B (en) Bill information identification method and device
JP3842006B2 (en) Form classification device, form classification method, and computer-readable recording medium storing a program for causing a computer to execute these methods
CN110766014A (en) Bill information positioning method, system and computer readable storage medium
WO2014103297A1 (en) Object identification device, method, and storage medium
CN104584071B (en) Object detector, object identification method
CN106033535B (en) Electronic paper marking method
BR112013015346B1 (en) METHOD AND DEVICE TO CLASSIFY AN OBJECT IN IMAGE DATA INTO ONE OF A SET OF CLASSES USING A CLASSIFIER, NON TRANSIENT COMPUTER READIBLE MEDIA AND SYSTEM
CN104584073A (en) Object discrimination device, object discrimination method, and program
CN101678404B (en) Method of handling transmittals including graphic classification of signatures associated with transmittals
JP2002312385A (en) Document automated dividing device
CN110490190B (en) Structured image character recognition method and system
CN111783770B (en) Image correction method, device and computer readable storage medium
CN115497114A (en) Structured information extraction method for cigarette logistics receipt bill
CN111858977B (en) Bill information acquisition method, device, computer equipment and storage medium
CN112396047B (en) Training sample generation method and device, computer equipment and storage medium
CN113903024A (en) Handwritten bill numerical value information identification method, system, medium and device
CN114648776B (en) Financial reimbursement data processing method and processing system
US8787702B1 (en) Methods and apparatus for determining and/or modifying image orientation
CN113505789A (en) Electrical equipment nameplate text recognition method based on CRNN
CN114359553A (en) Signature positioning method and system based on Internet of things and storage medium
CN112115907A (en) Method, device, equipment and medium for extracting structured information of fixed layout certificate
CN114694161A (en) Text recognition method and equipment for specific format certificate and storage medium
JP2003109007A (en) Device, method and program for classifying slip form and image collating device
JP2002251592A (en) Learning method for pattern recognition dictionary
CN106599910B (en) Mimeograph documents discrimination method based on texture recombination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant