CN115497114A

CN115497114A - Structured information extraction method for cigarette logistics receipt bill

Info

Publication number: CN115497114A
Application number: CN202211442689.4A
Authority: CN
Inventors: 曾华; 徐伟; 刘永海; 朱小晓; 胡晓峰; 李涛; 周幸; 曾鹏程; 李�瑞; 廖健; 王静雅; 付雯; 龙涛
Original assignee: Shenzhen Aimo Technology Co ltd; China National Tobacco Corp Sichuan Branch
Current assignee: Shenzhen Aimo Technology Co ltd; China National Tobacco Corp Sichuan Branch
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2022-12-20
Anticipated expiration: 2042-11-18
Also published as: CN115497114B

Abstract

The invention discloses a structured information extraction method of cigarette logistics receipt bills, which comprises a pre-labeling step and an identification step, wherein the pre-labeling step comprises the following steps: setting template picture standards of the bill, selecting the standard template picture, and marking key and value on the template picture, wherein the key is a fixed and unchangeable keyword in the bill, and the value is variable content in the bill; an identification step: determining a picture to be recognized, matching keys associated with the picture to be recognized and keys associated with the template picture, setting a text frame except the keys in the picture to be recognized as a value candidate frame, aligning the picture to be recognized and the template picture according to the corresponding relation between the keys, correcting the value candidate frame in a staggered mode, and extracting structural information according to the content in the template picture value text frame. The method is based on the extraction of the structured information of template alignment and dislocation correction, only one template picture is required to be marked for each bill, the method is flexible and high in applicability, the problem of printing dislocation can be solved, and the identification accuracy is high.

Description

Structured information extraction method for cigarette logistics receipt bill

Technical Field

The invention relates to the field of logistics, in particular to a structured information extraction method for cigarette logistics receipt bills.

Background

In a tobacco logistics scene, a receiver has to confirm the information of a delivery party, and the information of a bill needs to be checked and examined with the information recorded by a system. Manual review is time consuming and error prone, and an alternative is to automatically extract the structured information (date, number, etc.) on the ticket by image recognition algorithm and compare it with the structured information recorded by the system.

At present, two main methods are mainly used for extracting structured information of bills, one method is to perform post-processing on a ocr (optical character recognition) result by using relevant rules such as regular expression matching and the like, the method is flexible but low in accuracy rate, especially cannot solve the situation of printing dislocation, the other method is to detect the position of each field by using a deep learning method and then perform ocr content recognition on the field, the method is high in accuracy rate, but a large amount of data needs to be collected for labeling and training for each bill, and the method is not flexible enough and low in applicability.

Disclosure of Invention

Based on the above problem, the invention provides a structured information extraction method for cigarette logistics receipt bills, which is based on structured information extraction of template alignment and dislocation correction, only one template picture needs to be marked for each bill, and the method is flexible and high in applicability, can solve the problem of printing dislocation, and is high in identification accuracy.

The technical scheme of the invention is as follows:

a structured information extraction method for cigarette logistics receipt bills is characterized by comprising the following steps:

pre-labeling: setting template picture standards of the bill, selecting the standard template picture, and marking key and value on the template picture, wherein the key is a fixed and unchangeable keyword in the bill, and the value is variable content in the bill;

an identification step: determining a picture to be recognized, matching keys associated with the picture to be recognized and keys associated with the template picture, setting a text frame except the keys in the picture to be recognized as a value candidate frame, aligning the picture to be recognized and the template picture according to the corresponding relation between the keys, correcting the value candidate frame in a staggered mode, and extracting structural information according to the content in the template picture value text frame.

The idea of the technical scheme is as follows:

a bill is composed of two parts: the method comprises the steps of fixing key words (such as names) and variable content values (such as Zhang III), wherein each bill conforms to a specific typesetting style, the key contents are unchanged, the positions of the key contents are completely aligned, the value contents are variable, even the length and the line number of the value contents are variable, but the positions basically fluctuate only near a preset position.

Based on the characteristic of the bill, the scheme of the invention selects a standard template picture for each bill to label the key and the value, matches the key in the picture to be recognized with the template key to associate, and aligns the template through perspective transformation, so that the corresponding structural information can be extracted near the position of the preset value frame, and further, in order to solve the interference caused by the dislocation of printing the value, the dislocation correction is carried out according to the alignment degree of the picture value to be recognized and the template value after the key template alignment, and the accuracy of the extraction of the structural information is greatly improved.

In the pre-labeling step, the template picture of the bill is a flat picture without inclination and dislocation in printing.

In the pre-labeling step, the step of labeling the key is as follows:

and carrying out rectangular box labeling and text content labeling on the template picture, and setting a compact rectangular box as a keyword area.

In the pre-labeling step, the step of labeling value is as follows:

and carrying out rectangular frame labeling and field name labeling on fields needing to be identified except for the labeling key in the template picture.

In the identification step, the method further comprises the following steps:

all text boxes and text content in the picture to be recognized are detected and recognized at ocr.

In the identification step, the method further comprises the following steps:

matching all the text boxes and the text contents in the obtained picture to be recognized through keywords, judging whether the text boxes belong to the key text boxes of the template picture, if so, associating the keys in the picture to be recognized with the keys of the template picture to form a group of key corresponding relations, if not, taking the text boxes and the text contents in the picture to be recognized as value candidate boxes, and if no group of key corresponding relations exists, the current picture cannot be recognized.

In the identification step, the method further comprises the following steps:

aligning the picture to be identified and the template picture according to the corresponding relation between the keys, extracting 4 vertexes of the text box, and establishing the corresponding relation of the 4 groups of vertexes according to the corresponding relation between each group of the picture to be identified and the template picture; and when the N groups of key corresponding relations exist, establishing the corresponding relation of N x4 groups of vertex coordinates, calculating a homography matrix according to the corresponding relation between the vertex coordinates, and aligning the picture to be identified and the template picture through perspective transformation.

In the identification step, the method further comprises the following steps:

translating all the value candidate frames at least once according to a preset rule, calculating the alignment degree of the value candidate frame and the template value frame of each displacement, selecting the displacement with the highest alignment degree as the final dislocation displacement, and performing dislocation correction on all the value candidate frames according to the displacement to obtain the dislocation-corrected value frame of the picture to be recognized and the content of the value candidate frame.

In the identification step, the method further comprises the following steps:

extracting structural information near the template picture value text box, finding a template picture value text box with the largest overlapping area of each value candidate box, if the overlapping degree of the two template picture value text boxes is larger than a set threshold value, associating the value candidate box with the corresponding template picture value text box, and otherwise, ignoring the value candidate box;

after the association of all the value candidate frames is completed, the value candidate frames associated with each template picture value text frame are the content corresponding to the template picture value field, the text content is connected in series to obtain the extracted structural information of the field, and if the template picture value text frame is not associated with the value candidate frames, the field cannot be identified.

The invention has the beneficial effects that:

1. the structured information extraction based on template alignment and dislocation correction only needs to mark one template picture for each bill, the method is flexible and high in applicability, meanwhile, the problem of printing dislocation can be solved, and the recognition accuracy is high;

2. selecting a standard template picture for each bill to label keys and values, matching the keys in the picture to be identified with the template keys for association, and aligning the templates through perspective transformation, so that corresponding structural information can be extracted near a preset value frame position;

3. in order to solve the interference caused by dislocation of value printing, after the key template is aligned, dislocation correction is carried out according to the alignment degree of the picture value to be recognized and the template value, and the accuracy of structured information extraction is greatly improved.

Detailed Description

The following provides a detailed description of embodiments of the invention.

Example (b):

a structured information extraction method for cigarette logistics receipt bills comprises the following steps:

pre-labeling: selecting a standard template picture, and carrying out key and value labeling on the template picture;

an identification step: ocr detection and identification are carried out on the picture to be identified, matching key text boxes in ocr are associated with template keys, the rest text boxes are used as value candidate boxes, the picture to be identified is aligned with the template picture according to the corresponding relation between the keys, dislocation correction is carried out on the value candidate boxes, and structured information is extracted near the value boxes preset by the template.

The idea of the above embodiment is as follows:

In the pre-labeling step, the selected standard template picture is a smooth picture without inclination and dislocation in printing.

In the pre-labeling step, keys are fixed and unchangeable keywords in the bill, rectangular frame labeling and text content labeling are carried out on the keywords, a compact rectangular frame only comprises a keyword area, values are variable contents in the bill, all the contents are not labeled, only fields needing to be identified are labeled, and rectangular frame labeling and field name labeling are carried out on the fields.

In the identification step, all text boxes and text contents in the picture are obtained through ocr detection and identification.

In the identification step, whether all ocr text boxes belong to key text boxes is judged through keyword matching, if yes, the key text boxes are associated with corresponding template keys to form a group of key corresponding relations, if not, the key text boxes are used as value candidate boxes, and if no any group of key corresponding relations exist, the current picture cannot be identified.

In the identification step, aligning the picture to be identified with the template picture according to the corresponding relation between the keys, extracting 4 vertexes of the text frame, establishing the corresponding relation of the 4 groups of vertexes according to the corresponding relation of each group of key text frame and the template key, if N groups of key corresponding relations exist, establishing the corresponding relation of Nx 4 groups of vertex coordinates, calculating a homography matrix according to the corresponding relation of the vertex coordinates, aligning the picture to be identified with the template picture through perspective transformation, and transforming the position of the value candidate frame to the aligned picture through the same transformation matrix.

In the identification step, dislocation correction is carried out on the value candidate frame, after template alignment is carried out, if printing dislocation does not exist, the value candidate frame falls inside the template value frame, if printing dislocation exists, deviation exists in the value candidate frame, and alignment with the template value frame falls on the edge or outside the template value frame.

Translating all the value candidate frames for multiple times in a certain range, translating the value candidate frames in the range of taking the original position of the value candidate frames as the center and taking the radius of the x direction and the y direction as 50 pixels in all the directions from top to bottom, from left to right, taking 10 pixels as a unit, wherein the total number of translation times is (50/10 + 2+ 1) 2=121, calculating the alignment degree of the value candidate frames and the template value frames of each displacement, selecting the displacement with the highest alignment degree as the final dislocation displacement, and performing dislocation correction on all the value candidate frames according to the displacement.

Note that the ith template value box is

The number of the frames is n, the jth candidate value frame is

If the number of frames is m, intersection is intersection operation, area is area calculation operation, bin is binarization function (1 if the condition is satisfied, or 0 if not), then the calculation formula of alignment _ ratio is as follows:

。

in the identification step, structured information is extracted near a value frame preset by a template, for each value candidate frame, a template value frame with the largest overlapping area with the value candidate frame is found, if the overlapping degree of the two is greater than a set threshold value, the value candidate frame is associated to the template value frame, otherwise, the value candidate frame is ignored, after all the value candidate frames are associated, all the value candidate frames associated to each template value frame are the contents corresponding to the template value field, the text contents of the value candidate frames are connected in series to form the extracted structured information of the field, if the template value frame is not associated to any value candidate frame, the field cannot be identified, the template value frame is marked as tv, the value candidate frame is marked as v, and the overlapping degree overlap-io calculation formula is as follows:

。

the method is based on the extraction of the structured information of the template alignment and the dislocation correction, only one template picture is required to be marked for each bill, the method is flexible and high in applicability, the problem of printing dislocation can be solved, and the identification accuracy is high. And selecting a standard template picture for each bill to label keys and values, matching the keys in the picture to be identified with the template keys for association, and aligning the templates through perspective transformation, so that corresponding structural information can be extracted near the preset value frame position. In order to solve the interference caused by dislocation of value printing, after the key template is aligned, dislocation correction is carried out according to the alignment degree of the picture value to be recognized and the template value, and the accuracy of structured information extraction is greatly improved.

The steps in the pre-labeling stage are as follows:

1. and selecting a standard template picture.

The stencil image should be as flat as possible without skewing and without misalignment of the printing.

2. And carrying out key and value labeling on the template picture.

The key is a fixed and unchangeable keyword in the bill, and is subjected to rectangular box labeling (the compact rectangular box only comprises a keyword area) and text content labeling.

The value is variable content in the bill, all content does not need to be labeled, only the field needing to be identified is labeled, and rectangular frame labeling (a wide rectangular frame can cover all position ranges where the field content appears) and field name labeling (key and value are not in one-to-one correspondence, some values have no key at all, and therefore the field corresponding to the value is directly specified).

After combing in combination with the above examples, further details are given below.

The steps in the identification phase are as follows:

1. and (3) detecting and identifying the picture to be identified by ocr.

All text boxes and text contents in the picture are detected and identified through ocr.

2. The matching key text box from the ocr result is associated with the template key, with the remaining text boxes being value candidate boxes.

And judging whether all ocr text boxes belong to the key text box or not through keyword matching, if so, associating with the corresponding template key to form a group of key corresponding relations, and otherwise, taking the key corresponding relation as a value candidate box.

And if no any group of key corresponding relation exists, the current picture cannot be identified, otherwise, the step 3 is continuously executed.

3. And aligning the picture to be identified with the template picture according to the corresponding relation between the keys.

Extracting 4 vertexes of the text box, establishing the corresponding relation of the 4 groups of vertexes according to the corresponding relation of each group of key text box and template key, and establishing the corresponding relation of the Nx 4 groups of vertex coordinates if the corresponding relation of the N groups of keys exists. And calculating a homography matrix according to the corresponding relation between the vertex coordinates, aligning the picture to be identified with the template picture through perspective transformation, and converting the position of the value candidate frame to the aligned picture through the same transformation matrix.

The introduction and principles relating to homography matrices and perspective transformations are prior art and will not be described in detail

4. And carrying out dislocation correction on the value candidate frame.

After template alignment, if there is no print misalignment, the value candidate frame should fall as inside the template value frame as possible, whereas if there is a print misalignment, the value candidate frame is offset to some extent and not well aligned with the template value frame (falls on the edge or outside of the template value frame).

Translating all the value candidate frames for multiple times in a certain range (for example, translating the value candidate frames in the range of taking the original position of the value candidate frames as the center and the radius of the x direction and the y direction both being 50 pixels in the up, down, left and right directions, and taking 10 pixels as a unit, the number of times of translation is (50/10 + 2+ ue 1) 2=121 in total), calculating the alignment degree of the value candidate frames and the template value frames of each displacement, selecting a displacement with the highest alignment degree as a final dislocation displacement, and performing dislocation correction on all the value candidate frames according to the displacement.

Note that the ith template value box is

The number of the frames is n, the jth candidate value frame is

5. and extracting the structured information near a value frame preset by the template.

And for each value candidate box, finding a template value box with the largest coincidence area with the value candidate box, if the coincidence degree of the two is larger than a set threshold value (such as 0.6), associating the value candidate box to the template value box, and otherwise, ignoring the value candidate box. After all the value candidate frames are associated, all the value candidate frames associated with each template value frame are the content corresponding to the template value field, the text content of the value candidate frames is connected in series to be the extracted structural information of the field, and if the template value frame is not associated with any value candidate frame, the field cannot be identified.

If the template value frame is tv and the value candidate frame is v, the calculation formula of the overlap degree overlap _ ratio is as follows:

。

the homography matrix is explained as follows:

respectively acquiring coordinates of a text box of a key marked by a template picture and coordinates of a text box of a key matched with a picture to be identified, and establishing a corresponding homography matrix according to the corresponding relation between the acquired coordinates; and transforming the coordinates of the value of the picture to be recognized into the aligned coordinates according to the homography matrix of the marked key of the template picture and the matched key of the picture to be recognized, so as to obtain the aligned value of the picture to be recognized.

Specifically, for the text box corresponding to each key, respectively acquiring four vertex coordinates of the text box of the key marked by the template picture and four vertex coordinates of the text box of the key matched with the picture to be recognized, and establishing a homography matrix according to a corresponding relation between the four acquired vertex coordinates;

（1）；

wherein, (x 1, y 1), (x 2, y 2), (x 3, y 3), (x 4, y 4) are four vertex coordinates of the text box corresponding to the current key identified by the picture to be identified, (x '1, y' 1), (x '2, y' 2), (x '3, y' 3), (x '4, y' 4) are four vertex coordinates of the text box corresponding to the current key labeled by the template picture, and h11, h12, h13, h21, h22, h23, h31, h32 are unknown parameters to be solved. And substituting the formula (1) into corresponding coordinate values to obtain 8 unknown parameters of the homography matrix, and inputting the solved homography matrix into an image transformation model to obtain the aligned key of the picture to be recognized. The image transformation model of the present embodiment is a perspective transformation model. For the alignment of the value of the matched picture to be recognized, as shown above, detailed description is omitted here.

In a real application scenario, each set of coordinate points (e.g., (x 1, y 1) and (x '1, y' 1) calculated in the above embodiment form a set of coordinate points, which will be referred to as a point pair hereinafter) contains noise. For example, the position of the coordinate point deviates by several pixels, and even the characteristic point is mismatched, and if only four points are used to calculate the homography matrix, a large error occurs. So to make the calculation more accurate, it is common to use much more than four points to calculate the homography matrix.

In the above embodiment, a homography matrix is calculated by using the four vertices of the text box of all keys, and the method used is RANSAC, which specifically includes the steps of:

(1) Randomly selecting 4 pairs of matching characteristic points from the initial matching point pair set S as an inner point set Si, and estimating an initial homography matrix Hi;

(2) And calculating the remaining matching point pairs in the S by using Hi. If the projection error of a certain characteristic point is smaller than a threshold value t, adding the certain characteristic point into Si;

(3) Recording the number of matching point pairs in the Si set;

(4) Repeating the steps (2) to (3) until the iteration number is more than K;

(5) And comparing which iteration calculates the point pair with the maximum number, wherein the estimation model with the maximum number of the point pairs is the homography matrix to be solved.

The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A structured information extraction method for cigarette logistics receipt bills is characterized by comprising the following steps:

2. The method for extracting the structural information of the receipt of cigarette logistics according to claim 1, wherein in the pre-labeling step, the template picture standard of the receipt is a flat, non-inclined, printed and non-misplaced picture.

3. The method for extracting the structured information of the cigarette logistics receipt according to claim 2, wherein in the pre-labeling step, the step of labeling the key is as follows:

4. The method for extracting the structured information of the cigarette logistics receipt according to claim 3, wherein in the pre-labeling step, the step of labeling value is as follows:

5. The method for extracting the structured information of the receipt of cigarette logistics according to claim 1, 2, 3 or 4, wherein the step of identifying further comprises the steps of:

6. The method for extracting the structured information of the cigarette logistics receipt according to claim 5, wherein the step of identifying further comprises the steps of:

7. The method for extracting the structured information of the cigarette logistics receipt according to claim 6, wherein the step of identifying further comprises the steps of:

aligning the picture to be identified and the template picture according to the corresponding relation between the keys, extracting 4 vertexes of the text frame, and establishing the corresponding relation of the 4 groups of vertexes according to the corresponding relation between each group of picture to be identified and the template picture;

and when the N groups of key corresponding relations exist, establishing the corresponding relation of N x4 groups of vertex coordinates, calculating a homography matrix according to the corresponding relation between the vertex coordinates, and aligning the picture to be identified and the template picture through perspective transformation.

8. The method for extracting the structured information of the receipt of cigarette logistics according to claim 7, wherein the step of identifying further comprises the steps of:

9. The method for extracting the structured information of the receipt of cigarette logistics according to claim 8, wherein the step of identifying further comprises the steps of: