CN116343237A - Bill identification method based on deep learning and knowledge graph - Google Patents
Bill identification method based on deep learning and knowledge graph Download PDFInfo
- Publication number
- CN116343237A CN116343237A CN202110883236.4A CN202110883236A CN116343237A CN 116343237 A CN116343237 A CN 116343237A CN 202110883236 A CN202110883236 A CN 202110883236A CN 116343237 A CN116343237 A CN 116343237A
- Authority
- CN
- China
- Prior art keywords
- text
- bill
- image
- character
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000013135 deep learning Methods 0.000 title claims abstract description 13
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 37
- 238000001514 detection method Methods 0.000 claims abstract description 35
- 238000000605 extraction Methods 0.000 claims abstract description 26
- 238000012937 correction Methods 0.000 claims description 12
- 238000005516 engineering process Methods 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000007797 corrosion Effects 0.000 claims description 5
- 238000005260 corrosion Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 5
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 4
- 101100269850 Caenorhabditis elegans mask-1 gene Proteins 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 102100032202 Cornulin Human genes 0.000 claims description 3
- 101000920981 Homo sapiens Cornulin Proteins 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 230000008961 swelling Effects 0.000 claims description 2
- 239000012634 fragment Substances 0.000 abstract description 4
- 238000007670 refining Methods 0.000 abstract description 2
- 238000013461 design Methods 0.000 description 10
- 238000012545 processing Methods 0.000 description 6
- 239000013598 vector Substances 0.000 description 5
- 238000007639 printing Methods 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 241001494479 Pecora Species 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000003999 initiator Substances 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Character Discrimination (AREA)
Abstract
A bill identification method based on deep learning and knowledge graph belongs to the field of electronic information. The system consists of a character detection module, a character recognition module and a key information extraction module. The text detection module obtains text position coordinates in the picture through a text detection algorithm, and then transmits the text position coordinates to the text recognition module and the key information extraction module. The text recognition module predicts the text of the coordinate area provided by the text detection module to obtain text information, and simultaneously transmits the text information to the key information extraction module. And finally, predicting the entity category of the text fragment according to the position information and the corresponding text information of the text, refining key information such as invoice numbers and company names in the bill by means of the bill knowledge graph, correcting and adapting the information such as the company names and place names obtained in Web search such as enterprise search, and further improving the accuracy of bill identification.
Description
Technical Field
The invention belongs to the field of electronic information, and relates to an OCR technology based on deep learning and knowledge graph, which is applied to structural recognition of various tickets (invoices, train tickets and the like).
Background
In the traditional financial system, original bills are manually input by financial staff, so that a great deal of time and energy are consumed by the staff, and input errors are easy to occur. Text detection and recognition technology based on computer vision provides a technical basis for bill structural recognition. However, the existing method can only identify the characters on the bill image, but cannot understand the semantic information of the characters, so that the identified characters cannot be structured. In addition, the real bill image has the phenomena of over-light printing ink marks, character position deviation and the like, and the problems of low character detection recall ratio, low recognition accuracy and the like can be caused. In recent years, a combination of text detection and identification and key information extraction technology provides a new method for the problems. The text in the bill is screened by adopting a key information extraction method, the concerned text segment is selected, and the entity attribute (such as the entity with the number of the bill, the head-up, the tax payer, the billing date, the amount and the like in the value added tax invoice) of the text segment is identified. These entities and their relationships to each other provide the basis for the structured recognition of notes. In addition, the knowledge graph can efficiently represent the relationship between entities in the real world. Therefore, the invention introduces the knowledge graph to model the structured and unstructured data in the bill, and combines the deep learning algorithm to realize the accurate detection, identification and structured analysis of the bill characters.
Disclosure of Invention
Aiming at the defects of the traditional bill identification method, the invention designs a bill structured identification technology based on deep learning. The system consists of a character detection module, a character recognition module and a key information extraction module. The text detection module obtains text position coordinates in the picture through a text detection algorithm, and then transmits the text position coordinates to the text recognition module and the key information extraction module. The text recognition module predicts the text of the coordinate area provided by the text detection module to obtain text information, and simultaneously transmits the text information to the key information extraction module. And finally, predicting the entity category of the text fragment according to the position information and the corresponding text information of the text, refining key information such as invoice numbers and company names in the bill by means of the bill knowledge graph, correcting and adapting the information such as the company names and place names obtained in Web search such as enterprise search, and further improving the accuracy of bill identification. The invention mainly works as follows:
(1) As shown in fig. 1, an integrated text detection module, text recognition module, key information extraction module, and bill structured recognition system using knowledge-graph modeling and error correction module are designed.
(2) The method has the advantages that pretreatment steps such as seal removal and image alignment are added, and the accuracy of model detection and identification is improved.
(3) In order to more accurately sort text segments with slightly larger word spacing into a text box, a text box merging algorithm based on vertical IOU and lateral distance is designed and applied.
(4) A key information extraction flow based on a neural network is designed.
(5) A recognition error correction flow based on the knowledge graph is designed.
Typical bill recognition methods often employ a template matching method, in which the spatial position of a key region is determined for a bill with a certain fixed template in a manually set rule manner, and then corresponding text information is extracted through a text recognition algorithm. However, this method still has the following problems: in life, most paper invoices carry out secondary printing on key information on a fixed bill template, but complete bills can be generated by printing for the first time, so that the problem of character position deviation of secondary printing exists. The use of such template matching methods often results in the loss of text information or the matching of erroneous information.
If the character position deviation occurs in the bill identification process, the bill identification effect is seriously affected. Therefore, the invention combines and improves the character detection algorithm based on the convolutional neural network based on the previous research result, the character recognition algorithm based on the convolutional neural network and the long-term and short-term memory based on the key information extraction algorithm of the graph convolution network. In addition, a post-processing step is added to the text detection algorithm, and two text fragments with similar semantics are combined into a text box. And the knowledge graph technology is also used for correcting the wrongly recognized characters, so that the recognition accuracy is increased.
The invention takes the image of the real train ticket, the value-added tax invoice and other tickets as the input data to realize the structured output of the ticket content, and the specific steps are as follows:
(1) Constructing bill knowledge graph
Reasonable models are built for various notes and key fields of the notes so as to achieve the purposes of structured output and error correction after recognition.
(2) Template alignment
And simultaneously extracting image features from the real invoice picture and a blank invoice template which is built in advance, carrying out feature point matching according to feature description of feature points, calculating to obtain an optimal transformation matrix according to a random sampling consistency principle, and carrying out corresponding affine transformation on the invoice picture so as to enable the image to be identical with a predefined template structure. Lays a foundation for the subsequent extraction of key information. An example of template alignment is shown in fig. 2.
(3) Paper image preprocessing
And binarizing and denoising the paper image to improve the accuracy of subsequent detection and identification.
(4) Stamp removal
And removing the red seal in the invoice image by using a threshold segmentation technology, and preventing the seal from affecting the identification result.
(5) Text detection
And extracting features from the image by using a convolutional neural network, predicting the probability that the position contains characters according to the features, and obtaining the position information of each text segment in the image.
(6) Character recognition
Then cutting the pictures according to the coordinates obtained in the steps to obtain the pictures of the text region, and predicting the characters in the sequence by using a deep learning method.
(7) Key information extraction
And identifying which entity in the knowledge graph the text segment belongs to according to the position information and the semantic information of the text segment by using a deep learning-based method.
(8) Error correction of identification content using knowledge-graph database
And matching the identified key text with the entity in the knowledge graph, determining whether the identified content is correct by checking whether the identified content meets the common characteristics of the entity and matching the identified content with the instance library, and correcting the identified content by using a certain rule if the identified content is incorrect.
Difficulty of the invention
(1) The existing bill identification method is low in accuracy, manual rechecking is still needed after the identification result is obtained, and the requirement of full-automatic input is not met. How to solve this problem is a difficulty in this field. The invention designs a bill identification and error correction technology based on a knowledge graph, which can improve the bill identification accuracy, even ensure that some key fields are identified 100% correctly and meet the requirement of full-automatic input.
(2) The invention designs a new seal removing method, which can solve the problem that the identification accuracy is reduced after the seal is removed by the existing method. The invention adds the image alignment algorithm into the bill identification flow, thereby effectively solving the problem that the phenomena of bill fold, shooting angle inclination and the like cause interference to text detection and identification.
(3) The text detection technology for the bill is designed, and the difficulty is that the accurate prediction of the text area with the size in the bill image is realized, and meanwhile, the low delay requirement is ensured.
Drawings
FIG. 1 is a schematic diagram of the system structure of the present invention
FIG. 2 template pair Ji Shili
FIG. 3 is a schematic diagram of a text region detection module for a ticket image
FIG. 4 stamp removal flow chart
Detailed Description
The core algorithm of the invention
(1) RTDNN architecture of bill text detection network
The core idea of the bill text detection network (Receipt Text Detection Neural Networks) is to consider a character as a target object to be detected, rather than a word (made up of characters), i.e. not to consider text boxes as targets. It detects individual characters (character region score) and the connection between characters (afinit score) and then determines the final text line based on the connection between characters. In this way, there is no need to change the size of the receptive field, only the character-level content is of interest and not the entire text line. The method can be better suitable for bill texts with different sizes and different lengths.
The model structure is divided into 3 parts. The first part is the Input terminal. Part 2 is the backhaul network, which is responsible for extracting the picture features. Part 3 is a Prediction module that outputs a Region score map for predicting the probability that each pixel is at the center of the character. The detailed description of the 3 parts is as follows:
1) The Input terminal firstly passes through a convolution layer with the step length stride=2 of 5×5×64 when the neural network is Input, and then passes through a max pool layer with the step length stride=2 of 3×3.
2) The Backbone network of the backhaul is composed of 4 groups of convolution modules by taking reference to the idea of a residual network (ResNet), and the details of each module are as follows:
the meaning of each term in structure is the number of wide and high channels.
All activation functions in the neural network designed by the invention adopt the leak_relu. The function also has a small positive slope when the input value is negative, and can be counter-propagated. The Backbone network of the Backbone designed by the invention references the classical network frame and excellent network design thought in the convolutional neural network, and can greatly reduce the operation time while guaranteeing the efficient extraction of the image characteristics.
3) The Prediction module consists of a 1-layer average pooling layer average pool and a 4-layer Conv. Finally, a Region score map of h×w×1 is output. The Region score map represents the probability that the point is the center of the text.
The invention also designs a text box generation algorithm for the bill image, which is based on the obtained Region score map and obtains a final bill image text detection box by setting a threshold value and calculating an IOU. The algorithm is described in detail as follows:
first, a pixel point having a score of 0.9 or more is selected in Regions score map, and a set of these points is denoted as S1. Points adjacent to set S1 and scoring greater than 0.6 are then added to S1 using breadth-first traversal. Calculating the maximum circumscribed rectangle of each isolated area in S1, and then merging text boxes belonging to the same text segment in the following mode: if the IOU of the two text boxes in the vertical direction is more than or equal to 0.8 and the horizontal distance is less than 30px, the two text boxes are combined into one. The generated rectangular box is the text detection result of the bill image.
(2) Seal removal algorithm
The invention designs a seal removing algorithm. The algorithm flow is shown in fig. 4. The detailed steps of the algorithm are as follows:
1. the RGB image is mapped to HSV space to facilitate more accurate extraction of red areas in the picture.
The calculation formula is as follows:
Cmax=max(R′,G′,B′)
Cmin=min(R′,G′,B′)
Δ=Cmax-Cmin
V=Cmax
wherein R, G, B are pixel values of R channel, G channel, B channel of bill image, which are mapped to HSV space by the above steps. H is hue, S is saturation, and V is brightness.
2. The value of red in HSV space is [0,43,46] to [10,255,255] U [156,43,46] to [180,255,255], the whole picture is traversed, the value of the point belonging to the value range is set to 255, and the value of the point not belonging to the value range is set to 0. Then, the picture is corroded and then swelled, wherein corrosion is used for removing noise points, swelling is used for expanding the red range and preventing red from being missed. This picture obtained is designated Mask1.
3. And extracting a diagram of the R channel of the bill image, setting a pixel point larger than a threshold value (160) as 255, and setting a pixel point smaller than the threshold value as 0 to obtain a diagram Mask2.
4. The image Mask is generated, and the pixel value in the Mask is calculated as follows: if the corresponding pixel values of the position in Mask1 and Mask2 are 255, the value of the point is 255, otherwise, the point is 0.
5. Traversing the original ticket image, if the value of the position in Mask is 255, the RGB value of the position is set to (255, 255, 255). To this end, the red stamp is removed.
The method firstly converts the image into HSV space, then extracts red, and then obtains the approximate area of the seal by using the operation of corrosion expansion. The threshold segmentation is only carried out on the area, other positions are not affected, and compared with the mode of only using the threshold segmentation, the adverse effect of the threshold segmentation on ocr is greatly reduced, and the accuracy of identification is improved.
(3) Character recognition algorithm based on CRNN and ACE
The architecture is shown as a bill character recognition network module in fig. 5. In a text recognition network of the bill image, firstly, preprocessing the bill text region image to enable data to be more standard, then inputting the processed image into a bill image feature extraction network to carry out serialization coding on text features of the bill image, and finally decoding the serialization coding through a character recognizer to obtain a text recognition result of the bill image.
The invention decodes the feature sequence of the bill text by adopting an aggregation cross Entropy (Aggregation Cross-Entropy, ACE) algorithm so as to realize the identification of the bill image text. With the last step, there are T time steps output. The final cross entropy is obtained through the following four steps:
1) Summing probabilities of the kth character of all time steps to obtain y k :
2) For y k Standardization:
4) Calculating the Loss of ACE:
wherein C is k Is the number of occurrences of character C.
The method solves the problems that the calculation process of the CTC algorithm is very complex and time-consuming, and the algorithm does not depend on a complex attention module to realize the functions like an attention mechanism, so that no additional network parameters are required to be generated, and the algorithm provides great help for decoding the text feature sequence of the bill image.
(4) Knowledge graph-based bill ocr result error correction technology
The invention designs a text error correction method, and the detailed flow is shown in figure 4. If ocr result of an entity is not matched with data in the knowledge-graph database, determining that the entity is identified by mistake, and using two branch processes for identifying the entity with mistake:
branch 1: and calculating the similarity between each candidate word and the recognition result by using a TF-IDF algorithm on the candidate word list of the entity. And screening out words with similarity to the recognition result higher than 0.8, and marking the result set as C.
Branch 2: the goal of this branch is to predict the law of errors in the OCR process for a certain chinese character. The invention collects an error conversion mapping set, which comprises 201 mappings of text conversion errors occurring in the practical OCR process, wherein the mapping format is c- > { c1, c2 … cn }, c1 is the wrong character, and { c1, c2 … cn } is the correct character set. Through statistical analysis of error rules, the character which is identified to be wrong in OCR conversion is found to have certain similarity with the original character in the stroke structure of the font. For example, the number 1 is often identified as "[", "", "| -! "etc. The invention uses this rule to predict the word that is recognized incorrectly, replacing it with the correct word. For example, there is an entity of an amount, the true value of the amount is "3001 #", but the result of recognition by OCR is "300 sheep". In the constructed knowledge graph database, the amount is composed of numbers, decimal points and special symbols. Obviously, "]" and "sheep" do not match the data in the database, and the two characters need to be replaced with the correct characters. And (5) replacing the original character according to the error conversion mapping set, and marking the obtained character string set as S. And finding out the value with the highest similarity from the intersection of the set S and the set C as the corrected value.
In addition, company names are typically made up of one or more of the place names, other words, and 3 parts of the limited company. The place name can be calculated in similarity with candidate words in the knowledge graph database, the candidate words are replaced by correct place names, then the character string is used as a search Key, and an http request is sent to an http:// api. The Name field in the json data returned by the request is the possibly correct company Name, and then the value with the highest similarity in the set is selected as the corrected value.
Detailed Description
1. Modeling various notes using knowledge graph
Firstly, modeling is carried out by using a knowledge graph aiming at various common notes in life. Various invoice types are used as main entities, and key field types in the bill are used as sub-entities of the main entities. For each key field, extracting the common characteristic of the field as the attribute of the sub-entity. In addition, for some key entities, all instances of the entity can be acquired by a web crawler and the like, screened and stored in a database. And aiming at the third party resource, acquiring a corresponding data access interface, and acquiring corresponding data through the interface. And then link the entities through reasonable relationships. Thus, the knowledge graph of the bill can be constructed.
The system mainly comprises a knowledge acquisition and processing module, a knowledge storage module and a knowledge application module. The base layer comprises a knowledge acquisition and processing module, the database layer and the cache layer comprise a knowledge storage module, and the Service end and the API end comprise knowledge application modules.
The knowledge acquisition and processing module is used for obtaining the ticket, the ticket entity and the relation network of the corresponding example of the entity through three processes of data cleaning, knowledge processing and knowledge representation of the Excel electronic form original data obtained from the related books and websites of the ticket entity.
Taking a train ticket as an example, the entities that need to be identified by the train ticket are shown in the following table:
the knowledge storage module provides bill knowledge graph storage service by utilizing the Neo4j graph database, and stores the relationship between the bill type and the key entity, wherein the constitution of a certain entity is very simple, such as a train ticket ID and a train number are composed of letters and numbers, and the time is composed of numbers, ": "," year "," month "," day ". Price consists of a number and "". In addition, the initiator and the terminal can acquire all the examples through the interface of https:// www.12306.cn/index/website. Through the above rules, an indication map for the train ticket can be constructed.
The knowledge application module contains common user services such as: user login, user registration, log management, knowledge retrieval, etc. Knowledge retrieval may retrieve attribute information for key fields of the ticket. The system adopts a design mode of micro-service to carry out architecture design, divides a system core service line into user identity verification service, user authority control service, bill feature entity extraction service, bill knowledge retrieval service and bill text image identification service based on SOA architecture, adopts Restful specification design and realizes an API interface, and stores user information and system log records by using MySQL object relational database. In consideration of the expandability and support high concurrency of the platform, the Redis is utilized for distributed caching. The knowledge service application of the system platform is packaged by utilizing the Docker container technology, so that the knowledge service system is convenient to deploy in a distributed mode, and has high portability and high expandability. The Kubernetes platform management container is adopted, so that the system platform can realize automatic deployment, expansion and management, and the system has high availability.
2. Template alignment
And simultaneously extracting image features from the real invoice picture and a blank invoice template which is built in advance by using an ORB feature point detector in the openCV.
The ORB feature point detector consists of two parts:
1. a positioner: this module finds points on the picture that have rotation invariance, scaling invariance, and affine invariance. The locator finds the abscissa of these points.
2. Description of: after obtaining the feature points we need to describe the properties of these feature points in some way. The output of this attribute is called the descriptor (Feature DescritorS) of the feature point. The core idea of the BRIEF algorithm is to select N point pairs around the key point P in a certain pattern, and combine the comparison results of the gray values of the N point pairs as a descriptor.
And carrying out feature point matching according to feature description of the feature points, and calculating according to the RANSAC principle to obtain a homography matrix. Affine transformation is carried out on the invoice picture according to the homography matrix, so that the image is matched with a predefined template structure. Lays a foundation for the subsequent extraction of key information.
3. Stamp removal
Mapping an original RGB image into HSV space, screening red parts in the image according to the value range, and then processing the image by using corrosion and expansion operation in openCV, wherein the corrosion is used for removing noise, and the expansion is used for expanding the range of red to prevent the red from being missed. Then the seal removing algorithm designed by the invention is used for processing the bill image and removing the red seal.
4. Text detection
And extracting features from the image by using a convolutional neural network VGG-16, predicting each position according to the features to obtain the probability that the position contains characters, and combining the positions containing the characters by using a certain algorithm to obtain the position coordinates of all text fragments in the image. Then cutting the picture according to the coordinates to obtain the picture of the text region.
5. Character recognition
The character recognition stage adopts a mainstream CRNN network model. The method comprises the following steps:
1) And converting the picture obtained in the text detection step into a picture with any width and height of 32 pixels. Then input into a CNN network composed of 7 layers of convolution layers, four layers of maximum pooling layers and two layers of Batchnormal layers, and output a feature map with the size of (512,1,40).
2) The feature Map is input to the Map-to-Sequence layer. Feature maps of size (512,1, 40) are recombined into feature vector sequences of (512, 40).
3) The feature sequence is then predicted using a bi-directional RNN (BLSTM), each feature vector in the sequence is learned, and a predicted tag distribution is output. The softmax probability distribution of all characters is obtained, which is a vector with the length of the character class number and is used as the input of the CTC layer.
4) The CTC layer takes each '-' as a separator, and merges the same and adjacent characters in the separator. And finally deleting the separator, wherein the final content is the predicted value of the text.
6. Key information extraction
And converting the position information of each text region obtained by using a text detection technology and the semantic information obtained by using a text recognition technology into vectors according to a certain mapping relation. Then training a neural network, taking the two vectors as input, extracting and deducing text position features and semantic features, outputting a probability matrix formed by key field entities in a previously established knowledge graph, and converting the key information extraction into a text segment classification task by using the method.
7. Error correction of identification content using knowledge-graph techniques
And matching the identified key text with the entity in the knowledge graph, and if the database has the entity matched with the key text, indicating that the identification is successful. If there is no entity matching with the identification, the identification is wrong. The best matching text needs to be selected from a pre-constructed knowledge-graph database. The TF-IDF algorithm is selected for the candidate text selection algorithm, and the algorithm is calculated as follows:
1) Calculating word Frequency (TF), wherein the word Frequency represents the Frequency of occurrence of a feature word in a certain category of text, and the higher the word Frequency is, the higher the importance of the feature word is, and the calculation method is as follows:
2) The inverse document frequency (Inverse Document Frequency, IDF) is calculated, and if a certain feature word appears in a plurality of candidate words, the more candidate words containing the feature word, the lower the distinguishing ability of the feature word for the candidate words. The calculation method comprises the following steps:
3) Calculating TF-IDF:
TF-IDF=TF*IDF
the algorithm is used for obtaining the similarity degree of the identification text and each text in the database, and the text with the highest similarity degree is selected as a corrected result.
Claims (3)
1. The bill identification method based on deep learning and knowledge graph is characterized by comprising the following steps:
(1) An integrated text detection module, a text recognition module, a key information extraction module and a bill structured recognition system using a knowledge graph modeling and error correction module are designed.
(2) The steps of seal removal and image alignment pretreatment are added, and the accuracy of model detection and identification is improved.
(3) A text box merge algorithm based on the vertical direction IOU and the lateral distance is designed and applied.
(4) A key information extraction flow based on a neural network is designed.
(5) A recognition error correction flow based on the knowledge graph is designed.
2. The ticket recognition method based on deep learning and knowledge graph according to claim 1, characterized by comprising the steps of:
(1) Constructing bill knowledge graph
(2) Template alignment
And simultaneously extracting image features from the real invoice picture and a blank invoice template which is built in advance, carrying out feature point matching according to feature description of feature points, calculating to obtain an optimal transformation matrix according to a random sampling consistency principle, and carrying out corresponding affine transformation on the invoice picture so as to enable the image to be identical with a predefined template structure. Lays a foundation for the subsequent extraction of key information.
(3) Paper image preprocessing
And binarizing and denoising the paper image to improve the accuracy of subsequent detection and identification.
(4) Stamp removal
Removing the red seal in the invoice image by using a threshold segmentation technology;
(5) Text detection
And extracting features from the image by using a convolutional neural network, predicting the probability that the position contains characters according to the features, and obtaining the position information of each text segment in the image.
(6) Character recognition
Then cutting the pictures according to the coordinates obtained in the steps to obtain the pictures of the text region, and predicting the characters in the sequence by using a deep learning method.
(7) Key information extraction
And identifying which entity in the knowledge graph the text segment belongs to according to the position information and the semantic information of the text segment by using a deep learning-based method.
(8) Error correction of identification content using knowledge-graph database
And matching the identified key text with the entity in the knowledge graph, determining whether the identified content is correct by checking whether the identified content meets the common characteristics of the entity and whether the identified content is matched with the instance library, and correcting the identified content if the identified content is incorrect.
3. The ticket recognition method based on deep learning and knowledge graph according to claim 1, characterized by comprising the steps of:
(1) RTDNN architecture of bill text detection network
The model structure is divided into 3 parts. The first part is the Input terminal. Part 2 is the backhaul network, which is responsible for extracting the picture features. Part 3 is a Prediction module that outputs a Region score map for predicting the probability that each pixel is at the center of the character. The detailed description of the 3 parts is as follows:
1) The Input terminal firstly passes through a convolution layer with the step length stride=2 of 5×5×64 when the neural network is Input, and then passes through a max pool layer with the step length stride=2 of 3×3.
2) The Backbone network of the backhaul is composed of 4 groups of convolution modules by taking reference to the idea of a residual network (ResNet), and the details of each module are as follows:
the meaning of each term in structure is the number of wide and high channels.
All activation functions in the neural network use leak_relu. The Prediction module consists of a 1-layer average pooling layer average pool and a 4-layer Conv. Finally, a Region score map is output. The Region score map represents the probability that the point is the center of the text.
The text box generation algorithm for the bill image is described in detail as follows:
first, a pixel point having a score of 0.9 or more is selected in Regions score map, and a set of these points is denoted as S1. Points adjacent to set S1 and scoring greater than 0.6 are then added to S1 using breadth-first traversal. Calculating the maximum circumscribed rectangle of each isolated area in S1, and then merging text boxes belonging to the same text segment in the following mode: if the IOU of the two text boxes in the vertical direction is more than or equal to 0.8 and the horizontal distance is less than 30px, the two text boxes are combined into one. The generated rectangular box is the text detection result of the bill image.
(2) Seal removal algorithm
The detailed steps of the algorithm are as follows:
1) The RGB image is mapped to HSV space to facilitate more accurate extraction of red areas in the picture. The calculation formula is as follows:
Cmax=max(R′,G′,B′)
Cmin=min(R′,G′,B′)
Δ=Cmax-Cmin
V=Cmax
wherein R, G, B are pixel values of R channel, G channel, B channel of bill image, which are mapped to HSV space by the above steps. H is hue, S is saturation, and V is brightness.
2) The value of red in HSV space is [0,43,46] to [10,255,255] U [156,43,46] to [180,255,255], the whole picture is traversed, the value of the point belonging to the value range is set to 255, and the value of the point not belonging to the value range is set to 0. Then, the picture is corroded and then swelled, wherein corrosion is used for removing noise points, swelling is used for expanding the red range and preventing red from being missed. This picture obtained is designated Mask1.
3) And extracting a diagram of the R channel of the bill image, setting a pixel point larger than a threshold value (160) as 255, and setting a pixel point smaller than the threshold value as 0 to obtain a diagram Mask2.
4) The image Mask is generated, and the pixel value in the Mask is calculated as follows: if the corresponding pixel values of the position in Mask1 and Mask2 are 255, the value of the point is 255, otherwise, the point is 0.
5) Traversing the original ticket image, if the value of the position in Mask is 255, the RGB value of the position is set to (255, 255, 255). To this end, the red stamp is removed.
(3) Character recognition algorithm based on CRNN and ACE
In a text recognition network of the bill image, firstly, preprocessing the bill text region image, then inputting the processed image into a bill image feature extraction network to carry out serialization coding on text features of the bill image, and finally decoding the serialized coding through a character recognizer to obtain a text recognition result of the bill image.
And decoding the feature sequence of the bill text by adopting an aggregation cross entropy algorithm so as to realize identification of the bill image text. With the last step, there are T time steps output. The final cross entropy is obtained through the following four steps:
1) Summing probabilities of the kth character of all time steps to obtain y k :
2) For y k Standardization:
4) Calculating the Loss of ACE:
wherein C is k Is the number of occurrences of character C.
(4) Knowledge-graph-based bill ocr result error correction
If ocr result of an entity is not matched with data in the knowledge-graph database, determining that the entity is identified by mistake, and using two branch processes for identifying the entity with mistake:
branch 1: and calculating the similarity between each candidate word and the recognition result by using a TF-IDF algorithm on the candidate word list of the entity. And screening out words with similarity to the recognition result higher than 0.8, and marking the result set as C.
Branch 2: the goal of this branch is to predict the law of errors in the OCR process for a certain chinese character. An error conversion mapping set is collected, which comprises a plurality of mappings of text conversion errors occurring in the actual OCR process, wherein the mapping format is c- > { c1, c2 … cn }, c1 is the wrong character, and { c1, c2 … cn } is the correct character set. And (5) replacing the original character according to the error conversion mapping set, and marking the obtained character string set as S. And finding out the value with the highest similarity from the intersection of the set S and the set C as the corrected value.
In addition, company names are typically made up of one or more of the place names, other words, and 3 parts of the limited company. The place name can be calculated in similarity with candidate words in the knowledge graph database, the candidate words are replaced by correct place names, and then the character string is used as a search Key, and an http request is sent to an enterprise search interface for fuzzy search. The Name field in the json data returned by the request is the possibly correct company Name, and then the value with the highest similarity in the set is selected as the corrected value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110883236.4A CN116343237A (en) | 2021-08-02 | 2021-08-02 | Bill identification method based on deep learning and knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110883236.4A CN116343237A (en) | 2021-08-02 | 2021-08-02 | Bill identification method based on deep learning and knowledge graph |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116343237A true CN116343237A (en) | 2023-06-27 |
Family
ID=86891630
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110883236.4A Pending CN116343237A (en) | 2021-08-02 | 2021-08-02 | Bill identification method based on deep learning and knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116343237A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117115839A (en) * | 2023-08-10 | 2023-11-24 | 广州方舟信息科技有限公司 | Invoice field identification method and device based on self-circulation neural network |
CN117727059A (en) * | 2024-02-18 | 2024-03-19 | 蓝色火焰科技成都有限公司 | Method and device for checking automobile financial invoice information, electronic equipment and storage medium |
-
2021
- 2021-08-02 CN CN202110883236.4A patent/CN116343237A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117115839A (en) * | 2023-08-10 | 2023-11-24 | 广州方舟信息科技有限公司 | Invoice field identification method and device based on self-circulation neural network |
CN117115839B (en) * | 2023-08-10 | 2024-04-16 | 广州方舟信息科技有限公司 | Invoice field identification method and device based on self-circulation neural network |
CN117727059A (en) * | 2024-02-18 | 2024-03-19 | 蓝色火焰科技成都有限公司 | Method and device for checking automobile financial invoice information, electronic equipment and storage medium |
CN117727059B (en) * | 2024-02-18 | 2024-05-03 | 蓝色火焰科技成都有限公司 | Method and device for checking automobile financial invoice information, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543690B (en) | Method and device for extracting information | |
JP5522408B2 (en) | Pattern recognition device | |
US8494273B2 (en) | Adaptive optical character recognition on a document with distorted characters | |
EP1598770B1 (en) | Low resolution optical character recognition for camera acquired documents | |
He et al. | Beyond OCR: Multi-faceted understanding of handwritten document characteristics | |
Park et al. | Automatic detection and recognition of Korean text in outdoor signboard images | |
CN108509881A (en) | A kind of the Off-line Handwritten Chinese text recognition method of no cutting | |
CN107194400A (en) | A kind of finance reimbursement unanimous vote is according to picture recognition processing method | |
CN109740606B (en) | Image identification method and device | |
RU2760471C1 (en) | Methods and systems for identifying fields in a document | |
CN108681735A (en) | Optical character recognition method based on convolutional neural networks deep learning model | |
CN112395996A (en) | Financial bill OCR recognition and image processing method, system and readable storage medium | |
CN112508011A (en) | OCR (optical character recognition) method and device based on neural network | |
US11615244B2 (en) | Data extraction and ordering based on document layout analysis | |
CN112434690A (en) | Method, system and storage medium for automatically capturing and understanding elements of dynamically analyzing text image characteristic phenomena | |
CN116343237A (en) | Bill identification method based on deep learning and knowledge graph | |
CN112766255A (en) | Optical character recognition method, device, equipment and storage medium | |
Peanho et al. | Semantic information extraction from images of complex documents | |
CN113158895A (en) | Bill identification method and device, electronic equipment and storage medium | |
CN111539417B (en) | Text recognition training optimization method based on deep neural network | |
Colter et al. | Tablext: A combined neural network and heuristic based table extractor | |
CN111340032A (en) | Character recognition method based on application scene in financial field | |
CN114782965A (en) | Visual rich document information extraction method, system and medium based on layout relevance | |
Lin et al. | Radical-based extract and recognition networks for Oracle character recognition | |
CN114219507A (en) | Qualification auditing method and device for traditional Chinese medicine supplier, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |