CN110008944B

CN110008944B - OCR recognition method and device based on template matching and storage medium

Info

Publication number: CN110008944B
Application number: CN201910127136.1A
Authority: CN
Inventors: 高梁梁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-02-20
Filing date: 2019-02-20
Publication date: 2024-02-13
Anticipated expiration: 2039-02-20
Also published as: CN110008944A

Abstract

The application discloses an OCR (optical character recognition) method and device based on template matching, a storage medium and computer equipment, and relates to the technical field of information processing. The method comprises the following steps: collecting sample document pictures of different appointed typesetting modes; performing frame selection on each sample document picture to obtain an identification template corresponding to each sample document picture; establishing an identification template database, wherein the identification template database stores identification templates corresponding to the document pictures of the samples; collecting a document picture to be identified, and identifying the border and the title of the document picture to be identified to obtain the document type of the document picture to be identified; and calling a corresponding recognition template in the recognition template database according to the document type of the document picture to be recognized, and performing OCR recognition on the document picture to be recognized. The recognition template database is established, so that the recognition template database can adapt to recognition of documents with various typesetting formats, and the accuracy of OCR recognition is improved.

Description

OCR recognition method and device based on template matching and storage medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to an OCR recognition method and apparatus based on template matching, a storage medium, and a computer device.

Background

The optical character recognition (Optical Character Recognition, OCR) method refers to obtaining an electronic document of a paper document by an electronic device (e.g., a scanner or a digital camera), cutting character strings in the electronic document apart to form a small picture containing a single character, and then recognizing the cut text by using a certain method.

The conventional OCR recognition method can only accurately recognize the pictures with fixed character typesets such as identity cards, bank cards and the like because of various character typesets in the pictures to be recognized, but has poor picture recognition effect on other documents.

Disclosure of Invention

In view of this, the present application provides an OCR recognition method and apparatus, a storage medium, and a computer device based on template matching, which mainly aims to solve the problem of poor recognition effect of the existing OCR recognition method.

According to one aspect of the present application, there is provided an OCR recognition method based on template matching, the method comprising:

collecting sample document pictures of different appointed typesetting modes;

performing frame selection on each sample document picture to obtain an identification template corresponding to each sample document picture;

establishing an identification template database, wherein the identification template database stores identification templates corresponding to the document pictures of the samples;

collecting a document picture to be identified, and identifying the border and the title of the document picture to be identified to obtain the document type of the document picture to be identified;

and calling a corresponding recognition template in the recognition template database according to the document type of the document picture to be recognized, and performing OCR recognition on the document picture to be recognized.

Optionally, the identifying the border and the title of the document picture to be identified to obtain the document type of the document picture to be identified includes:

performing binarization processing on the document picture to be identified to obtain a binarization table image;

performing tilt correction on the binarized table image based on a tilt correction algorithm of perspective change;

extracting the frame of the document picture to be identified by adopting an image morphology processing method based on the binarized table image after inclination correction;

OCR is carried out on a preset area of the binarized table image after inclination correction, so that a title of the document picture to be identified is obtained;

and obtaining the document type of the document picture to be identified according to the border and the title of the document picture to be identified.

Optionally, the performing OCR recognition on the document picture to be recognized includes:

and adopting a convolutional cyclic neural network model to perform OCR recognition on the document picture to be recognized.

Optionally, the convolutional recurrent neural network model includes a neural network CNN, a bidirectional recurrent neural network LSTM, and a join time-classified CTC model;

the adopting the convolutional neural network model to perform OCR (optical character recognition) on the document picture to be recognized comprises the following steps:

the neural network CNN extracts the characteristics of the identification area of the document picture to be identified and generates a characteristic sequence of the identification area;

determining a label distribution list corresponding to each feature in the feature sequence by the bidirectional cyclic neural network LSTM;

and determining the characters of the identification area by the joint time classification CTC model according to the label distribution list corresponding to each characteristic.

Optionally, the collecting the document picture to be identified includes:

and acquiring a document picture to be identified by a high-definition camera with an automatic shooting angle adjusting function.

Optionally, before invoking the corresponding recognition template in the recognition template database to perform OCR recognition on the document picture to be recognized, the method further includes:

adjusting the brightness and contrast of the document picture to be identified;

carrying out gray processing on the document picture to be identified;

and receiving an angle adjustment instruction of a user on the document picture to be identified after gray processing, and adjusting the angle of the document picture to be identified.

Optionally, the first sample document picture is configured with a plurality of region identification templates in the identification template database, each region identification template being used for identifying a partial region of the sample document picture.

Optionally, the framing each sample document picture includes:

carrying out integral automatic frame selection on each sample document picture to obtain an identification area of each sample document picture;

and adjusting the identification area selected by the automatic frame, adjusting a plurality of identification areas selected by the error automatic frame into one identification area, and splitting one identification area selected by the error automatic frame into a plurality of identification areas.

According to another aspect of the present application, there is provided an OCR recognition device based on template matching, the device comprising:

the sample document picture collecting unit is used for collecting sample document pictures of different appointed typesetting modes;

the identification template acquisition unit is used for carrying out frame selection on each sample document picture to obtain an identification template corresponding to each sample document picture;

the identification template database establishing unit is used for establishing an identification template database, and the identification template database stores identification templates corresponding to the document pictures of the samples;

the document type acquisition unit is used for acquiring a document picture to be identified, identifying the border and the title of the document picture to be identified and obtaining the document type of the document picture to be identified;

and the OCR recognition unit is used for calling a corresponding recognition template in the recognition template database according to the document type of the recognized document picture to be recognized so as to perform OCR recognition on the document picture to be recognized.

Optionally, the document type obtaining unit is further configured to:

Optionally, the OCR recognition unit is further configured to:

Specifically, the convolutional recurrent neural network model comprises a neural network CNN, a bidirectional recurrent neural network LSTM and a joint time classification CTC model;

Optionally, the apparatus further comprises:

the picture adjusting unit is used for adjusting the brightness and the contrast of the document picture to be identified;

the gray processing unit is used for carrying out gray processing on the document picture to be identified;

and the angle adjusting unit is used for receiving an angle adjusting instruction of a user on the document picture to be identified after the gray level processing and adjusting the angle of the document picture to be identified.

Specifically, the identification template acquisition unit is further configured to:

According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described template matching-based OCR recognition method.

According to yet another aspect of the present application, there is provided a computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the above-described template matching-based OCR recognition method when executing the program.

By means of the technical scheme, the OCR recognition method and device based on template matching, the storage medium and the computer equipment establish a recognition template database, can adapt to recognition of documents with different typesetting formats, and improve accuracy of OCR recognition.

In addition, the high-definition camera is also adopted to collect the picture of the document to be identified, so that the influence of shooting light and angles on OCR identification can be eliminated. And the specific region of a certain document picture can be identified based on the region identification template, so that the identification efficiency is improved.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 shows a schematic flow chart of an OCR recognition method based on template matching according to an embodiment of the present application;

FIG. 2 shows a schematic diagram of a sample document provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of an OCR recognition device based on template matching according to an embodiment of the present application.

Detailed Description

The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

Aiming at the problem of poor recognition effect of the existing OCR recognition method. The embodiment provides an OCR recognition method based on template matching, which can adapt to recognition of documents in a plurality of different typesetting formats and improve the accuracy of OCR recognition, as shown in fig. 1, and comprises the following steps:

s11: collecting sample document pictures of different appointed typesetting modes;

in practical application, sample document pictures in different appointed typesetting modes are collected through a high-definition camera with an automatic shooting angle adjusting function.

S12: performing frame selection on each sample document picture to obtain an identification template corresponding to each sample document picture;

it should be noted that, in the embodiment of the present application, a sample document picture is collected, an identification area range of the sample document picture is determined, an identification template corresponding to the document picture to be identified is established, and the identification template includes coordinate positions and area names of each identification area.

It can be understood that in OCR recognition, the quality of image segmentation directly affects the recognition rate of OCR. When OCR recognition is performed on a cut-and-miss image, a correct recognition result is often not obtained. Therefore, the method and the device establish a template of the document picture to be identified, and record the coordinate positions and the area names of the identification areas in the template. And the printing area of the document picture to be identified is not identified, and the area name in the template is directly used as the characters of the printing area, so that the identification efficiency is improved.

S13: establishing an identification template database, wherein the identification template database stores identification templates corresponding to the document pictures of the samples;

it can be understood that, the embodiment of the application establishes the recognition template database comprising the recognition templates corresponding to various sample document pictures, so that the document pictures to be recognized with different typesetting modes can be recognized more accurately.

S14: collecting a document picture to be identified, and identifying the border and the title of the document picture to be identified to obtain the document type of the document picture to be identified;

s15: and calling a corresponding recognition template in the recognition template database according to the document type of the document picture to be recognized, and performing OCR recognition on the document picture to be recognized.

It can be understood that, according to the embodiment of the application, the identification area and the area name of the document picture to be identified are determined according to the identification template matched with the document picture to be identified, the identification area can be accurately determined, OCR identification is performed on the identification area (in practice, namely, the handwriting area), and the area name in the identification template is used as the characters of the printing area.

According to the OCR recognition method based on the template matching, a recognition template database is established, the method can adapt to recognition of documents in various typesetting formats, and accuracy of OCR recognition is improved.

In an alternative implementation manner of the embodiment of the present application, similar to the method in fig. 1, step S14 identifies the border and the title of the document picture to be identified, and obtains the document type of the document picture to be identified, which includes:

It should be noted that the borders of the document picture to be identified refer to the horizontal and vertical border of the table in the document to be identified. The specific process for extracting the frame of the document picture to be identified is as follows:

respectively selecting a horizontal structural element and a vertical structural element to perform open operation on the binarized table image after inclination correction to obtain a table horizontal line image and a table vertical line image;

performing AND operation on the table horizontal line image and the table vertical line image to obtain a table frame image;

and carrying out refinement treatment on the table frame graph, and extracting a table frame wire framework, namely extracting the frame of the document picture to be identified.

It is understood that the area in the middle of the upper portion of the document picture to be recognized is the title of the document picture to be recognized. Determining the major class of the document pictures to be identified, such as financial class or object class, according to the title; the subclasses of the document pictures to be identified, such as the money application form and the reimbursement approval form, can be further determined according to the frames, and the subclasses belong to the financial class form, but are respectively provided with different frames, so that the subclasses correspond to different document types.

In another optional implementation manner of the embodiment of the present application, the performing OCR recognition on the document picture to be recognized in step S15 includes:

Specifically, the tag distribution list corresponding to a certain feature is a softmax vector, the probability of each tag corresponding to the feature is represented, after the probabilities of all the features are transmitted to a CTC model, the most probable tag is output, and the final sequence tag, namely the text of the identification area, is obtained through operations such as space removal and the like.

It can be understood that the handwritten characters are not as regular as the printed characters, so that the effect of OCR recognition of the handwritten characters is poor.

Preferably, the collecting the document picture to be identified includes:

It should be noted that, the high definition digtal camera can adjust ISO value or exposure according to the intensity of ambient light to improve the quality of document picture. In a digital camera, ISO indicates the light sensing speed of a CCD or CMOS light sensing element, and a higher ISO value indicates a higher light sensing capability of the light sensing element.

In general, the lower the ISO value is, the higher the quality of the photo is, the finer the detail of the photo is, the higher the ISO value is, the higher the brightness of the photo is, the quality of the photo is reduced along with the increase of the ISO value, the noise becomes more serious, but the high ISO value can make up for the shortage of light.

In addition, the high-definition camera can adjust the shooting angle according to the document position, so that the effect of OCR recognition is prevented from being influenced by Chinese skew in the document.

OCR recognition requires a high quality input image, and often the user needs to provide a high quality image to have a good recognition quality. The resolution cannot be too low, the color cannot be too rich, the contrast cannot be too low, and the text on the image cannot be skewed.

Preferably, in the embodiment of the present application, before invoking a corresponding recognition template in the recognition template database to perform OCR recognition on the document picture to be recognized, the preprocessing of the document picture to be recognized further includes:

adjusting the brightness and contrast of the document picture to be identified;

carrying out gray processing on the document picture to be identified;

In an alternative embodiment, the present application uses a Tesseactor-OCR open source framework to automatically adjust the brightness and contrast of an image; using an open source cross-platform computer vision library openCV to carry out gray processing on the image, so that the color of the image is changed into black and white to form clear contrast; the image interaction interface is displayed to enable a user to manually correct the image, and the user can adjust the angle of the image through the internal functions provided by the application program, so that characters on the image are not deflected any more.

In an alternative embodiment of the present application, the first sample document picture is configured with a plurality of region identification templates in the identification template database, and each region identification template is used for identifying a partial region of the sample document picture.

It should be noted that the first sample picture is any sample document picture in the recognition template database. By configuring a plurality of region identification templates for a certain sample document picture, the identification of partial regions of the document picture to be identified can be realized, and the efficiency of OCR identification is improved.

In practical application, taking fig. 2 as an example, a sample document is a bill document, and the sample document is configured with an area identification template a and an area identification template B, wherein the area identification template a has an identification area range including a money amount, a money department and a contract number, and the area identification template B has an identification area range including a money amount, a money department, an applicant and a department responsible person. Different area identification templates can be set according to actual service requirements.

In another embodiment of the present solution, in order to improve the efficiency of creating the template of the document picture to be identified, the frame selection may be further performed on each sample document picture so as to obtain an identification template corresponding to each sample document picture:

It can be understood that, in the embodiment of the present application, the overall automatic frame selection is performed on the document picture to be identified, and then the result of the overall automatic frame selection is adjusted, so as to establish an identification template corresponding to the document picture to be identified, where the identification template includes the coordinate positions and the area names of the identification areas. The automatic marking area can mark the frame at any part of the area when the frame is arranged, the frame wire can be automatically adjusted to the boundary of the area, and the frame wire can be marked at the blank outside the four boundaries of the area, and can be automatically contracted to the boundary of the area.

Specifically: one-standard-number adjustment is carried out on the whole automatic frame selection result, and a plurality of areas marked by errors are combined into one area; and (3) performing multi-label one adjustment on the whole automatic frame selection result, and splitting the error mark into a plurality of areas.

The OCR recognition method establishes the template matched with the document to be recognized, can adapt to recognition of the documents with different typesetting formats, and improves recognition rate and accuracy of OCR recognition. And the influence of shooting light and angles on OCR recognition can be eliminated by adopting the high-definition camera. The OCR recognition method can also recognize the specific region of a certain document picture based on the region recognition template, so that recognition efficiency is improved.

Fig. 3 is a schematic structural diagram of an OCR recognition device based on template matching according to an embodiment of the present application. As shown in fig. 3, the apparatus of the embodiment of the present application includes:

a sample document picture collecting unit 31 for collecting sample document pictures of different appointed typesetting modes;

in practical application, the sample document picture collecting unit 31 collects sample document pictures of different specified typesetting modes through a high-definition camera with an automatic shooting angle adjusting function.

An identification template obtaining unit 32, configured to perform frame selection on each sample document picture, and obtain an identification template corresponding to each sample document picture;

An identification template database creation unit 33, configured to create an identification template database, where identification templates corresponding to the respective sample document pictures are stored in the identification template database;

The document type obtaining unit 34 is configured to collect a document picture to be identified, and identify a border and a title of the document picture to be identified, so as to obtain a document type of the document picture to be identified.

And the OCR recognition unit 35 is configured to collect a document picture to be recognized, and call a corresponding recognition template in the recognition template database to perform OCR recognition on the document picture to be recognized.

The OCR recognition device based on the template matching establishes a recognition template database, can adapt to recognition of documents in various typesetting formats, and improves the accuracy of OCR recognition.

Optionally, the document type obtaining unit 34 is further configured to:

Optionally, the OCR recognition unit 35 is further configured to:

Optionally, the apparatus further comprises:

The image adjusting unit automatically adjusts the brightness and contrast of the image by using a Tesseact-OCR open source frame; the gray processing unit uses an open source cross-platform computer vision library openCV to perform gray processing on the image, so that the color of the image is changed into black and white, and bright contrast is formed; the angle adjusting unit displays an image interaction interface to enable a user to manually correct the image, and the user can adjust the angle of the image through an internal function provided by an application program, so that characters on the image are not deflected any more.

Specifically, the recognition template acquiring unit 32 is further configured to:

The OCR recognition device establishes the template matched with the document to be recognized, can adapt to recognition of the documents with various typesetting formats, and improves recognition rate and accuracy of OCR recognition. And the influence of shooting light and angles on OCR recognition can be eliminated by adopting the high-definition camera. The OCR recognition method can also recognize the specific region of a certain document picture based on the region recognition template, so that recognition efficiency is improved.

It should be noted that, for other corresponding descriptions of each functional unit related to the OCR recognition device based on the template matching provided in the embodiment of the present application, reference may be made to corresponding descriptions in fig. 1 and fig. 2.

Based on the method shown in fig. 1, correspondingly, the embodiment of the application also provides a storage medium, on which a computer program is stored, and when the program is executed by a processor, the OCR recognition method based on template matching shown in fig. 1 is implemented.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods described in various implementation scenarios of the present application.

Based on the method shown in fig. 1 and the virtual device embodiment shown in fig. 3, in order to achieve the above objective, the embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, etc., where the entity device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the above-described template matching-based OCR recognition method as shown in fig. 1.

Optionally, the computer device may also include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., bluetooth interface, WI-FI interface), etc.

It will be appreciated by those skilled in the art that the architecture of a computer device provided in this embodiment is not limited to this physical device, but may include more or fewer components, or may be combined with certain components, or may be arranged in a different arrangement of components.

The storage medium may also include an operating system, a network communication module. An operating system is a program that manages the hardware and software resources of a computer device, supporting the execution of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the entity equipment.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. By applying the technical scheme of the application.

It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the description of the present application, numerous specific details are set forth. It may be appreciated, however, that embodiments of the present application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the method of this application should not be interpreted as reflecting the intent: i.e., the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims

1. An OCR recognition method based on template matching, comprising:

collecting sample document pictures of different appointed typesetting modes;

invoking a corresponding recognition template in the recognition template database according to the document type of the document picture to be recognized obtained by recognition to perform OCR recognition on the document picture to be recognized;

the identifying the border and the title of the document picture to be identified to obtain the document type of the document picture to be identified includes:

obtaining the document type of the document picture to be identified according to the border and the title of the document picture to be identified;

the OCR for the document picture to be identified comprises the following steps:

performing OCR (optical character recognition) on the document picture to be recognized by adopting a convolutional cyclic neural network model;

the frame selection of each sample document picture comprises the following steps:

2. The method of claim 1, wherein the convolutional recurrent neural network model comprises a neural network CNN, a bi-directional recurrent neural network LSTM, and a join time classification CTC model;

3. The method of claim 1, wherein prior to invoking the corresponding recognition template in the recognition template database to OCR recognize the document picture to be recognized, the method further comprises:

adjusting the brightness and contrast of the document picture to be identified;

carrying out gray processing on the document picture to be identified;

4. The method of claim 1, wherein a first sample document picture is configured with a plurality of region identification templates in the identification template database, each region identification template for identifying a partial region of the sample document picture.

5. An OCR recognition device based on template matching, comprising:

the OCR recognition unit is used for calling a corresponding recognition template in the recognition template database according to the document type of the document picture to be recognized obtained by recognition to perform OCR recognition on the document picture to be recognized;

the document type acquisition unit is configured to:

the OCR recognition unit is used for:

the identification template acquisition unit is used for:

6. A storage medium having stored thereon a computer program, which when executed by a processor implements the template matching based OCR recognition method of any one of claims 1 to 4.

7. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the template matching based OCR recognition method of any one of claims 1 to 4 when executing the computer program.