CN114202765A

CN114202765A - Image text recognition method and storage medium

Info

Publication number: CN114202765A
Application number: CN202111330318.2A
Authority: CN
Inventors: 陈江海; 梁懿; 苏江文; 卢伟龙
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-03-18

Abstract

The invention relates to an image text recognition method and a storage medium, wherein the method comprises the following steps: s1: receiving first image text information; the image text information comprises first text information and background image information; s2: extracting first text information and background image information, and determining parameter information corresponding to the first text information; s3: acquiring one or more pieces of character information from a character database, and processing the acquired character information by adopting parameter information corresponding to the first text information to obtain second text information; s4: and synthesizing the second text information and the background image information into second image text information, and inputting the second image text information into a text detection model for training. By the scheme, the training data volume of the detection model can be effectively expanded, and the accuracy of the trained model on text detection is improved.

Description

Image text recognition method and storage medium

Technical Field

The present invention relates to the field of image recognition, and in particular, to an image text recognition method and a storage medium.

Background

Image text Recognition, or OCR (Optical Character Recognition), refers to an emerging technology that recognizes words in an image and returns them in the form of text. OCR recognition technology has evolved through several stages, from the first printed english language that can only recognize a given font to the present time, which can recognize many national characters including handwriting.

In recent years, with the rapid development of the field of artificial intelligence, there has been an increasing demand for text content recognition of documents such as digitally processed scannings and photographed images. The image text recognition technology becomes a necessary link for intelligent processing of the current unstructured files. And plays a vital role in a plurality of business fields, such as marketing archive identification, audit document identification, engineering file identification, electronic license identification and the like.

At present, a certain method exists for universal image text recognition, but the defects of low recognition accuracy, low recognition speed, incapability of recognizing bent characters, incapability of supporting multi-language hybrid recognition and the like exist.

For example, in a patent application with an application number of [ CN201911221023.4 ], entitled "a method for recognizing text of images in natural scenes based on pruning depth model", an image text recognition method is proposed, which performs feature extraction through Darknet, performs target detection by combining with YoloV3, recognizes bbox in a text area in an image, and then performs recognition. This solution has the following drawbacks: 1. the adoption of Darknet as a backbone can result in slower overall recognition speed; 2. YoloV3 is generally used in target detection scenarios where text detection accuracy is more general. In conclusion, the scheme has the obvious defects of low recognition speed, low recognition accuracy and the like.

For another example, in a patent application with an application number of [ CN202110584533.9 ] and a title of "optical character fast recognition method and system", a character fast recognition method is proposed, which comprises the following basic steps: (1) text detection is carried out through a DB algorithm; (2) and performing text recognition by adopting a CRNN algorithm. The scheme represents the current common OCR recognition method, detection is carried out through a DB algorithm, and recognition is carried out by adopting a CRNN algorithm. This solution has the following drawbacks: 1. the DB algorithm adopts single-line character detection, a plurality of detection boxes are required to be adopted for respectively identifying the multi-line character detection, the identification rate of an overlong text (for example, the length of a character exceeds 25 characters) is low, and other measures (for example, a sliding window) are required to be matched to improve the accuracy rate; 2. although the prediction speed of the CRNN algorithm is slightly high, the recognition accuracy of the CRNN algorithm is obviously lower than that of algorithms such as SRN and NRTR.

Disclosure of Invention

Therefore, a technical scheme for image text recognition needs to be provided to solve the problems of low recognition rate, low speed and the like of the existing image text recognition method.

To achieve the above object, in a first aspect, the present application provides an image text recognition method, including the steps of:

s1: receiving first image text information; the image text information comprises first text information and background image information;

s2: extracting first text information and background image information, and determining parameter information corresponding to the first text information;

s3: acquiring one or more pieces of character information from a character database, and processing the acquired character information by adopting parameter information corresponding to the first text information to obtain second text information;

s4: and synthesizing the second text information and the background image information into second image text information, and inputting the second image text information into a text detection model for training.

As an alternative embodiment, step S3 includes:

and randomly acquiring one or more character information from the character database, repeating the steps for multiple times, and processing the acquired multiple character information by adopting the parameter information corresponding to the first text information to obtain multiple second text information.

As an alternative embodiment, the parameter information includes any one or more of a font, a font size, a font style, a color, a typesetting mode, and a decoration effect.

As an alternative embodiment, the first image text information includes any one or more of invoice data, tickets, business licenses, electronic itineraries, identity cards, social security cards, and bank cards.

As an alternative embodiment, the text detection model is the ResNet50_ vd and SAST algorithm detection model; the method specifically comprises the following steps: ResNet50_ vd is adopted as a network structure, and a full connection layer in the network structure is replaced by an FCN full convolution layer.

As an alternative embodiment, the loss function of the text detection model is as follows: l is_total＝λ₁L_tcl+λ₂L_tco+λ₃L_tvo+λ₄L_tbo；

Wherein, λ 1, λ 2, λ 3 and λ 4 are weighted values, and tcl, tco, tvo and tvo represent four characteristic diagrams; tcl represents a text area where the first text information is located; tco, tvo denote the amount of pixel shift compared to tcl; the method specifically comprises the following steps: tco feature map is text pixel center offset relative to tcl feature map; the tvo feature map is the pixel offset relative to the four bounding box vertices of the text of the tvl feature map; tbo is the offset relative to the upper and lower bounds of the tcl profile.

As an alternative embodiment, λ 1 ═ 1.0; λ 2 is 0.5; λ 3 ═ 0.5; λ 4 is 1.0.

As an alternative embodiment, step S4 is followed by step S5:

inputting the output result of the text detection model into a text recognition model for training; the text detection model is the Resnet50_ vd _ fpn and the SRN algorithm recognition model.

As an alternative embodiment, the method further comprises:

optimizing the trained model, specifically comprising: and distilling, quantifying and cutting the trained model in sequence to obtain the final model.

In a second aspect, the present application also provides a storage medium storing a computer program which, when executed by a processor, performs the method steps as in the first aspect of the present application.

The invention relates to an image text recognition method and a storage medium, which are different from the prior art, and the method comprises the following steps: s1: receiving first image text information; the image text information comprises first text information and background image information; s2: extracting first text information and background image information, and determining parameter information corresponding to the first text information; s3: acquiring one or more pieces of character information from a character database, and processing the acquired character information by adopting parameter information corresponding to the first text information to obtain second text information; s4: and synthesizing the second text information and the background image information into second image text information, and inputting the second image text information into a text detection model for training. According to the scheme, the second text information is obtained by expanding according to the first text information, the second text information and the background image information are synthesized into the second image text information and then transmitted to the text detection model for training, the training data volume of the model is effectively improved, and the accuracy of the trained model for text detection is further improved.

Drawings

Fig. 1 is a flowchart of an image text recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for recognizing text in an image according to another embodiment of the present invention;

FIG. 3 is a flow chart of a method for image text recognition according to another embodiment of the present invention;

FIG. 4 is a flow chart of model training according to an embodiment of the present invention;

FIG. 5 is a flow chart of model optimization according to an embodiment of the present invention;

FIG. 6 is a flow chart of predictive identification according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a SAST algorithm according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a GSRM model structure according to an embodiment of the present invention.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Fig. 1 is a flowchart illustrating an image text recognition method according to an embodiment of the present invention. The method comprises the following steps: s1: receiving first image text information; the image text information comprises first text information and background image information; s2: extracting first text information and background image information, and determining parameter information corresponding to the first text information; s3: acquiring one or more pieces of character information from a character database, and processing the acquired character information by adopting parameter information corresponding to the first text information to obtain second text information; s4: and synthesizing the second text information and the background image information into second image text information, and inputting the second image text information into a text detection model for training.

As an alternative embodiment, the parameter information includes any one or more of a font, a font size, a font style, a color, a typesetting mode, and a decoration effect. The typesetting mode comprises the space, line spacing and the like among a plurality of fonts, and the modification effect can be some effects added on the basis of the characters, such as shadow and the like.

As an alternative embodiment, the first image text information includes any one or more of invoice data, tickets, business licenses, electronic itineraries, identity cards, social security cards, and bank cards. Of course, in other embodiments, the first image text information may also be other image data containing text.

In the present application, the text database refers to a dictionary database containing a plurality of characters, the text information refers to data information containing one or more characters, and the background image information refers to background information corresponding to the position of the text information on the image. Generally, the detection of the text information on the image is to detect a text box in which the text information is located first, and then recognize the characters in the text box, so that the background image information of the application may be the background information left after the character information is subtracted from the text box.

In the above scheme, the expansion of the training data can be realized by acquiring one or more pieces of character information from the character database, and processing the acquired character information by using the parameter information corresponding to the first text information to obtain the second text information. Meanwhile, as the parameter information of the second text information is completely consistent with the parameter information of the first text information, after the parameter information of the second text information is input into the text detection model for training, the recognition speed of the text detection model for the text information of the parameter information type can be greatly enhanced, the accuracy of text recognition is improved, and the effect is remarkable.

As an alternative embodiment, as shown in fig. 2, step S3 includes: s31 randomly obtains one or more text messages from the text database, repeats the process a plurality of times, and processes the obtained plurality of text messages using the parameter information corresponding to the first text message to obtain a plurality of second text messages. The step S4 includes S41 combining each piece of second text information with the background image information to obtain a plurality of pieces of second image text information, and inputting the obtained plurality of pieces of second image text information into the text detection model for training. The character information required by the generation of the second text information is randomly acquired from the character database, and the acquired characters are synthesized with the corresponding background image information in the first image text information each time, so that a plurality of second text information with the same style as the first text information can be obtained, and the rapid identification of the training model to the type of character information can be greatly enhanced because the second text information and the first image text information adopt similar backgrounds.

In the present application, the number of words of the characters included in the second text information may be the same as or different from the number of words of the characters included in the first text information. Preferably, when the second text information is generated, the same number of characters as the number of characters included in the first text information may be randomly acquired from the character database.

As an alternative embodiment, in the present application, the text detection model is the ResNet50_ vd and SAST algorithm detection model; the method specifically comprises the following steps: ResNet50_ vd is adopted as a network structure, and a full connection layer in the network structure is replaced by an FCN full convolution layer.

In the application, a mode of combining ResNet50_ vd (backbone) and SAST algorithms can be adopted as the algorithms corresponding to the text detection model of the application for the text detection model, and verification on a plurality of training sets shows that the combination of ResNet50_ vd and SAST algorithms is adopted as the text detection model, so that the effect of the text detection model is obviously superior to that of the common text detection model which adopts ResNet34_ vd, MobileNet V3 and the like as network structures; DB, EAST, etc. are combined as the text recognition model of the algorithm.

The main principle of SAST is shown in fig. 7, and specifically includes: using ResNet50_ vd as the network structure of the network, the last full connection layer is replaced with an FCN full convolution layer, and the semantic segmentation result having the same size as the original image is output. And feature points of different levels of feature maps are fused for multiple times (such as three times) by using an FPN algorithm, so that a feature network can contain more information of objects with different sizes.

Preferably, the output of the SAST network is divided into four parts, which are tcl, tco, tvo profiles, respectively. The loss function of the text detection model is as follows: l is_total＝λ₁L_tcl+λ₂L_tco+λ₃L_tvo+λ₄L_tbo(ii) a Wherein, λ 1, λ 2, λ 3 and λ 4 are weighted values, and tcl, tco, tvo and tvo represent four characteristic diagrams; tcl represents a text area where the first text information is located; tco, tvo denote the amount of pixel shift compared to tcl; the method specifically comprises the following steps:tco feature map is text pixel center offset relative to tcl feature map; the tvo feature map is the pixel offset relative to the four bounding box vertices of the text of the tvl feature map; tbo is the offset relative to the upper and lower bounds of the tcl profile.

Preferably, λ 1 ═ 1.0; λ 2 is 0.5; λ 3 ═ 0.5; λ 4 is 1.0. In this application, λ 1, λ 2, λ 3 and λ 4 are used to balance the four tasks, i.e. to make them equally important in this model, so we set {1.0,0.5,0.5,1.0} to make the four loss gradient values equally effective in back propagation.

As shown in fig. 3, step S4 is followed by step S5: inputting the output result of the text detection model into a text recognition model for training; the text detection model is the Resnet50_ vd _ fpn and the SRN algorithm recognition model.

In the application, by adopting Resnet50_ vd _ fpn (backbone) and SRN algorithms as the network structure and algorithm of text recognition and verifying on a plurality of public data sets, the effect is obviously superior to that of common text recognition models which take Resnet34_ vd, MobileNet V3 and the like as network structures and combine CRNN, Rosetta, StarNet, RARE and the like as algorithms.

The main steps of the SRN are generally as follows: the method comprises the steps of firstly recoding sequence characteristics by utilizing a character reading and writing sequence to obtain a primary identification result, and then re-integrating the primary identification result into the sequence characteristics, namely judging whether the sequence characteristics are correct or not from the whole layer, then determining whether fine adjustment is needed or not, and then obtaining the identification result again. SRNs generally consist of four parts: the system comprises a backbone network, a Parallel Visual Attention Module (PVAM), a Global Semantic Reasoning Module (GSRM) and a Visual Semantic Fusion Decoder (VSFD). The present invention adopts Resnet50_ vd _ fpn as the backbone network of SRNs. PVAM is used to generate N aligned one-dimensional features G, where each feature corresponds to a character in the text and captures the aligned visual information. These N one-dimensional features G are then fed into the GSRM to capture the semantic information S. Finally, VSFD fuses aligned visual features G and semantic information S together to predict N characters. The GSRM model structure is shown in FIG. 8.

After the text detection and the text recognition are finished, the method also aims at the training model for finishing the text recognition, and adopts the residual data in each random sample to carry out evaluation respectively to obtain the model evaluation data of each model, then continuously adjusts the model hyper-parameter, repeats the text detection/text recognition steps until the best evaluation index is obtained, and solidifies the hyper-parameter; and storing the model set with the highest evaluation index as a primary available model.

Specifically, as shown in fig. 4, the present application is obtained by the following steps: firstly, step S41 training data preparation is carried out; then, performing step S42 data expansion, specifically, the data expansion may be performed according to the method shown in fig. 1; then, performing step S43 text detection/text recognition model training; then, carrying out step S44 model evaluation; the original model publishing then proceeds to step S45. Through steps S41-S45, an initial training model may be obtained.

As an optional embodiment, after obtaining the initial training model, the trained initial model may be optimized, specifically including: and distilling, quantifying and cutting the trained model in sequence to obtain the final model.

As shown in fig. 5, the model optimization method is specifically as follows:

the process first proceeds to step S51 where an initial model is input. Specifically, an initial text detection model and a text recognition module obtained after last training are used as input models of the model optimization step.

And then to step S52 model distillation. In the present application, the distillation model uses the transfer learning, and another simple network (student model) is trained by using the output of a complex model (Teacher model) trained in advance as a supervision signal. The goal of model distillation is to let the student model learn the generalization ability of the teacher model, and the final result will be better than the student model that fits the training data alone. Meanwhile, the student model adopts a lightweight backbone, so that the volume of the model file can be greatly reduced, and the prediction speed is improved. In the present application, the model trained in the foregoing manner is used as a Teacher model, MobileNetV3 as a backbone of a student model, and softmax _ with _ cross _ entry _ loss as a function of distillation strategy loss.

Then, the model quantization is performed in step S53. The purpose of model quantization is to reduce the neural network parameters, improve the speed and reduce the memory by quantizing the problems of less quantity of parameters, large calculation amount and large memory occupation of the existing convolutional neural network, and the final purpose is to reduce the volume of a model file, reduce the memory occupation and improve the prediction speed. In the application, a BNN algorithm is adopted for model quantization, a binary weight is adopted for replacing a floating point weight in an activation value in forward and reverse training of a neural network, and a model quantization formula is as follows:

wherein x isⁿThe value of the nth bit of the method is expressed by 8 bits.

And then proceeds to step S54 model clipping. The model cutting is to judge the importance of the parameters through the sensitivity analysis of the trained model parameters, and cut the unimportant connection or filter to reduce the redundancy of the model, thereby reducing the file volume of the model and improving the prediction speed. Since most of the neurons are activated towards zero and the 0-activated neurons are redundant, eliminating them can greatly reduce the size and the operation amount of the model without influencing the performance of the model, and the number of 0-activated values in each filter is measured by a variable apoz (average persistence of zeros) as a criterion for evaluating whether a filter is important. APoZ is defined as follows:

and then proceeds to step S55 for final model release. After the initial model is processed through steps S52-S54, a finally usable text detection model and a text recognition model can be obtained.

After the final model is obtained, the recognition and prediction of the image can be carried out. As shown in fig. 6, the method comprises the following steps:

the flow first proceeds to step S61 to input an image to be recognized. The image to be recognized is particularly an image containing text information.

The process then proceeds to step S62 where classification adjustment is performed according to the image orientation. This step may be implemented by an image orientation classifier for identifying whether there is a rotation angle in the input image, such as: and if the rotation angle is 90 degrees, 180 degrees or 270 degrees, automatically correcting and repairing. A large number of practices prove that the recognition effect of the image after rotation is directly input into the model is greatly reduced, because the data is not included when the training data of the model in the early stage is collected, and the training data is enlarged by 4 times by adding the data into the data, so that the whole training duration is influenced. The accuracy of final text recognition can be effectively improved by judging and adjusting the image direction optimization. Preferably, in the present application, the image classification algorithm of CNN is used for image classification.

Then, the process proceeds to step S63 for text detection. Specifically, the image is sent to a text detection model, and an area set where text information is located is returned, that is, the aforementioned text box information is returned.

And then proceeds to step S64 text recognition. Specifically, the text box is sent to the text recognition model, and the text information in the text box is returned, that is, the text information is extracted from the text box as mentioned above.

And then proceeds to step S65 for output. The output text recognition result may be displayed through the display unit.

Preferably, the Processor is an electronic component having a data Processing function, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP) or a System on Chip (SoC).

Preferably, the storage medium is an electronic component with a data storage function, including but not limited to: RAM, ROM, magnetic disk, magnetic tape, optical disk, flash memory, U disk, removable hard disk, memory card, memory stick, etc.

The invention provides an image text recognition algorithm based on deep learning, which forms end-to-end text detection and recognition capabilities by constructing an image classification model, a text block detection model and a text recognition model, and amplifies training data in a self-defined image enhancement mode. In addition, the method greatly improves the identification accuracy and the identification speed through a series of model optimization strategies.

The method of the present application has the following three advantages:

(1) the method adopts a two-stage recognition method, is based on a deep learning technology, and combines a unique data enhancement scheme, so that the accuracy of text detection and text recognition is greatly improved.

In the existing deep learning two-stage image text recognition method, data often becomes an important influence factor for restricting the final index of a model. This is due to the difficulty of training data collection and the time consuming labeling process. In conventional image data enhancement schemes, for example: the measures of randomly adjusting brightness, randomly adjusting contrast, Gaussian blur and the like cannot play a role in improving model indexes in the task of image text recognition. Therefore, the scheme provides a novel data enhancement scheme. And extracting the text foreground style and the picture background in the existing training data. And a new random text is adopted to fuse the foreground style of the text and the background of the picture, so as to generate new training data. A large number of practical verifications show that the data enhancement in the mode can generally improve the final identification accuracy by more than 10%.

(2) Through a series of model compression algorithms, the model volume is reduced, and the prediction speed is improved.

In order to pursue high accuracy, residual error networks with a large number of layers, such as ResNet50 or ResNet101, are often used as a text detection model and a text recognition model obtained by backsnone training in the prior art, and the method has the problems of large model file volume and low prediction speed. In order to increase the prediction speed and reduce the volume of a model file on the basis of ensuring the accuracy of the model as much as possible, the invention adopts various compression methods to greatly increase the prediction speed and reduce the volume of the model. Such as: l1NormFilterPruner (L1-norm statistic), Embedding quantization, etc.

(3) And a model distillation algorithm is adopted, the generalization capability of the model is improved, and the accuracy of the model is finally improved.

The deep learning model cannot achieve good effects in subsequent practical application only by fitting training data, and is only optimally applied (namely, generalization capability) only by learning how to generalize to new data. The goal of model distillation is to make the student model (new model) learn the generalization ability of the teacher model (original model or model ensemble), and the obtained result is better than that of the student simply fitting the training data. According to the invention, the model is distilled after training by taking ResNet101 as the teacher network for distillation training, so that the generalization capability of the model can be effectively improved, and the accuracy of the model is finally improved.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

1. An image text recognition method, characterized in that the method comprises the steps of:

s2: extracting the first text information and the background image information, and determining parameter information corresponding to the first text information;

2. The image text recognition method according to claim 1, wherein step S3 includes:

and randomly acquiring one or more character information from a character database, repeating the steps for multiple times, and processing the acquired character information by adopting the parameter information corresponding to the first text information to obtain a plurality of second text information.

3. The image text recognition method according to claim 1 or 2, wherein the parameter information includes any one or more of a font, a font size, a font style, a color, a layout style, and a decoration effect.

4. The image text recognition method according to claim 1 or 2, wherein the first image text information includes any one or more of invoice data, tickets, business licenses, electronic itineraries, identification cards, social security cards, and bank cards.

5. The image text recognition method of claim 1, wherein the text detection model is a ResNet50_ vd and SAST algorithm detection model; the method specifically comprises the following steps: ResNet50_ vd is adopted as a network structure, and a full connection layer in the network structure is replaced by an FCN full convolution layer.

6. The image text recognition method of claim 5, wherein the loss function of the text detection model is as follows: l is_total＝λ₁L_tcl+λ₂L_tco+λ₃L_tvo+λ₄L_tbo；

7. The image text recognition method according to claim 6, wherein λ 1 ═ 1.0; λ 2 is 0.5; λ 3 ═ 0.5; λ 4 is 1.0.

8. The image text recognition method of claim 1, wherein step S4 is followed by step S5:

inputting the output result of the text detection model into a text recognition model for training; the text detection model is Resnet50_ vd _ fpn and the SRN algorithm recognition model.

9. The image text recognition method of claim 1, wherein the method further comprises:

10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, carries out the method steps of any one of claims 1 to 9.