CN114694133A

CN114694133A - Text recognition method based on combination of image processing and deep learning

Info

Publication number: CN114694133A
Application number: CN202210600210.9A
Authority: CN
Inventors: 陈大龙; 舒成成; 陆慧雯; 孟维
Original assignee: Nanjing Howso Technology Co ltd
Current assignee: Nanjing Howso Technology Co ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-07-01
Anticipated expiration: 2042-05-30
Also published as: CN114694133B

Abstract

The invention relates to a text recognition method based on combination of image processing and deep learning, which comprises the following steps: s1 data collection: collecting data to form a data set; s2 data set processing: inputting a data set, performing horizontal projection on the images in the data set, and performing module matching on the data in the data set to obtain a matching result image; s3, text recognition is carried out by adopting a deep learning CRNN model: constructing a convolutional neural network, extracting features from the matching result image obtained in the step S2 by using the constructed convolutional neural network, and detecting the extracted feature sequence; and then transcribing the characteristic sequence to obtain a final recognition result. The character recognition under the special scene is realized by the method of combining the traditional image processing and the deep learning.

Description

Text recognition method based on combination of image processing and deep learning

Technical Field

The invention relates to the technical field of deep learning, in particular to a text recognition method based on combination of image processing and deep learning.

Background

With the rapid development of communication and internet technologies, people can acquire information more quickly and conveniently through mobile devices, and the information tends to be fragmented and diversified, such as in the form of characters, pictures, videos and the like, wherein the pictures are the most common way to acquire information quickly. Text areas are often included in the natural scene picture, and the text areas have more semantic information than other graphic areas, such as shop signboards, road signs, and the like.

The text area can not only help the computer system to detect and identify the area, but also provide other deeper information such as geographic coordinate position or shop attribute information. In the process of understanding deep semantic information of an image by a computer vision system, characters in the image can be utilized, and the image processing is greatly promoted, so that the recognition of the characters in a scene is a very important and popular research direction in the field of computer vision research.

However, the existing identification method has the problems of scene special text identification, limited sequence length, high operation and storage cost, inaccurate complex scene identification and the like.

Disclosure of Invention

The invention provides a text recognition method based on combination of image processing and deep learning, which realizes character recognition in special scenes, solves the problem of difficult recognition when samples are scarce, and has high recognition speed and high accuracy.

In order to solve the problems, the technical scheme adopted by the invention is as follows: the text recognition method based on the combination of image processing and deep learning comprises the following steps:

s1 collects data: collecting data to form a data set;

and S2 data set processing: inputting a data set, performing horizontal projection on the images in the data set, and performing module matching on the data in the data set to obtain a matching result image;

s3, text recognition is carried out by adopting a deep learning CRNN model: constructing an improved Convolutional Neural Network (CNN), extracting features from the matching result image obtained in the step S2 by using the constructed Convolutional Neural Network (CNN), and detecting the extracted feature sequence; and then transcribing the characteristic sequence to obtain a final recognition result.

By adopting the technical scheme, the character recognition under special scenes is realized by a method combining traditional image processing and deep learning, such as: the page type is not standard, the table content is inconsistent, and all samples to be identified are not samples, but specific to a certain area; the text recognition method based on the combination of the image processing and the deep learning well solves the problems and solves the problem that the recognition is difficult when the samples are scarce. The text recognition method based on the combination of image processing and deep learning does not need global matching, and small pictures are obtained by frame searching; and then, a deep learning CRNN model is adopted to carry out sequence prediction, wherein a convolutional neural network CNN is improved, and the constructed improved convolutional neural network CNN is used for carrying out feature extraction, so that the problems of scene special text recognition, limited sequence length, high operation storage cost, inaccurate complex scene recognition and the like in the traditional equipment recognition method are solved.

As a preferred technical solution of the present invention, the data collected in step S1 includes a data set formed by a plurality of PDF files, and the data set is divided into a training set and a test set, where the training set is used for training a model, and the test set is used for testing. The data set is partially derived from customer requirements and partially derived from data enhancement, the data set formed by PDF is divided into a training set and a testing set, the training set is used for training a model, and the testing set is used for testing.

As a preferred embodiment of the present invention, the step S2 includes:

s21: firstly, horizontally projecting an image in an input data set, carrying out binarization on the image, circulating each row, and sequentially counting the number of all black pixels in each row; setting the row as black from the first column to the Mth column if the row has M black pixels;

s22: the method comprises the steps of finding an area needing template matching in an image in an input data set, firstly carrying out frame searching on the input image in the corresponding area, carrying out template matching operation between a preset image template and the input image according to a picture to be matched obtained by the frame, and obtaining a matching result image. By carrying out frame search on the target image and then carrying out template matching according to the small picture obtained by the frame, the speed is improved, the target image pyramid is constructed, and multi-resolution template matching support is realized. By adopting the matching method, the input image is segmented by adopting a frame, and then template matching is carried out, so that the speed is increased, a target image pyramid is constructed, and multi-resolution template matching support is realized; while being able to find the maximum and minimum values (including their positions) in a given matrix. And obtaining a matching picture according to the frame, and then searching for matching between the template and the input image to obtain a matching result image.

As a preferred technical solution of the present invention, the CRNN model in step S3 includes a convolutional neural network CNN, a cyclic neural network RNN, and a CTC loss function, and the specific steps include:

s31 convolutional layer: constructing an improved convolutional neural network CNN and extracting features in the matching result image by adopting the constructed convolutional neural network CNN, namely extracting sequence features of an input image by the convolutional neural network CNN through convolution and pooling operations to obtain a feature sequence;

s32 circulation layer: predicting the characteristic sequence by using a bidirectional Recurrent Neural Network (RNN) (BLSTM), learning each characteristic vector in the sequence, and outputting a series of label (real value) distribution; simultaneously, text recognition is carried out to obtain a prediction result of an input image;

s33 transcription layer: the prediction results obtained from the loop layer are transcribed into the final tag sequence using the CTC loss function. The CRNN network consists of three parts: (1) a Convolutional Neural Network (CNN) part extracts a characteristic sequence from an input image; (2) predicting a convolution result converted into the characteristic sequence by the recurrent neural network RNN; (3) the transcription layer converts each frame of prediction result obtained by the Recurrent Neural Network (RNN) into a character sequence; the CRNN network may perform end-to-end overall training, although composed of different types of neural networks. The convolutional neural network CNN is improved, DW convolution is adopted in the convolution of the first two layers of the convolutional neural network CNN, the first two layers of the original convolutional layer are replaced, the calculated amount is reduced, and the speed is improved; adding a residual error network structure Resnet into a network backbone, and obtaining a better result through testing; and a flooding regularization constraint module is added, so that the precision is adjusted, and the loss function is reduced.

As a preferred technical solution of the present invention, the step S31 specifically includes:

s311: firstly, zooming a matching result image to a fixed height;

s312: constructing a convolutional neural network CNN;

s313: inputting the image with fixed height into a convolutional neural network CNN, outputting sequence characteristics after calculation by the convolutional neural network CNN, thereby obtaining a characteristic sequence to be input by the bidirectional recurrent neural network RNN, wherein the characteristic sequence consists of a plurality of columns of vectors. Since convolution pooling has translational invariance, each column of features can be mapped to a rectangular receptive field of the original image.

As a preferred technical solution of the present invention, the step S32 specifically includes:

s321, image recognition: establishing a deep bidirectional Recurrent Neural Network (RNN) (BLSTM) on the top of the convolutional layer as a recurrent layer; outputting an image prediction label distribution by predicting a label distribution ytytytytytyt of each frame xtxt in the cyclic layer prediction feature sequence x = x1, …, xTx = x1, …, xT;

s322 character recognition: and (3) adopting a classic Bi-LSTM combined CTC loss function framework, and respectively adding a residual error network ResNet into an input layer and an output layer of the Bi-STM model, thereby outputting text prediction label distribution. The method not only effectively solves the inherent gradient explosion and disappearance problems of the Bi-LSTM, but also accelerates the convergence rate of the model. The circulation layer here has three advantages: first, RNN (bidirectional recurrent neural network) has a strong ability to capture context information within a sequence. The use of contextual cues for image-based sequence recognition is more stable and helpful than processing each symbol independently. Taking scene text recognition as an example, a wide character may require some consecutive frames to fully describe; in addition, some ambiguous characters are more easily distinguished when viewing their context, for example, by identifying "il" more easily than the character height rather than identifying each of them separately. Second, the bidirectional recurrent neural network RNN can back-propagate the error difference to its input, i.e., the convolutional layer, allowing the recurrent and convolutional layers to be trained together in a unified network. Third, the bidirectional recurrent neural network RNN can operate on sequences of arbitrary length from beginning to end.

As a preferred technical solution of the present invention, in the step S33, a flooding regularization constraint module is added to the CTC loss function, so that the CTC loss function first falls to a threshold flooding level of a flooding method, and when the loss function is lower than the threshold flooding level of the flooding method, gradient ascent is performed; therefore, the phenomenon of overfitting of a testloss rebound model when the training loss trainloss approaches zero can be prevented; transcribing the predicted tag distribution per frame output by the bidirectional Recurrent Neural Network (RNN) into a tag sequence, wherein the transcription patterns comprise dictionary-free transcription and dictionary-based transcription. Transcription is the process of converting each frame of prediction made by RNN into a tag sequence. Mathematically, transcription is the prediction from each frame to find the tag sequence with the highest probability. The dictionary is a set of tag sequences, and predictions are constrained by the spell-check dictionary. In the dictionary-less transcription mode, there is no dictionary at the time of prediction. In dictionary-based transcription patterns, the prediction is made by selecting the tag sequence with the highest probability.

As a preferred technical solution of the present invention, in the step S312, the convolutional neural network CNN adopts a VGG structure as an image feature extraction network, and both the first layer and the second layer adopt DW convolution; setting the nuclear size of the third layer of max-pooling and the fourth layer of max-pooling as 1 x 2; while a BN layer (Batch Normalization layer) is added for the fifth through sixth convolutions. The integral model adopts a VGG structure as an image feature extraction network, and in order to adapt to the input format of the LSTM, the core sizes of the third and fourth max-firing are changed from 2 x 2 to 1 x 2; and meanwhile, the network convergence is accelerated, and a BN layer is added to the subsequent convolution.

As a preferred technical solution of the present invention, the specific steps of constructing the convolutional neural network in step S312 are:

s3121: firstly, defining a pyramid level in each stage, performing depthwise _ conv2d convolution on the previous two layers, and independently applying different convolution kernels to each channel of in _ channels;

s3122: using the last residual structural feature activation output per stage, denoted as { C2, C3, C4, C5}, corresponding to conv2, conv3, conv4, conv5, respectively, with {4,8,16,32} pixel step size relative to the original input image;

s3123: merging the characteristic images with the same space size of the bottom-up path and the top-down path through each transverse connection, wherein the step is to utilize the positioning detail information of the bottom layer;

s3124: a set of feature images, labeled P2, P3, P4, P5, is obtained, and feature maps P2, P3, P4, P5 are input as a continuous input to the text recognition portion.

As a preferred technical solution of the present invention, in the step S311, the heights of all the test pictures are normalized to 32, and the widths are scaled according to a ratio of 1:1.5 of the minimum height of 100 pixels. The purpose of this is to speed up the process of test picture training.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: according to the text recognition method based on the combination of image processing and deep learning, a matching template method of data is improved, so that global matching is not needed, namely, a frame is firstly searched for a target image, then template matching is carried out according to a small picture obtained by the frame, the speed is increased, a target image pyramid is constructed, and multi-resolution template matching support is realized; DW convolution is adopted by the improved deep learning model CRNN, namely the first two layers of the convolutional neural network CNN, so that the first two layers of the original convolutional layer are replaced, the calculated amount is reduced, and the speed is improved; replacing a network backbone to add a residual error network structure Resnet, and obtaining a better result through testing; and a flooding regularization constraint module is added in the CTC loss function, so that the precision is adjusted, and the loss function is reduced.

Drawings

The technical scheme of the invention is further described by combining the accompanying drawings as follows:

FIG. 1 is a flow chart of a text recognition method based on image processing combined with deep learning according to the present invention;

FIG. 2 is an exemplary diagram of the data set collected in step S1 of the text recognition method based on image processing combined with deep learning according to the present invention;

FIG. 3 is a diagram of a CRNN network structure in step S3 of the text recognition method based on the combination of image processing and deep learning according to the present invention;

FIG. 4 is a flowchart of step S3 of the text recognition method based on the combination of image processing and deep learning according to the present invention;

FIG. 5a is a result diagram (upper half) of training data enhancement processing performed on the data set FIG. 2 in step S3 by using the text recognition method based on the combination of image processing and deep learning according to the present invention;

FIG. 5b is a diagram (lower half) of the result of the training data enhancement process performed on the data set of FIG. 2 in step S3 using the text recognition method based on the combination of image processing and deep learning according to the present invention;

FIG. 6 is a diagram of matching an opencv template with a single modulus digital template and a text output diagram in step S3 of the text recognition method based on image processing and deep learning according to the present invention;

FIG. 7a is a diagram illustrating the result of step S3 of the text recognition method based on image processing combined with deep learning according to the present invention corresponding to FIG. 5 a;

FIG. 7b is a diagram illustrating the result of step S3 of the text recognition method based on image processing combined with deep learning according to the present invention corresponding to FIG. 5 b;

fig. 8a is a schematic diagram of a change trend of a training loss and a test loss without adding regularization constraint to a transcription layer in step S33 of the text recognition method based on combination of image processing and deep learning according to the present invention;

FIG. 8b is a schematic view of a loss variation trend of the text recognition method based on the combination of image processing and deep learning, in which a flooding regularization module is added to the transcription layer in step S33;

FIG. 8c is a schematic diagram of the actual representation of loss when the transcription layer in step S33 of the text recognition method based on the combination of image processing and deep learning corresponds to that in FIG. 8a and model training is adopted on the cifar-10 dataset;

fig. 8d is a schematic diagram of the actual representation of loss when the transcription layer in step S33 of the text recognition method based on the combination of image processing and deep learning corresponds to fig. 8b and model training is adopted on the cifar-10 data set.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings and embodiments, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example (b): as shown in fig. 1, the text recognition method based on the combination of image processing and deep learning includes the following steps:

s1 collects data: collecting data to form a data set;

the data collected in step S1 includes a data set formed by a plurality of PDF files, and the data set is divided into a training set and a test set, where the training set is used for training a model and the test set is used for testing; the data set is derived partly from customer requirements and partly from data enhancement, as shown in fig. 2; dividing a data set formed by PDF into a training set and a test set, wherein the training set is used for training a model, and the test set is used for testing;

the specific steps of step S2 include:

s22: the method comprises the steps of finding an area needing template matching in an image in an input data set, firstly performing frame searching on the input image in a corresponding area, and performing template matching operation between a preset image template and the input image according to a picture to be matched obtained by the frame to obtain a matching result image; as shown in fig. 6; firstly, a frame of a target image is searched, then template matching is carried out according to a small picture obtained by the frame, the speed is increased, a target image pyramid is constructed, and multi-resolution template matching support is realized; by adopting the matching method, the input image is segmented by adopting a frame, and then template matching is carried out, so that the speed is increased, a target image pyramid is constructed, and multi-resolution template matching support is realized; while being able to find the maximum and minimum values (including their locations) in a given matrix;

s3, text recognition is carried out by adopting a deep learning CRNN model: constructing an improved convolutional neural network CNN, performing feature extraction on the matching result image obtained in the step S2 by using the constructed convolutional neural network CNN, detecting the extracted feature sequence, and transcribing the feature sequence to obtain a final recognition result, as shown in FIGS. 7a and 7 b;

as shown in fig. 3, the CRNN model convolutional neural network CNN, recurrent neural network RNN, and CTC loss function in step S3 includes the specific steps of:

the step S31 includes the following steps:

s311: firstly, zooming a matching result image to a fixed height; in the step S311, the heights of all the test pictures are normalized to 32, and the widths are scaled according to a ratio of 1:1.5 of the minimum height of 100 pixels; the purpose of this processing is to speed up the process of test picture training;

s312: constructing a convolutional neural network CNN;

the specific steps of constructing the convolutional neural network CNN in step S312 are as follows:

s3124: obtaining a group of feature images, marked as { P2, P3, P4 and P5}, and inputting the feature images { P2, P3, P4 and P5} into the text recognition part as continuous input;

in the step S312, the convolutional neural network CNN adopts a VGG structure as an image feature extraction network, and DW convolutions are adopted for both the first layer and the second layer; setting the nuclear size of the third layer of max-pooling and the fourth layer of max-pooling as 1 x 2; simultaneously adding a BN layer (Batch Normalization layer) to the convolution after the fifth layer and the sixth layer; a VGG structure is adopted as an image feature extraction network, and in order to adapt to the input format of the LSTM, the kernel scales of the third and fourth max-firing are changed from 2 x 2 to 1 x 2; meanwhile, network convergence is accelerated, and a BN layer is added to the subsequent convolution;

the path from bottom to top of the convolutional neural network CNN, the path from top to bottom is connected with the transverse direction, and the high resolution and strong semantic features of the convolutional neural network CNN are fully utilized; the bottom-up path is to define a pyramid level at each stage, use depthwise _ conv2d convolution for the first two layers, and apply different convolution kernels to each channel of in _ channels independently: generally, convolution is performed on a three-channel image, namely weighted summation is performed first and then convolution is performed (note that the result of weighted summation and then convolution is the same as the result of convolution first and then weighted summation), and the visual description is that 3 channels are flattened into 1 channel, and then the channel is lifted and slid into x channels by using x convolution kernels (or the 3 channels are lifted and slid into 3x channels by using x convolution kernels first and then flattened into the x channels respectively); while depthwise _ conv2d is not weighted and summed, and is directly convolved, so the total number of final output channels is in _ channels _ channel _ multiplier; the data transmission path from top to bottom semantically promotes the stronger feature map from a higher pyramid level, so as to generate features with higher resolution; then, merging the feature maps with the same space size of the bottom-up path and the top-down path through each transverse connection, wherein the step is to utilize the positioning detail information of the bottom layer; finally, obtaining a group of characteristic maps, which are marked as { P2, P3, P4 and P5 }; inputting the feature maps { P2, P3, P4 and P5} into the text recognition part as continuous input, as shown in FIGS. 5a and 5 b;

s313: inputting an image with a fixed height into a Convolutional Neural Network (CNN), and outputting sequence characteristics after calculation by the Convolutional Neural Network (CNN) so as to obtain a characteristic sequence to be input by the bidirectional Recurrent Neural Network (RNN), wherein the characteristic sequence consists of a plurality of columns of vectors; because the convolution pooling has translation invariance, each column of characteristics can be mapped to a rectangular receptive field of the original image;

s32 circulation layer: predicting the characteristic sequence by using a bidirectional Recurrent Neural Network (RNN) (BLSTM), learning each characteristic vector in the sequence, and outputting a series of label (real value) distribution; simultaneously, text recognition is carried out to obtain a prediction result of the input image;

the step S32 includes the following steps:

s321, image recognition: building a deep bidirectional Recurrent Neural Network (RNN) (BLSTM) on top of the convolutional layer as a recurrent layer; outputting an image prediction label distribution by predicting a label distribution ytytytytytyt of each frame xtxt in the cyclic layer prediction feature sequence x = x1, …, xTx = x1, …, xT;

s322 character recognition: and (3) adopting a classic Bi-LSTM combined CTC loss function framework, and respectively adding a residual error network ResNet into an input layer and an output layer of the Bi-STM model, thereby outputting text prediction label distribution. The problems of inherent gradient explosion and disappearance of Bi-LSTM are effectively solved, and the convergence rate of the model is accelerated; the circulation layer here has three advantages: firstly, a bidirectional Recurrent Neural Network (RNN) has a strong ability to capture context information within a sequence; the use of contextual cues for image-based sequence recognition is more stable and helpful than processing each symbol independently. Taking scene text recognition as an example, a wide character may require some consecutive frames to fully describe; in addition, some ambiguous characters are more easily distinguished when viewing their context, for example, by identifying "il" more easily than the character height rather than identifying each of them separately. Second, the bidirectional recurrent neural network RNN can back-propagate the error difference to its input, i.e., the convolutional layer, allowing the recurrent and convolutional layers to be trained together in a unified network. Thirdly, the bidirectional recurrent neural network RNN can operate on sequences of arbitrary length from beginning to end;

s33 transcription layer: the prediction results obtained from the loop layer are transcribed into the final tag sequence using the CTC loss function. The CRNN network consists of three parts: (1) the convolutional neural network part extracts a characteristic sequence from an input image; (2) predicting the convolution result converted into the characteristic sequence by the recurrent neural network; (3) the transcription layer converts each frame of prediction results obtained by the recurrent neural network into a character sequence; although composed of different types of neural networks, the CRNN network can perform end-to-end overall training; step S33 is to add a flooding regularization constraint module to the CTC loss function, so that the CTC loss function first falls to a threshold flooding level of the flood method, and when the loss function is lower than the threshold flooding level of the flood method, gradient ascent is performed; transcribing each frame of predicted tag distribution output by the bidirectional Recurrent Neural Network (RNN) into a tag sequence, wherein the transcription mode comprises dictionary-free transcription and dictionary-based transcription; transcription is the process of converting each frame of prediction made by the bidirectional recurrent neural network RNN into a tag sequence. Mathematically, transcription is to find the tag sequence with the highest probability according to each frame prediction; the dictionary is a set of tag sequences, the predictions being constrained by the spell check dictionary; in the dictionary-free transcription mode, there is no dictionary at the time of prediction; in dictionary-based transcription patterns, the prediction is made by selecting the tag sequence with the highest probability.

Fig. 8a to 8d are comparative diagrams showing whether (flood method) streaming regularization constraint modules are added to the transcription layer, wherein fig. 8a is a schematic diagram illustrating a variation trend of a streaming loss and a test loss of a transcription layer without adding regularization constraints in step S33 of the text recognition method based on combination of image processing and deep learning according to the present invention; fig. 8b is a schematic view of a loss variation trend of the text recognition method based on the combination of image processing and deep learning, in which the streaming regularization module is added to the transcription layer in step S33, and it can be seen that test loss has a property of a double-descent curve; fig. 8c is a schematic diagram showing the actual representation of loss when the transcription layer is trained by using the model on the cifar-10 data set corresponding to fig. 8a in step S33 of the text recognition method based on the combination of image processing and deep learning of the present invention, and fig. 8d is a schematic diagram showing the actual representation of loss when the transcription layer is trained by using the model on the cifar-10 data set corresponding to fig. 8b in step S33 of the text recognition method based on the combination of image processing and deep learning of the present invention; the method for adding the flooding regularization constraint template is simple to use, and the formula is as follows:

；

wherein J represents the original objective function; b is a threshold flooding level of the flood method, and a value of a training loss offline is set; theta is the objective function parameter.

Wherein, fig. 8a and 8c are the results of the flood method (flooding), the shaded part is the loss value infinitely approaching 0, fig. 8b and 8d are the fluctuating areas of the flood area (flooding area) determined by the flood threshold (flooding level) after the flood method is added, and the training loss is prevented from approaching 0; it should be noted that in fig. 8a and 8C, a + B is a complete training process, a represents a stage a, B represents a stage B, and A, B is divided from the number of training rounds, stage B is the time period when the test loss curve rebounds, the boundary is the normal case when the training loss and the test loss converge at the same time and the hyperbola decreases, the boundary is followed by the test loss rebounding, and the test loss no longer converges and starts to diverge, and stage C expresses that the training loss approaches zero.

The invention relates to the traditional image processing OpenCv and a deep learning algorithm CRNN; the OpenCv template matching algorithm is completely template matching of basic pixels, is particularly easy to be influenced by illumination, the illumination is slightly different, the method is influenced, and the method is mainly based on image edge gradient and has strong anti-interference capability on image illumination and pixel migration; the self-contained template matching algorithm is mainly based on image gradient, NCC template matching based on gradient level is realized, and dx, dy and magnitude are obtained based on a Sobel gradient operator;

；

；

image XY gradient

；

Wherein G is the gradient of the image, G_XAnd G_YRepresenting the gray values of the image detected by the transverse edge and the longitudinal edge respectively, (x, y) are the coordinates of the pixel; [ A ]]Is a matrix of pixels of the original image, above exemplified 3X3 size matrix; wherein A is_i-1,i-1，A_i,i-1，A_i+1,i-1，A_i-1,i+1，A_i,i，A_i+1,i，A_i-1,i+1，A_i,i+1，A_i+1,i+1Representing the relative positions of the 9 pixels of the 3X3 matrix.

Obtaining an edge image through a Canny algorithm, obtaining all contour point sets based on contour discovery, calculating three values of dx, dy and magnitude (dxy) of each point based on each point, and generating template information; then, after the Sobel gradient image is performed on the input image, matching is performed according to the model information, which is advantageous in that: (1) the gradient has strong anti-interference capability on illumination and is anti-illumination interference matched with the template; (2) based on the gradient matching, the small pixel shifts that occur on the target image can be cancelled out.

The technical scheme of the invention is improved, the global matching is not needed any more, the frame searching can be firstly carried out on the target image, then the template matching is carried out according to the small picture obtained by the frame, the speed is improved, the pyramid of the target image is constructed, and the multi-resolution template matching support is realized.

The deep learning algorithm CRNN was proposed in 2015 to solve the problem of image-based sequence recognition, especially recognition of picture text. The biggest characteristic is that the single character is not cut first, but is recognized based on the sequence of the image. The neural network is a combination of a deep convolutional neural network and a recurrent neural network, so that the neural network model is named as a convolutional recurrent neural network. The technical scheme of the invention improves the deep learning algorithm, and DW convolution is adopted in the former two layers of convolution of the convolutional neural network CNN, so that the former two layers of convolution layers are replaced, the calculated amount is reduced, and the speed is improved; replacing a network backbone to add a residual error network structure Resnet, and obtaining a better result through testing; and a flooding regularization constraint module is added into the CTC loss function, so that the precision is adjusted, and the loss function is reduced.

It is obvious to those skilled in the art that the present invention is not limited to the above embodiments, and it is within the scope of the present invention to adopt various insubstantial modifications of the method concept and technical scheme of the present invention, or to directly apply the concept and technical scheme of the present invention to other occasions without modification.

Claims

1. A text recognition method based on combination of image processing and deep learning is characterized by comprising the following steps:

s1 collects data: collecting data to form a data set;

s3, text recognition is carried out by adopting a deep learning CRNN model: constructing a convolutional neural network CNN, extracting features from the matching result image obtained in the step S2 by using the constructed convolutional neural network CNN, and detecting the extracted feature sequence; and then transcribing the characteristic sequence to obtain a final recognition result.

2. The method for text recognition based on combination of image processing and deep learning of claim 1, wherein the data collected in step S1 includes a data set composed of several PDF files, and the data set is divided into a training set and a test set, the training set is used for training a model, and the test set is used for testing.

3. The text recognition method based on the combination of image processing and deep learning of claim 1, wherein the specific steps of step S2 include:

s21: firstly, horizontally projecting an image in an input data set, carrying out binarization on the image, circulating each row, and sequentially counting the number of all black pixels in each row;

s22: the method comprises the steps of finding an area needing template matching in an image in an input data set, firstly carrying out frame searching on the input image in the corresponding area, carrying out template matching operation between a preset image template and the input image according to a picture to be matched obtained by the frame, and obtaining a matching result image.

4. The method for text recognition based on combination of image processing and deep learning of claim 1, wherein the CRNN model in step S3 includes a convolutional neural network CNN, a cyclic neural network RNN and a CTC loss function, and the method includes the following specific steps:

s31 convolutional layer: constructing a convolutional neural network CNN, and extracting features in the matching result image by adopting the constructed convolutional neural network CNN, namely extracting sequence features of an input image by the convolutional neural network CNN through convolution and pooling operations to obtain a feature sequence;

s32 loop layer: predicting the characteristic sequence by using a bidirectional Recurrent Neural Network (RNN), learning each characteristic vector in the sequence, outputting a series of label distributions, and simultaneously performing text recognition to obtain a prediction result of an input image;

s33 transcription layer: the prediction results obtained from the loop layer are transcribed into the final tag sequence using the CTC loss function.

5. The text recognition method based on the combination of image processing and deep learning of claim 4, wherein the step S31 specifically comprises the steps of:

s311: firstly, zooming a matching result image to a fixed height;

s312: constructing a convolutional neural network CNN;

s313: inputting the image with fixed height into a convolutional neural network CNN, outputting sequence characteristics after calculation by the convolutional neural network CNN, thereby obtaining a characteristic sequence to be input by the bidirectional recurrent neural network RNN, wherein the characteristic sequence consists of a plurality of columns of vectors.

6. The text recognition method based on the combination of image processing and deep learning of claim 5, wherein the step S32 comprises the following steps:

s321, image recognition: establishing a deep bidirectional Recurrent Neural Network (RNN) at the top of the convolutional layer as a recurrent layer; outputting an image prediction label distribution by predicting a label distribution ytytytytytyt of each frame xtxt in the cyclic layer prediction feature sequence x = x1, …, xTx = x1, …, xT;

s322 character recognition: and (3) adopting a classic Bi-LSTM combined CTC loss function framework, and respectively adding a residual error network ResNet into an input layer and an output layer of the Bi-STM model, thereby outputting text prediction label distribution.

7. The text recognition method based on the combination of image processing and deep learning of claim 6, wherein the step S33 is implemented by adding a flooding regularization constraint module to the CTC loss function, so that the CTC loss function first falls to a threshold flooding level of the flood method, and when the loss function is lower than the threshold flooding level of the flood method, a gradient is raised; transcribing the predicted tag distribution per frame output by the bidirectional Recurrent Neural Network (RNN) into a tag sequence, wherein the transcription patterns comprise dictionary-free transcription and dictionary-based transcription.

8. The text recognition method based on the combination of image processing and deep learning of claim 5, wherein the convolutional neural network CNN in step S312 adopts a VGG structure as an image feature extraction network, and both the first layer and the second layer adopt DW convolution; setting the nuclear size of the third layer of max-pooling and the fourth layer of max-pooling as 1 x 2; meanwhile, a BN layer is added to the convolution after the fifth layer and the sixth layer, and the training of the network is accelerated.

9. The text recognition method based on the combination of image processing and deep learning of claim 8, wherein the specific steps of constructing the convolutional neural network CNN in step S312 are as follows:

s3121: firstly, defining a pyramid level in each stage, and performing depthwise _ conv2d convolution on the first two layers;

s3123: merging the characteristic images with the same space size of the bottom-up path and the top-down path through each transverse connection;

10. The method for text recognition based on combination of image processing and deep learning of claim 5, wherein in step S311, the heights of all the test pictures are normalized to 32, and the widths are scaled by a ratio of 1:1.5 according to the minimum height of 100 pixels.