CN112580507B

CN112580507B - Deep learning text character detection method based on image moment correction

Info

Publication number: CN112580507B
Application number: CN202011506599.8A
Authority: CN
Inventors: 田辉; 刘其开
Original assignee: Hefei High Dimensional Data Technology Co ltd
Current assignee: Hefei High Dimensional Data Technology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2024-05-31
Anticipated expiration: 2040-12-18
Also published as: CN112580507A

Abstract

The invention discloses a deep learning text character detection method based on image moment correction, which specifically comprises the following steps: preparing a data set, manually correcting a box with inaccurate pre-labeling, generating a heat map label in a Gaussian heat map form according to the box, defining a neural network structure and a loss function, pre-training, expanding a training sample set of an actual scene, performing self-adaptive binarization operation on the expanded training sample set, calculating Hu moment feature vectors of each character, taking an orientation quantity mean value as an auxiliary label of the character, and modifying the loss function form to perform fine-tuning training, model test and verification; according to the method, the optimized loss function is formed by combining the heat map label and the moment feature vector label, so that the accuracy of the character box frame is improved, and the problems of over-segmentation and under-segmentation of the character frame are solved; the problem of insufficient character level labeling is solved by preprocessing the sample set after expanding, and the character detection generalization capability is better.

Description

Deep learning text character detection method based on image moment correction

Technical Field

The invention belongs to the field of target detection, and particularly relates to a deep learning text character detection method based on image moment correction.

Background

At present, text detection has wide application in the field of computer vision, such as real-time translation, image retrieval, scene analysis, geographic positioning, blind navigation and the like, so that the method has extremely high application value and research significance in scene understanding and text analysis.

The existing text detection methods are divided into the following categories:

1. the traditional image processing method is based on the feature detection of manual design, such as MSER (maximum stable extremum region) and SWT (stroke width transformation), and mainly processes the text detection of printing fonts and printing scanning scenes, and has poor text detection effect on natural scenes;

2. The Two-stage method based on deep learning generates candidate areas and extracts corresponding features, performs network training fine adjustment and outputs a corresponding text area box, and has the advantages of higher precision, good performance on small-scale target detection, calculated amount sharing, low reasoning speed and longer training period;

3. The One-stage method based on deep learning directly skips the step of generating a candidate frame and predicts a text region frame of a target end to end, and has the advantages of high reasoning speed, lower precision than two-stage and poor small target detection effect.

Most of the existing text detection algorithm technologies are based on the position coordinates of the output text line area, for example, a reference network CTPN in the existing text detection technologies is improved based on a Two-stage method, and on the basis of FASTER RCNN, the special improvement of horizontal arrangement or vertical arrangement of target texts is combined, and the output text line area is output. Existing text detection algorithm techniques do not provide accurate character-level text detection and thus provide limited information.

The existing text detection algorithm at the character level is based on the concept of semantic segmentation, a label replaces a pixel-level block heat map with a Gaussian center heat map, a regional score or a compact score is adopted to optimize a network, and the final character frame is obtained by performing binarization processing on a probability map in post-processing. The character-level text detection can output the coordinates of a single character frame body and the coordinates of a text line area, so that the output information is richer, and the larger requirements of customers can be met. However, the existing algorithm for detecting the text at the character level is affected by parameters and the complex Chinese text scene where the parameters are located, and the segmented character frames can be over-segmented or under-segmented, which respectively correspond to the rectangular frame and the darkened rectangular frame shown in fig. 4.

Disclosure of Invention

In order to solve the problems, the invention provides a deep learning text character detection method based on image moment correction, which comprises the following steps:

A: preparing a data set, pre-marking samples randomly sampled in the data set, and storing box boxes of each character of the samples;

B: manually correcting the box with inaccurate pre-labeling, and generating a heat map label in a Gaussian heat map form according to the box;

c: defining a neural network structure and a loss function loss _cross;

d: performing preliminary pre-training by adopting the determined network structure and the loss function loss _cross in the step C;

E: extending a training sample set of an actual scene;

F: performing self-adaptive binarization operation on the training sample set expanded in the step E, and calculating Hu moment feature vectors of each character, wherein an orientation quantity average value is used as an auxiliary label of the character;

g: modifying a loss function form, adding a regular term branch, and performing fine tuning training by using the modified loss function loss by using the extended training sample set;

H: model test and verification, namely modifying parameters theta of the Gaussian heat map generated by the pre-labeling, and drawing an accuracy change curve of a character box under different theta thresholds, so that proper parameters theta are selected according to requirements.

Further, the method comprises the steps of,

The data set in the step A mainly comprises ICDAR2017, ICDAR2019 and data in CTW, and a public character segmentation model trained by EasyOCR is adopted to pre-label samples randomly sampled in the data set.

Further, the method comprises the steps of,

The pre-labeling inaccuracy in the step B specifically refers to over-segmentation or under-segmentation of the character box;

The over-segmentation means that the character box does not contain all the current characters into the box, and the under-segmentation means that other characters or symbols except the current characters exist in the character box.

Further, the method comprises the steps of,

And B, mapping the box frame to a two-dimensional Gaussian graph by adopting perspective transformation to generate a Gaussian thermal graph type label.

Further, the method comprises the steps of,

The specific operation of determining the neural network structure in the step C is as follows:

the network inputs samples with preset sizes, adopts a VGG16 reference network as a characteristic extraction network and takes U-net as a decoding network;

outputting a pixel score matrix representing the confidence region;

The loss function loss _cross in step C is determined by the following method:

The loss function loss _cross employs pixel level cross entropy loss, i.e., by setting the theta threshold for the label heat map, greater than the theta threshold is considered a character region, represented by category 1, and less than the theta threshold is represented by non-character region, represented by category 0.

Further, the method comprises the steps of,

The method for expanding the training sample set of the actual scene in the step E is to shoot interfaces containing documents of a computer screen at random screenshot or different angles, pretag the interfaces with a pretrained model, and manually correct the interfaces in the step B.

Further, the method comprises the steps of,

The theta threshold is obtained by the following steps:

carrying out Gaussian smoothing on the thermal icon label, and calculating a gradient map of the thermal icon label;

Determining communication areas under different thresholds according to a watershed algorithm, and taking the minimum circumscribed rectangle under each communication area, namely a character frame under the threshold;

And randomly counting and sampling a plurality of words, judging the accuracy of the minimum external frame under the corresponding different thresholds, and taking the threshold with the highest accuracy as the theta threshold.

Further, the method comprises the steps of,

The modified loss function loss in step G is the loss function loss _cross in step C plus

Loss of L2: loss=loss _cross+m*loss_L2

Wherein the method comprises the steps ofThe L2 loss representing the moment characteristics of the samples, m represents the number of samples, K represents the number of characters of a single sample, y _ij represents the mean value of moment characteristic vectors corresponding to the jth character in the ith sample, and f (x _ij) represents the mean value of moment characteristic vectors corresponding to the jth character in the ith sample of network output prediction.

Further, the method comprises the steps of,

And (3) the sample in the test and verification of the model in the step (H) is a character in a text scene of the shot or screenshot of the arbitrarily selected computer document.

The invention has the advantages that:

The detection method of the invention provides a method for representing the center of a single character based on image moment characteristics and providing more robust auxiliary information, namely, the accuracy of a character box is improved by combining a Gaussian heat map and moment characteristics to form an optimized loss function, the character detection segmentation capability of a model is improved by combining a segmentation task (heat map label) and a regression task (moment characteristic label), and the problems of over segmentation and under segmentation of character frames are solved; in addition, a sample is synthesized through text scenes in the screen shots, a preliminary character text detection model is pre-trained, then pre-labeling is carried out in a real text sample, corrected text is carried out manually, moment characteristics of each character in the real sample are calculated, and the corrected text is used as a regular term of a loss function in training fine tuning. The preprocessing mode makes up the problem of insufficient character level labeling on one hand, and on the other hand, in the text scene of actual printing photographing or screenshot, the character detection generalization capability is better.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a prior art character segmentation algorithm flow diagram;

FIG. 2 shows a flow chart of a character segmentation algorithm according to an embodiment of the present invention;

FIG. 3 shows an exemplary diagram of a sample tag Gao Situ of the present invention;

fig. 4 shows an exemplary diagram of an over-segmentation or under-segmentation phenomenon.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Since the sample background of the natural scene is complex, the calculated image moment features can cause deviation, so that the image moment feature values are calculated only for the screenshot of the background of the computer document or the shot specific scene. The characteristics of moments of different orders are also different, the origin moment or the center moment is used as the characteristic of the image, the characteristic cannot be guaranteed to have translation, rotation and proportion invariance at the same time, if the characteristic of the image is represented by the center moment only, the characteristic has translation invariance, and the normalized center moment has translation invariance, proportion invariance and rotation invariance, so that Hu moment vectors are used as auxiliary information, and more priori knowledge of a network is provided for training.

The invention discloses a method for detecting deep learning text characters based on image moment correction, which comprises the following steps:

Preparing a data set; the public Chinese data set used in the method mainly comprises ICDAR2017 data set, ICDAR2019 data set and CTW (wild Chinese text) data, wherein the CTW data has higher diversity and complexity, and comprises a plane text, a convex text, a city street view text, a village street view text, a text under a weak lighting condition, a long-distance text, a part of display text and the like; for each image, all Chinese characters are marked in the data set; for each Chinese character, the dataset is labeled with its character class, bounding box. Pre-labeling randomly sampled samples in the dataset by using a public character segmentation model trained by simple optical character recognition (EasyOCR), and storing a box of each character of each sample;

Developing a simple fine-tuning man-machine interaction labeling interface, which is similar to a target detection labeling tool, can automatically load pictures and json format labels of corresponding pictures, and then manually correcting some character box boxes with inaccurate pre-labeling in a pop-up dialog box, wherein inaccurate prediction refers to insufficient frames (over segmentation) of current characters or areas from the frames to adjacent characters, commas and the like (under segmentation), and specific examples can be seen in a rectangular box (over segmentation) and a darkened rectangular box (under segmentation) in fig. 4; and generating a Gaussian heat graph type label according to the box of the character, wherein in the step, the box of the character is mapped onto a two-dimensional Gaussian graph through perspective transformation so as to represent the heat graph label of the character, as shown in a sample label Gao Situ of fig. 3.

C defines network structure and loss function: the network inputs a sample with the size of h x w x 3, adopts a VGG16 reference network as a feature extraction network, takes an improved U-net as a decoding network, outputs a pixel score matrix (the specific structure is shown in figure 2) representing a confidence coefficient region, h represents the height of an image input into the network, w represents the width of the image input into the network, and 3 is the number of RGB channels;

The loss function uses pixel level cross entropy loss, and by setting a theta threshold to the label heat map, a character region is considered to be a type 1 if the theta threshold is larger, and a non-character region is considered to be a type 0 if the theta threshold is smaller.

Therefore, the accuracy of the parameter theta at different values needs to be compared, so that the best parameter theta is selected, and the theta threshold value is obtained by testing in an actual training sample by means of a watershed algorithm in graphics, and the method mainly comprises the following steps of:

firstly, carrying out Gaussian smoothing treatment on a label heat map, calculating a gradient map of the label heat map, then determining communication areas under different thresholds according to a watershed algorithm, taking the minimum circumscribed rectangle under each communication area (namely a character frame under the threshold), randomly counting and sampling a plurality of words, artificially and subjectively judging the accuracy of the minimum circumscribed frame under the threshold, and taking a threshold with relatively high accuracy as a theta threshold.

And D, pre-training, namely performing preliminary pre-training by adopting the network structure and the loss function defined in the step C.

E, expanding a training sample set of an actual scene, randomly capturing a screenshot or shooting interfaces, such as web pages, word documents and the like, containing documents of a computer screen under different angles, pre-marking by using a pre-trained model, and manually correcting by using the mode of the step B.

F, carrying out self-adaptive binarization on the sample expanded in the step E to obtain a binary image, then calculating Hu moment feature vectors of each character, and taking the average value of the Hu moment feature vectors as an auxiliary label of the character; in theory, the mean value of moment characteristics of a character area is not greatly different, compared with a non-character area, the moment characteristic value of the character area is much larger, moment characteristic branches are introduced, on one hand, the attention of a model can be more inclined to the character area, and the detection is facilitated; on the other hand, the moment characteristic mean value can guide the network to learn more accurate character frames, and is beneficial to segmentation.

G, modifying a loss function form, adding a regular term branch, and performing fine tuning training by using the modified loss function by using the expanded training sample set; model training is performed by using the expanded training samples, and the steps distinguish the details of pre-training as follows: modifying a loss function of the network, namely adding a regular term branch taking Hu moment feature vectors as auxiliary label information, and carrying out joint training after original cross entropy loss _cross and L2 loss due to character frame moment vectors, wherein the value of m is 0.01-0.05;

loss＝loss_cross+m*loss_L2

Wherein the method comprises the steps of L2 loss representing the feature of the moment of the sample, m representing the number of samples, and K representing the number of characters of a single sample. y _ij represents the mean value of the moment feature vector corresponding to the j-th character in the i-th sample, the mean value is taken as a moment feature label, f (x _ij) represents the mean value of the moment feature vector corresponding to the j-th character in the i-th sample of the network output prediction, and L2 represents the least square error.

The H model test and verification, the model of the method is mainly used for improving the character detection problem of a text scene shot by a computer document, so that the test and verification are carried out by adopting a sample under the scene, and the accuracy of character segmentation is counted; since the pre-labeled thermodynamic diagram is affected by the parameter theta. Therefore, the accuracy of the parameter theta at different values needs to be compared, so that the best parameter theta is selected. And modifying the parameter theta of the Gaussian heat map generated by pre-labeling, and drawing an accuracy change curve of the character box under different theta thresholds, so that the proper parameter theta is selected according to the requirement.

FIG. 1 illustrates a character segmentation algorithm representative of the prior art

Scaling an input sample to a size of h x w x 3 as network input, adopting a VGG16 reference network as a feature extraction network, and enabling the higher the stage of the extraction network, the more abstract the corresponding generated feature map, i.e. reducing the size to 1/2; in order to fuse the information of the bottom layer features and the high layer features, the decoding network U-net will make the feature map of a certain output layer and the feature map of a certain stage of the extraction network have the same size through up-sampling, so as to perform merging and fusion, and finally output a pixel score matrix representing the character connection confidence coefficient region through a 1*1 convolution layer. The main idea is to predict character detection frames by using segmentation tasks, character connection confidence coefficient matrixes are added on output branches to solve the problem of character positioning of non-rectangular areas, and the synthesized character data sets are utilized to perform weak supervision learning to complete the pre-training tasks of the model, so that the character segmentation effect under the whole natural scene is improved.

FIG. 2 shows a character segmentation algorithm of the present method

The method is basically the same as the method in the aspect of network structure, the size and output of an input sample are different, the input size is h x w x 3 structure, a VGG16 reference network is adopted as a feature extraction network and a decoding network to fuse the features of a high layer and a bottom layer, a pixel score matrix representing character moment mean value vectors is output through a 1*1 convolution layer, and moment feature vectors are obtained through introducing full-connection layer branch output. The two branches mainly combine segmentation and regression tasks, and replace the box coordinates of target detection by moment features, so that the moment feature vectors are more robust in the positioning segmentation task of Chinese character texts with relatively consistent aspect ratios due to the characteristics of the moment feature vectors. By constructing a batch of data sets related to the actual application of the algorithm, such as a text data set of a computer shooting and screenshot scene; the method aims to solve the problem of character-level text detection; by using the concept of semantic segmentation, each character is labeled by adopting a Gaussian heat map, and the higher the pixel value of the heat map is, the closer the pixel is to the center point of the character.

Although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting deep learning text characters based on image moment correction, the method comprising the steps of:

c: defining a neural network structure and a loss function loss _cross;

E: extending a training sample set of an actual scene;

G: modifying a loss function form, adding a regular term branch, and performing fine tuning training by using the modified loss function loss by using the extended training sample set; the modified loss function loss is the loss function loss _cross plus L2 loss: loss=loss _cross+m*loss_L2

Wherein the method comprises the steps ofL2 loss representing sample moment characteristics, m represents the number of samples, K represents the number of characters of a single sample, yij represents the average value of moment characteristic vectors corresponding to the jth character in the ith sample, and f (xij) represents the average value of moment characteristic vectors corresponding to the jth character in the ith sample of network output prediction;

h: model test and verification, namely modifying a parameter theta of the Gaussian heat diagram generated by the pre-labeling, and drawing an accuracy rate change curve of a character box under different theta thresholds, so that a proper parameter theta is selected according to requirements; the theta threshold is obtained by the following steps:

2. The method for deep learning text character detection based on image moment correction as claimed in claim 1, wherein,

3. The method for deep learning text character detection based on image moment correction as claimed in claim 1, wherein,

4. The method for deep learning text character detection based on image moment correction as claimed in claim 1, wherein,

5. The method for deep learning text character detection based on image moment correction as claimed in claim 1, wherein,

outputting a pixel score matrix representing the confidence region;

The loss function loss _cross in step C is determined by the following method:

6. The method for detecting deep learning text characters based on image moment correction according to any one of claims 1 to 5, wherein,

7. The method for deep learning text character detection based on image moment correction as claimed in claim 6, wherein,