CN111062386A

CN111062386A - Natural scene text detection method based on depth pyramid attention and feature fusion

Info

Publication number: CN111062386A
Application number: CN201911192949.5A
Authority: CN
Inventors: 贾世杰; 冯宇静
Original assignee: Beijing Wonderroad Magnesium Technology Co Ltd
Current assignee: Beijing Wonderroad Magnesium Technology Co Ltd
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-04-24
Anticipated expiration: 2039-11-28
Also published as: CN111062386B

Abstract

The invention provides a natural scene text detection method based on depth pyramid attention and feature fusion, which is a natural scene text detection algorithm combining a depth pyramid attention network and feature fusion and solves the two problems that an originally well-designed model cannot be fully utilized, the overall performance is limited, and the long dependence disappears along with the deepening of convolution due to the fact that convolution operation is based on a local receptive field. The utilization rate of the model is improved better by utilizing the feature fusion and the depth pyramid attention model, the defects that the existing many character detection models are good in design structure but cannot be fully utilized, the convolution operation is based on local receptive fields, and the long dependence disappears along with the deepening of the convolution are overcome.

Description

Natural scene text detection method based on depth pyramid attention and feature fusion

Technical Field

The invention relates to a natural scene text detection method, in particular to a natural scene text detection algorithm combining a depth pyramid attention network and a feature fusion technology.

Background

With the development of science and technology, the demand of internet products is continuously increased, and more aspects need to use text information in images. In order to identify the content of characters in an image more completely, character detection is the first step and is also an extremely important step, and the performance of character identification is directly influenced.

Text detection based on natural scenes needs to overcome background interference, variable aspect ratio of characters, variable directions of characters and detection complexity brought by small text to text detection, and is one of the most challenging subjects in the field of computer vision at present. The natural scene text detection can be divided into traditional natural scene character detection and natural scene character detection based on deep learning from different feature extraction modes. Scene pictures are different from document pictures and comprise complex backgrounds and character angle changes, and characters are difficult to distinguish from the backgrounds by using the traditional natural scene character detection method alone. At present, text detection in a natural scene of deep learning can be mainly divided into two types, namely a text detection method based on region suggestion and a text detection method based on image segmentation. Through analysis of the two methods, most models lack characteristic-level balancing, so that the originally well-designed models cannot be fully utilized, and the overall performance is limited.

In order to make full use of the model better, the invention provides a new network which overcomes the defect that the originally well designed model cannot be fully utilized and limits the overall performance, and solves the problem that the long dependence disappears along with the deepening of the convolution due to the fact that the convolution operation is based on the local receptive field.

Disclosure of Invention

The invention provides a natural scene text detection algorithm combining a depth pyramid attention network and feature fusion, and solves the problem that an originally well-designed model cannot be fully utilized and the overall performance is limited.

The technical scheme of the invention is as follows:

a natural scene text detection method based on depth pyramid attention and feature fusion comprises the following steps:

taking a common data set of a natural scene text as a training sample;

step two, inputting 8 pictures of training samples into a primary extracted feature network (the extracted feature network of PixelLink) according to each batch, wherein a basic framework is a VGG16 network, and a Unet structure is adopted; the top-down path uses a VGG16 network, which is a deep network consisting of a series of 3 × 3 convolutions and maximum pooling. The advantage of using multiple convolutional concatenations is: fewer parameters are required and there are more non-linear variations than using only one larger convolution kernel.

The bottom-up path, the upsampling phase. Wherein the up-sampling uses bilinear interpolation.

To prevent the feature map output by the VGG16 from being directly upsampled and thus losing context information, a cross-connect is employed. The method performs feature fusion on feature graphs with the same space size of a top-down path and a bottom-up path, thereby complementing lost information and enabling feature representation capability after up-sampling to be stronger.

Step three, extracting 4 characteristic mapping layers obtained by a characteristic network by the PixelLink: h4, h3, h2 and h1, wherein 4 feature mapping layers are up-sampled to h4, and average summation of pixel values is carried out, the size of channels is unchanged, and the feature fusion is called; wherein, the up-sampling uses bilinear interpolation; the formula of feature fusion is:

F＝(h4+Up_×2(h3)+Up_×4(h2)+Up_×4(h1))/4 (1)

wherein Up_×2(. o) and Up_×4(. cndot.) representing 2-fold and 4-fold enlargement, respectively;

step four, the output of the feature fusion is used as the input of the depth pyramid attention model, the depth pyramid attention model is further added, and the added depth pyramid attention model is more fully utilized;

the depth pyramid attention model consists of three branches: a depth feature pyramid network branch, a nonlinear transformation branch, and a global average pooling branch. The invention does not simply add the extracted information but does fine processing to the depth feature pyramid network. The depth feature pyramid network branches are respectively convolved with 2 convolutions 7 × 7,2 convolutions 5 × 5, and 2 convolutions 3 × 3, so as to extract information from different pyramid scales. The same convolution kernel adopts a serial connection mode, and different convolution kernels adopt a parallel connection mode. The left half of Conv7 is labeled as Conv7_1, the right half of Conv7 is labeled as Conv7_2, BN is labeled as Conv 357, BN. Similarly, Conv5 × 5, BN in the left half is labeled as Conv5_1, Conv5 × 5, BN in the right half is labeled as Conv5_2, Conv3 × 3, BN in the left half is labeled as Conv3_1, and Conv3 × 3, BN in the right half is labeled as Conv3_ 2. The refining process is as follows: feature mapping after feature fusion is performed by Conv7_1, Conv5_1, Conv3_1 and Conv3_2 respectively. The feature map of Conv3_2 is then upsampled and superimposed on the feature map of Conv5_1 in pixel values and the result of the superimposition is input to Conv5_ 2. Finally, the feature map of Conv5_2 is up-sampled and superimposed with the feature map of Conv7_1 for pixel values and the result of the superimposition is input to Conv7_ 2. Wherein, the up-sampling uses deconvolution, the size of the kernel is 4 x 4, the step size is 2, and BN and Relu activation functions are used;

inputting the refined feature mapping layer into a PixelLink output network;

the PixelLink output network mainly comprises two parts: the first part is to predict whether the pixel is text; the second part is to predict whether the pixel and the 8 surrounding pixels belong to the same text instance; connecting the pixels of the positive example by using positive connection to form a connected component, wherein each component is a text example;

step six, finally, the segmented text instance is used for obtaining a final connected domain through minAreaRect in an Opencv connected domain method; and when the connected region with the shortest side pixel less than 10 or the area less than 300 pixels is used as false detection, automatically filtering the text region, and finally outputting the bounding box.

The invention has the beneficial effects that:

(1) the utilization rate of the model is improved better by utilizing the feature fusion and the depth pyramid attention model, and the defects that the existing many character detection models are good in design structure but cannot be fully utilized and the overall performance is limited are overcome.

(2) The convolution operation is based on local receptive fields, and the problem that long dependence disappears along with the deepening of convolution is solved.

(3) Valid for multi-scale text.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of the overall network architecture of the present invention.

FIG. 3 is a partial schematic diagram of a depth pyramid attention network structure.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

As shown in fig. 1, the following steps are specifically described:

taking a common data set training set of a natural scene text as a training sample;

step two, using an extracted feature network of PixelLink as a primary extracted feature network, wherein a basic framework is a VGG16 network, and a Unet structure is adopted;

the Unet consists of top-down paths, bottom-up paths, and transverse connections.

(1) The top-down path uses a VGG16 network, which is a deep network consisting of a series of 3 × 3 convolutions and maximum pooling. The advantage of using multiple convolutional concatenations is: fewer parameters are required and there are more non-linear variations than using only one larger convolution kernel.

(2) The bottom-up path, the upsampling phase. Wherein the up-sampling uses bilinear interpolation.

(3) To prevent the feature map output by the VGG16 from being directly upsampled and thus losing context information, a cross-connect is employed. The method performs feature fusion on feature graphs with the same space size of a top-down path and a bottom-up path, thereby complementing lost information and enabling feature representation capability after up-sampling to be stronger.

Step three, extracting 4 characteristic mapping layers obtained by a characteristic network by the PixelLink: h 4; h 3; h 2; h1, up-sampling 4 feature mapping layers to h4, carrying out average summation of pixel values, and enabling the size of channels to be unchanged, wherein the process is called feature fusion; wherein, the up-sampling uses bilinear interpolation; the formula of feature fusion is:

F＝(h4+Up_×2(h3)+Up_×4(h2)+Up_×4(h1))/4 (1)

(1) due to hardware equipment, the training picture size is 256 × 256, the h4 size is 64 × 64, the h3 size is 32 × 32, the h2 size is 16 × 16, and the h1 size is 16 × 16.

Taking the output of the feature fusion as the input of the depth pyramid attention network, further refining the features and more fully utilizing the model;

(1) the depth pyramid attention network is composed of a depth feature pyramid network branch, a nonlinear transformation branch and a global average pooling branch. The design is made on the branch parts of the depth feature pyramid network, so that the features of each branch are not only simply fused, but also each part in the branch parts of the depth feature pyramid network is further refined.

And step five, inputting the refined feature mapping layer into a PixelLink output network.

(1) This output network is mainly composed of two parts. The first part is to predict whether the pixel is text/not text; the second part is to predict whether the pixel and its surrounding 8 pixels belong to the same text instance. Connecting the pixels of the positive example by using positive connection to form a connected component, wherein each component is a text example;

and step six, finally, the segmented text example is used for obtaining a final connected domain through minAreaRect in an Opencv connected domain method, but the method is sensitive to noise and can predict the noise as a real text, so that some threshold values are set, and false positives are reduced. When the connected region with the shortest edge pixel less than 10 or the area less than 300 pixels is used as the false detection, the text region is automatically filtered, and finally the bounding box is output.

The invention is characterized in that the refinement network is composed of two parts: the utilization rate of the model is improved better, the problems that the existing many character detection models are good in design structure but cannot be fully utilized and the convolution operation is based on local receptive fields, and long dependence disappears along with the deepening of convolution are solved.

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, which are implemented on the premise of the technical solution of the invention, and give detailed implementation manners and specific operation procedures, but the scope of the invention is not limited to the following implementation examples.

The data set for the experiments of the invention was used for ICDAR2015 and ICDAR 2013. There are 1500 pictures in the natural scene in the ICDAR2015 data set with a resolution size of 1280 × 720, of which 1000 are training pictures and 500 are testing pictures. The difference from the previous image of the ICDAR race is: these pictures are mainly obtained by *** glasses and are very random when being taken, and the text has the condition of inclination and blurring, and the aim is to increase the difficulty for detection.

ICDAR2013 contained 229 training pictures and 233 test pictures. This data set is a subset of the ICDAR2011, deleting ICDAR2011 duplicate pictures and remedying the problem of incorrect image labeling. The method is widely applied to text detection, but only contains horizontal text.

The experiment was carried out on a computer equipped with Intel (R) Core i7-6700 CPU 3.40GHz, running Linux Ubuntu 14.04 operating system and Pycharm Python 2.7. The deep learning framework is tensiflow-gpu ═ 1.3.0, and the libraries mainly needed are Opencv2, setprogram, matplotlib.

ICDAR2015 experiment: for the ICDAR2015 experiments, the training picture input size in the ICDAR2015 dataset used was 256 × 256 and the test picture resolution in the ICDAR2015 dataset was 1280 × 704. The evaluation criteria used were the R, P, F values published under the ICDAR2015 challenge.

Table 1 is the R, P, F values on the ICDAR2015 data set for the model of the invention and PixelLink, respectively. The results of the ICDAR2015 experiments are shown in table 1:

TABLE 1 ICDAR2015 multidirectional text detection experiment results

Model (model)	Recall rate	Rate of accuracy	F value
				Model of the invention	0.7708	0.7595	0.7651
PixelLink	0.7299	0.7607	0.7450

ICDAR2013 experiment: in the ICDAR2013 experiment, the input size of a training picture in the ICDAR2013 data set used is 256 × 256, and the resolution of a test picture in the ICDAR2013 data set is 384 × 384. The evaluation standard adopts R, P, F value of evaluation mode disclosed by ICDAR2013 challenge match.

Table 2 is the R, P, F values for the inventive model and PixelLink, respectively, on the ICDAR2013 dataset. The results of the ICDAR2013 experiments are shown in Table 2:

table 2 ICDAR2013 horizontal text detection experimental results

Model (model)	Recall rate	Rate of accuracy	F value
				Model of the invention	0.8168	0.7041	0.7563
PixelLink	0.6919	0.7508	0.7201

Claims

1. A natural scene text detection method based on depth pyramid attention and feature fusion is characterized by comprising the following steps:

taking a common data set of a natural scene text as a training sample;

inputting training samples into a primary extraction feature network according to 8 pictures in each batch, wherein a basic framework is a VGG16 network and adopts a Unet structure; the primary extracted feature network is an extracted feature network of PixelLink;

F＝(h4+Up_×2(h3)+Up_×4(h2)+Up_×4(h1))/4 (1)

the depth pyramid attention model consists of three branches: the system comprises a depth characteristic pyramid network branch, a nonlinear transformation branch and a global average pooling branch; 2 convolutions of 7 × 7,2 convolutions of 5 × 5 and 2 convolutions of 3 × 3 are respectively used in the depth feature pyramid network branches, so that information is extracted from different pyramid scales; the same convolution kernels adopt a serial connection mode, and different convolution kernels adopt a parallel connection mode; the left half of Conv7 × 7, BN, Relu is marked as Conv7_1, the right half of Conv7 × 7, BN is marked as Conv7_ 2; similarly, the left half of Conv5 is Conv5_1, the right half of Conv5 is Conv5_2, the left half of Conv3 is BN, the Relu of Conv3_1, the right half of Conv3 is 3, the BN is Conv3_ 2; the refining process is as follows: the feature mapping after feature fusion is respectively performed by Conv7_1, Conv5_1, Conv3_1 and Conv3_ 2; then upsampling the feature map of Conv3_2 and performing superposition of pixel values with the feature map of Conv5_1 and inputting a superposition result to Conv5_ 2; finally, feature maps of Conv5_2 are up-sampled and are overlapped with feature maps of Conv7_1 to form pixel values, and the overlapping result is input into Conv7_ 2; wherein, the up-sampling uses deconvolution, the size of the kernel is 4 x 4, the step size is 2, and BN and relu activating functions are used;

inputting the refined feature mapping layer into a PixelLink output network;

the PixelLink output network consists of two parts: the first part is to predict whether the pixel is text; the second part is to predict whether the pixel and the 8 surrounding pixels belong to the same text instance; connecting the pixels of the positive example by using positive connection to form a connected component, wherein each component is a text example;