CN106650725A

CN106650725A - Full convolutional neural network-based candidate text box generation and text detection method

Info

Publication number: CN106650725A
Application number: CN201611070587.9A
Authority: CN
Inventors: 马景法; 金连文; 钟卓耀
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2017-05-10
Anticipated expiration: 2036-11-29
Also published as: CN106650725B

Abstract

The invention discloses a full convolutional neural network-based candidate text box generation and text detection method. The method comprises the steps of generating text region candidate boxes, taking a natural scene picture and a set of real bounding boxes for marking a text region as inputs by an inception-RPN, generating a controllable number of word region candidate boxes, sliding an inception network on a convolutional feature response graph of a VGG16 model, and providing assistance in each sliding position through a set of text feature priori boxes; incorporating text type monitoring information easily causing ambiguity, fusing multilevel regional down-sampling information, and performing text detection; training an inception candidate box generation network and a text detection network in an end-to-end way through back propagation and stochastic gradient descent; and performing iterative voting by the candidate boxes, obtaining a higher text recall rate in a supplementary way, and removing excessive detection boxes by using a candidate box filtering algorithm. According to the method, the accuracy rates of 0.83 and 0.85 are obtained in ICDAR 2011 and 2013 robust text detection standard databases and are superior to the previous best result.

Description

Candidate's text box based on full convolutional neural networks is generated and Method for text detection

Technical field

The present invention relates to natural scene picture Chinese version candidate frame generates the technology with text detection, more particularly to based on complete Candidate's text box of convolutional neural networks is generated and Method for text detection.

Background technology

Text in image provides abundant and accurate high-caliber semantic information, and these information understand for scene, Image and food are retrieved, and content-based recommendation system etc. is potentially large number of using most important.The text inspection of natural scene picture Survey has attracted substantial amounts of concern in computer vision and image understanding community.However, the text detection of natural scene remains one It is individual full of challenge and an open question.First, the background of textual image is very complicated, and symbol, mark, fragment of brick and grass The regions such as ground composition is very difficult to and text differentiation.Additionally, uneven illumination condition, heavy exposure, low contrast, fuzzy Huge challenge is added to text detection task with the super confounding factor such as low resolution

The content of the invention

To overcome the deficiencies in the prior art, the present invention to propose that the candidate's text box based on full convolutional neural networks is generated and text This detection method.

The technical scheme is that what is be achieved in that：

Candidate's text box based on full convolutional neural networks is generated and Method for text detection, including step

S1：Generate text filed candidate frame, inception-RPN is with natural scene picture and a set of retrtieval region Real border frame produces the word region candidate frame of controlled quantity, on the convolution characteristic response figure of VGG16 models as input Slide an inception network, and aids in a set of text feature priori frame in each sliding position；

S2：The text categories supervision message for easily causing ambiguity is incorporated to, multi-level region down-sampling information is incorporated, is carried out Text detection；

S3：By backpropagation and stochastic gradient descent, inception candidate frames are trained to give birth in a kind of mode end to end Into network and text detection network；

S4：The ballot of candidate frame iteration obtains higher text recall rate in the way of a kind of supplement, is filtered using candidate frame Algorithm, removes the detection block of surplus.

Further, step S1 includes step

S11：Text feature priori frame is designed；

S12：Build Inception candidate frames and generate network.

Further, totally 24 kinds of step S11 Chinese eigen priori frame, the width of wherein each sliding position sliding window sets For 32,48,64 and 80, Aspect Ratio is 0.2,0.5,0.8,1.0,1.2 and 1.5.

Further, inception candidate frames generate convolutional layer of the network by a 3*3, the volume of 5*5 in step S12 The maximum pond layer of lamination and 3*3 is connected to the corresponding space of the characteristic response figure of a Conv5_3 as input and receives On domain.

Further, step S2 Chinese version classification supervision message is：Candidate frame IoU overlaps being appointed as more than or equal to 0.5 There is text, candidate frame IoU is overlapped and is appointed as " fuzzy text " less than 0.5 more than or equal to 0.2, other are appointed as not including Text message.

Further, multi-level in step S2 region down-sampling information is：VGG16 networks Conv4_3 and The convolution characteristic response figure of Conv5_3 is carried out multi-level region down-sampling, and obtains the sampling feature of two 512*H*W, Then the feature for being linked together with the convolution layer decoder of a 512*1*1.

The beneficial effects of the present invention is, compared with prior art, the present invention proposes inception candidate frames and generates net Network, this network applies different size of sliding window on convolution characteristic pattern, and aids in a set of text in each sliding position Feature priori frame, generates word region candidate frame.This different size of sliding window retains local information on relevant position While also take into account contextual information, help filters out the candidate frame without text, and the inception candidate frames of the present invention are generated Network has obtained very high recall rate in the case of only with hundreds of word candidates frame；The present invention also draws in text detection network Enter the extra easily text categories supervision message of an ambiguity and incorporate multi-level region down-sampling information, these information The more distinction information of help text detection e-learning distinguish text from complicated background；Additionally, the present invention is in order to more Well using the model in training process, it is proposed that a kind of scheme of candidate frame iteration ballot, obtained in the way of a kind of supplement Higher word recall rate, the filter algorithm that the present invention is used retains optimal candidate frame, removes the candidate frame of surplus.

Description of the drawings

Fig. 1 is the flow chart of candidate text box generation and Method for text detection of the present invention based on full convolutional neural networks.

Fig. 2 is the exemplary plot that the IoU of the word region candidate frame of one embodiment of the invention list overlaps specific interval.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

Fig. 1 is referred to, candidate text box of the present invention based on full convolutional neural networks is generated and Method for text detection, comprising Four steps：S1, text filed candidate frame are generated；S2, text detection；S3, end to end study optimization；S4, heuristic process.

The part S1's act as：Inception-RPN is with natural scene picture and a set of retrtieval region Real border frame as input, produce the word region candidate frame of controlled quantity；For searching words region candidate frame, we Slide an inception network on the convolution characteristic response figure of VGG16 models, and aids in a set of text in each sliding position Eigen priori frame.Particularly may be divided into two steps：(1) text feature priori frame (2) Inception candidate frames are designed and generates network. Each sliding position arrange four kinds of different scales (32,48,64 with 80) different with six kinds ratio (0.2,0.5,0.8,1.0, 1.2 and 1.5), common k=24 kinds priori sliding window.In the study stage, be more than 0.5 divided by union occuring simultaneously with real text frame Be appointed as text label, otherwise overlapping region is appointed as background label divided by union refion less than 0.3.Design Inception candidate frames generate convolutional layer of the network by a 3*3, and the convolutional layer of 5*5 and the maximum pond layer of 3*3 are connected to one In the corresponding space acceptance region of the characteristic response figure of the individual Conv5_3 as input.In addition, in order to reduce dimension, the volume of 1*1 Product operation is used on the maximum pond layer of 3*3.Then, we couple together the feature of various pieces on passage coordinate, The connection features vector of one 640 dimension is sent to two output layers：Classification layer predicts score of the region with the presence or absence of text, returns Layer is returned to improve the text filed position of the various priori windows of each sliding position.

Step S2 includes：(1) the comprehensive text categories supervision message for easily causing ambiguity is to increase more rational prisons Superintend and direct information, help grader to learn more area's another characteristics, identify from complicated and diversified background text filed, and filter Fall the candidate frame not comprising text.(2) multi-level region down-sampling information is incorporated.It act as preferably utilizing multi-level volume The distinction information of product feature and abundant each sliding window.

Being much operated in detection network in the past is appointed as the presence of text the candidate frame that IoU is overlapped more than 0.5, otherwise It is appointed as no presence of text.But this judgement candidate frame is irrational with the presence or absence of the method for text, because IoU is overlapped Interval 0.2 to 0.5 may include space or autgmentability text message, as shown in Figure 2.The label information that these mix can be upset The classification learning of text and non-textual candidate frame.For this purpose, it is proposed that candidate frame IoU is overlapped being appointed as more than or equal to 0.5 There is text, candidate frame IoU is overlapped and is appointed as " fuzzy text " less than 0.5 more than or equal to 0.2, other are appointed as not including Text message.This strategy provides more rational supervision messages and helps grader to learn more distinction features, with Text is identified from complicated and diversified background and the candidate frame without text is filtered out.

In order to better profit from multi-level convolution feature and enrich the discriminant information of each candidate frame, the present invention is in VGG16 The convolution characteristic response figure of the Conv4_3 and Conv5_3 of network is carried out multi-level region down-sampling, and obtains two 512* The sampling feature of H*W.Then the feature for being linked together with the convolution layer decoder of a 512*1*1.The convolutional layer of this 1*1 Together and in the training process Weight merges by multi-level sampling combinations of features to act as (1).(2) reduce dimension with First full articulamentum of matching VGG16.

The part S3 is different from having pointed out the four step Training strategies for combining RPN and Fast-RCNN, and the present invention is logical The method for crossing backpropagation and stochastic gradient descent generates network and text detection network with end-to-end inception candidate frames Mode be trained.Shared convolutional network is by the good imageNet sorter networks initialization of training in advance.The weight of new layer The Gaussian Profile initialization that by average be 0 and deviation is 0.01.Benchmark learning rate is 0.001, and original is reduced into 40000 times per iteration / 10th for coming.Momentum and weights attenuation are set to 0.9 and 0.0005.

Inception candidate frames generate network and text detection network two fraternal input layers：One classification layer, one Return layer.Inception candidate frames generate network and the difference of text detection network output layer is as follows：(1) inception candidates Frame generates network, and each priori frame should be by independent parameter, so we need to predict k=24 priori candidate simultaneously Frame.Classification layer exports 2k and judges whether candidate frame has the score of text, while returning the candidate frame after layer output 4k improves Deviate the numerical value of former candidate frame.(2) text detection network has three output scores to each candidate frame, and background, mould are corresponded to respectively Paste text and the candidate frame that there is text.Return layer and export 4 deviation from regression values of each text candidates frame.In our training process The loss function minimum of this multitask is made, formula is as follows:

L(p,p^*,t,t^*)=L_cls(p,p^*)+λL_reg(t,t^*), (0.1)

The loss function L of classification layer_clsIt is softmax loss functions, p and p^*It is respectively the label and real mark of prediction Sign.Return loss function L_regUsing smooth-L1 loss functions.In addition, t={ t_x,t_y,t_w,t_hAndPoint The deviation from regression value vector of prediction and true candidate frame, t are not represented not accordingly^*By equation below gained：

Here, P={ P_x,P_y,P_w,P_hAnd G={ G_x,G_y,G_w,G_hCorresponding candidate frame P and real text frame G is represented respectively Centre coordinate, height and width.λ represents loss balance parameters, and we allow λ=3 in inception candidate frames generate network So that he is partial to more preferable candidate frame position, in text detection network by λ=1.

The part S4 includes candidate frame iteration voting mechanism and filter algorithm.Candidate frame iteration voting mechanism makes this Invention obtains higher text recall rate in the way of a kind of supplement, and improve text detection system is energy.Filter algorithm makes this Invention removes the detection block of surplus, to improve accuracy.

Natural scene picture and a set of real text frame data are input to inception candidate frames and are generated by the present invention first Network, produces a number of word region candidate frame.Then will obtain word region candidate frame send into one be used for text and Non-textual classification and the text detection network of String localization, the network increased in the training process the text for easily causing ambiguity Classification supervision message and multi-level region down-sampling information is incorporated.Whole system declines mechanism by backpropagation and gradient It is trained in a kind of mode end to end.The mid-module present invention to make full use of training process is thrown using candidate frame iteration Ticket mechanism obtains the high recall rate of text example in the way of a kind of supplement, improves the performance of whole text detection system.Finally The present invention applies a kind of filter algorithm, this algorithm that the inside and outside candidate frame of each text example is found for coordinate position, protects High score candidate frame is stayed, the candidate frame of low score is removed.

The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications are also considered as Protection scope of the present invention.

Claims

1. the candidate's text box based on full convolutional neural networks is generated and Method for text detection, it is characterised in that including step

S1：Generate text filed candidate frame, inception-RPN is true with natural scene picture and a set of retrtieval region Bounding box produces the word region candidate frame of controlled quantity as input, slides on the convolution characteristic response figure of VGG16 models One inception network, and aid in a set of text feature priori frame in each sliding position；

S2：The text categories supervision message for easily causing ambiguity is incorporated to, multi-level region down-sampling information is incorporated, text is carried out Detection；

S3：By backpropagation and stochastic gradient descent, inception candidate frames are trained to generate net in a kind of mode end to end Network and text detection network；

S4：The ballot of candidate frame iteration obtains higher text recall rate in the way of a kind of supplement, using candidate frame filter algorithm, Remove the detection block of surplus.

2. candidate's text box as claimed in claim 1 based on full convolutional neural networks is generated and Method for text detection, and it is special Levy and be, step S1 includes step

S11：Text feature priori frame is designed；

S12：Build Inception candidate frames and generate network.

3. candidate's text box as claimed in claim 2 based on full convolutional neural networks is generated and Method for text detection, and it is special Levy and be, totally 24 kinds of step S11 Chinese eigen priori frame, wherein each sliding position sliding window width is set to 32,48,64 With 80, Aspect Ratio is 0.2,0.5,0.8,1.0,1.2 and 1.5.

4. candidate's text box as claimed in claim 2 based on full convolutional neural networks is generated and Method for text detection, and it is special Levy and be, inception candidate frames generate convolutional layer of the network by a 3*3 in step S12, and the convolutional layer and 3*3 of 5*5 are most Great Chiization layer is connected in the corresponding space acceptance region of the characteristic response figure of a Conv5_3 as input.

5. candidate's text box as claimed in claim 1 based on full convolutional neural networks is generated and Method for text detection, and it is special Levy and be, step S2 Chinese version classification supervision message is：Candidate frame IoU overlaps being appointed as more than or equal to 0.5 and there is text, Candidate frame IoU is overlapped and is appointed as " fuzzy text " less than 0.5 more than or equal to 0.2, and other are appointed as not comprising text message.

6. candidate's text box as claimed in claim 1 based on full convolutional neural networks is generated and Method for text detection, and it is special Levy and be, region down-sampling information multi-level in step S2 is：It is special in the convolution of the Conv4_3 and Conv5_3 of VGG16 networks Levy response diagram and be carried out multi-level region down-sampling, and obtain the sampling feature of two 512*H*W, then with a 512*1* The feature that 1 convolution layer decoder links together.