CN111553349A

CN111553349A - Scene text positioning and identifying method based on full convolution network

Info

Publication number: CN111553349A
Application number: CN202010340617.3A
Authority: CN
Inventors: 杨海东; 黄坤山; 巴姗姗; 彭文瑜; 林玉山
Original assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Current assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2020-08-18
Anticipated expiration: 2040-04-26
Also published as: CN111553349B

Abstract

The invention discloses a scene text positioning and identifying method based on a full convolution network, which comprises the steps of S1, obtaining a training set containing a plurality of training pictures marked with text positions; s2, constructing a full convolution neural network Model based on text positioning, wherein the full convolution neural network Model comprises 5 steps of inputting a training set into the full convolution neural network Model based on text positioning for training, iterating Model parameters to obtain a converged text positioning network Model1 and the like, and the full convolution neural network Model comprises a feature extraction network, a feature fusion network and a text candidate box screening layer.

Description

Scene text positioning and identifying method based on full convolution network

Technical Field

The invention relates to the technical field of text positioning and identification, in particular to a scene text positioning and identification method based on a full convolution network.

Background

The text is used as the most expressive information expression mode, records colorful scientific and technological culture of human beings, and can be embedded into a document or a scene as communication information. Text in images of scenes can be roughly divided into two categories: artificial text and scene text. With the development of internet technology, text positioning and recognition technologies, such as license plate recognition, identification card recognition, etc., have been widely used in life. The traditional ORC recognition technology can only be used for recognizing a print with a single background and fixed fonts, but the texts on a scene image have diversity, such as irregular arrangement and non-uniform font sizes, and in addition, the problems of font blurring, incomplete detection and the like caused by factors such as illumination intensity or photographing angle cause strong interference on text detection, so that the accuracy of text detection is seriously influenced, and the scene text positioning and recognition are extremely challenging tasks. Therefore, in order to improve the accuracy of scene text detection, a scene text positioning and identifying method based on a full convolution network is provided.

Disclosure of Invention

Aiming at the problems, the invention provides a scene text positioning and identifying method based on a full convolution network, which mainly solves the problems in the background technology.

The invention provides a scene text positioning and identifying method of a full convolution network, which comprises the following steps:

s1, acquiring a training set containing a plurality of training pictures marked with text positions;

s2, constructing a full convolution neural network Model based on text positioning, wherein the full convolution neural network Model comprises a feature extraction network, a feature fusion network and a text candidate box screening layer, inputting the training set into the full convolution neural network Model based on the text positioning for training, and iterating Model parameters to obtain a converged text positioning network Model 1;

s3, constructing a text recognition network Model, wherein the text recognition network comprises a convolutional neural network layer, an attention mechanism layer, a cyclic neural network layer and a translation layer, inputting the training set into the text recognition network Model for training, and iterating Model parameters to obtain a converged text recognition network Model 2;

s4, inputting the scene image to be subjected to text positioning and recognition into the text positioning network Model1 to obtain the text existence confidence and the text region position, and outputting the best text candidate box after screening;

s5, inputting the image containing the best text candidate box into the text recognition network Model2 to obtain a text recognition result.

The further improvement is that the feature extraction network consists of a convolution layer and a pooling layer and is used for extracting a convolution feature map of the input image; the feature fusion network is used for convolving the multi-feature prediction layer with feature layers in different stages to predict the confidence coefficient of the text and the position of the text region; the text candidate box screening layer is used for post-processing candidate boxes in different text areas to obtain the position of the best text candidate box.

In a further improvement, the construction process of the full convolution neural network model based on text positioning in step S2 is as follows:

s21, extracting multi-scale features through a feature extraction network;

s22, performing multi-scale feature fusion through a feature fusion network;

and S23, screening and outputting the image containing the best text candidate box through the text candidate box.

In a further improvement, in step S23, each of the text candidate boxes has a confidence score, and the processing of the text candidate boxes removes non-best candidate boxes, and finally filters out an image of the best text candidate box, which specifically includes:

s231, sorting all the text candidate boxes from high to low according to the confidence score, taking the highest score as a current best candidate box a, and sequentially taking the remaining candidate boxes as a post-selection best text candidate box b;

s232, the overlapping degree of the best text candidate box b and the current best text candidate box a is selected after calculation, and the overlapping degree calculation formula is the ratio of the overlapping area of the two candidate boxes to the area of the union of the two candidate boxes, namely:

s233, if the IOU between b and a is greater than the threshold, it indicates that b and a have higher overlap and should be in the same text region, but the confidence score of b is not as high as a, so that the post-selected best text candidate box b is to be suppressed, that is, the text candidate box b is removed from the remaining candidate boxes;

and S234, repeating the 3 steps to screen the post-selection optimal text candidate frames b one by one, and when the remaining candidate frames are screened sequentially, only the candidate frames with the overlapping degree with the text candidate frame a smaller than the threshold value are left, namely the remaining candidate frames are all other text region candidate frames.

In a further improvement, the construction process of the text recognition network model in the step S3 is as follows:

s31, inputting the image output by the text positioning network Model1 into the convolutional neural network layer, and extracting a feature vector sequence of the image;

s32, calculating the association degree of all the feature vectors through an attention mechanism, converting the association degree into probability weight, and then multiplying the probability weight by the input sequence to screen out a new feature vector sequence;

s33, taking the new feature vector sequence as the input of a recurrent neural network layer, and predicting the label distribution of each frame sequence;

and S34, finally, translating the prediction of each frame sequence into a label sequence with the highest probability through a translation layer.

In a further improvement, the process of predicting the text region position in step S4 is:

s41, presetting default boxes on the feature graph of the input multi-feature prediction layer, and returning to a series of multi-angle text boxes, wherein the multi-angle text boxes are in two forms of quadrangles represented by four points and rotating rectangles represented by upper left corners, upper right corners and heights;

s42, representing the text existence confidence coefficient and the coordinate offset of the text region position candidate box output by the feature fusion network as a quadrilateral or rotating rectangular associated default box;

s43, regression is carried out to obtain a real boundary box of a quadrangle or a rotating rectangle according to the text candidate box and the horizontal rectangle circumscribed by the candidate box, and the regression calculation formula is as follows:

d_i＝|b₁-q_i|+|b₂-q_(i+1)|+|b₃-q_(i+2)|+|b₄-q_(i+3)|,i＝1

d_i＝|b₁-q_i|+|b₂-q_(i+1)％4|+|b₃-q_(i+2)％4|+|b₄-q_(i+3)％4|,i＝2,3,4

wherein, b_i、

q

_i1,2,3 and 4 are respectively four vertexes of an external horizontal rectangular frame and a regression quadrangle or a rotation rectangle, and the percentage represents the remainder;

and S44, obtaining the best text area candidate box through the screening of the text candidate box.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention adopts multi-scale feature detection, uses a low-level feature map to locate a smaller text region, uses a high-level feature map to locate a larger text region, and then inputs a multi-scale feature image into a feature fusion network to obtain the confidence coefficient of the text in a scene image and the position of the text region. Compared with the method that only the last characteristic layer is used for detection, the multi-characteristic prediction layer detects the scene text image by using the plurality of characteristic layers, so that the accuracy is improved, and the robustness of text regions with different scales is enhanced.

2. The invention selects irregular 3x5 convolution filter in the multi-feature prediction layer of the feature fusion network. The goal is to obtain a better perceived field of view because multiple features are generally long objects in the scene image.

3. In the text recognition process, an attention mechanism is added before the recurrent neural network. The feature sequence extracted in the recognition is actually a mapping of the input image to the sequence position, but when the feature vectors of a plurality of feature sequences in succession are background noise between texts corresponding to the input images, text recognition errors are easily caused. In addition, when the text in the input image is long and the space between the characters is large, the interference of background noise of non-text areas is easy to happen, so the attention mechanism is adopted to solve the problem.

4. The invention replaces the rectangle frame of the conventional target detector with the quadrangle and the rotating rectangle in the text positioning, can challenge any direction of the text in the scene image, and can input the image with any size.

Drawings

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

FIG. 1 is a schematic overall flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a Model1 for text-based positioning network according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a text recognition network Model2 according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a regression graph of the minimum distance of a real bounding box according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating regression results from a matched default box to a true bounding box according to an embodiment of the present invention.

Detailed Description

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, so to speak, as communicating between the two elements. The specific meaning of the above terms in the present invention can be understood in specific cases to those skilled in the art. The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

The invention provides a scene text positioning and identifying method of a full convolution network, which overcomes the defects of the traditional text detection and identifying method and carries out real-time positioning on texts in a scene image by utilizing deep learning target detection. Firstly, acquiring a training set containing a plurality of training pictures marked with text positions; then inputting the training set into a text positioning network for training; inputting the training set into a text recognition network for training; then, inputting the scene image to be subjected to text positioning and recognition into a text positioning network model to obtain a text existence confidence coefficient and a text region position; then inputting the text area position image into a text recognition network; and finally, obtaining a text recognition result. Two trained network models are obtained through training, and a scene text positioning and identifying method with high precision and high efficiency can be realized by combining the prediction of the rear end of the model. The method specifically comprises the following steps:

It can be appreciated that, in the embodiment of the present invention, compared with the conventional neural network that performs multiple nonlinear mappings on the input image through multiple serial convolutional layers, the multi-feature fusion network adopted in the present invention overcomes two disadvantages of the conventional technology: (1) with the increase of the network depth, the features are more and more abstract, and information is lost; (2) different characteristic layers correspond to different receptive fields, and the deeper characteristic layer has a larger receptive field. And adopting multi-scale feature detection, positioning a smaller text region by using a low-level feature map, positioning a larger text region by using a high-level feature map, and then inputting the multi-scale feature image into a feature fusion network to obtain the confidence coefficient of the text in the scene image and the position of the text region. Compared with the method that only the last characteristic layer is used for detection, the multi-characteristic prediction layer detects the scene text image by using the plurality of characteristic layers, so that the accuracy is improved, and the robustness of text regions with different scales is enhanced.

It is understood that in the embodiment of the present invention, the text region in the scene image is generally a long object, and in order to obtain a better receptive field, an irregular 3x5 convolution filter is selected for the multi-feature prediction layer. Compared with the method that only the last characteristic layer is used for detection, the multi-characteristic prediction layer detects the scene text image by using the plurality of characteristic layers, so that the accuracy is improved, and the robustness of text regions with different scales is enhanced.

s21, extracting multi-scale features through a feature extraction network, wherein the feature extraction network is a 23-layer neural network, the feature fusion network is a network comprising 6 feature prediction layers, the multi-feature prediction layers are convolution layers, the 6 feature prediction layers are respectively connected with 6 stage feature layers of 13 th, 15 th, 17 th, 19 th, 21 th and 23 th layers, an irregular 3x5 convolution filter is selected for the multi-feature prediction layers, and convolution cores of different depths are used for performing convolution on each feature layer;

s22, performing multi-scale feature fusion through a feature fusion network, outputting a group of vectors with fixed sizes, predicting the confidence coefficient of the text and the position of the text region, and outputting 14-dimensional prediction vectors (2-dimensional text exists, 8-dimensional coordinate offset of the quadrangular candidate frame and 4-dimensional coordinate offset of the minimum circumscribed horizontal rectangular frame) for each quadrangular candidate frame determined as the text region; outputting a 7-dimensional prediction vector for each rotating rectangular candidate box determined as the text region (the text has 2 dimensions, and the coordinate offset of the rotating rectangular candidate box is 5 dimensions);

s233, if the IOU between b and a is greater than the threshold (e.g. IOU >0.6), indicating that b and a have higher overlap and should be the same text region, but the confidence score of b is not as high as a, so that the post-selection of the best text candidate box b is suppressed, i.e. the text candidate box b is removed from the remaining candidate boxes;

It is understood that, in the embodiment of the present invention, the basic idea of the attention mechanism in step S32 is to give more attention to the necessary information so as to extract the necessary information from the input sequence. Conceptually, important information is screened from a large amount of information, and unimportant information is ignored. The main process is as follows: input x_iI-0, 1, … …, n signature, the signature sequence h is obtained from the encoder f_i＝f(x_i) And through

(α_jiIs the attention mechanism calculates the corresponding weights) to obtain the element weighted sum as input c to the decoder g_jFinally, decoding to obtain output y_j＝g(c_j)。

d_i＝|b₁-q_i|+|b₂-q_(i+1)|+|b₃-q_(i+2)|+|b₄-q_(i+3)|,i＝1

wherein, b_i、

q

It is understood that, in the embodiment of the present invention, a more specific process of predicting the position of the text region is:

(1) each position of the feature map in the multi-feature prediction layer has a series of preset default frames, and the default frames return to a series of polygonsAnd the degree text box is represented by a set { q } or { r }, and the minimum external horizontal rectangular box corresponding to the multi-angle text box is output and represented by a set { b }. The default box may be denoted as b₀＝(x₀,y₀,w₀,h₀)，(x₀,y₀) Center point, w, representing a default box₀,h₀The width and the height of the default box are respectively represented, so that the returned multi-angle text box has two representation forms:

(1) q is represented by four points of a quadrilateral₀＝(x_q01,y_q01,x_q02,y_q02,x_q03,y_q03,x_q04,y_q04)；

(2) R is represented by the upper left point, the upper right point and the height of the rotating rectangle of a quadrangle₀＝(x_r01,y_r02,x_r02,y_r02,h_r0) The parameter relationship between the multi-angle text box and the default box is as follows:

x_q01＝x₀-w₀/2,y_q01＝y₀-h₀/2,x_q02＝x₀+w₀/2,y_q02＝y₀-h₀/2

x_q03＝x₀+w₀/2,y_q03＝y₀+h₀/2,x_q04＝x₀-w₀/2,y_q04＝y₀+h₀/2

x_r01＝x₀-w₀/2,y_r01＝y₀-h₀/2,xr02＝x₀+w₀/2,y_r02＝y₀-h₀/2,h_r0＝h₀

(2) and expressing the text existence confidence coefficient and the text region position coordinate offset output by the feature fusion network as each associated default box of the OR.

For a quadrilateral, the predicted values of the multi-feature prediction layer are:

(Δx,Δy,Δw,Δh,Δx₁,Δy₁,Δx₂,Δy₂,Δx₃,Δy₃,Δx₄,Δy₄,c)

the horizontal rectangle and quadrilateral calculation formula output under the confidence c is as follows:

x＝x0+w₀Δx,y＝y₀+hΔy,w＝w₀sΔw,h＝h₀e^Δh

x_qn＝x_q0n+w₀Δx_qn,y_qn＝y_q0n+h₀Δy_qn,n＝1,2,3,4

for the rotated rectangle, the predicted values of the multi-feature prediction layer are (Δ x, Δ y, Δ w, Δ h, Δ x)₁,Δy₁,Δx₂,Δy₂,Δh_rC), the rotation rectangle r output with confidence c ═ x_r1,y_r1,x_r2,y_r2,h_r) The calculation formula is as follows:

(3) and regressing a real boundary frame of a quadrangle or a rotating rectangle according to the candidate frame and the horizontal rectangle circumscribed by the candidate frame.

For a quadrangle, the real boundary of a horizontal frame is a circumscribed horizontal rectangle of a candidate frame, four points of the rectangle are arranged clockwise, the first point is an upper left corner point, the real boundary of a text frame is the candidate frame, four vertexes of the text frame are also arranged clockwise, the sum of four-point distances (four times needed to be calculated) is calculated from the upper left corner point and the four vertexes of the text frame and the horizontal frame according to the clockwise direction, the point corresponding to the upper left corner point of the horizontal frame is the first point of the text frame, and the second point, the third point and the fourth point are calculated by the same method. The regression calculation formula is:

d_i＝|b₁-q_i|+|b₂-q_(i+1)|+|b₃-q_(i+2)|+|b₄-q_(i+3)|,i＝1

wherein b is_i、q_i,i1,2,3,4 are the four vertices of the external horizontal rectangular frame and the regression quadrilateral or the rotation rectangle, respectively,% represents the remainder, and fig. 4 is the minimum distance regression graph. The regression result diagram of the real bounding box is shown in fig. 5, in which a dotted box represents a default box matching the real bounding box, a solid box represents a minimum circumscribed horizontal rectangular box corresponding to the real bounding box, another solid box represents the real bounding box, and an arrow represents the regression direction.

And for the rotating rectangle, determining an upper left point and an upper right point according to a quadrilateral method, wherein the height is the height of the rotating rectangle.

(4) And screening the text candidate boxes to obtain the best text region candidate box.

In the drawings, the positional relationship is described for illustrative purposes only and is not to be construed as limiting the present patent; it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A scene text positioning and identifying method based on a full convolution network is characterized by comprising the following steps:

2. The method for locating and identifying scene texts based on the full convolution network as claimed in claim 1, wherein the feature extraction network is composed of convolution layers and a pooling layer and is used for extracting convolution feature maps of input images; the feature fusion network is used for convolving the multi-feature prediction layer with feature layers in different stages to predict the confidence coefficient of the text and the position of the text region; the text candidate box screening layer is used for post-processing candidate boxes in different text areas to obtain the position of the best text candidate box.

3. The method for locating and identifying scene texts based on full convolutional network as claimed in claims 1-2, wherein the construction process of the full convolutional neural network model based on text locating in step S2 is as follows:

s21, extracting multi-scale features through a feature extraction network;

s22, performing multi-scale feature fusion through a feature fusion network;

4. The method as claimed in claim 3, wherein each of the text candidate boxes in step S23 has a confidence score, and the processing of the text candidate boxes removes non-best candidate boxes, and finally filters out the image of the best text candidate box, which specifically includes:

5. The method for locating and identifying scene texts based on full convolutional network of claim 1, wherein the construction process of the text recognition network model in the step S3 is as follows:

6. The method for locating and identifying scene text based on full convolutional network as claimed in claim 2, wherein the process of predicting the text region position in step S4 is as follows:

d_i＝|b₁-q_i|+|b₂-q_(i+1)|+|b₃-q_(i+2)|+|b₄-q_(i+3)|，i＝1

d_i＝|b₁-q_i|+|b₂-q_(i+1)％4|+|b₃-q_(i+2)％4|+|b₄-q_(i+3)％4|，i＝2，3，4

wherein, b_i、q_i1,2,3 and 4 are respectively four vertexes of an external horizontal rectangular frame and a regression quadrangle or a rotation rectangle, and the percentage represents the remainder;