CN110837835A - End-to-end scene text identification method based on boundary point detection - Google Patents
End-to-end scene text identification method based on boundary point detection Download PDFInfo
- Publication number
- CN110837835A CN110837835A CN201911038568.1A CN201911038568A CN110837835A CN 110837835 A CN110837835 A CN 110837835A CN 201911038568 A CN201911038568 A CN 201911038568A CN 110837835 A CN110837835 A CN 110837835A
- Authority
- CN
- China
- Prior art keywords
- text
- network
- rpn
- multidirectional
- boundary point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a scene text end-to-end identification method based on boundary point detection, which extracts text characteristics through a characteristic pyramid network and is used for generating a candidate text box through a regional extraction network; then, detecting a more accurate multidirectional bounding box of the text example through a multidirectional rectangular detection network; secondly, detecting an upper boundary point sequence and a lower boundary point sequence of the text in the multidirectional bounding box; and finally, converting the text in any shape into a horizontal text by using the detected boundary point sequence for recognition by a subsequent sequence recognition network based on an attention mechanism, and finally finding the best matching word of the prediction sequence in the given dictionary by using a cluster search algorithm to obtain a final text recognition result. The method can simultaneously detect and identify scene texts in any shapes in the natural image without character-level labeling, wherein the scene texts comprise horizontal texts, multi-directional texts and curved texts, and can completely carry out end-to-end training.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a scene text end-to-end identification method based on boundary point detection.
Background
Scene text detection and recognition is a very active and challenging research direction in the field of computer vision, and many practical applications are highly relevant to the scene text detection and recognition, such as network information security monitoring systems, intelligent transportation systems, blind help and the like.
In most of past researches, scene text detection and recognition technology is regarded as two separate processes, namely, firstly, a trained detector is used for detecting character areas in a natural scene picture, and secondly, the character areas detected in the first step are input into a recognition module for recognition to obtain character contents. Since the detection and recognition tasks are highly correlated and complementary to each other, on the one hand, the quality of the detection step determines the accuracy of the recognition; on the other hand, the result of the recognition may also provide feedback for the detection. Such separate processing may result in less than optimal performance of the detection and identification.
Recently, there are various methods for providing an end-to-end identification solution, and these methods can be roughly classified into two types. The first approach follows a similar process flow: firstly, a text instance is represented as a horizontal or multidirectional bounding box, the text bounding box is detected by using a detection network, and then a text image or a feature is acquired from an image or a feature map according to the detected bounding box and is identified by a subsequent text identification network. Since text instances are described as horizontal or multi-directional bounding boxes, such schemes have difficulty handling arbitrarily shaped text. The second solution consists of a text detector based on example segmentation and a text recognizer based on character segmentation. Detecting texts in any shapes by a method of segmenting example text regions; and recognizing the text through semantic segmentation in a two-dimensional space, so that irregular text instances are recognized. However, such methods require character-level labeling and the recognition network cannot model literal sequence information. Therefore, an economical and efficient end-to-end recognition method is needed to process the scene text with any shape.
Disclosure of Invention
The invention aims to provide a scene text end-to-end identification method based on boundary point detection, which consists of a text detector based on boundary point detection and a text recognizer based on sequence identification of an attention mechanism. Detecting texts in any shapes by a method for detecting boundary points of text instances; correcting the text in any shape into a horizontal text by utilizing a thin plate spline interpolation algorithm according to the detected text instance boundary points; identifying irregular text instances is accomplished by identifying the rectified text with a text recognizer based on the sequence recognition of the attention mechanism. The method can detect and recognize text instances of arbitrary shapes and can perform end-to-end training completely.
In order to achieve the above object, the present invention provides an end-to-end recognition method for scene texts with arbitrary shapes, comprising the following steps:
(1) training a scene text end-to-end recognition network model based on boundary point detection, comprising the following sub-steps:
(1.1) carrying out word-level labeling on texts in any shapes of all pictures in an original data set, wherein labels are the clockwise vertex coordinates of polygons of text bounding boxes in word level and word character sequences of the texts, and obtaining a standard training data set with labels;
and (1.2) defining a scene text end-to-end identification network model based on boundary point detection, wherein the scene text end-to-end identification network model based on boundary point detection is composed of a characteristic pyramid structure network, a region extraction network, a multi-direction rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism. Calculating a training label according to the standard training data set with the label in the step (1.1), designing a loss function, and training the scene text end-to-end recognition network based on the boundary point detection by using a reverse conduction method to obtain a scene text end-to-end recognition network model based on the boundary point detection; the method specifically comprises the following substeps:
(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; the characteristic pyramid structure network is formed by adding a bottom-up connection, a top-down connection and a transverse connection by taking a ResNet-50 deep convolution neural network as a basic network, and is used for extracting and fusing characteristics with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network is composed of 3 full-connection layers FC1, FC2 and FC3, and outputs a prediction vector with dimension 5, which respectively represents the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle. The boundary point detection network is composed of 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and a full-connection layer, and outputs a vector with dimension of 28, wherein the vector respectively represents the offset of 7 boundary points of the upper boundary and the lower boundary of the text example; the attention-based sequence recognition network is composed of three convolutional layers and an attention-based model, and the attention model outputs probability distribution of predicted characters at each step.
(1.2.2) generating a horizontal initial bounding box, a multidirectional rectangular bounding box and upper and lower boundary points of a text example on an original image according to a standard training set with labels and a characteristic diagram, and respectively providing training labels for the area extraction network, the multidirectional rectangular detection network and the boundary point detection network: for the labeled standard training set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region1,p2…pmAnd a character string S ═ S representing the text content1,s2…smFor input picture ItriWherein P isiIs a picture ItriPolygonal bounding box of the middle text region, pij=(xij,yij) Is a polygon PiCoordinates of jth vertex, m represents the number of polygonal text label boxes, siIs a polygon PiThe content of the text in the text.
For a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P1,p2…pmThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label boxd(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data setd(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated0Annotation bounding box G with respect to an annotation data setdWhen all the labeled bounding boxes G are labeleddAnd an initial bounding box Q0All Jaccard coefficients are less than 0.5, then the initial bounding box Q0Labeled negative class non-text, class label PrpnThe value is 0; otherwise, i.e. there is at least one label bounding box GdAnd Q0Has a Jaccard coefficient of not less than 0.5, Q0Quilt labelMarked as positive text, category label PrpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:
x=x0+w0Δx
y=y0+h0Δy
w=w0exp(Δw)
h=h0exp(Δh)
wherein x is0、y0Respectively an initial bounding box Q0Abscissa, ordinate, w of the center point of (a)0、h0Respectively an initial bounding box Q0And Δ x, Δ y are Q, respectively0Center point of (D) relative to GdThe horizontal and vertical coordinate position offset of the central point, exp is exponential operation, and the training label of the area extraction network is obtained as follows:
gtrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn,Prpn)
for a multidirectional rectangular detection network, first the polygon P in the dataset label is set to { P ═ P1,p2…pmConverting to a minimum multidirectional rectangular bounding box of a polygonal text labeling box, representing the multidirectional rectangular bounding box G by the center point (x, y), height h, width w and rotation angle theta of a rectanglerotate(x, y, h, w, θ); the candidate text region after the network correction by the region extraction is Grpn(xrpn,yrpn,wrpn,hrpn) The predicted positional deviation amount calculation formula is as follows:
x=xrpn+wrpnΔxor
y=yrpn+hrpnΔyor
w=wrpnexp(Δwor)
h=hrpnexp(Δhor)
the training label of the multidirectional rectangular detection network obtained by the formula is as follows:
gtor=(Δxor,Δyor,Δhor,Δwor,θ)
for the boundary point detection network, the training label calculation process of the boundary point detection network is as follows:
a. setting default boundary points: based on detected multidirectional rectangular bounding boxes
Grotate(x, y, h, w, theta), rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box Ghorizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: pdu={p1,p2…pKAnd Pdd={p1,p2…pKIs of Pd=Pdu∪Pdd。
b. Generating a target boundary point:
a) first, a polygon P is divided into two sides according to a long side, P1={p1,p2…plAnd P2={pl+1,…,pmP represents a point in the polygon.
b) According to P1And P2Generating boundary points of an upper boundary and a lower boundary: ptu={p1,p2…pKAnd Ptd={p1,p2…pKIs of Pt=Ptu∪Ptd。
c. Calculating training label gt according to the following formulabp={(Δxi,Δyi),|i∈[0,2K-1)}:
Wherein the content of the first and second substances,andrespectively representing the coordinates of the ith target boundary point and the coordinates of the ith default boundary point.
For the sequence recognition network based on the attention mechanism, each text instance in the input image is marked with a corresponding character string s with the length of ni={(c0,c1,…,cn-1),|ciE {0,1, …,9, a, B, …, Z, a, B, …, Z } } to describe text content. Identifying the training label of the network as gtrecog=(onehot(c0),onehot(c1),…,onehot(cn-1) Wherein onehot (c)i) Indicates a character c1And converting into a one-hot coding form. Combining the above, the final training label is generated as follows: gt ═ { gtrpn,gtor,gtbp,gtrecog};
(1.2.3) training data set I with the standardtrAs the input of the recognition network model, extracting the characteristics by using a characteristic pyramid network module, namely extracting the characteristics of a standard training data set ItrIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; the top-down connection in the feature pyramid network module upsamples the output convolution feature of the ResNet-50 to generate a multi-scale upsampling feature, and the transverse connection structure in the feature pyramid network module fuses the feature of each level upsampled in the top-down process and the feature generated in the bottom-up process to generate a final feature { F2, F3, F4, F5, F6}, which is shown in fig. 3.
(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, distributing anchor points, adjusting a feature map by using a region-of-interest alignment method, and generating a candidate text box:
for input picture ItrkExtracting 5 stage features { F2, F3, F4, F5, F6} through a feature pyramid network, and defining anchors according to stages { P2, P3, P4, P5, P6}, wherein the anchors are defined in the stagesFeature scale of different stages is 322,642,1282,2562,5122And each scale layer has 5 aspect ratios {1:5, 1:2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted1,Ftr2,…,Ftr25Is denoted as FtrpSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classificationrpnPredicting candidate textbox offsets by regression:
Yrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn)。
selecting candidate text boxes predicted as correct text area bounding boxes, and inputting the candidate text boxes to a subsequent multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; and generating a candidate text region with a fixed scale of 7 multiplied by 7 by the correct text region selected by the region extraction network through a region-of-interest alignment operation, and predicting the multidirectional bounding box of the text instance in the candidate text region with the fixed scale by the multidirectional rectangular prediction network. In particular, the multidirectional rectangular prediction network prediction quantity Yor=(Δxor,Δyor,Δhor,Δworθ), which includes 4 prediction offsets and one prediction angle, the network finally learns the multidirectional bounding box of the predicted text instance by computing the loss function and conducting backwards.
(1.2.5) after the multidirectional bounding box of each text example is predicted by the multidirectional rectangular prediction network, generating a candidate text region with a fixed scale of 7 x 7 through a rotating region-of-interest alignment operation. The boundary point prediction network outputs 28 prediction regression offsets Ybp={(Δxi,Δyi) And i belongs to 0,14), and by calculating a loss function and conducting reversely, the network finally learns and predicts the boundary points of the text instance.
(1.2.6) after the boundary point of each text example is predicted by the boundary point prediction network, generating a sampling grid by a thin plate spline interpolation algorithm, and enabling the text in any shape to be in a special shapeThe eigen-rectification is a horizontal, fixed-scale 16 x 64 feature map. And inputting the feature map into a sequence recognition network based on an attention mechanism to predict text content. The identification network consists of 3 convolutional layers and an RNN network with a basic unit of GRU. After 3 convolutional layers, the resolution of text features is 2 × 32, the RNN model outputs probability distribution with dimension 63(62 characters and a stop character) for each step length, and the value of each dimension is [0,1 ]]And is 1. Predicted probability distribution P combining all stepsrecogAnd the beam search algorithm to predict the character sequence Sq。
(1.2.7) taking training label gt as expected output of the network to predict labelsFor the network prediction output, an objective loss function between the desired output and the prediction output is designed for the constructed network model: taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)For network prediction output, aiming at the network model constructed in (1.2.1), designing an objective loss function between expected output and prediction output, wherein the overall objective loss function consists of a region extraction network, a multidirectional rectangular prediction network, a boundary point prediction network and a sequence identification network, and the overall objective loss function expression is as follows:
wherein L isrpn(Prpn,Yrpn) Extracting the loss function of the network for the region, Lor(Yor) Detecting the loss function of the network for multidirectional rectangles, Lbp(Ybp) The loss function of the network is detected for the boundary points,
L(Prpn,Yrpn,Yor,Ybp,Precog)=Lrpn(Prpn,Yrpn)+α1Lor(Yor)+α2Lbp(Ybp)+α3Lrecog(Precog)
Lrecog(Precog) Identifying loss function of network for sequence α1,α2,α3Are respectively a loss function Lrcnn、LbpAnd LrecogThe weight coefficient of (1) is simply set to 1;
according to a designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, an optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set (SynthText) in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.
(2) The character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:
(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the non-maximum suppression operation to obtain a more accurate multidirectional candidate text region: for the data set I to be detectedtstIth picture ItstkInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture ItstkThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score PrcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) and (4) carrying out non-maximum suppression operation (NMS) on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved quadrilateral bounding box of the positive text. Then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Yor=(Δxor,Δyor,Δhor,Δworθ). Calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width and the rotation angle of the center point of the predicted multidirectional rectangle; and rotating the multi-directional text features into horizontal features according to the predicted multi-directional text bounding boxes, and inputting the horizontal features into the boundary point detection network. Boundary point detection network predicts regression quantity Y of 7 boundary points of upper and lower boundariesbp={(Δxi,Δyi) And, | i ∈ [0,14) }. And combining 14 preset default boundary points, calculating the coordinates of the boundary points in the horizontal frame by using the formula in (1.2.2), and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.
And (2.2) generating a sampling grid by using a thin plate spline interpolation algorithm according to the boundary points of the text example predicted in the step (2.1), and correcting the text characteristic of an arbitrary shape into a horizontal shape. The corrected text feature resolution is 16 x 64, and the feature map is input into a sequence recognition network to obtain a probability distribution sequence { p0,p1,…,pN-1In which p isiThe probability distribution of each step of prediction of the RNN is represented, the dimensionality is 63, and N represents the maximum step size of the RNN and takes 35. In the test process, when the predicted value of the k step is a stop character, stopping prediction, and finally predicting the probability distribution of the sequence to be { p }0,p1,…,pk-1}. According to the probability distribution, the category of the maximum probability obtained in each step is the current predicted character, and the predicted character sequence S is finally obtainedq。
Through the technical scheme, compared with the prior art, the invention has the following technical effects:
(1) the accuracy is high: aiming at the problem of recognizing texts in any shapes in scene texts, the method converts the texts in any shapes into horizontal texts by predicting boundary points of the texts, and more accurately detects the text positions and recognizes the texts.
(2) The speed is high: the detection and recognition model provided by the invention has the advantages that the detection and recognition accuracy is ensured, the training speed is high, iterative training is not needed, and the whole network can be trained end to end.
(3) The universality is strong: the invention discloses an end-to-end trainable text detection and recognition model, which can not only simultaneously detect and recognize texts, but also process texts in various shapes without marking at a character level, including horizontal, directional and curved texts;
(4) the robustness is strong: the invention can overcome the change of text dimension and shape, and can detect the recognition level, orientation and curve text at the same time.
Drawings
FIG. 1 is a flowchart of a method for recognizing a scene text end-to-end based on boundary point detection according to the present invention, in which a solid arrow represents training and a dotted arrow represents testing;
FIG. 2 is a diagram of an end-to-end recognition network model for scene text based on boundary point detection according to the present invention;
FIG. 3 is a schematic diagram of a network structure of a feature pyramid structure module in an end-to-end scene text recognition model based on boundary point detection according to the present invention;
FIG. 4 is a diagram of a sequence recognition network structure based on attention mechanism in a scene text end-to-end recognition model based on boundary point detection according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The technical terms of the present invention are explained and explained first:
ResNet-50: a neural network for classification mainly comprises 50 convolutional layers, a pooling layer and a short connecting layer. The convolution layer is used for extracting picture characteristics; the pooling layer has the functions of reducing the dimensionality of the feature vector output by the convolutional layer and reducing overfitting; the shortcut connection layer is used for transferring gradient and solving the problems of extinction and explosion gradient. The network parameters can be updated through a reverse conduction algorithm;
area extraction network: a network for generating candidate text regions is used for generating full-connection features with the height of a specific dimension on an extracted feature map by using a sliding window, generating two full-connection branch classification and regression candidate text regions according to the full-connection features, and finally generating candidate text regions with different scale proportions for a subsequent network according to different anchor points and proportions.
Jaccard coefficient: the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, in the field of text detection, the Jaccard coefficient is defaulted to be equal to IOU (input/output), namely the intersection area/combination area of two frames, and describes the overlapping rate of a predicted text box and an original marked text box generated by a model, wherein the IOU is larger, the overlapping degree is higher, and the detection is more accurate.
Non-maximum inhibition (NMS): the non-maximum suppression is a post-processing algorithm widely applied in the field of computer vision detection, and the non-maximum suppression is used for filtering overlapped detection frames by means of sorting, traversing and rejecting to realize loop iteration according to a set threshold value, and removing redundant detection frames to obtain a final detection result.
Thin plate spline interpolation algorithm (TPS): the thin-plate spline interpolation algorithm is an interpolation method that finds a smooth surface with minimal curvature through all control points. By the algorithm, characters in any shapes can be converted into horizontal shapes, so that the distortion degree of the whole characters is minimum.
As shown in fig. 1, the method for recognizing a scene text end-to-end based on boundary point detection of the present invention includes the following steps:
(1) training a scene text end-to-end recognition network model based on boundary point detection, comprising the following sub-steps:
(1.1) carrying out word-level labeling on texts in any shapes of all pictures in an original data set, wherein labels are the clockwise vertex coordinates of polygons of text bounding boxes in word level and word character sequences of the texts, and obtaining a standard training data set with labels;
and (1.2) defining a scene text end-to-end identification network model based on boundary point detection, wherein the model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism. Calculating a training label according to the standard training data set with the label in the step (1.1), designing a loss function, and training the scene text end-to-end recognition network based on the boundary point detection by using a reverse conduction method to obtain a scene text end-to-end recognition network model based on the boundary point detection; the method specifically comprises the following substeps:
(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism, as shown in fig. 2; the feature pyramid structure network is shown in fig. 3, and is formed by adding a bottom-up connection, a top-down connection and a transverse connection to a base network of a ResNet-50 deep convolutional neural network, and is used for extracting features fused with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network is composed of 3 full-connection layers FC1, FC2 and FC3, and outputs a prediction vector with dimension 5, which respectively represents the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle. The boundary point detection network is composed of 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and a full-connection layer, and outputs a vector with dimension of 28, wherein the vector respectively represents the offset of 7 boundary points of the upper boundary and the lower boundary of the text example; the attention-based sequence recognition network is shown in fig. 4 and is composed of three convolutional layers and an attention-based model, and the attention-based model outputs a probability distribution of a predicted character at each step.
(1.2.2) generating a horizontal initial bounding box, a multidirectional rectangular bounding box and upper and lower boundary points of a text example on an original image according to a standard training set with labels and a characteristic diagram, and respectively providing training labels for the area extraction network, the multidirectional rectangular detection network and the boundary point detection network: for the labeled standard training set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region1,p2…pmAnd a character string S ═ S representing the text content1,s2…smFor input picture ItriWherein P isiIs a picture ItriPolygonal bounding box of the middle text region, pij=(xij,yij) Is a polygon PiCoordinates of jth vertex, m represents the number of polygonal text label boxes, siIs a polygon PiThe content of the text in the text.
For a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P1,p2…pmThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label boxd(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data setd(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated0Annotation bounding box G with respect to an annotation data setdWhen all the labeled bounding boxes G are labeleddAnd an initial bounding box Q0All Jaccard coefficients are less than 0.5, then the initial bounding box Q0Labeled negative class non-text, class label PrpnThe value is 0; otherwise, i.e. there is at least one label bounding box GdAnd Q0Jaccard coefficient ofNot less than 0.5, Q0Marked as positive text, class label PrpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:
x=x0+w0Δx
y=y0+h0Δy
w=w0exp(Δw)
h=h0exp(Δh)
wherein x is0、y0Respectively an initial bounding box Q0Abscissa, ordinate, w of the center point of (a)0、h0Respectively an initial bounding box Q0And Δ x, Δ y are Q, respectively0Center point of (D) relative to GdThe horizontal and vertical coordinate position offset of the central point, exp is exponential operation, and the training label of the area extraction network is obtained as follows:
gtrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn,Prpn)
for a multidirectional rectangular detection network, first the polygon P in the dataset label is set to { P ═ P1,p2…pmConverting to a minimum multidirectional rectangular bounding box of a polygonal text labeling box, representing the multidirectional rectangular bounding box G by the center point (x, y), height h, width w and rotation angle theta of a rectanglerotate(x, y, h, w, θ); the candidate text region after the network correction by the region extraction is Grpn(xrpn,yrpn,wrpn,hrpn) The predicted positional deviation amount calculation formula is as follows:
x=xrpn+wrpnΔxor
y=yrpn+hrpnΔyor
w=wrpnexp(Δwor)
h=hrpnexp(Δhor)
the training label of the multidirectional rectangular detection network obtained by the formula is as follows:
gtor=(Δxor,Δyor,Δhor,Δwor,θ)
for the boundary point detection network, the training label calculation process of the boundary point detection network is as follows:
a. setting default boundary points: according to the detected multidirectional rectangular bounding box Grotate(x, y, h, w, theta), rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box Ghorizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: pdu={p1,p2…pKAnd Pdd={p1,p2…pKIs of Pd=Pdu∪Pdd。
b. Generating a target boundary point:
a) first, a polygon P is divided into two sides according to a long side, P1={p1,p2…plAnd P2={pl+1,…,pmP represents a point in the polygon.
b) Will P1And P2Inputting boundary points in Algorithm 1 which generate upper and lower boundaries: ptu={p1,p2…pKAnd Ptd={p1,p2…pKIs of Pt=Ptu∪Ptd。
c. Calculating training label gt according to the following formulabp={(Δxi,Δyi),|i∈[0,2K-1)}:
Wherein the content of the first and second substances,andrespectively representing the coordinates of the ith target boundary point and the coordinates of the ith default boundary point.
For the sequence recognition network based on the attention mechanism, each text instance in the input image is marked with a corresponding character string s with the length of ni={(c0,c1,…,cn-1),|ciE {0,1, …,9, a, B, …, Z, a, B, …, Z } } to describe text content. Identifying the training label of the network as gtrecog=(onehot(c0),onehot(c1),…,onehot(cn-1) Wherein onehot (c)i) Indicates a character c1And converting into a one-hot coding form. Combining the above, the final training label is generated as follows: gt ═ { gtrpn,gtor,gtbp,gtrecog};
(1.2.3) training data set I with the standardtrAs the input of the recognition network model, extracting the characteristics by using a characteristic pyramid network module, namely extracting the characteristics of a standard training data set ItrIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; the top-down connection in the feature pyramid network module upsamples the output convolution feature of the ResNet-50 to generate a multi-scale upsampling feature, and the transverse connection structure in the feature pyramid network module fuses the feature of each level upsampled in the top-down process and the feature generated in the bottom-up process to generate a final feature { F2, F3, F4, F5, F6}, which is shown in fig. 3.
(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, distributing anchor points, adjusting a feature map by using a region-of-interest alignment method, and generating a candidate text box:
for input picture ItrkExtracting 5 stage features { F2, F3, F4, F5, F6} through a feature pyramid network, and extracting the stage features according to stages { P2, P3, P4, P5 }P6 defines the characteristic dimension of the anchor at different stages as 322,642,1282,2562,5122And each scale layer has 5 aspect ratios {1:5, 1:2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted1,Ftr2,…,Ftr25Is denoted as FtrpSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classificationrpnPredicting candidate textbox offsets by regression:
Yrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn)。
selecting candidate text boxes predicted as correct text area bounding boxes, and inputting the candidate text boxes to a subsequent multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; and generating a candidate text region with a fixed scale of 7 multiplied by 7 by the correct text region selected by the region extraction network through a region-of-interest alignment operation, and predicting the multidirectional bounding box of the text instance in the candidate text region with the fixed scale by the multidirectional rectangular prediction network. In particular, the multidirectional rectangular prediction network prediction quantity Yor=(Δxor,Δyor,Δhor,Δworθ), which includes 4 prediction offsets and one prediction angle, the network finally learns the multidirectional bounding box of the predicted text instance by computing the loss function and conducting backwards.
(1.2.5) after the multidirectional bounding box of each text example is predicted by the multidirectional rectangular prediction network, generating a candidate text region with a fixed scale of 7 x 7 through a rotating region-of-interest alignment operation. The boundary point prediction network outputs 28 prediction regression offsets Ybp={(Δxi,Δyi) And i belongs to 0,14), and by calculating a loss function and conducting reversely, the network finally learns and predicts the boundary points of the text instance.
(1.2.6) after the boundary point of each text instance is predicted by the boundary point prediction network, generating a sampling grid by a thin plate spline interpolation algorithm, and randomly selecting the text instanceThe text features of the shape are rectified into a horizontal, fixed scale 16 x 64 feature map. And inputting the feature map into a sequence recognition network based on an attention mechanism to predict text content. As shown in fig. 4, the identification network is composed of 3 convolutional layers and an RNN network whose basic unit is a GRU. After 3 convolutional layers, the resolution of text features is 2 × 32, the RNN model outputs probability distribution with dimension 63(62 characters and a stop character) for each step length, and the value of each dimension is [0,1 ]]And is 1. Predicted probability distribution P combining all stepsrecogAnd the beam search algorithm to predict the character sequence Sq。
(1.2.7) taking training label gt as expected output of the network to predict labelsFor the network prediction output, an objective loss function between the desired output and the prediction output is designed for the constructed network model: taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)For network prediction output, aiming at the network model constructed in (1.2.1), designing an objective loss function between expected output and prediction output, wherein the overall objective loss function consists of a region extraction network, a multidirectional rectangular prediction network, a boundary point prediction network and a sequence identification network, and the overall objective loss function expression is as follows:
L(Prpn,Yrpn,Yor,Ybp,Precog)=Lrpn(Prpn,Yrpn)+α1Lor(Yor)+α2Lbp(Ybp)+α3Lrecog(Precog)
wherein L isrpn(Prpn,Yrpn) Extracting the loss function of the network for the region, Lor(Yor) Detecting the loss function of the network for multidirectional rectangles, Lbp(Ybp) As boundary pointsDetecting loss functions of the network, Lrecog(Precog) Identifying loss function of network for sequence α1,α2,α3Are respectively a loss function Lrcnn、LbpAnd LrecogThe weight coefficient of (1) is simply set to 1;
according to a designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, an optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set (SynthText) in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.
The character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:
(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the non-maximum suppression operation to obtain a more accurate multidirectional candidate text region: for the data set I to be detectedtstIth picture ItstkInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture ItstkThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score PrcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) and (4) performing non-maximum suppression operation (NMS) on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved positive text quadrilateral bounding box. Then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Yor=(Δxor,Δyor,Δhor,Δworθ). Calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width and the rotation angle of the center point of the predicted multidirectional rectangle; and rotating the multi-directional text features into horizontal features according to the predicted multi-directional text bounding boxes, and inputting the horizontal features into the boundary point detection network. Boundary point detection network predicts regression quantity Y of 7 boundary points of upper and lower boundariesbp={(Δxi,Δyi) And, | i ∈ [0,14) }. And combining 14 preset default boundary points, calculating the coordinates of the boundary points in the horizontal frame by using the formula in (1.2.2), and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.
And (2.2) generating a sampling grid by using a thin plate spline interpolation algorithm according to the boundary points of the text example predicted in the step (2.1), and correcting the text characteristic of an arbitrary shape into a horizontal shape. The corrected text feature resolution is 16 x 64, and the feature map is input into a sequence recognition network to obtain a probability distribution sequence { p0,p1,…,pN-1In which p isiThe probability distribution of each step of prediction of the RNN is represented, the dimensionality is 63, and N represents the maximum step size of the RNN and takes 35. In the test process, when the predicted value of the k step is a stop character, stopping prediction, and finally predicting the probability distribution of the sequence to be { p }0,p1,…,pk-1}. According to the probability distribution, the category of the maximum probability obtained in each step is the current predicted character, and the predicted character sequence S is finally obtainedq。
Claims (10)
1. A scene text end-to-end identification method based on boundary point detection is characterized by comprising the following steps:
(1) training a scene text end-to-end recognition network model based on boundary point detection, comprising the following sub-steps:
(1.1) carrying out word-level labeling on texts in any shapes of all pictures in an original data set, wherein labels are the clockwise vertex coordinates of polygons of text bounding boxes in word level and word character sequences of the texts, and obtaining a standard training data set with labels;
(1.2) defining a scene text end-to-end recognition network model based on boundary point detection, calculating a training label according to (1.1) a standard training data set with labels, designing a loss function, and training the scene text end-to-end recognition network based on boundary point detection by using a reverse conduction method to obtain the scene text end-to-end recognition network model based on boundary point detection; the method comprises the following steps:
(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism;
(1.2.2) generating a horizontal initial bounding box, a multidirectional rectangular bounding box and upper and lower boundary points of a text example on an original image according to a standard training set with labels and a characteristic diagram, and respectively providing training labels for the area extraction network, the multidirectional rectangular detection network, the boundary point detection network and a sequence identification network based on an attention mechanism;
(1.2.3) training data set I with the standardtrAs input for identifying the network model, extracting features by using a feature pyramid network module;
(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, and generating a candidate text box by using a region-of-interest alignment method to adjust a feature map through anchor point distribution; generating a candidate text region with a fixed scale of 7 multiplied by 7 by a correct text region selected by the region extraction network through region-of-interest alignment operation, and predicting a multidirectional bounding box of a text example in the candidate text region with the fixed scale by a multidirectional rectangular prediction network;
(1.2.5) after a multidirectional bounding box of each text example is predicted by the multidirectional rectangular prediction network, generating a candidate text region with a fixed scale of 7 x 7 through a rotary region-of-interest alignment operation, and finally learning and predicting boundary points of the text examples by the network;
(1.2.6) after predicting the boundary point of each text example by the boundary point prediction network, generating a sampling grid by a thin-plate spline interpolation algorithm, correcting the text characteristics of any shape into a horizontal characteristic diagram with a fixed scale of 16 x 64, inputting the characteristic diagram into a sequence recognition network based on an attention mechanism to predict the text content, and predicting the text content according to all the prediction probability distributions PrecogTo predict the character sequence Sq;
(1.2.7) taking training label gt as expected output of the network to predict labelsDesigning a target loss function between the expected output and the predicted output for the network prediction output aiming at the constructed network model;
(2) the character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:
(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the multidirectional candidate text region by carrying out non-maximum suppression operation to obtain a more accurate multidirectional candidate text region; rotating the multi-directional text characteristics into horizontal characteristics according to the predicted multi-directional text bounding boxes, and inputting the horizontal characteristics into a boundary point detection network; calculating coordinates of the boundary points in the horizontal frame by using a formula in (1.2.2) in combination with 14 preset default boundary points, and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the rotation angle of the multidirectional rectangle predicted in (2.1) to obtain the positions of the boundary points in the original image;
(2.2) generating a sampling grid by using a thin plate spline interpolation algorithm according to the boundary points of the text example predicted in the step (2.1), correcting the text features of any shape into a horizontal shape, inputting the feature map into a sequence recognition network to obtain a probability distribution sequence, acquiring the category of the maximum probability in each step as the current predicted character according to the probability distribution, and finally acquiring a predicted character sequence Sq。
2. The method for end-to-end recognition of scene text based on boundary point detection according to claim 1, wherein the scene text end-to-end recognition network model based on boundary point detection in step (1.2.1) is specifically:
the scene text end-to-end recognition network model based on the boundary point detection is composed of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence recognition network based on an attention mechanism; the characteristic pyramid structure network is formed by adding a bottom-up connection, a top-down connection and a transverse connection by taking a ResNet-50 deep convolution neural network as a basic network, and is used for extracting and fusing characteristics with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network comprises 3 full-connection layers FC1, FC2 and FC3, a prediction vector with the output dimension of 5 is respectively used for representing the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and the height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle, the boundary point detection network comprises 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and one full-connection layer, and a vector with the output dimension of 28 is used for respectively representing the offset of 7 boundary points of the upper boundary and the lower boundary of a text example; the attention-based sequence recognition network is composed of three convolutional layers and an attention-based model, and the attention model outputs probability distribution of predicted characters at each step.
3. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.2) is specifically as follows:
for the labeled standard training set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region1,p2…pmAnd a character string S ═ S representing the text content1,s2…smFor input picture ItriIn which P isiIs a picture ItriPolygonal bounding box of the middle text region, pij=(xij,yij) Is a polygon PiCoordinates of jth vertex, m represents the number of polygonal text label boxes, siIs a polygon PiThe content of the text in the text;
for a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P1,p2…pmThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label boxd(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data setd(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated0Annotation bounding box G with respect to an annotation data setdWhen all the labeled bounding boxes G are labeleddAnd an initial bounding box Q0All Jaccard coefficients are less than 0.5, then the initial bounding box Q0Labeled negative class non-text, class label PrpnThe value is 0; otherwise, i.e. there is at least one label bounding box GdAnd Q0Has a Jaccard coefficient of not less than 0.5, Q0Marked as positive text, class label PrpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:
x=x0+w0Δx
y=y0+h0Δy
w=w0exp(Δw)
h=h0exp(Δh)
wherein x is0、y0Respectively an initial bounding box Q0Abscissa, ordinate, w of the center point of (a)0、h0Respectively an initial bounding box Q0And Δ x, Δ y are Q, respectively0Center point of (D) relative to GdThe horizontal and vertical coordinate position offset of the central point, exp is exponential operation, and the training label of the area extraction network is obtained as follows:
gtrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn,Prpn)
for a multidirectional rectangular detection network, first the polygon P in the dataset label is set to { P ═ P1,p2…pmConverting to a minimum multidirectional rectangular bounding box of a polygonal text labeling box, representing the multidirectional rectangular bounding box G by the center point (x, y), height h, width w and rotation angle theta of a rectanglerotate(x, y, h, w, θ); the candidate text region after the network correction by the region extraction is Grpn(xrpn,yrpn,wrpn,hrpn) The predicted positional deviation amount calculation formula is as follows:
x=xrpn+wrpnΔxor
y=yrpn+hrpnΔyor
w=wrpnexp(Δwor)
h=hrpnexp(Δhor)
the training label of the multidirectional rectangular detection network obtained by the formula is as follows:
gtor=(Δxor,Δyor,Δhor,Δwor,θ)
for the boundary point detection network, the training label calculation process of the boundary point detection network is as follows:
a. setting default boundary points:
according to the detected multidirectional rectangular bounding box Grotate(x,y,h,w,θ)Rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box Ghorizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: pdu={p1,p2…pKAnd Pdd={p1,p2…pKIs of Pd=Pdu∪Pdd;
b. Generating a target boundary point:
first, a polygon P is divided into two sides according to a long side, P1={p1,p2…plAnd P2={pl+1,...,pmP represents a point in the polygon;
according to P1And P2Generating boundary points of an upper boundary and a lower boundary: ptu={p1,p2…pKAnd Ptd={p1,p2…pKIs of Pt=Ptu∪Ptd;
c. Calculating training label gt according to the following formulabp={(Δxi,Δyi),|i∈[0,2K-1)}:
Wherein the content of the first and second substances,andrespectively representing the coordinates of the ith target boundary point and the coordinates of the ith default boundary point;
for the sequence recognition network based on the attention mechanism, each text instance in the input image is labeled with a corresponding text instanceIs n, is a character string si={(c0,c1,...,cn-1),|ciE {0, 1., 9, a, B., Z, A, B., Z } } to describe the text content, the training label identifying the network is gtrecog=(onehot(c0),onehot(c1),…,onehot(cn-1) Wherein onehot (c)i) Indicates a character c1Converting into a one-hot coding form;
the final training labels are generated as follows: gt ═ { gtrpn,gtor,gtbp,gtrecog}。
4. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.3) is specifically as follows:
standard training data set ItrIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; and the top-down connection in the feature pyramid network module performs up-sampling on the output convolution features of ResNet-50 to generate multi-scale up-sampling features, and the transverse connection structure in the feature pyramid network module performs fusion on the features of each level up-sampled in the top-down process and the features generated in the bottom-up process to generate final features { F2, F3, F4, F5, F6 }.
5. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.4) is specifically as follows:
for input picture ItrkExtracting 5 stage features { F2, F3, F4, F5 and F6} through a feature pyramid network, and defining the feature scale of the anchor at different stages as {32 } according to stages { P2, P3, P4, P5 and P6}2,642,1282,2562,5122Each scale layer has 5 length-width ratios {1:5, 1: }2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted1,Ftr2,…,Ftr25Is denoted as FtrpSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classificationrpnPredicting candidate textbox offsets by regression: y isrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn);
Selecting candidate text boxes predicted as correct text area bounding boxes, and inputting the candidate text boxes to a subsequent multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; multidirectional rectangular prediction network prediction quantity Yor=(Δxor,Δyor,Δhor,Δworθ), which includes 4 prediction offsets and one prediction angle, the network finally learns the multidirectional bounding box of the predicted text instance by computing the loss function and conducting backwards.
6. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.5) is specifically as follows:
after the multidirectional rectangular prediction network predicts the multidirectional bounding box of each text example, a candidate text region with a fixed scale of 7 multiplied by 7 is generated through the alignment operation of a rotating interested region, and the boundary point prediction network outputs 28 prediction regression offsets Ybp={(Δxi,Δyi) And i belongs to 0,14), and by calculating a loss function and conducting reversely, the network finally learns and predicts the boundary points of the text instance.
7. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.6) is specifically as follows:
generating a sampling grid by a thin plate spline interpolation algorithm, and correcting the text features of any shape into horizontal special features with fixed dimension of 16 x 64The identification network is composed of 3 convolutional layers and an RNN network with a basic unit of GRU, after the 3 convolutional layers, the resolution of text features is 2 x 32, each step length of the RNN model outputs probability distribution with dimension 63(62 characters and a stop symbol), and the value of each dimension is [0,1 ]]In combination with a predicted probability distribution P of all steps, with a sum of 1recogAnd the beam search algorithm to predict the character sequence Sq。
8. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.7) is specifically as follows:
taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6) For network prediction output, aiming at the network model constructed in (1.2.1), designing an objective loss function between expected output and prediction output, wherein the overall objective loss function consists of a region extraction network, a multidirectional rectangular prediction network, a boundary point prediction network and a sequence identification network, and the overall objective loss function expression is as follows: l (P)rpn,Yrpn,Yor,Ybp,Precog)=Lrpn(Prpn,Yrpn)+α1Lor(Yor)+α2Lbp(Ybp)+α3Lrecog(Precog) Wherein L isrpn(Prpn,Yrpn) Extracting the loss function of the network for the region, Lor(Yor) Detecting the loss function of the network for multidirectional rectangles, Lbp(Ybp) Detecting loss functions of the network for boundary points, Lrecog(Precog) Identifying loss functions of the network for the sequence, α1,α2,α3Are respectively a lossFunction Lrcnn、LbpAnd LrecogThe weight coefficient of (a);
according to the designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, the optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.
9. The method for recognizing the scene text end to end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (2.1) is specifically as follows:
for the data set I to be detectedtstIth picture ItstkInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture ItstkThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score PrcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) carrying out non-maximum suppression operation on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain a final reserved positive text quadrilateral bounding box; then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Yor=(Δxor,Δyor,Δhor,ΔworTheta) calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width of the center point and the rotation angle of the predicted multidirectional rectangle; according to the predicted multidirectional text bounding box, the multidirectional text features are rotated into horizontal features and input into a boundary point detection network, and the boundary point detection network predicts regression quantities Y of 7 boundary points of an upper boundary and a lower boundarybp={(Δxi,Δyi) And | i ∈ [0,14) ], calculating the coordinates of the boundary points in the horizontal frame by using a formula in (1.2.2) in combination with 14 preset default boundary points, and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.
10. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (2.2) is specifically as follows:
the corrected text feature resolution is 16 x 64, and the feature map is input into a sequence recognition network to obtain a probability distribution sequence { p0,p1,...,pN-1In which p isiRepresenting the probability distribution of each step of prediction of RNN, wherein N represents the maximum step length of RNN, and in the test process, when the predicted value of the k step is a stop character, stopping prediction, and finally, the probability distribution of the predicted sequence is { p }0,p1,...,pk-1And obtaining the type of the maximum probability in each step as the current predicted character according to the probability distribution, and finally obtaining a predicted character sequence Sq。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911038568.1A CN110837835B (en) | 2019-10-29 | 2019-10-29 | End-to-end scene text identification method based on boundary point detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911038568.1A CN110837835B (en) | 2019-10-29 | 2019-10-29 | End-to-end scene text identification method based on boundary point detection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110837835A true CN110837835A (en) | 2020-02-25 |
CN110837835B CN110837835B (en) | 2022-11-08 |
Family
ID=69575725
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911038568.1A Active CN110837835B (en) | 2019-10-29 | 2019-10-29 | End-to-end scene text identification method based on boundary point detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110837835B (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111476235A (en) * | 2020-03-31 | 2020-07-31 | 成都数之联科技有限公司 | Method for synthesizing 3D curved surface text picture |
CN111507333A (en) * | 2020-04-21 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Image correction method and device, electronic equipment and storage medium |
CN111553349A (en) * | 2020-04-26 | 2020-08-18 | 佛山市南海区广工大数控装备协同创新研究院 | Scene text positioning and identifying method based on full convolution network |
CN111553361A (en) * | 2020-03-19 | 2020-08-18 | 四川大学华西医院 | Pathological section label identification method |
CN111753714A (en) * | 2020-06-23 | 2020-10-09 | 中南大学 | Multidirectional natural scene text detection method based on character segmentation |
CN111753812A (en) * | 2020-07-30 | 2020-10-09 | 上海眼控科技股份有限公司 | Text recognition method and equipment |
CN111767921A (en) * | 2020-06-30 | 2020-10-13 | 上海媒智科技有限公司 | Express bill positioning and correcting method and device |
CN111898570A (en) * | 2020-08-05 | 2020-11-06 | 盐城工学院 | Method for recognizing text in image based on bidirectional feature pyramid network |
CN112036405A (en) * | 2020-08-31 | 2020-12-04 | 浪潮云信息技术股份公司 | Detection and identification method for handwritten document text |
CN112070082A (en) * | 2020-08-24 | 2020-12-11 | 西安理工大学 | Curve character positioning method based on instance perception component merging network |
CN112101359A (en) * | 2020-11-11 | 2020-12-18 | 广州华多网络科技有限公司 | Text formula positioning method, model training method and related device |
CN112101355A (en) * | 2020-09-25 | 2020-12-18 | 北京百度网讯科技有限公司 | Method and device for detecting text in image, electronic equipment and computer medium |
CN112183322A (en) * | 2020-09-27 | 2021-01-05 | 成都数之联科技有限公司 | Text detection and correction method for any shape |
CN112200202A (en) * | 2020-10-29 | 2021-01-08 | 上海商汤智能科技有限公司 | Text detection method and device, electronic equipment and storage medium |
CN112308051A (en) * | 2020-12-29 | 2021-02-02 | 北京易真学思教育科技有限公司 | Text box detection method and device, electronic equipment and computer storage medium |
CN112446372A (en) * | 2020-12-08 | 2021-03-05 | 电子科技大学 | Text detection method based on channel grouping attention mechanism |
CN112733822A (en) * | 2021-03-31 | 2021-04-30 | 上海旻浦科技有限公司 | End-to-end text detection and identification method |
CN112765955A (en) * | 2021-01-22 | 2021-05-07 | 中国人民公安大学 | Cross-modal instance segmentation method under Chinese reference expression |
CN112800801A (en) * | 2021-02-03 | 2021-05-14 | 珠海格力电器股份有限公司 | Method and device for recognizing pattern in image, computer equipment and storage medium |
WO2021098861A1 (en) * | 2019-11-21 | 2021-05-27 | 上海高德威智能交通***有限公司 | Text recognition method, apparatus, recognition device, and storage medium |
CN113298054A (en) * | 2021-07-27 | 2021-08-24 | 国际关系学院 | Text region detection method based on embedded spatial pixel clustering |
CN113298167A (en) * | 2021-06-01 | 2021-08-24 | 北京思特奇信息技术股份有限公司 | Character detection method and system based on lightweight neural network model |
CN113343980A (en) * | 2021-06-10 | 2021-09-03 | 西安邮电大学 | Natural scene text detection method and system |
CN113591864A (en) * | 2021-07-28 | 2021-11-02 | 北京百度网讯科技有限公司 | Training method, device and system for text recognition model framework |
WO2021232464A1 (en) * | 2020-05-20 | 2021-11-25 | 南京理工大学 | Character offset detection method and system |
CN113807336A (en) * | 2021-08-09 | 2021-12-17 | 华南理工大学 | Semi-automatic labeling method, system, computer equipment and medium for image text detection |
CN113887282A (en) * | 2021-08-30 | 2022-01-04 | 中国科学院信息工程研究所 | Detection system and method for any-shape adjacent text in scene image |
CN114155540A (en) * | 2021-11-16 | 2022-03-08 | 深圳市联洲国际技术有限公司 | Character recognition method, device and equipment based on deep learning and storage medium |
CN114266800A (en) * | 2021-12-24 | 2022-04-01 | 中设数字技术股份有限公司 | Multi-rectangular bounding box algorithm and generation system for graphs |
CN115482538A (en) * | 2022-11-15 | 2022-12-16 | 上海安维尔信息科技股份有限公司 | Material label extraction method and system based on Mask R-CNN |
CN116884013A (en) * | 2023-07-21 | 2023-10-13 | 江苏方天电力技术有限公司 | Text vectorization method of engineering drawing |
CN116958981A (en) * | 2023-05-31 | 2023-10-27 | 广东南方网络信息科技有限公司 | Character recognition method and device |
CN117975467A (en) * | 2024-04-02 | 2024-05-03 | 华南理工大学 | Bridge type end-to-end character recognition method |
WO2024092484A1 (en) * | 2022-11-01 | 2024-05-10 | Boe Technology Group Co., Ltd. | Computer-implemented object detection method, object detection apparatus, and computer-readable medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107977620A (en) * | 2017-11-29 | 2018-05-01 | 华中科技大学 | A kind of multi-direction scene text single detection method based on full convolutional network |
CN108549893A (en) * | 2018-04-04 | 2018-09-18 | 华中科技大学 | A kind of end-to-end recognition methods of the scene text of arbitrary shape |
-
2019
- 2019-10-29 CN CN201911038568.1A patent/CN110837835B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107977620A (en) * | 2017-11-29 | 2018-05-01 | 华中科技大学 | A kind of multi-direction scene text single detection method based on full convolutional network |
CN108549893A (en) * | 2018-04-04 | 2018-09-18 | 华中科技大学 | A kind of end-to-end recognition methods of the scene text of arbitrary shape |
Non-Patent Citations (2)
Title |
---|
LUO等: "MORAN: A Multi-Object Rectified Attention Network for scene text recognition", 《PATTERN RECOGNITION》 * |
ZHANG等: "Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
Cited By (52)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11928872B2 (en) | 2019-11-21 | 2024-03-12 | Shanghai Goldway Intelligent Transportation System Co., Ltd. | Methods and apparatuses for recognizing text, recognition devices and storage media |
WO2021098861A1 (en) * | 2019-11-21 | 2021-05-27 | 上海高德威智能交通***有限公司 | Text recognition method, apparatus, recognition device, and storage medium |
CN111553361A (en) * | 2020-03-19 | 2020-08-18 | 四川大学华西医院 | Pathological section label identification method |
CN111476235A (en) * | 2020-03-31 | 2020-07-31 | 成都数之联科技有限公司 | Method for synthesizing 3D curved surface text picture |
CN111476235B (en) * | 2020-03-31 | 2023-04-25 | 成都数之联科技股份有限公司 | Method for synthesizing 3D curved text picture |
CN111507333A (en) * | 2020-04-21 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Image correction method and device, electronic equipment and storage medium |
CN111507333B (en) * | 2020-04-21 | 2023-09-15 | 腾讯科技(深圳)有限公司 | Image correction method and device, electronic equipment and storage medium |
CN111553349A (en) * | 2020-04-26 | 2020-08-18 | 佛山市南海区广工大数控装备协同创新研究院 | Scene text positioning and identifying method based on full convolution network |
CN111553349B (en) * | 2020-04-26 | 2023-04-18 | 佛山市南海区广工大数控装备协同创新研究院 | Scene text positioning and identifying method based on full convolution network |
WO2021232464A1 (en) * | 2020-05-20 | 2021-11-25 | 南京理工大学 | Character offset detection method and system |
CN111753714A (en) * | 2020-06-23 | 2020-10-09 | 中南大学 | Multidirectional natural scene text detection method based on character segmentation |
CN111753714B (en) * | 2020-06-23 | 2023-09-01 | 中南大学 | Multidirectional natural scene text detection method based on character segmentation |
CN111767921A (en) * | 2020-06-30 | 2020-10-13 | 上海媒智科技有限公司 | Express bill positioning and correcting method and device |
CN111753812A (en) * | 2020-07-30 | 2020-10-09 | 上海眼控科技股份有限公司 | Text recognition method and equipment |
CN111898570A (en) * | 2020-08-05 | 2020-11-06 | 盐城工学院 | Method for recognizing text in image based on bidirectional feature pyramid network |
CN112070082A (en) * | 2020-08-24 | 2020-12-11 | 西安理工大学 | Curve character positioning method based on instance perception component merging network |
CN112070082B (en) * | 2020-08-24 | 2023-04-07 | 西安理工大学 | Curve character positioning method based on instance perception component merging network |
CN112036405A (en) * | 2020-08-31 | 2020-12-04 | 浪潮云信息技术股份公司 | Detection and identification method for handwritten document text |
CN112101355B (en) * | 2020-09-25 | 2024-04-02 | 北京百度网讯科技有限公司 | Method and device for detecting text in image, electronic equipment and computer medium |
CN112101355A (en) * | 2020-09-25 | 2020-12-18 | 北京百度网讯科技有限公司 | Method and device for detecting text in image, electronic equipment and computer medium |
CN112183322A (en) * | 2020-09-27 | 2021-01-05 | 成都数之联科技有限公司 | Text detection and correction method for any shape |
CN112183322B (en) * | 2020-09-27 | 2022-07-19 | 成都数之联科技股份有限公司 | Text detection and correction method for any shape |
CN112200202A (en) * | 2020-10-29 | 2021-01-08 | 上海商汤智能科技有限公司 | Text detection method and device, electronic equipment and storage medium |
CN112101359A (en) * | 2020-11-11 | 2020-12-18 | 广州华多网络科技有限公司 | Text formula positioning method, model training method and related device |
CN112446372A (en) * | 2020-12-08 | 2021-03-05 | 电子科技大学 | Text detection method based on channel grouping attention mechanism |
CN112308051B (en) * | 2020-12-29 | 2021-10-29 | 北京易真学思教育科技有限公司 | Text box detection method and device, electronic equipment and computer storage medium |
CN112308051A (en) * | 2020-12-29 | 2021-02-02 | 北京易真学思教育科技有限公司 | Text box detection method and device, electronic equipment and computer storage medium |
CN112765955A (en) * | 2021-01-22 | 2021-05-07 | 中国人民公安大学 | Cross-modal instance segmentation method under Chinese reference expression |
CN112765955B (en) * | 2021-01-22 | 2023-05-26 | 中国人民公安大学 | Cross-modal instance segmentation method under Chinese finger representation |
CN112800801A (en) * | 2021-02-03 | 2021-05-14 | 珠海格力电器股份有限公司 | Method and device for recognizing pattern in image, computer equipment and storage medium |
CN112800801B (en) * | 2021-02-03 | 2022-11-11 | 珠海格力电器股份有限公司 | Method and device for recognizing pattern in image, computer equipment and storage medium |
CN112733822A (en) * | 2021-03-31 | 2021-04-30 | 上海旻浦科技有限公司 | End-to-end text detection and identification method |
CN112733822B (en) * | 2021-03-31 | 2021-07-27 | 上海旻浦科技有限公司 | End-to-end text detection and identification method |
CN113298167A (en) * | 2021-06-01 | 2021-08-24 | 北京思特奇信息技术股份有限公司 | Character detection method and system based on lightweight neural network model |
CN113343980B (en) * | 2021-06-10 | 2023-06-09 | 西安邮电大学 | Natural scene text detection method and system |
CN113343980A (en) * | 2021-06-10 | 2021-09-03 | 西安邮电大学 | Natural scene text detection method and system |
CN113298054B (en) * | 2021-07-27 | 2021-10-08 | 国际关系学院 | Text region detection method based on embedded spatial pixel clustering |
CN113298054A (en) * | 2021-07-27 | 2021-08-24 | 国际关系学院 | Text region detection method based on embedded spatial pixel clustering |
CN113591864A (en) * | 2021-07-28 | 2021-11-02 | 北京百度网讯科技有限公司 | Training method, device and system for text recognition model framework |
CN113807336B (en) * | 2021-08-09 | 2023-06-30 | 华南理工大学 | Semi-automatic labeling method, system, computer equipment and medium for image text detection |
CN113807336A (en) * | 2021-08-09 | 2021-12-17 | 华南理工大学 | Semi-automatic labeling method, system, computer equipment and medium for image text detection |
CN113887282A (en) * | 2021-08-30 | 2022-01-04 | 中国科学院信息工程研究所 | Detection system and method for any-shape adjacent text in scene image |
CN114155540B (en) * | 2021-11-16 | 2024-05-03 | 深圳市联洲国际技术有限公司 | Character recognition method, device, equipment and storage medium based on deep learning |
CN114155540A (en) * | 2021-11-16 | 2022-03-08 | 深圳市联洲国际技术有限公司 | Character recognition method, device and equipment based on deep learning and storage medium |
CN114266800B (en) * | 2021-12-24 | 2023-05-05 | 中设数字技术股份有限公司 | Method and system for generating multiple rectangular bounding boxes of plane graph |
CN114266800A (en) * | 2021-12-24 | 2022-04-01 | 中设数字技术股份有限公司 | Multi-rectangular bounding box algorithm and generation system for graphs |
WO2024092484A1 (en) * | 2022-11-01 | 2024-05-10 | Boe Technology Group Co., Ltd. | Computer-implemented object detection method, object detection apparatus, and computer-readable medium |
CN115482538A (en) * | 2022-11-15 | 2022-12-16 | 上海安维尔信息科技股份有限公司 | Material label extraction method and system based on Mask R-CNN |
CN116958981A (en) * | 2023-05-31 | 2023-10-27 | 广东南方网络信息科技有限公司 | Character recognition method and device |
CN116958981B (en) * | 2023-05-31 | 2024-04-30 | 广东南方网络信息科技有限公司 | Character recognition method and device |
CN116884013A (en) * | 2023-07-21 | 2023-10-13 | 江苏方天电力技术有限公司 | Text vectorization method of engineering drawing |
CN117975467A (en) * | 2024-04-02 | 2024-05-03 | 华南理工大学 | Bridge type end-to-end character recognition method |
Also Published As
Publication number | Publication date |
---|---|
CN110837835B (en) | 2022-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110837835B (en) | End-to-end scene text identification method based on boundary point detection | |
CN108549893B (en) | End-to-end identification method for scene text with any shape | |
US10762376B2 (en) | Method and apparatus for detecting text | |
WO2020108311A1 (en) | 3d detection method and apparatus for target object, and medium and device | |
CN113785305A (en) | Method, device and equipment for detecting inclined characters | |
CN111488826A (en) | Text recognition method and device, electronic equipment and storage medium | |
Rekha et al. | Hand gesture recognition for sign language: A new hybrid approach | |
Chiang et al. | Recognizing text in raster maps | |
CN112541491B (en) | End-to-end text detection and recognition method based on image character region perception | |
CN112446370B (en) | Method for identifying text information of nameplate of power equipment | |
CN110598690A (en) | End-to-end optical character detection and identification method and system | |
CN113435240B (en) | End-to-end form detection and structure identification method and system | |
Cao et al. | Robust vehicle detection by combining deep features with exemplar classification | |
CN112766184A (en) | Remote sensing target detection method based on multi-level feature selection convolutional neural network | |
CN111476210A (en) | Image-based text recognition method, system, device and storage medium | |
Wang et al. | Spatially prioritized and persistent text detection and decoding | |
Ghadhban et al. | Segments interpolation extractor for finding the best fit line in Arabic offline handwriting recognition words | |
Zhang et al. | A vertical text spotting model for trailer and container codes | |
CN113420648B (en) | Target detection method and system with rotation adaptability | |
Mohammad et al. | Contour-based character segmentation for printed Arabic text with diacritics | |
Turk et al. | Computer vision for mobile augmented reality | |
CN111476226B (en) | Text positioning method and device and model training method | |
Shi et al. | Fuzzy support tensor product adaptive image classification for the internet of things | |
CN115601586A (en) | Label information acquisition method and device, electronic equipment and computer storage medium | |
CN112287763A (en) | Image processing method, apparatus, device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |