CN111553349A - Scene text positioning and identifying method based on full convolution network - Google Patents

Scene text positioning and identifying method based on full convolution network Download PDF

Info

Publication number
CN111553349A
CN111553349A CN202010340617.3A CN202010340617A CN111553349A CN 111553349 A CN111553349 A CN 111553349A CN 202010340617 A CN202010340617 A CN 202010340617A CN 111553349 A CN111553349 A CN 111553349A
Authority
CN
China
Prior art keywords
text
candidate
network
box
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010340617.3A
Other languages
Chinese (zh)
Other versions
CN111553349B (en
Inventor
杨海东
黄坤山
巴姗姗
彭文瑜
林玉山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute
Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Original Assignee
Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute
Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute, Foshan Guangdong University CNC Equipment Technology Development Co. Ltd filed Critical Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute
Priority to CN202010340617.3A priority Critical patent/CN111553349B/en
Publication of CN111553349A publication Critical patent/CN111553349A/en
Application granted granted Critical
Publication of CN111553349B publication Critical patent/CN111553349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a scene text positioning and identifying method based on a full convolution network, which comprises the steps of S1, obtaining a training set containing a plurality of training pictures marked with text positions; s2, constructing a full convolution neural network Model based on text positioning, wherein the full convolution neural network Model comprises 5 steps of inputting a training set into the full convolution neural network Model based on text positioning for training, iterating Model parameters to obtain a converged text positioning network Model1 and the like, and the full convolution neural network Model comprises a feature extraction network, a feature fusion network and a text candidate box screening layer.

Description

Scene text positioning and identifying method based on full convolution network
Technical Field
The invention relates to the technical field of text positioning and identification, in particular to a scene text positioning and identification method based on a full convolution network.
Background
The text is used as the most expressive information expression mode, records colorful scientific and technological culture of human beings, and can be embedded into a document or a scene as communication information. Text in images of scenes can be roughly divided into two categories: artificial text and scene text. With the development of internet technology, text positioning and recognition technologies, such as license plate recognition, identification card recognition, etc., have been widely used in life. The traditional ORC recognition technology can only be used for recognizing a print with a single background and fixed fonts, but the texts on a scene image have diversity, such as irregular arrangement and non-uniform font sizes, and in addition, the problems of font blurring, incomplete detection and the like caused by factors such as illumination intensity or photographing angle cause strong interference on text detection, so that the accuracy of text detection is seriously influenced, and the scene text positioning and recognition are extremely challenging tasks. Therefore, in order to improve the accuracy of scene text detection, a scene text positioning and identifying method based on a full convolution network is provided.
Disclosure of Invention
Aiming at the problems, the invention provides a scene text positioning and identifying method based on a full convolution network, which mainly solves the problems in the background technology.
The invention provides a scene text positioning and identifying method of a full convolution network, which comprises the following steps:
s1, acquiring a training set containing a plurality of training pictures marked with text positions;
s2, constructing a full convolution neural network Model based on text positioning, wherein the full convolution neural network Model comprises a feature extraction network, a feature fusion network and a text candidate box screening layer, inputting the training set into the full convolution neural network Model based on the text positioning for training, and iterating Model parameters to obtain a converged text positioning network Model 1;
s3, constructing a text recognition network Model, wherein the text recognition network comprises a convolutional neural network layer, an attention mechanism layer, a cyclic neural network layer and a translation layer, inputting the training set into the text recognition network Model for training, and iterating Model parameters to obtain a converged text recognition network Model 2;
s4, inputting the scene image to be subjected to text positioning and recognition into the text positioning network Model1 to obtain the text existence confidence and the text region position, and outputting the best text candidate box after screening;
s5, inputting the image containing the best text candidate box into the text recognition network Model2 to obtain a text recognition result.
The further improvement is that the feature extraction network consists of a convolution layer and a pooling layer and is used for extracting a convolution feature map of the input image; the feature fusion network is used for convolving the multi-feature prediction layer with feature layers in different stages to predict the confidence coefficient of the text and the position of the text region; the text candidate box screening layer is used for post-processing candidate boxes in different text areas to obtain the position of the best text candidate box.
In a further improvement, the construction process of the full convolution neural network model based on text positioning in step S2 is as follows:
s21, extracting multi-scale features through a feature extraction network;
s22, performing multi-scale feature fusion through a feature fusion network;
and S23, screening and outputting the image containing the best text candidate box through the text candidate box.
In a further improvement, in step S23, each of the text candidate boxes has a confidence score, and the processing of the text candidate boxes removes non-best candidate boxes, and finally filters out an image of the best text candidate box, which specifically includes:
s231, sorting all the text candidate boxes from high to low according to the confidence score, taking the highest score as a current best candidate box a, and sequentially taking the remaining candidate boxes as a post-selection best text candidate box b;
s232, the overlapping degree of the best text candidate box b and the current best text candidate box a is selected after calculation, and the overlapping degree calculation formula is the ratio of the overlapping area of the two candidate boxes to the area of the union of the two candidate boxes, namely:
Figure BDA0002468391990000031
s233, if the IOU between b and a is greater than the threshold, it indicates that b and a have higher overlap and should be in the same text region, but the confidence score of b is not as high as a, so that the post-selected best text candidate box b is to be suppressed, that is, the text candidate box b is removed from the remaining candidate boxes;
and S234, repeating the 3 steps to screen the post-selection optimal text candidate frames b one by one, and when the remaining candidate frames are screened sequentially, only the candidate frames with the overlapping degree with the text candidate frame a smaller than the threshold value are left, namely the remaining candidate frames are all other text region candidate frames.
In a further improvement, the construction process of the text recognition network model in the step S3 is as follows:
s31, inputting the image output by the text positioning network Model1 into the convolutional neural network layer, and extracting a feature vector sequence of the image;
s32, calculating the association degree of all the feature vectors through an attention mechanism, converting the association degree into probability weight, and then multiplying the probability weight by the input sequence to screen out a new feature vector sequence;
s33, taking the new feature vector sequence as the input of a recurrent neural network layer, and predicting the label distribution of each frame sequence;
and S34, finally, translating the prediction of each frame sequence into a label sequence with the highest probability through a translation layer.
In a further improvement, the process of predicting the text region position in step S4 is:
s41, presetting default boxes on the feature graph of the input multi-feature prediction layer, and returning to a series of multi-angle text boxes, wherein the multi-angle text boxes are in two forms of quadrangles represented by four points and rotating rectangles represented by upper left corners, upper right corners and heights;
s42, representing the text existence confidence coefficient and the coordinate offset of the text region position candidate box output by the feature fusion network as a quadrilateral or rotating rectangular associated default box;
s43, regression is carried out to obtain a real boundary box of a quadrangle or a rotating rectangle according to the text candidate box and the horizontal rectangle circumscribed by the candidate box, and the regression calculation formula is as follows:
di=|b1-qi|+|b2-q(i+1)|+|b3-q(i+2)|+|b4-q(i+3)|,i=1
di=|b1-qi|+|b2-q(i+1)%4|+|b3-q(i+2)%4|+|b4-q(i+3)%4|,i=2,3,4
wherein, bi q i1,2,3 and 4 are respectively four vertexes of an external horizontal rectangular frame and a regression quadrangle or a rotation rectangle, and the percentage represents the remainder;
and S44, obtaining the best text area candidate box through the screening of the text candidate box.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention adopts multi-scale feature detection, uses a low-level feature map to locate a smaller text region, uses a high-level feature map to locate a larger text region, and then inputs a multi-scale feature image into a feature fusion network to obtain the confidence coefficient of the text in a scene image and the position of the text region. Compared with the method that only the last characteristic layer is used for detection, the multi-characteristic prediction layer detects the scene text image by using the plurality of characteristic layers, so that the accuracy is improved, and the robustness of text regions with different scales is enhanced.
2. The invention selects irregular 3x5 convolution filter in the multi-feature prediction layer of the feature fusion network. The goal is to obtain a better perceived field of view because multiple features are generally long objects in the scene image.
3. In the text recognition process, an attention mechanism is added before the recurrent neural network. The feature sequence extracted in the recognition is actually a mapping of the input image to the sequence position, but when the feature vectors of a plurality of feature sequences in succession are background noise between texts corresponding to the input images, text recognition errors are easily caused. In addition, when the text in the input image is long and the space between the characters is large, the interference of background noise of non-text areas is easy to happen, so the attention mechanism is adopted to solve the problem.
4. The invention replaces the rectangle frame of the conventional target detector with the quadrangle and the rotating rectangle in the text positioning, can challenge any direction of the text in the scene image, and can input the image with any size.
Drawings
The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
FIG. 1 is a schematic overall flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of a Model1 for text-based positioning network according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a text recognition network Model2 according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a regression graph of the minimum distance of a real bounding box according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating regression results from a matched default box to a true bounding box according to an embodiment of the present invention.
Detailed Description
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, so to speak, as communicating between the two elements. The specific meaning of the above terms in the present invention can be understood in specific cases to those skilled in the art. The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
The invention provides a scene text positioning and identifying method of a full convolution network, which overcomes the defects of the traditional text detection and identifying method and carries out real-time positioning on texts in a scene image by utilizing deep learning target detection. Firstly, acquiring a training set containing a plurality of training pictures marked with text positions; then inputting the training set into a text positioning network for training; inputting the training set into a text recognition network for training; then, inputting the scene image to be subjected to text positioning and recognition into a text positioning network model to obtain a text existence confidence coefficient and a text region position; then inputting the text area position image into a text recognition network; and finally, obtaining a text recognition result. Two trained network models are obtained through training, and a scene text positioning and identifying method with high precision and high efficiency can be realized by combining the prediction of the rear end of the model. The method specifically comprises the following steps:
s1, acquiring a training set containing a plurality of training pictures marked with text positions;
s2, constructing a full convolution neural network Model based on text positioning, wherein the full convolution neural network Model comprises a feature extraction network, a feature fusion network and a text candidate box screening layer, inputting the training set into the full convolution neural network Model based on the text positioning for training, and iterating Model parameters to obtain a converged text positioning network Model 1;
s3, constructing a text recognition network Model, wherein the text recognition network comprises a convolutional neural network layer, an attention mechanism layer, a cyclic neural network layer and a translation layer, inputting the training set into the text recognition network Model for training, and iterating Model parameters to obtain a converged text recognition network Model 2;
s4, inputting the scene image to be subjected to text positioning and recognition into the text positioning network Model1 to obtain the text existence confidence and the text region position, and outputting the best text candidate box after screening;
s5, inputting the image containing the best text candidate box into the text recognition network Model2 to obtain a text recognition result.
The further improvement is that the feature extraction network consists of a convolution layer and a pooling layer and is used for extracting a convolution feature map of the input image; the feature fusion network is used for convolving the multi-feature prediction layer with feature layers in different stages to predict the confidence coefficient of the text and the position of the text region; the text candidate box screening layer is used for post-processing candidate boxes in different text areas to obtain the position of the best text candidate box.
It can be appreciated that, in the embodiment of the present invention, compared with the conventional neural network that performs multiple nonlinear mappings on the input image through multiple serial convolutional layers, the multi-feature fusion network adopted in the present invention overcomes two disadvantages of the conventional technology: (1) with the increase of the network depth, the features are more and more abstract, and information is lost; (2) different characteristic layers correspond to different receptive fields, and the deeper characteristic layer has a larger receptive field. And adopting multi-scale feature detection, positioning a smaller text region by using a low-level feature map, positioning a larger text region by using a high-level feature map, and then inputting the multi-scale feature image into a feature fusion network to obtain the confidence coefficient of the text in the scene image and the position of the text region. Compared with the method that only the last characteristic layer is used for detection, the multi-characteristic prediction layer detects the scene text image by using the plurality of characteristic layers, so that the accuracy is improved, and the robustness of text regions with different scales is enhanced.
It is understood that in the embodiment of the present invention, the text region in the scene image is generally a long object, and in order to obtain a better receptive field, an irregular 3x5 convolution filter is selected for the multi-feature prediction layer. Compared with the method that only the last characteristic layer is used for detection, the multi-characteristic prediction layer detects the scene text image by using the plurality of characteristic layers, so that the accuracy is improved, and the robustness of text regions with different scales is enhanced.
In a further improvement, the construction process of the full convolution neural network model based on text positioning in step S2 is as follows:
s21, extracting multi-scale features through a feature extraction network, wherein the feature extraction network is a 23-layer neural network, the feature fusion network is a network comprising 6 feature prediction layers, the multi-feature prediction layers are convolution layers, the 6 feature prediction layers are respectively connected with 6 stage feature layers of 13 th, 15 th, 17 th, 19 th, 21 th and 23 th layers, an irregular 3x5 convolution filter is selected for the multi-feature prediction layers, and convolution cores of different depths are used for performing convolution on each feature layer;
s22, performing multi-scale feature fusion through a feature fusion network, outputting a group of vectors with fixed sizes, predicting the confidence coefficient of the text and the position of the text region, and outputting 14-dimensional prediction vectors (2-dimensional text exists, 8-dimensional coordinate offset of the quadrangular candidate frame and 4-dimensional coordinate offset of the minimum circumscribed horizontal rectangular frame) for each quadrangular candidate frame determined as the text region; outputting a 7-dimensional prediction vector for each rotating rectangular candidate box determined as the text region (the text has 2 dimensions, and the coordinate offset of the rotating rectangular candidate box is 5 dimensions);
and S23, screening and outputting the image containing the best text candidate box through the text candidate box.
In a further improvement, in step S23, each of the text candidate boxes has a confidence score, and the processing of the text candidate boxes removes non-best candidate boxes, and finally filters out an image of the best text candidate box, which specifically includes:
s231, sorting all the text candidate boxes from high to low according to the confidence score, taking the highest score as a current best candidate box a, and sequentially taking the remaining candidate boxes as a post-selection best text candidate box b;
s232, the overlapping degree of the best text candidate box b and the current best text candidate box a is selected after calculation, and the overlapping degree calculation formula is the ratio of the overlapping area of the two candidate boxes to the area of the union of the two candidate boxes, namely:
Figure BDA0002468391990000091
s233, if the IOU between b and a is greater than the threshold (e.g. IOU >0.6), indicating that b and a have higher overlap and should be the same text region, but the confidence score of b is not as high as a, so that the post-selection of the best text candidate box b is suppressed, i.e. the text candidate box b is removed from the remaining candidate boxes;
and S234, repeating the 3 steps to screen the post-selection optimal text candidate frames b one by one, and when the remaining candidate frames are screened sequentially, only the candidate frames with the overlapping degree with the text candidate frame a smaller than the threshold value are left, namely the remaining candidate frames are all other text region candidate frames.
In a further improvement, the construction process of the text recognition network model in the step S3 is as follows:
s31, inputting the image output by the text positioning network Model1 into the convolutional neural network layer, and extracting a feature vector sequence of the image;
s32, calculating the association degree of all the feature vectors through an attention mechanism, converting the association degree into probability weight, and then multiplying the probability weight by the input sequence to screen out a new feature vector sequence;
s33, taking the new feature vector sequence as the input of a recurrent neural network layer, and predicting the label distribution of each frame sequence;
and S34, finally, translating the prediction of each frame sequence into a label sequence with the highest probability through a translation layer.
It is understood that, in the embodiment of the present invention, the basic idea of the attention mechanism in step S32 is to give more attention to the necessary information so as to extract the necessary information from the input sequence. Conceptually, important information is screened from a large amount of information, and unimportant information is ignored. The main process is as follows: input xiI-0, 1, … …, n signature, the signature sequence h is obtained from the encoder fi=f(xi) And through
Figure BDA0002468391990000101
jiIs the attention mechanism calculates the corresponding weights) to obtain the element weighted sum as input c to the decoder gjFinally, decoding to obtain output yj=g(cj)。
In a further improvement, the process of predicting the text region position in step S4 is:
s41, presetting default boxes on the feature graph of the input multi-feature prediction layer, and returning to a series of multi-angle text boxes, wherein the multi-angle text boxes are in two forms of quadrangles represented by four points and rotating rectangles represented by upper left corners, upper right corners and heights;
s42, representing the text existence confidence coefficient and the coordinate offset of the text region position candidate box output by the feature fusion network as a quadrilateral or rotating rectangular associated default box;
s43, regression is carried out to obtain a real boundary box of a quadrangle or a rotating rectangle according to the text candidate box and the horizontal rectangle circumscribed by the candidate box, and the regression calculation formula is as follows:
di=|b1-qi|+|b2-q(i+1)|+|b3-q(i+2)|+|b4-q(i+3)|,i=1
di=|b1-qi|+|b2-q(i+1)%4|+|b3-q(i+2)%4|+|b4-q(i+3)%4|,i=2,3,4
wherein, bi q i1,2,3 and 4 are respectively four vertexes of an external horizontal rectangular frame and a regression quadrangle or a rotation rectangle, and the percentage represents the remainder;
and S44, obtaining the best text area candidate box through the screening of the text candidate box.
It is understood that, in the embodiment of the present invention, a more specific process of predicting the position of the text region is:
(1) each position of the feature map in the multi-feature prediction layer has a series of preset default frames, and the default frames return to a series of polygonsAnd the degree text box is represented by a set { q } or { r }, and the minimum external horizontal rectangular box corresponding to the multi-angle text box is output and represented by a set { b }. The default box may be denoted as b0=(x0,y0,w0,h0),(x0,y0) Center point, w, representing a default box0,h0The width and the height of the default box are respectively represented, so that the returned multi-angle text box has two representation forms:
(1) q is represented by four points of a quadrilateral0=(xq01,yq01,xq02,yq02,xq03,yq03,xq04,yq04);
(2) R is represented by the upper left point, the upper right point and the height of the rotating rectangle of a quadrangle0=(xr01,yr02,xr02,yr02,hr0) The parameter relationship between the multi-angle text box and the default box is as follows:
xq01=x0-w0/2,yq01=y0-h0/2,xq02=x0+w0/2,yq02=y0-h0/2
xq03=x0+w0/2,yq03=y0+h0/2,xq04=x0-w0/2,yq04=y0+h0/2
xr01=x0-w0/2,yr01=y0-h0/2,xr02=x0+w0/2,yr02=y0-h0/2,hr0=h0
(2) and expressing the text existence confidence coefficient and the text region position coordinate offset output by the feature fusion network as each associated default box of the OR.
For a quadrilateral, the predicted values of the multi-feature prediction layer are:
(Δx,Δy,Δw,Δh,Δx1,Δy1,Δx2,Δy2,Δx3,Δy3,Δx4,Δy4,c)
the horizontal rectangle and quadrilateral calculation formula output under the confidence c is as follows:
x=x0+w0Δx,y=y0+hΔy,w=w0sΔw,h=h0eΔh
xqn=xq0n+w0Δxqn,yqn=yq0n+h0Δyqn,n=1,2,3,4
for the rotated rectangle, the predicted values of the multi-feature prediction layer are (Δ x, Δ y, Δ w, Δ h, Δ x)1,Δy1,Δx2,Δy2,ΔhrC), the rotation rectangle r output with confidence c ═ xr1,yr1,xr2,yr2,hr) The calculation formula is as follows:
Figure BDA0002468391990000121
(3) and regressing a real boundary frame of a quadrangle or a rotating rectangle according to the candidate frame and the horizontal rectangle circumscribed by the candidate frame.
For a quadrangle, the real boundary of a horizontal frame is a circumscribed horizontal rectangle of a candidate frame, four points of the rectangle are arranged clockwise, the first point is an upper left corner point, the real boundary of a text frame is the candidate frame, four vertexes of the text frame are also arranged clockwise, the sum of four-point distances (four times needed to be calculated) is calculated from the upper left corner point and the four vertexes of the text frame and the horizontal frame according to the clockwise direction, the point corresponding to the upper left corner point of the horizontal frame is the first point of the text frame, and the second point, the third point and the fourth point are calculated by the same method. The regression calculation formula is:
di=|b1-qi|+|b2-q(i+1)|+|b3-q(i+2)|+|b4-q(i+3)|,i=1
di=|b1-qi|+|b2-q(i+1)%4|+|b3-q(i+2)%4|+|b4-q(i+3)%4|,i=2,3,4
wherein b isi、qi,i1,2,3,4 are the four vertices of the external horizontal rectangular frame and the regression quadrilateral or the rotation rectangle, respectively,% represents the remainder, and fig. 4 is the minimum distance regression graph. The regression result diagram of the real bounding box is shown in fig. 5, in which a dotted box represents a default box matching the real bounding box, a solid box represents a minimum circumscribed horizontal rectangular box corresponding to the real bounding box, another solid box represents the real bounding box, and an arrow represents the regression direction.
And for the rotating rectangle, determining an upper left point and an upper right point according to a quadrilateral method, wherein the height is the height of the rotating rectangle.
(4) And screening the text candidate boxes to obtain the best text region candidate box.
In the drawings, the positional relationship is described for illustrative purposes only and is not to be construed as limiting the present patent; it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (6)

1. A scene text positioning and identifying method based on a full convolution network is characterized by comprising the following steps:
s1, acquiring a training set containing a plurality of training pictures marked with text positions;
s2, constructing a full convolution neural network Model based on text positioning, wherein the full convolution neural network Model comprises a feature extraction network, a feature fusion network and a text candidate box screening layer, inputting the training set into the full convolution neural network Model based on the text positioning for training, and iterating Model parameters to obtain a converged text positioning network Model 1;
s3, constructing a text recognition network Model, wherein the text recognition network comprises a convolutional neural network layer, an attention mechanism layer, a cyclic neural network layer and a translation layer, inputting the training set into the text recognition network Model for training, and iterating Model parameters to obtain a converged text recognition network Model 2;
s4, inputting the scene image to be subjected to text positioning and recognition into the text positioning network Model1 to obtain the text existence confidence and the text region position, and outputting the best text candidate box after screening;
s5, inputting the image containing the best text candidate box into the text recognition network Model2 to obtain a text recognition result.
2. The method for locating and identifying scene texts based on the full convolution network as claimed in claim 1, wherein the feature extraction network is composed of convolution layers and a pooling layer and is used for extracting convolution feature maps of input images; the feature fusion network is used for convolving the multi-feature prediction layer with feature layers in different stages to predict the confidence coefficient of the text and the position of the text region; the text candidate box screening layer is used for post-processing candidate boxes in different text areas to obtain the position of the best text candidate box.
3. The method for locating and identifying scene texts based on full convolutional network as claimed in claims 1-2, wherein the construction process of the full convolutional neural network model based on text locating in step S2 is as follows:
s21, extracting multi-scale features through a feature extraction network;
s22, performing multi-scale feature fusion through a feature fusion network;
and S23, screening and outputting the image containing the best text candidate box through the text candidate box.
4. The method as claimed in claim 3, wherein each of the text candidate boxes in step S23 has a confidence score, and the processing of the text candidate boxes removes non-best candidate boxes, and finally filters out the image of the best text candidate box, which specifically includes:
s231, sorting all the text candidate boxes from high to low according to the confidence score, taking the highest score as a current best candidate box a, and sequentially taking the remaining candidate boxes as a post-selection best text candidate box b;
s232, the overlapping degree of the best text candidate box b and the current best text candidate box a is selected after calculation, and the overlapping degree calculation formula is the ratio of the overlapping area of the two candidate boxes to the area of the union of the two candidate boxes, namely:
Figure FDA0002468391980000021
s233, if the IOU between b and a is greater than the threshold, it indicates that b and a have higher overlap and should be in the same text region, but the confidence score of b is not as high as a, so that the post-selected best text candidate box b is to be suppressed, that is, the text candidate box b is removed from the remaining candidate boxes;
and S234, repeating the 3 steps to screen the post-selection optimal text candidate frames b one by one, and when the remaining candidate frames are screened sequentially, only the candidate frames with the overlapping degree with the text candidate frame a smaller than the threshold value are left, namely the remaining candidate frames are all other text region candidate frames.
5. The method for locating and identifying scene texts based on full convolutional network of claim 1, wherein the construction process of the text recognition network model in the step S3 is as follows:
s31, inputting the image output by the text positioning network Model1 into the convolutional neural network layer, and extracting a feature vector sequence of the image;
s32, calculating the association degree of all the feature vectors through an attention mechanism, converting the association degree into probability weight, and then multiplying the probability weight by the input sequence to screen out a new feature vector sequence;
s33, taking the new feature vector sequence as the input of a recurrent neural network layer, and predicting the label distribution of each frame sequence;
and S34, finally, translating the prediction of each frame sequence into a label sequence with the highest probability through a translation layer.
6. The method for locating and identifying scene text based on full convolutional network as claimed in claim 2, wherein the process of predicting the text region position in step S4 is as follows:
s41, presetting default boxes on the feature graph of the input multi-feature prediction layer, and returning to a series of multi-angle text boxes, wherein the multi-angle text boxes are in two forms of quadrangles represented by four points and rotating rectangles represented by upper left corners, upper right corners and heights;
s42, representing the text existence confidence coefficient and the coordinate offset of the text region position candidate box output by the feature fusion network as a quadrilateral or rotating rectangular associated default box;
s43, regression is carried out to obtain a real boundary box of a quadrangle or a rotating rectangle according to the text candidate box and the horizontal rectangle circumscribed by the candidate box, and the regression calculation formula is as follows:
di=|b1-qi|+|b2-q(i+1)|+|b3-q(i+2)|+|b4-q(i+3)|,i=1
di=|b1-qi|+|b2-q(i+1)%4|+|b3-q(i+2)%4|+|b4-q(i+3)%4|,i=2,3,4
wherein, bi、qi1,2,3 and 4 are respectively four vertexes of an external horizontal rectangular frame and a regression quadrangle or a rotation rectangle, and the percentage represents the remainder;
and S44, obtaining the best text area candidate box through the screening of the text candidate box.
CN202010340617.3A 2020-04-26 2020-04-26 Scene text positioning and identifying method based on full convolution network Active CN111553349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010340617.3A CN111553349B (en) 2020-04-26 2020-04-26 Scene text positioning and identifying method based on full convolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010340617.3A CN111553349B (en) 2020-04-26 2020-04-26 Scene text positioning and identifying method based on full convolution network

Publications (2)

Publication Number Publication Date
CN111553349A true CN111553349A (en) 2020-08-18
CN111553349B CN111553349B (en) 2023-04-18

Family

ID=72003025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010340617.3A Active CN111553349B (en) 2020-04-26 2020-04-26 Scene text positioning and identifying method based on full convolution network

Country Status (1)

Country Link
CN (1) CN111553349B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560857A (en) * 2021-02-20 2021-03-26 鹏城实验室 Character area boundary detection method, equipment, storage medium and device
CN112990201A (en) * 2021-05-06 2021-06-18 北京世纪好未来教育科技有限公司 Text box detection method and device, electronic equipment and computer storage medium
CN113221885A (en) * 2021-05-13 2021-08-06 中国科学技术大学 Hierarchical modeling method and system based on whole words and radicals
CN113221884A (en) * 2021-05-13 2021-08-06 中国科学技术大学 Text recognition method and system based on low-frequency word storage memory
CN113537195A (en) * 2021-07-21 2021-10-22 北京数美时代科技有限公司 Image text recognition method and system and electronic equipment
CN116958981A (en) * 2023-05-31 2023-10-27 广东南方网络信息科技有限公司 Character recognition method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
CN110569843A (en) * 2019-09-09 2019-12-13 中国矿业大学(北京) Intelligent detection and identification method for mine target
CN110837835A (en) * 2019-10-29 2020-02-25 华中科技大学 End-to-end scene text identification method based on boundary point detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
CN110569843A (en) * 2019-09-09 2019-12-13 中国矿业大学(北京) Intelligent detection and identification method for mine target
CN110837835A (en) * 2019-10-29 2020-02-25 华中科技大学 End-to-end scene text identification method based on boundary point detection

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560857A (en) * 2021-02-20 2021-03-26 鹏城实验室 Character area boundary detection method, equipment, storage medium and device
CN112990201A (en) * 2021-05-06 2021-06-18 北京世纪好未来教育科技有限公司 Text box detection method and device, electronic equipment and computer storage medium
CN113221885A (en) * 2021-05-13 2021-08-06 中国科学技术大学 Hierarchical modeling method and system based on whole words and radicals
CN113221884A (en) * 2021-05-13 2021-08-06 中国科学技术大学 Text recognition method and system based on low-frequency word storage memory
CN113221884B (en) * 2021-05-13 2022-09-06 中国科学技术大学 Text recognition method and system based on low-frequency word storage memory
CN113221885B (en) * 2021-05-13 2022-09-06 中国科学技术大学 Hierarchical modeling method and system based on whole words and radicals
CN113537195A (en) * 2021-07-21 2021-10-22 北京数美时代科技有限公司 Image text recognition method and system and electronic equipment
CN116958981A (en) * 2023-05-31 2023-10-27 广东南方网络信息科技有限公司 Character recognition method and device
CN116958981B (en) * 2023-05-31 2024-04-30 广东南方网络信息科技有限公司 Character recognition method and device

Also Published As

Publication number Publication date
CN111553349B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN111553349B (en) Scene text positioning and identifying method based on full convolution network
CN109635883B (en) Chinese character library generation method based on structural information guidance of deep stack network
CN109948510B (en) Document image instance segmentation method and device
Ye et al. Text detection and recognition in imagery: A survey
US7480408B2 (en) Degraded dictionary generation method and apparatus
Nakamura et al. Scene text eraser
CN111414906A (en) Data synthesis and text recognition method for paper bill picture
CN111914698B (en) Human body segmentation method, segmentation system, electronic equipment and storage medium in image
CN113435240B (en) End-to-end form detection and structure identification method and system
CN111738055B (en) Multi-category text detection system and bill form detection method based on same
RU2726185C1 (en) Detecting and identifying objects on images
JP2008530700A (en) Fast object detection method using statistical template matching
CN111553837A (en) Artistic text image generation method based on neural style migration
CN113158977B (en) Image character editing method for improving FANnet generation network
CN112949455B (en) Value-added tax invoice recognition system and method
CN111523537A (en) Character recognition method, storage medium and system
CN113033558A (en) Text detection method and device for natural scene and storage medium
US20240161304A1 (en) Systems and methods for processing images
CN113033559A (en) Text detection method and device based on target detection and storage medium
CN110570450B (en) Target tracking method based on cascade context-aware framework
CN117115824A (en) Visual text detection method based on stroke region segmentation strategy
Lee et al. Backbone alignment and cascade tiny object detecting techniques for dolphin detection and classification
CN114783042A (en) Face recognition method, device, equipment and storage medium based on multiple moving targets
Shiravale et al. Recent advancements in text detection methods from natural scene images
CN117095423B (en) Bank bill character recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant