CN113255669A - Method and system for detecting text of natural scene with any shape - Google Patents

Method and system for detecting text of natural scene with any shape Download PDF

Info

Publication number
CN113255669A
CN113255669A CN202110715820.9A CN202110715820A CN113255669A CN 113255669 A CN113255669 A CN 113255669A CN 202110715820 A CN202110715820 A CN 202110715820A CN 113255669 A CN113255669 A CN 113255669A
Authority
CN
China
Prior art keywords
mask
frame
candidate frame
candidate
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110715820.9A
Other languages
Chinese (zh)
Other versions
CN113255669B (en
Inventor
许信顺
刘新宇
罗昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202110715820.9A priority Critical patent/CN113255669B/en
Publication of CN113255669A publication Critical patent/CN113255669A/en
Application granted granted Critical
Publication of CN113255669B publication Critical patent/CN113255669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a system for detecting a natural scene text with any shape, which comprises the following steps: acquiring a to-be-detected text image; inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area; and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame. The invention designs a mask attention module for connecting a mask generation process and a mask quality scoring process, wherein the mask attention module has a positive effect on the prediction of the mask score.

Description

Method and system for detecting text of natural scene with any shape
Technical Field
The invention relates to the technical field of natural scene text detection, in particular to a method and a system for detecting a natural scene text with any shape.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Text appears in every corner of our daily lives as the most direct way of information dissemination. Due to potential application values in the aspects of automatic driving, blind navigation, information retrieval and the like, the text understanding task in a natural scene gets more and more attention. In general, the natural scene text understanding task involves two steps: text detection and text recognition. The first step is to locate text regions; the second step is for identifying content in the text region. Text detection is a crucial position as a precursor to text understanding tasks.
The traditional text detection method is mainly based on a connected region analysis method and a sliding window method, the two methods are based on the characteristics of manual design and can play a role in some simple scenes, and some methods cannot be used in some complex scenes. In recent years, due to the improvement of computer performance, the deep learning meets the unprecedented development opportunity, and the text detection technology based on the deep learning is also rapidly improved. Text detection based on deep learning can be mainly classified into two categories: regression-based methods and segmentation-based methods. The regression-based method can be used for detecting horizontal or multi-directional texts, and the segmentation-based method can be used for detecting texts in any shapes, so that the segmentation-based method occupies the dominant position of natural scene text detection at present.
One of the mainstream segmentation-based methods is an instance-based segmentation method. Such methods typically first use a horizontal candidate box (pro-visual) to locate a region; a classification score is then generated to determine whether the region enclosed by the candidate box belongs to text, and a segmentation mask is generated for delineating the text region. Such methods typically use classification scores as the only criteria for evaluating the quality of predicted candidate boxes, which can lead to serious false positive problems. False positive problems can be particularly classified into three categories:
(1) classify the resulting false positives. As shown in fig. 2(a), some areas in the natural scene have features similar to the text, such as graffiti on a wall, lines on a book, cracks on a road surface, etc., and these areas may be mistakenly classified as texts, resulting in a false positive sample.
(2) False positives due to regression. As shown in fig. 2(b), for some long texts or texts with larger character spacing (such as chinese), a candidate box may only contain a partial text segment, and an incomplete text segment may cause ambiguity in the subsequent recognition module.
(3) False positives resulting from segmentation. As shown in fig. 2(c), for some irregular texts, the horizontal candidate box may contain a large amount of background noise, so that the final segmented mask may not perfectly represent the text region.
For some systems with high precision requirements, the existence of false positive samples can cause immeasurable loss, many systems prefer not to recognize and fail to recognize, and the false positive samples in the detection result can cause fatal influence on the recognition result. For example, in the automatic driving, the four words of 'no parking' detect only the latter half, which may cause the vehicle to illegally park, and in the information retrieval process, the four words of 'football' detect only the former half, which may cause the detected result to be greatly different from the ideal result.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method and a system for detecting a natural scene text with any shape;
in a first aspect, the invention provides a method for detecting a text in a natural scene in any shape;
the method for detecting the text of the natural scene with any shape comprises the following steps:
acquiring a to-be-detected text image;
inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.
In a second aspect, the invention provides a system for detecting text in a natural scene in any shape;
an arbitrarily shaped natural scene text detection system, comprising:
an acquisition module configured to: acquiring a to-be-detected text image;
a detection module configured to: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention analyzes and summarizes the false positive problem existing in the traditional text detection method based on example segmentation, and provides a mechanism for scoring the mask quality to inhibit the false positive examples.
2. The invention designs a new method for detecting the natural scene text with any shape according to the proposed mask quality scoring mechanism, and the proposed method can inhibit all types of false positive samples in a simple and uniform manner.
3. The invention designs a mask attention module for connecting a mask generation process and a mask quality scoring process, wherein the mask attention module has a positive effect on the prediction of the mask score.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of the first embodiment;
FIG. 2(a) is a graph of false positives resulting from classification according to the first embodiment;
FIG. 2(b) is a false positive resulting from the regression of the first embodiment;
FIG. 2(c) is a graph of false positives resulting from the segmentation of the first embodiment;
FIG. 2(d) is a sample of the true positive of the first embodiment;
FIG. 3 is a detailed structure of the MAM of the first embodiment;
fig. 4 is a detailed structure of the Mask head of the first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.
In order to uniformly solve the above-mentioned false positive problem, the invention designs an arbitrary shape text detection method based on a mask quality scoring mechanism, which is different from the previous method that only uses the classification score of a candidate box to evaluate the quality of a candidate box, and in the model designed by the invention, whether a candidate box can be kept is determined by the classification score and the mask score of the candidate box. Based on the mechanism, the model can evaluate the quality of the candidate box more reasonably, and false positive samples are more likely to be found and filtered out. The overall framework of the model is shown in fig. 1. The model designed by the invention consists of four parts: a skeleton Network (Backbone), a candidate area Network (RPN), a bounding Box module (Box head), and a Mask module (Mask head). The frame module (Box head) comprises two full connection layers which are connected in sequence.
Example one
The embodiment provides a method for detecting a natural scene text in any shape;
as shown in fig. 1, the method for detecting a text in a natural scene with an arbitrary shape includes:
s1: acquiring a to-be-detected text image;
s2: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.
Further, the S2: inputting the image to be detected into the trained detection model to obtain a final detection frame; the method specifically comprises the following steps:
s21: carrying out feature extraction on the image to be detected;
s22: constructing an initial candidate frame based on the extracted image features;
s23: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame;
s24: generating characteristics of the adjusted candidate frame for the adjusted candidate frame;
expanding the adjusted candidate frame to obtain an expanded candidate frame;
for the expansion candidate frame, generating an expansion candidate frame characteristic;
s25: generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score;
s26: and screening the adjusted candidate frames through the classification score and the mask score to form a final detection frame.
Further, the S21: carrying out feature extraction on the image to be detected; the method specifically comprises the following steps:
and a Deep residual Network (ResNet) is adopted as a skeleton Network Backbone, the features of the image to be detected are extracted, and Feature Pyramid networks are used for enhancing and representing the features.
Further, the S22: constructing an initial candidate frame based on the extracted image features; the method specifically comprises the following steps:
inputting the extracted features into a candidate Region generation Network (RPN) to obtain a constructed initial candidate frame.
Illustratively, the RPN network will output several horizontal candidate boxes, and one candidate box may be specifically expressed as follows:
Figure 296726DEST_PATH_IMAGE001
wherein,
Figure 237000DEST_PATH_IMAGE002
is a candidate frame
Figure 25965DEST_PATH_IMAGE003
The coordinates of the upper left corner of the table,
Figure 659071DEST_PATH_IMAGE004
and
Figure 166276DEST_PATH_IMAGE005
are respectively as
Figure 893929DEST_PATH_IMAGE003
Width and height of (d);
further, the S23: generating an initial candidate frame feature based on the initial candidate frame; the method specifically comprises the following steps:
and generating initial candidate frame features by adopting a candidate region alignment operation RoIAlign based on the initial candidate frame and the extracted image features.
Further, the S23: predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:
reducing the dimension of the initial candidate frame features through two full-connection layers, and simultaneously and respectively sending the dimension-reduced features to a classification branch and a regression branch;
the classification branch is a full-connection layer with two-dimensional vector output, and a classification score is obtained by calculation according to the output of the classification branch;
the regression branch is a full connection layer with four-dimensional vector output, and the initial candidate frame is subjected to frame regression according to the output of the regression branch.
Illustratively, the S23: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:
s231: candidate frame
Figure 802980DEST_PATH_IMAGE003
ROIAlign formation by candidate region alignment operation
Figure 341408DEST_PATH_IMAGE006
Is characterized by
Figure 539172DEST_PATH_IMAGE007
S232: candidate frame features
Figure 883565DEST_PATH_IMAGE007
Reducing the dimension through two full-connection layers;
s233: outputting a two-dimensional vector by the reduced features through a full connection layer
Figure 584805DEST_PATH_IMAGE008
Figure 622031DEST_PATH_IMAGE009
Is directed to the output of a category of text,
Figure 25199DEST_PATH_IMAGE010
is directed to the output of the background analogy.
The classification score of a candidate box is used for determining whether the area enclosed by the candidate box belongs to the text or not, and the classification score
Figure 173284DEST_PATH_IMAGE011
Figure 729030DEST_PATH_IMAGE012
S234: the feature after dimensionality reduction simultaneously outputs a four-dimensional vector through another full-connection layer
Figure 937158DEST_PATH_IMAGE013
For bounding box regression, initial candidate boxes
Figure 578355DEST_PATH_IMAGE003
Forming adjusted candidate frame by adjusting position and size through frame regression branch
Figure 467813DEST_PATH_IMAGE014
Figure 205962DEST_PATH_IMAGE015
Further, the S24: generating characteristics of the adjusted candidate frame for the adjusted candidate frame; the method specifically comprises the following steps:
and generating the feature of the adjusted candidate frame by adopting a candidate region alignment operation RoIAlign based on the adjusted candidate frame and the image feature extracted in the S22.
Further, the S24: expanding the adjusted candidate frame to obtain an expanded candidate frame; the method specifically comprises the following steps:
and adopting the extension operation extension to form an extended candidate frame for the adjusted candidate frame.
The extension operation keeps the center position of the candidate frame unchanged, and the width and the height of the candidate frame are respectively expanded
Figure 771942DEST_PATH_IMAGE016
In the practical operation process
Figure 962751DEST_PATH_IMAGE016
Typically 2 is taken.
Meanwhile, the extension operation also ensures that the candidate frame after expansion does not exceed the image boundary.
Further, the S24: generating an expansion candidate frame characteristic for the expansion candidate frame; the method specifically comprises the following steps:
and for the expansion candidate frame, generating an expansion candidate frame characteristic by adopting a candidate region alignment operation RoIAlign.
Further, the S25: generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score; the method specifically comprises the following steps:
the adjusted features of the candidate box and the expanded features of the candidate box are input into a Mask module Mask head, and the Mask module Mask head comprises two workflows: mask generating stream and mask score stream; two workflows are connected through a Mask Attention Module (MAM).
The upper workflow in the Mask head of fig. 4 is a Mask generation flow, and outputs a Mask with the adjusted candidate box feature as an input. A mask generation stream comprising: the segmentation mask comprises a convolutional layer C1, a convolutional layer C2 and an deconvolution layer, wherein the characteristics of the deconvolution layer are subjected to dimensionality reduction through a convolutional layer C3 with a 1x1 convolutional kernel, and a segmentation mask is output.
The next workflow in the Mask head of fig. 4 is a Mask score stream, and outputs a Mask score with the adjusted candidate box features and the expanded candidate box features as inputs. The Mask score stream firstly adopts two layers of convolution layers and two Mask Attention Modules (MAM) to fuse the input characteristics, and the Mask score stream comprises: convolutional layer D1, convolutional layer D2, convolutional layer D3, full-link layer FC1, full-link layer FC2, and full-link layer FC 3.
The Mask of the feature and Mask generation stream output after the Mask Attention Module (MAM) in fig. 4 is stacked into the convolutional layer D3, the full connection layer FC1, the full connection layer FC2, and the full connection layer FC3, and outputs a predicted Mask score.
Further, the S26: screening the adjusted candidate frame through the classification score and the mask score to obtain a final detection frame; the method specifically comprises the following steps:
s261: all adjusted candidate frames are first deduplicated by non-maximal suppression (NMS);
s262: the quality of a candidate frame is scored by the candidate frame
Figure 390322DEST_PATH_IMAGE017
The specific calculation formula is as follows:
Figure 717398DEST_PATH_IMAGE018
wherein,
Figure 470590DEST_PATH_IMAGE019
for the predicted mask score, will
Figure 86379DEST_PATH_IMAGE017
Filtering out candidate frames smaller than 0.5, and forming a final detection frame by the reserved candidate frames;
s263: and selecting the maximum connected region in the mask of the reserved detection frame as a final detection result.
Further, the detection model, the model structure, includes:
the system comprises a skeleton network Backbone, a text detection module and a text analysis module, wherein the skeleton network Backbone is used for inputting a text detection image;
the output end of the Backbone Network Backbone is connected with the input end of a Region candidate Network (RPN);
an output end of the candidate Region generation Network (RPN) is connected with an input end of the RoIAlign layer; the output end of the RoIAlign layer is connected with the input end of the frame module Box head; the frame module Box head comprises two fully-connected layers which are connected in sequence;
the output end of the RoIAlign layer is also connected with the input end of the Mask head module.
Further, the Mask head module comprises: two parallel working branches: a first branch and a second branch;
wherein, first branch road includes: a convolutional layer C1 and a convolutional layer C2 connected in sequence; an input terminal of convolutional layer C1 for inputting the characteristics of the adjusted candidate frame;
wherein, the second branch road includes: a convolutional layer D1 and a convolutional layer D2 connected in this order; the input end of the convolution layer D1 is used for inputting the splicing value of the adjusted candidate frame characteristic and the expanded candidate frame characteristic;
the output terminal of convolutional layer C2 is connected to the first input terminal of the first Mask Attention Module (MAM);
the output end of the convolutional layer D2 is connected with the second input end of the first mask attention module;
the first output end of the first mask attention module is connected with the first input end of the second mask attention module;
the second output end of the first mask attention module is connected with the second input end of the second mask attention module;
a first output end of the second mask attention module is connected with an input end of the deconvolution layer, an output end of the deconvolution layer is connected with an input end of the convolution layer C3, and an output end of the convolution layer C3 generates a predicted mask;
the second output end of the second mask attention module is connected with the input end of the convolutional layer D3, the output end of the convolutional layer C3 is connected with the input end of the convolutional layer D3, the characteristics of the output end of the convolutional layer D3 are connected with three full-connection layers after being subjected to size adjustment, and the last full-connection layer outputs a mask score.
Illustratively, the Mask head module is divided into two workflows: the mask generates a stream and a masked score stream. The mask generation flow generates a corresponding mask for the candidate box feature by using the adjusted candidate box feature, and the mask scoring flow evaluates the mask quality by using the adjusted candidate box feature and the expanded candidate box feature, wherein the expanded candidate box comprises more peripheral information which is helpful for predicting the mask quality.
Illustratively, the Mask head module specifically works as follows:
step (1): adjusted candidate box features
Figure 114378DEST_PATH_IMAGE007
Forming mask generation stream features over two convolutional layers
Figure 561540DEST_PATH_IMAGE020
Step (2): adjusted candidate box features
Figure 469322DEST_PATH_IMAGE007
And extended candidate box features
Figure 572407DEST_PATH_IMAGE021
Cascading into two convolutional layers to form a masked scored stream feature
Figure 404097DEST_PATH_IMAGE022
And (3):
Figure 909028DEST_PATH_IMAGE020
and
Figure 535181DEST_PATH_IMAGE022
feeding into a first Mask Attention Module (MAM); the first mask attention module MAM causes the masking score stream to focus on the areas contained in the mask in order to predict the mask quality more accurately. The detailed structure of the MAM can refer to fig. 3;
and (4): features of the two workflows pass through a second mask attention module;
and (5): mask generation stream characterization through an deconvolution layer and a convolution layer to generate predicted masks
Figure 125562DEST_PATH_IMAGE023
And (6): masked score stream feature sum
Figure 760943DEST_PATH_IMAGE023
Outputting predicted mask scores for a convolution layer and three fully-connected layers in a stack
Figure 104069DEST_PATH_IMAGE019
The specific process of the step (3) is as follows:
step (3.1): masked production flow features
Figure 166703DEST_PATH_IMAGE020
Generating a phased mask through a convolutional layer as an attention map:
Figure 244380DEST_PATH_IMAGE024
wherein,
Figure 621135DEST_PATH_IMAGE025
indicates an possession of
Figure 897395DEST_PATH_IMAGE026
The convolution layer of the convolution kernel is formed,
Figure 68614DEST_PATH_IMAGE027
is one
Figure 430325DEST_PATH_IMAGE028
The attention map of (1);
Figure 594459DEST_PATH_IMAGE027
the region with the middle response value higher than the set threshold value indicates off in the segmentation processA region of note;
Figure 725226DEST_PATH_IMAGE027
the region with the medium response value lower than the set threshold value represents a region which is not concerned in the segmentation process;
step (3.2): enhancing on features of masked score streams
Figure 67345DEST_PATH_IMAGE027
The representation of the region of interest is specifically operated as follows:
Figure 916353DEST_PATH_IMAGE029
wherein,
Figure 369331DEST_PATH_IMAGE030
is a function for expanding the number of channels of the feature map, and the feature map is copied in actual operation
Figure 823446DEST_PATH_IMAGE027
From
Figure 398784DEST_PATH_IMAGE031
Extend to
Figure 922038DEST_PATH_IMAGE032
Figure 241024DEST_PATH_IMAGE033
Expanded attention map and masked score stream features
Figure 284066DEST_PATH_IMAGE022
Enhanced dot product feature
Figure 30305DEST_PATH_IMAGE034
Step (3.3):
Figure 526009DEST_PATH_IMAGE027
the response value of the area of no interest is usually very low, so
Figure 648685DEST_PATH_IMAGE034
The response of these regions of no interest is greatly suppressed;
in order to prevent the loss of the whole area information, in
Figure 546234DEST_PATH_IMAGE034
Adding original characteristics on the basis of the following steps:
Figure 463375DEST_PATH_IMAGE035
step (3.4):
Figure 961221DEST_PATH_IMAGE020
and
Figure 622010DEST_PATH_IMAGE036
respectively obtaining output values through convolution layer fusion characteristics.
Wherein the internal structure of the first mask attention module and the second MAM module are the same.
Further, the first masked attention module includes:
convolutional layer E1; an input of the convolutional layer E1 is for connection with a first mask attention module first input; the output end of the convolutional layer E1 is used for being connected with a first output end of a first mask attention module;
a convolutional layer F1; an input of the convolutional layer F1 is for connection with a first mask attention module first input; the output end of the convolutional layer F1 is used for being connected with the input end of the multiplier;
the input end of the multiplier is also connected with the second input end of the first mask attention module; the output end of the multiplier is connected with the input end of the adder, and the input end of the adder is also connected with the second input end of the first mask attention module; the output of the adder is further adapted to be coupled to a second output of the first mask attention module via convolutional layer G1.
Further, the training of the trained detection model comprises:
sa 1: constructing a training set, wherein the training set is an image of a known candidate frame label;
sa 2: inputting the training set into the detection model, training the detection model,
sa 3: carrying out feature extraction on the image of the known candidate frame tag;
sa 4: constructing an initial candidate frame based on the extracted features;
sa 5: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, generating a four-dimensional regression bias vector for the initial candidate frame based on the characteristics of the initial candidate frame;
sa 6: generating characteristics of the initial candidate frame for the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame;
sa 7: generating a mask based on the initial candidate box feature and the expanded candidate box feature; evaluating the mask quality to obtain a mask score;
sa 8: and calculating a loss function according to the generated classification score, the regression bias vector, the mask score and the attention map generated in the step Sa7, and optimizing network parameters through back propagation to obtain a trained detection model.
Exemplarily, the Sa 6: for the initial candidate frame, generating characteristics of the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame; the method specifically comprises the following steps:
sa 61: candidate frame
Figure 639644DEST_PATH_IMAGE003
Formed by Roiarign
Figure 930948DEST_PATH_IMAGE006
Is characterized by
Figure 463561DEST_PATH_IMAGE007
Sa62:
Figure 865723DEST_PATH_IMAGE003
Forming extended candidate boxes via extension operations
Figure 534602DEST_PATH_IMAGE037
The extension operation is specifically as follows:
Figure 246075DEST_PATH_IMAGE038
Figure 469246DEST_PATH_IMAGE039
Figure 471837DEST_PATH_IMAGE040
Figure 198484DEST_PATH_IMAGE041
Figure 893908DEST_PATH_IMAGE042
wherein,
Figure 73216DEST_PATH_IMAGE043
is that
Figure 879498DEST_PATH_IMAGE037
The coordinates of the upper left corner of the table,
Figure 241079DEST_PATH_IMAGE044
and
Figure 779507DEST_PATH_IMAGE045
are respectively as
Figure 774008DEST_PATH_IMAGE037
The width and the height of the base material,
Figure 56085DEST_PATH_IMAGE016
represents the expansion factor, i.e. the multiple by which the candidate box expands;
Sa63:
Figure 819642DEST_PATH_IMAGE037
also formed by Roiarign
Figure 528972DEST_PATH_IMAGE006
Extended candidate box feature of
Figure 10768DEST_PATH_IMAGE021
Further, in Sa8, the loss function of the computational model optimizes the parameters of the entire model by back propagation, and the specific form of the loss function is as follows:
Figure 611383DEST_PATH_IMAGE046
Figure 167129DEST_PATH_IMAGE047
wherein L is a loss function of the model and consists of six parts,
Figure 375257DEST_PATH_IMAGE048
Figure 16454DEST_PATH_IMAGE049
Figure 702650DEST_PATH_IMAGE050
Figure 378482DEST_PATH_IMAGE051
and
Figure 491931DEST_PATH_IMAGE052
for balancing the importance between the loss functions;
Figure 135271DEST_PATH_IMAGE053
Figure 562841DEST_PATH_IMAGE054
Figure 155497DEST_PATH_IMAGE055
the loss of the RPN network and the Box head is in the same form as the prior method based on example division,
Figure 643110DEST_PATH_IMAGE056
involving two parts, two classes of Log loss and bounding box regression
Figure 524478DEST_PATH_IMAGE057
The loss of the carbon dioxide gas is reduced,
Figure 552477DEST_PATH_IMAGE054
in the form of cross-entropy is used,
Figure 186590DEST_PATH_IMAGE055
by using
Figure 907421DEST_PATH_IMAGE058
The form will not be described herein.
Figure 10506DEST_PATH_IMAGE059
Mask generation flow loss for Mask head, adopting cross entropy form:
Figure 45458DEST_PATH_IMAGE060
wherein
Figure 347127DEST_PATH_IMAGE061
Representation mask
Figure 176542DEST_PATH_IMAGE023
To middle
Figure 829241DEST_PATH_IMAGE062
The value of the individual pixels is then calculated,
Figure 385993DEST_PATH_IMAGE063
is shown as
Figure 11009DEST_PATH_IMAGE062
Mask labels for individual pixels (obtained by real labels of the training data),
Figure 73643DEST_PATH_IMAGE064
to represent
Figure 885741DEST_PATH_IMAGE023
Total number of middle pixels.
Figure 324813DEST_PATH_IMAGE065
The loss function is a loss function aiming at a mask attention diagram in the MAM, and the specific form is as follows:
Figure 804336DEST_PATH_IMAGE066
Figure 224822DEST_PATH_IMAGE067
wherein
Figure 586533DEST_PATH_IMAGE068
And
Figure 235820DEST_PATH_IMAGE069
respectively representing the loss of two MAMs, and the loss forms of the two MAMs are consistent.
Figure 632166DEST_PATH_IMAGE070
Representation maskCode attention map
Figure 239865DEST_PATH_IMAGE027
To middle
Figure 823293DEST_PATH_IMAGE062
The value of the individual pixels is then calculated,
Figure 541851DEST_PATH_IMAGE071
a mask label representing the pixel is identified,
Figure 979654DEST_PATH_IMAGE072
can be labeled by a mask
Figure 554992DEST_PATH_IMAGE073
And (3) obtaining secondary interpolation:
Figure 828978DEST_PATH_IMAGE074
Figure 351227DEST_PATH_IMAGE075
aiming at Mask head Mask gain loss, the method adopts
Figure 456586DEST_PATH_IMAGE057
Form (a):
Figure 140508DEST_PATH_IMAGE076
wherein
Figure 698528DEST_PATH_IMAGE019
Is the mask score of the model prediction and,
Figure 8156DEST_PATH_IMAGE077
is the true mask score, defined as the intersection ratio between the generation mask and the true mask.
Figure 968022DEST_PATH_IMAGE077
Can be obtained through the following process:
if a candidate box has an intersection ratio greater than 0.2 to a true horizontal text box, then the true mask score for the candidate box is calculated by the following formula:
Figure 822845DEST_PATH_IMAGE078
wherein
Figure 337003DEST_PATH_IMAGE023
Firstly, binarization processing is carried out:
Figure 997792DEST_PATH_IMAGE079
if a candidate box does not intersect any real text boxes or the intersection ratio of all real text boxes is less than 0.2, then the candidate box's intersection ratio is less than
Figure 749847DEST_PATH_IMAGE077
Directly set to 0.
Fig. 2(a) -2 (d) show three types of false positive samples and one true positive sample. Wherein the rectangular box represents the final detection box of the model prediction; the shaded portions in the boxes represent the segmentation masks for these detection boxes. Wherein cls-score is the classification score predicted by the model; ms-score is the masking score predicted by the model. The traditional method only screens candidate frames through classification scores, and the three types of false positive samples generally have higher classification scores so that the false positive samples are reserved; the method provided by the invention screens the candidate boxes through the classification score and the mask score, the mask score of three types of false positive samples is very low and can be filtered, and the two scores of one true positive sample are both very high and can be reserved in the final detection result.
Training process:
step (1): acquiring original pictures of a training set and original labels of text regions in each picture (generally, horizontal or multidirectional texts are labeled by quadrangles, and irregular texts are labeled by polygons), and generating mask labels and candidate frame labels of the text regions by using the original labels. Marking all pixels inside the quadrangle or the polygon as a text category (namely, marking the pixel value as 1), and marking all pixels outside the quadrangle or the polygon as a background category (namely, marking the pixel value as 0), and forming a text region mask; taking the minimum horizontal frame capable of surrounding the quadrangle or the polygon as a candidate frame label;
step (2): sending the pictures into a backhaul extraction feature and constructing an initial candidate frame through an RPN;
and (3): the initial candidate Box characteristic and the expanded candidate Box characteristic are sent to the Box head and the Mask head to generate the classification score
Figure 103468DEST_PATH_IMAGE011
With offset frame
Figure 823031DEST_PATH_IMAGE080
Dividing the mask
Figure 21931DEST_PATH_IMAGE023
And a mask score
Figure 894072DEST_PATH_IMAGE019
. The MAM module in the Mask head outputs an attention diagram
Figure 418594DEST_PATH_IMAGE027
And (4): calculating a loss function of the model, and optimizing the whole model through back propagation;
and (5): after the whole training set trains K epochs, the fixed model stores network parameters, and K is a positive integer in the range of 30-40.
The testing process comprises the following steps:
step (1): acquiring a picture to be tested;
step (2): sending the pictures into a backhaul extraction feature and constructing an initial candidate frame through an RPN;
and (3): initial waiting timeThe Box feature is fed into Box head to generate a classification score
Figure 110607DEST_PATH_IMAGE011
And the frame offset
Figure 582040DEST_PATH_IMAGE080
By using
Figure 371004DEST_PATH_IMAGE080
Adjusting the original candidate frame;
and (4): sending the adjusted candidate frame characteristics and the expanded adjusted candidate frame characteristics into a Mask head to generate a segmentation Mask
Figure 987799DEST_PATH_IMAGE023
And a mask score
Figure 495004DEST_PATH_IMAGE019
And (5): non-maxima suppression is used to filter out duplicate candidate blocks. Reusing classification scores
Figure 238969DEST_PATH_IMAGE011
Sum mask score
Figure 85702DEST_PATH_IMAGE019
Calculating candidate box scores
Figure 952027DEST_PATH_IMAGE017
. Will be provided with
Figure 618632DEST_PATH_IMAGE017
Candidate boxes smaller than 0.5 are filtered out;
and (6): and selecting the maximum connected region in the mask of the reserved candidate frame as a final detection result.
Example two
The embodiment provides a natural scene text detection system with any shape;
an arbitrarily shaped natural scene text detection system, comprising:
an acquisition module configured to: acquiring a to-be-detected text image;
a detection module configured to: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.
It should be noted that the above-mentioned acquiring module and detecting module correspond to steps S1 to S2 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The method for detecting the text of the natural scene in any shape is characterized by comprising the following steps:
acquiring a to-be-detected text image;
inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the product of the classification score and the mask score to obtain the final detection frame.
2. The method for detecting the texts in the natural scenes with the arbitrary shapes as claimed in claim 1, wherein the images to be detected are input into the trained detection model to obtain a final detection frame; the method specifically comprises the following steps:
carrying out feature extraction on the image to be detected;
constructing an initial candidate frame based on the extracted image features;
generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame;
generating characteristics of the adjusted candidate frame for the adjusted candidate frame; expanding the adjusted candidate frame to obtain an expanded candidate frame; for the expansion candidate frame, generating an expansion candidate frame characteristic;
generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score;
and screening the adjusted candidate frame by the product of the classification score and the mask score to form a final detection frame.
3. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 2, wherein the classification score of the candidate box is predicted based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:
reducing the dimension of the initial candidate frame features through two full-connection layers, and simultaneously and respectively sending the dimension-reduced features to a classification branch and a regression branch;
the classification branch is a full-connection layer with two-dimensional vector output, and a classification score is obtained by calculation according to the output of the classification branch;
the regression branch is a full connection layer with four-dimensional vector output, and the initial candidate frame is subjected to frame regression according to the output of the regression branch.
4. The method for detecting text in a natural scene with an arbitrary shape as set forth in claim 2,
generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score; the method specifically comprises the following steps:
the adjusted features of the candidate box and the expanded features of the candidate box are input into a Mask module Mask head, and the Mask module Mask head comprises two workflows: mask generating stream and mask score stream;
a mask generation stream, which takes the adjusted candidate box characteristics as input and outputs a mask;
and the mask score stream takes the adjusted candidate box characteristics and the expanded candidate box characteristics as input and outputs a mask score.
5. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 1, wherein the model structure of the detection model comprises:
the system comprises a skeleton network Backbone, a text detection module and a text analysis module, wherein the skeleton network Backbone is used for inputting a text detection image;
the output end of the Backbone network Backbone is connected with the input end of the candidate region generation network RPN;
the output end of the candidate region generation network RPN is connected with the input end of a RoIAlign layer; the output end of the RoIAlign layer is connected with the input end of the frame module Box head; the frame module Box head comprises two fully-connected layers which are connected in sequence;
the output end of the RoIAlign layer is also connected with the input end of the Mask head module.
6. The method for detecting texts in natural scenes with arbitrary shapes according to claim 5, wherein the Mask head module comprises: two parallel working branches: a first branch and a second branch;
wherein, first branch road includes: a convolutional layer C1 and a convolutional layer C2 connected in sequence; an input terminal of convolutional layer C1 for inputting the characteristics of the adjusted candidate frame;
wherein, the second branch road includes: a convolutional layer D1 and a convolutional layer D2 connected in this order; the input end of the convolution layer D1 is used for inputting the splicing value of the adjusted candidate frame characteristic and the expanded candidate frame characteristic;
the output end of the convolutional layer C2 is connected to the first input end of the first mask attention module MAM;
the output end of the convolutional layer D2 is connected with the second input end of the first mask attention module;
the first output end of the first mask attention module is connected with the first input end of the second mask attention module;
the second output end of the first mask attention module is connected with the second input end of the second mask attention module;
a first output end of the second mask attention module is connected with an input end of the deconvolution layer, an output end of the deconvolution layer is connected with an input end of the convolution layer C3, and an output end of the convolution layer C3 generates a predicted mask;
the second output end of the second mask attention module is connected with the input end of the convolutional layer D3, the output end of the convolutional layer C3 is connected with the input end of the convolutional layer D3, the characteristics of the output end of the convolutional layer D3 are connected with three full-connection layers after being subjected to size adjustment, and the last full-connection layer outputs a mask score.
7. The method for detecting the text of the natural scene with the arbitrary shape as claimed in claim 5, wherein the Mask head module specifically works as follows:
adjusted candidate box features
Figure 570355DEST_PATH_IMAGE001
Forming mask generation stream features over two convolutional layers
Figure 18654DEST_PATH_IMAGE002
Adjusted candidate box features
Figure 470495DEST_PATH_IMAGE001
And extended candidate box features
Figure 600125DEST_PATH_IMAGE003
Cascading into two convolutional layers to form a masked scored stream feature
Figure 945655DEST_PATH_IMAGE004
Figure 689489DEST_PATH_IMAGE002
And
Figure 690943DEST_PATH_IMAGE004
feeding into a first mask attention module; the first mask attention module causes the mask score stream to focus on regions contained in the mask;
features of the two workflows pass through a second mask attention module;
mask generation stream characterization through an deconvolution layer and a convolution layer to generate predicted masks
Figure 686581DEST_PATH_IMAGE005
Masked score stream feature sum
Figure 89881DEST_PATH_IMAGE005
Outputting predicted mask scores for a convolution layer and three fully-connected layers in a stack
Figure 755348DEST_PATH_IMAGE006
8. The method for detecting text in an arbitrarily-shaped natural scene as recited in claim 6, wherein the first masking attention module comprises:
convolutional layer E1; an input of the convolutional layer E1 is for connection with a first mask attention module first input; the output end of the convolutional layer E1 is used for being connected with a first output end of a first mask attention module;
a convolutional layer F1; an input of the convolutional layer F1 is for connection with a first mask attention module first input; the output end of the convolutional layer F1 is used for being connected with the input end of the multiplier;
the input end of the multiplier is also connected with the second input end of the first mask attention module; the output end of the multiplier is connected with the input end of the adder, and the input end of the adder is also connected with the second input end of the first mask attention module; the output of the adder is further adapted to be coupled to a second output of the first mask attention module via convolutional layer G1.
9. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 1, wherein the training step of the trained detection model comprises:
constructing a training set, wherein the training set is an image of a known candidate frame label;
inputting the training set into the detection model, training the detection model,
carrying out feature extraction on the image of the known candidate frame tag;
constructing an initial candidate frame based on the extracted features;
generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, generating a four-dimensional regression bias vector for the initial candidate frame based on the characteristics of the initial candidate frame;
generating characteristics of the initial candidate frame for the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame;
generating a mask based on the initial candidate box feature and the expanded candidate box feature; evaluating the mask quality to obtain a mask score;
and calculating a loss function according to the generated classification score, the regression bias vector, the mask score and the generated attention map, and obtaining a trained candidate frame screening model by reversely propagating and optimizing network parameters.
10. The system for detecting the texts in the natural scenes in any shapes is characterized by comprising the following steps:
an acquisition module configured to: acquiring a to-be-detected text image;
a detection module configured to: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the product of the classification score and the mask score to obtain the final detection frame.
CN202110715820.9A 2021-06-28 2021-06-28 Method and system for detecting text of natural scene with any shape Active CN113255669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110715820.9A CN113255669B (en) 2021-06-28 2021-06-28 Method and system for detecting text of natural scene with any shape

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110715820.9A CN113255669B (en) 2021-06-28 2021-06-28 Method and system for detecting text of natural scene with any shape

Publications (2)

Publication Number Publication Date
CN113255669A true CN113255669A (en) 2021-08-13
CN113255669B CN113255669B (en) 2021-10-01

Family

ID=77189947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110715820.9A Active CN113255669B (en) 2021-06-28 2021-06-28 Method and system for detecting text of natural scene with any shape

Country Status (1)

Country Link
CN (1) CN113255669B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067237A (en) * 2021-10-28 2022-02-18 清华大学 Video data processing method, device and equipment
CN114863431A (en) * 2022-04-14 2022-08-05 中国银行股份有限公司 Text detection method, device and equipment
CN116958981A (en) * 2023-05-31 2023-10-27 广东南方网络信息科技有限公司 Character recognition method and device

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150170002A1 (en) * 2013-05-31 2015-06-18 Google Inc. Object detection using deep neural networks
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
CN110287960A (en) * 2019-07-02 2019-09-27 中国科学院信息工程研究所 The detection recognition method of curve text in natural scene image
CN110807422A (en) * 2019-10-31 2020-02-18 华南理工大学 Natural scene text detection method based on deep learning
CN110895695A (en) * 2019-07-31 2020-03-20 上海海事大学 Deep learning network for character segmentation of text picture and segmentation method
CN111754531A (en) * 2020-07-08 2020-10-09 深延科技(北京)有限公司 Image instance segmentation method and device
JP2020181255A (en) * 2019-04-23 2020-11-05 国立大学法人 東京大学 Image analysis device, image analysis method, and image analysis program
CN111950545A (en) * 2020-07-23 2020-11-17 南京大学 Scene text detection method based on MSNDET and space division
CN112163634A (en) * 2020-10-14 2021-01-01 平安科技(深圳)有限公司 Example segmentation model sample screening method and device, computer equipment and medium
CN112183545A (en) * 2020-09-29 2021-01-05 佛山市南海区广工大数控装备协同创新研究院 Method for recognizing natural scene text in any shape
CN112183322A (en) * 2020-09-27 2021-01-05 成都数之联科技有限公司 Text detection and correction method for any shape
AU2020103585A4 (en) * 2020-11-20 2021-02-04 Sonia Ahsan CDN- Object Detection System: Object Detection System with Image Classification and Deep Neural Networks
CN112446356A (en) * 2020-12-15 2021-03-05 西北工业大学 Method for detecting text with any shape in natural scene based on multiple polar coordinates
CN112749704A (en) * 2019-10-31 2021-05-04 北京金山云网络技术有限公司 Text region detection method and device and server
CN112861855A (en) * 2021-02-02 2021-05-28 华南农业大学 Group-raising pig instance segmentation method based on confrontation network model
CN112989927A (en) * 2021-02-03 2021-06-18 杭州电子科技大学 Scene graph generation method based on self-supervision pre-training

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150170002A1 (en) * 2013-05-31 2015-06-18 Google Inc. Object detection using deep neural networks
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
JP2020181255A (en) * 2019-04-23 2020-11-05 国立大学法人 東京大学 Image analysis device, image analysis method, and image analysis program
CN110287960A (en) * 2019-07-02 2019-09-27 中国科学院信息工程研究所 The detection recognition method of curve text in natural scene image
CN110895695A (en) * 2019-07-31 2020-03-20 上海海事大学 Deep learning network for character segmentation of text picture and segmentation method
CN112749704A (en) * 2019-10-31 2021-05-04 北京金山云网络技术有限公司 Text region detection method and device and server
CN110807422A (en) * 2019-10-31 2020-02-18 华南理工大学 Natural scene text detection method based on deep learning
CN111754531A (en) * 2020-07-08 2020-10-09 深延科技(北京)有限公司 Image instance segmentation method and device
CN111950545A (en) * 2020-07-23 2020-11-17 南京大学 Scene text detection method based on MSNDET and space division
CN112183322A (en) * 2020-09-27 2021-01-05 成都数之联科技有限公司 Text detection and correction method for any shape
CN112183545A (en) * 2020-09-29 2021-01-05 佛山市南海区广工大数控装备协同创新研究院 Method for recognizing natural scene text in any shape
CN112163634A (en) * 2020-10-14 2021-01-01 平安科技(深圳)有限公司 Example segmentation model sample screening method and device, computer equipment and medium
AU2020103585A4 (en) * 2020-11-20 2021-02-04 Sonia Ahsan CDN- Object Detection System: Object Detection System with Image Classification and Deep Neural Networks
CN112446356A (en) * 2020-12-15 2021-03-05 西北工业大学 Method for detecting text with any shape in natural scene based on multiple polar coordinates
CN112861855A (en) * 2021-02-02 2021-05-28 华南农业大学 Group-raising pig instance segmentation method based on confrontation network model
CN112989927A (en) * 2021-02-03 2021-06-18 杭州电子科技大学 Scene graph generation method based on self-supervision pre-training

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MINGHUI LIAO 等: "Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 *
ZHAOJIN HUANG 等: "Mask Scoring R-CNN", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
肖雅娟: "基于深度学习的图像文本检测技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
许信顺 等: "文本分类中一种新的特征选择方法", 《山东大学学报(工学版)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067237A (en) * 2021-10-28 2022-02-18 清华大学 Video data processing method, device and equipment
CN114863431A (en) * 2022-04-14 2022-08-05 中国银行股份有限公司 Text detection method, device and equipment
CN116958981A (en) * 2023-05-31 2023-10-27 广东南方网络信息科技有限公司 Character recognition method and device
CN116958981B (en) * 2023-05-31 2024-04-30 广东南方网络信息科技有限公司 Character recognition method and device

Also Published As

Publication number Publication date
CN113255669B (en) 2021-10-01

Similar Documents

Publication Publication Date Title
CN113255669B (en) Method and system for detecting text of natural scene with any shape
CN101971190B (en) Real-time body segmentation system
CN111598030A (en) Method and system for detecting and segmenting vehicle in aerial image
CN113486726A (en) Rail transit obstacle detection method based on improved convolutional neural network
CN105574524B (en) Based on dialogue and divide the mirror cartoon image template recognition method and system that joint identifies
CN109492596B (en) Pedestrian detection method and system based on K-means clustering and regional recommendation network
CN111767927A (en) Lightweight license plate recognition method and system based on full convolution network
Aggarwal et al. A robust method to authenticate car license plates using segmentation and ROI based approach
Ji et al. Filtered selective search and evenly distributed convolutional neural networks for casting defects recognition
CN115131797B (en) Scene text detection method based on feature enhancement pyramid network
CN114648665A (en) Weak supervision target detection method and system
CN112990282B (en) Classification method and device for fine-granularity small sample images
CN112287941A (en) License plate recognition method based on automatic character region perception
CN112507876A (en) Wired table picture analysis method and device based on semantic segmentation
He et al. Aggregating local context for accurate scene text detection
CN113496480A (en) Method for detecting weld image defects
Qin et al. Traffic sign segmentation and recognition in scene images
Wang A survey on IQA
CN116152226A (en) Method for detecting defects of image on inner side of commutator based on fusible feature pyramid
CN114330234A (en) Layout structure analysis method and device, electronic equipment and storage medium
CN117372876A (en) Road damage evaluation method and system for multitasking remote sensing image
Li et al. An improved PCB defect detector based on feature pyramid networks
CN117522735A (en) Multi-scale-based dense-flow sensing rain-removing image enhancement method
CN110363198B (en) Neural network weight matrix splitting and combining method
CN111178275A (en) Fire detection method based on convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant