CN113963341A - Character detection system and method based on multi-layer perceptron mask decoder - Google Patents

Character detection system and method based on multi-layer perceptron mask decoder Download PDF

Info

Publication number
CN113963341A
CN113963341A CN202111034219.XA CN202111034219A CN113963341A CN 113963341 A CN113963341 A CN 113963341A CN 202111034219 A CN202111034219 A CN 202111034219A CN 113963341 A CN113963341 A CN 113963341A
Authority
CN
China
Prior art keywords
network
features
mask
detection
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111034219.XA
Other languages
Chinese (zh)
Inventor
王伟平
秦绪功
周宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202111034219.XA priority Critical patent/CN113963341A/en
Publication of CN113963341A publication Critical patent/CN113963341A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a character detection system and method based on a multi-layer perceptron mask decoder, which relate to the field of image text detection.

Description

Character detection system and method based on multi-layer perceptron mask decoder
Technical Field
The invention relates to the field of image text detection, in particular to a character detection system and method based on a multi-layer perceptron mask decoder.
Background
Text detection and recognition of scene images are a research hotspot in recent years, wherein character detection is an important part of the whole process, and the task of the text detection is to position characters in the images and send the characters to a character recognition module to transcribe the characters in the images into a form which can be edited by a computer. With the development of deep learning, the text detection technology has greatly advanced. In recent years, scene character detection in an arbitrary shape becomes a hot detection direction, and Mask R-CNN gradually becomes one of important baseline methods for scene character detection in an arbitrary shape and end-to-end character recognition due to its excellent performance in target detection and instance segmentation. Dense characters, which are common situations in real scenes, receive less attention in academia. At present, Mask R-CNN has obvious performance reduction when detecting dense characters, wherein the problem is that dense character examples can not be effectively distinguished.
When a character detector of a detection framework based on Mask R-CNN detects dense characters, dense texts cannot be effectively distinguished due to the problem of learning confusion in the training process; an anchor frame needs to be manually set in the RPN to match character objects with different length-width ratios, the best effect is required to be obtained, and different settings are required on different data sets; the Mask textpointer v3 adopts a segmentation-based method to generate candidate regions, however, the method needs a larger input scale to detect small-scale characters, so that the inference speed becomes slow, and on the other hand, the positioning error of the character core region is accumulated in the later stage.
Disclosure of Invention
The invention aims to provide a character detection system and a character detection method based on a multi-layer perceptron mask decoder, which can effectively distinguish and extract dense characters in any shape on an image.
In order to achieve the purpose, the invention adopts the following technical scheme:
a word detection system based on a multi-layer perceptron mask decoder comprises:
the characteristic extraction module is composed of a residual error network with 50 layers and is used for extracting visual characteristics on the image;
the feature fusion module is composed of a feature pyramid network FPN, and is used for enhancing the visual features by fusing the high-resolution features of the low-level features and the high-semantic information features of the high-level features in the visual features to obtain a feature map;
the candidate frame generation network module is used for setting an anchor frame on the characteristic diagram as an initial representation of a detection object and obtaining a candidate area by classifying and regressing the anchor frame;
a RoI-Align network module, configured to extract local area features from the candidate areas;
and the detection head network module comprises a detection branch network and a mask branch network, wherein the detection branch network is used for generating a corrected rectangular frame and a confidence score aiming at the characters according to the local region characteristics, and the mask branch network is used for generating character detection results in any shapes according to the corrected rectangular frame with the confidence threshold.
Further, the candidate area generation network is trained in advance, and positive and negative samples of training data are distributed by using an adaptive sample distribution strategy during training.
Further, the adaptive sample allocation strategy comprises two steps of pre-allocation and allocation, in the pre-allocation step, as many positive samples as possible are allowed to be matched for each example; in the assigning step, for each instance, the first k (k e N) samples are selected as positive samples according to the match quality, using a loss function as a measure of the match quality.
Further, the mask branching network comprises an encoder and a decoder, wherein the encoder is composed of four continuous convolution layers and is used for further extracting visual features; the decoder comprises a multi-layer perceptron structure consisting of two fully-connected layers and is used for generating a prediction result from visual features to a binary mask and outputting a character detection result in any shape.
A character detection method based on a multi-layer perceptron mask decoder is based on the system and comprises the following steps:
inputting an image to be detected into a feature extraction module, and extracting visual features of the image;
inputting the visual features into a feature fusion module, fusing high-resolution features of low-level features and high-semantic information features of high-level features, and enhancing the visual features to obtain a feature map;
utilizing a candidate frame generation network module, setting an anchor frame on the characteristic diagram as an initial representation of a detection object, and classifying and regressing the anchor frame to obtain a candidate region;
extracting local area features from the candidate areas by utilizing a RoI-Align network module;
and generating a corrected rectangular frame and a confidence score aiming at the characters according to the local region characteristics by using a detection branch network of the detection head network module, and generating a character detection result in any shape according to the corrected rectangular frame with the confidence threshold value by using a mask branch network of the detection head network module.
Further, the candidate area generation network is trained in advance, and positive and negative samples of training data are distributed by using an adaptive sample distribution strategy during training.
Further, the adaptive sample allocation strategy comprises two steps of pre-allocation and allocation, in the pre-allocation step, as many positive samples as possible are allowed to be matched for each example; in the assigning step, for each instance, the first k (k e N) samples are selected as positive samples according to the match quality, using a loss function as a measure of the match quality.
Furthermore, the encoder of the mask branch network is used for further extracting visual features, the decoder of the mask branch network is used for generating a prediction result from the visual features to the binary mask, and a character detection result in any shape is output and generated.
Compared with the prior art, the character detection method based on the multi-layer perceptron mask decoder provided by the invention effectively distinguishes different examples by utilizing the multi-layer perceptron mask decoder, the multi-layer perceptron mask decoder cancels weight sharing, the learning confusion problem in mask branches is reduced, meanwhile, due to the introduction of global modeling and more context information, the obtained prediction result is more compact, and dense texts can be effectively distinguished.
Drawings
FIG. 1 is a block diagram of a text detection system based on a multi-layered perceptron mask decoder according to the present invention.
Fig. 2 is a diagram showing the structure of the detector head network, in which "conv" indicates a table convolution and "deconv" indicates a deconvolution "FC" indicates a full link layer.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The embodiment provides a character detection system based on a multi-layer perceptron mask decoder, which can be used for detecting characters in any shape in a simple and effective network model aiming at dense scenes in any shape, wherein the whole model consists of five parts: the device comprises a feature extraction module, a feature fusion module, a candidate frame generation network module, a RoI-Align network module and a detection head network module, which are shown in figure 1.
The feature extraction module is composed of a residual error network with 50 layers, and the residual error network can extract abundant visual features.
The feature fusion module is composed of a Feature Pyramid Network (FPN), and further enhances the expression capability of features by fusing the features of different layers by utilizing the high resolution of low-layer features and the high semantic information of high-layer features to obtain a feature map;
the candidate frame generation network module is used for generating candidate regions, setting corresponding anchor frames (namely, candidate frames and rectangular frames) on the fused feature maps as initial representations of detection objects, and roughly recalling character regions by classifying and regressing the anchor frames to obtain the candidate regions (namely, regions of interest in fig. 1).
A RoI-Align network module, which is used to extract local area features from the candidate areas, and the network structure and the algorithm of the network are common knowledge, and those skilled in the art can understand;
the detection head network module consists of a detection branch network and a mask branch network. Wherein the detection branch outputs a confidence score and a modified rectangular box for the word, and the mask branch is used to generate an arbitrary shape as a word representation. The structure of the mask branches is further divided into encoder and decoder: four successive convolutions are first used to further extract visual features, which are divided into encoders, and the latter structure is used to generate a prediction from the features to a binary mask, which is divided into decoders. As shown in fig. 2, the decoder of the standard detection head network module (i.e. the mask head in fig. 2) is composed of a "deconvolution-convolution" structure, while the decoder proposed by the present invention is a multi-layered perceptron structure composed of a "full-connection-full-connection" structure.
The embodiment provides a character detection method based on a multi-layer perceptron mask decoder based on the system, and the whole process of the method comprises the following steps:
1) inputting an image, and extracting visual features through a feature extraction module;
2) the extracted visual features are subjected to a feature fusion module to fuse features from different levels, namely, low-level feature high-resolution features and high-level semantic information features in the visual features are fused, the visual features are enhanced, and a feature map is obtained;
3) the obtained feature map generates a candidate region through a candidate region generation network, and positive and negative samples are distributed by adopting the adaptive sample distribution strategy provided by the invention in the training process; the invention simplifies the setting of the anchor frame in the complex candidate area generation network from the view point of positive and negative sample distribution, simultaneously improves the performance of the network, can accurately detect the characters and simultaneously keeps higher speed. Specifically, since the manually set anchor box cannot adaptively match the extreme aspect ratio text, which is essentially no or a small number of positive samples matching to long text instances, an adaptive sample allocation strategy is proposed to solve this problem from the positive and negative sample allocation perspective. The strategy comprises two steps, the first step is pre-allocation and the second step is allocation. In the first step, more positive samples are allowed to match for each instance, so that more positive samples may participate. In a second step, for each instance, the first k samples are selected as positive samples according to the quality of the match, using a loss function as a measure of the quality of the match, so that more positive samples are likely to participate.
4) Extracting the candidate region obtained in the step 3) by utilizing RoI-Align to obtain local region characteristics;
5) inputting the local region characteristics obtained in the step 4) into a detection branch network to obtain a corrected rectangular frame and a detection confidence score;
6) and (5) inputting the detection result obtained in the step 5) into a mask branch network after a confidence threshold value is obtained, so as to obtain a character detection result in any shape.
In order to prove the technical effect of the method (referred to as "MAYOR" for short), the inventor carries out extensive experimental evaluation, and trains and tests the MAYOR on five mainstream multidirectional scene character data sets. DAST1500, among others, is a dense arbitrarily-shaped text-detection dataset that collects images of items from the internet, including a detailed description of the items on a small package that was purchased, that contains 1038 training images and 500 test images. The instances in this dataset are at the text line level and are given in the form of polygon labeled boxes. The image of the MSRA-TD500 contains large angular and dimensional variations, with 300 training samples and 200 test samples; ICDAR2015 contained 1000 training images and 500 test images; CTW1500 contains 1000 training images and 500 test images. There are 10751 total text instances, 3530 of which are curved text and each image has at least one curved text. In this dataset, text instances are labeled as text lines and labeled with 14-point polygons. Total-text (TT) has 1255 training images and 300 test images, which contain curvilinear text, horizontal multi-directional text. Each text is labeled at the word level in a way that a polygon encloses a box.
Table 1 shows the effect comparison between decoders with different structures, and the result proves the effectiveness of the multi-layered perceptron mask decoder proposed by the present invention. Table 2 shows the comparison of the effects of the modules of the present invention, and the results demonstrate the effect of the proposed multi-layered perceptron mask decoder and adaptive sample distribution strategy. Tables 3, 4, and 5 show ablation experiments in the adaptive sample distribution strategy, and the results show that the strategy is robust to hyper-reference selection and can simplify the setting of the anchor frame.
Tables 2, 6 and 7 show the effect comparison of the invention and other mainstream methods on a plurality of data sets, the invention achieves the best performance on a plurality of data sets, and the effectiveness of the invention is proved.
TABLE 1 comparison of Performance (%)
Decoder Recall rate Rate of accuracy F value
Deconvolution-convolution 79.8 86.7 83.1
Deconvolution-local concatenation 84.0 87.9 85.9
Deconvolution-full concatenation 84.6 89.1 86.8
Full connect-full connect 85.5 87.8 86.6
TABLE 2 test results on DAST1500 (%)
Figure BDA0003246464650000051
Note: "MRCNN", "ALA", and "MMD" in Table 2 denote Mask R-CNN, adaptive sample distribution, and proposed Mask decoder of the multi-layered perceptron.
TABLE 3 results on DAST1500 with different k value (%)
k 3 5 7 9 11 13 15
Recall rate 84.8 85.5 84.6 84.6 85.2 84.5 85.2
Rate of accuracy 88.2 87.8 88.5 88.1 88.0 88.4 87.7
F value 86.5 86.6 86.5 86.3 86.6 86.4 86.4
TABLE 4 Performance on DAST1500 for different aspect ratios of the anchor frame (%)
Figure BDA0003246464650000052
TABLE 5 Performance Change (%) -in DAST1500 Using different loss functions as the match metric
Loss of positioning Loss of classification Recall rate Rate of accuracy F value
82.8 90.1 86.3
82.9 85.9 84.4
85.5 87.8 86.6
TABLE 6 Single Scale test results (%) (on ICDAR2015 and MSRA-TD 500)
Figure BDA0003246464650000061
TABLE 7 Single Scale test results (%) -on CTW1500 and Total-Text
Figure BDA0003246464650000062
Although the present invention has been described with reference to the above embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A word detection system based on a multi-layer perceptron mask decoder, comprising:
the characteristic extraction module is composed of a residual error network with 50 layers and is used for extracting visual characteristics on the image;
the feature fusion module is composed of a feature pyramid network FPN, and is used for enhancing the visual features by fusing the high-resolution features of the low-level features and the high-semantic information features of the high-level features in the visual features to obtain a feature map;
the candidate frame generation network module is used for setting an anchor frame on the characteristic diagram as an initial representation of a detection object and obtaining a candidate area by classifying and regressing the anchor frame;
a RoI-Align network module, configured to extract local area features from the candidate areas;
and the detection head network module comprises a detection branch network and a mask branch network, wherein the detection branch network is used for generating a corrected rectangular frame and a confidence score aiming at the characters according to the local region characteristics, and the mask branch network is used for generating character detection results in any shapes according to the corrected rectangular frame with the confidence threshold.
2. The system of claim 1, wherein the candidate area generation network is trained in advance, and wherein positive and negative samples of training data are assigned using an adaptive sample assignment strategy.
3. The system of claim 2, wherein the adaptive sample allocation strategy includes two steps, a pre-allocation step in which as many positive samples as possible are allowed to match for each instance, and an allocation step; in the assigning step, for each instance, the first k samples are selected as positive samples according to the match quality using a loss function as a measure of the match quality.
4. The system of claim 1, wherein the mask branching network comprises an encoder and a decoder, the encoder being comprised of four successive convolutional layers for further extracting visual features; the decoder comprises a multi-layer perceptron structure consisting of two fully-connected layers and is used for generating a prediction result from visual features to a binary mask and outputting a character detection result in any shape.
5. A text detection method based on a multi-layer perceptron mask decoder, the system according to any one of claims 1-4, characterized by comprising the following steps:
inputting an image to be detected into a feature extraction module, and extracting visual features of the image;
inputting the visual features into a feature fusion module, fusing high-resolution features of low-level features and high-semantic information features of high-level features, and enhancing the visual features to obtain a feature map;
utilizing a candidate frame generation network module, setting an anchor frame on the characteristic diagram as an initial representation of a detection object, and classifying and regressing the anchor frame to obtain a candidate region;
extracting local area features from the candidate areas by utilizing a RoI-Align network module;
and generating a corrected rectangular frame and a confidence score aiming at the characters according to the local region characteristics by using a detection branch network of the detection head network module, and generating a character detection result in any shape according to the corrected rectangular frame with the confidence threshold value by using a mask branch network of the detection head network module.
6. The method of claim 5, wherein the candidate area generation network is trained in advance, and wherein the training utilizes an adaptive sample allocation strategy to allocate positive and negative samples of training data.
7. The method of claim 6, wherein the adaptive sample allocation strategy includes two steps of pre-allocation and allocation, in which for each instance as many positive samples as possible are allowed to match; in the assigning step, for each instance, the first k samples are selected as positive samples according to the match quality using a loss function as a measure of the match quality.
8. The method of claim 5, wherein the encoder of the mask branching network is used to further extract visual features, the decoder of the mask branching network is used to generate a prediction result from the visual features to the binary mask, and the output generates a text detection result with an arbitrary shape.
CN202111034219.XA 2021-09-03 2021-09-03 Character detection system and method based on multi-layer perceptron mask decoder Pending CN113963341A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111034219.XA CN113963341A (en) 2021-09-03 2021-09-03 Character detection system and method based on multi-layer perceptron mask decoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111034219.XA CN113963341A (en) 2021-09-03 2021-09-03 Character detection system and method based on multi-layer perceptron mask decoder

Publications (1)

Publication Number Publication Date
CN113963341A true CN113963341A (en) 2022-01-21

Family

ID=79460835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111034219.XA Pending CN113963341A (en) 2021-09-03 2021-09-03 Character detection system and method based on multi-layer perceptron mask decoder

Country Status (1)

Country Link
CN (1) CN113963341A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070040A (en) * 2020-09-11 2020-12-11 上海海事大学 Text line detection method for video subtitles
WO2021017998A1 (en) * 2019-07-26 2021-02-04 第四范式(北京)技术有限公司 Method and system for positioning text position, and method and system for training model
CN113095319A (en) * 2021-03-03 2021-07-09 中国科学院信息工程研究所 Multidirectional scene character detection method and device based on full convolution angular point correction network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021017998A1 (en) * 2019-07-26 2021-02-04 第四范式(北京)技术有限公司 Method and system for positioning text position, and method and system for training model
CN112070040A (en) * 2020-09-11 2020-12-11 上海海事大学 Text line detection method for video subtitles
CN113095319A (en) * 2021-03-03 2021-07-09 中国科学院信息工程研究所 Multidirectional scene character detection method and device based on full convolution angular point correction network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
詹琦梁;陈胜勇;胡海根;李小薪;周乾伟;: "一种结合多种图像分割算法的实例分割方案", 小型微型计算机***, no. 04, 9 April 2020 (2020-04-09) *
陶月锋;姜维;张重生;: "场景文字检测算法的漏检问题研究", 河南大学学报(自然科学版), no. 05, 16 September 2020 (2020-09-16) *

Similar Documents

Publication Publication Date Title
WO2020221298A1 (en) Text detection model training method and apparatus, text region determination method and apparatus, and text content determination method and apparatus
CN108509978B (en) Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion
CN110348445B (en) Instance segmentation method fusing void convolution and edge information
CN111652217B (en) Text detection method and device, electronic equipment and computer storage medium
WO2019192397A1 (en) End-to-end recognition method for scene text in any shape
CN107609525B (en) Remote sensing image target detection method for constructing convolutional neural network based on pruning strategy
CN109241982B (en) Target detection method based on deep and shallow layer convolutional neural network
CN110210431B (en) Point cloud semantic labeling and optimization-based point cloud classification method
CN111598183B (en) Multi-feature fusion image description method
CN114937151A (en) Lightweight target detection method based on multi-receptive-field and attention feature pyramid
KR20200047307A (en) Cnn-based learning method, learning device for selecting useful training data and test method, test device using the same
CN105608454A (en) Text structure part detection neural network based text detection method and system
CN113903022B (en) Text detection method and system based on feature pyramid and attention fusion
CN112766409A (en) Feature fusion method for remote sensing image target detection
CN112699889A (en) Unmanned real-time road scene semantic segmentation method based on multi-task supervision
CN113763364B (en) Image defect detection method based on convolutional neural network
CN111898608B (en) Natural scene multi-language character detection method based on boundary prediction
CN116503744B (en) Height grade-guided single-view remote sensing image building height estimation method and device
CN117853955A (en) Unmanned aerial vehicle small target detection method based on improved YOLOv5
CN113111740A (en) Characteristic weaving method for remote sensing image target detection
CN116012709B (en) High-resolution remote sensing image building extraction method and system
CN113963341A (en) Character detection system and method based on multi-layer perceptron mask decoder
CN116385364A (en) Multi-level ground lead defect identification method based on parallax auxiliary semantic segmentation
CN112084815A (en) Target detection method based on camera focal length conversion, storage medium and processor
Lei et al. Noise-robust wagon text extraction based on defect-restore generative adversarial network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination