CN113963341A

CN113963341A - Character detection system and method based on multi-layer perceptron mask decoder

Info

Publication number: CN113963341A
Application number: CN202111034219.XA
Authority: CN
Inventors: 王伟平; 秦绪功; 周宇
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2022-01-21

Abstract

The invention discloses a character detection system and method based on a multi-layer perceptron mask decoder, which relate to the field of image text detection.

Description

Character detection system and method based on multi-layer perceptron mask decoder

Technical Field

The invention relates to the field of image text detection, in particular to a character detection system and method based on a multi-layer perceptron mask decoder.

Background

Text detection and recognition of scene images are a research hotspot in recent years, wherein character detection is an important part of the whole process, and the task of the text detection is to position characters in the images and send the characters to a character recognition module to transcribe the characters in the images into a form which can be edited by a computer. With the development of deep learning, the text detection technology has greatly advanced. In recent years, scene character detection in an arbitrary shape becomes a hot detection direction, and Mask R-CNN gradually becomes one of important baseline methods for scene character detection in an arbitrary shape and end-to-end character recognition due to its excellent performance in target detection and instance segmentation. Dense characters, which are common situations in real scenes, receive less attention in academia. At present, Mask R-CNN has obvious performance reduction when detecting dense characters, wherein the problem is that dense character examples can not be effectively distinguished.

When a character detector of a detection framework based on Mask R-CNN detects dense characters, dense texts cannot be effectively distinguished due to the problem of learning confusion in the training process; an anchor frame needs to be manually set in the RPN to match character objects with different length-width ratios, the best effect is required to be obtained, and different settings are required on different data sets; the Mask textpointer v3 adopts a segmentation-based method to generate candidate regions, however, the method needs a larger input scale to detect small-scale characters, so that the inference speed becomes slow, and on the other hand, the positioning error of the character core region is accumulated in the later stage.

Disclosure of Invention

The invention aims to provide a character detection system and a character detection method based on a multi-layer perceptron mask decoder, which can effectively distinguish and extract dense characters in any shape on an image.

In order to achieve the purpose, the invention adopts the following technical scheme:

a word detection system based on a multi-layer perceptron mask decoder comprises:

the characteristic extraction module is composed of a residual error network with 50 layers and is used for extracting visual characteristics on the image;

the feature fusion module is composed of a feature pyramid network FPN, and is used for enhancing the visual features by fusing the high-resolution features of the low-level features and the high-semantic information features of the high-level features in the visual features to obtain a feature map;

the candidate frame generation network module is used for setting an anchor frame on the characteristic diagram as an initial representation of a detection object and obtaining a candidate area by classifying and regressing the anchor frame;

a RoI-Align network module, configured to extract local area features from the candidate areas;

and the detection head network module comprises a detection branch network and a mask branch network, wherein the detection branch network is used for generating a corrected rectangular frame and a confidence score aiming at the characters according to the local region characteristics, and the mask branch network is used for generating character detection results in any shapes according to the corrected rectangular frame with the confidence threshold.

Further, the candidate area generation network is trained in advance, and positive and negative samples of training data are distributed by using an adaptive sample distribution strategy during training.

Further, the adaptive sample allocation strategy comprises two steps of pre-allocation and allocation, in the pre-allocation step, as many positive samples as possible are allowed to be matched for each example; in the assigning step, for each instance, the first k (k e N) samples are selected as positive samples according to the match quality, using a loss function as a measure of the match quality.

Further, the mask branching network comprises an encoder and a decoder, wherein the encoder is composed of four continuous convolution layers and is used for further extracting visual features; the decoder comprises a multi-layer perceptron structure consisting of two fully-connected layers and is used for generating a prediction result from visual features to a binary mask and outputting a character detection result in any shape.

A character detection method based on a multi-layer perceptron mask decoder is based on the system and comprises the following steps:

inputting an image to be detected into a feature extraction module, and extracting visual features of the image;

inputting the visual features into a feature fusion module, fusing high-resolution features of low-level features and high-semantic information features of high-level features, and enhancing the visual features to obtain a feature map;

utilizing a candidate frame generation network module, setting an anchor frame on the characteristic diagram as an initial representation of a detection object, and classifying and regressing the anchor frame to obtain a candidate region;

extracting local area features from the candidate areas by utilizing a RoI-Align network module;

and generating a corrected rectangular frame and a confidence score aiming at the characters according to the local region characteristics by using a detection branch network of the detection head network module, and generating a character detection result in any shape according to the corrected rectangular frame with the confidence threshold value by using a mask branch network of the detection head network module.

Furthermore, the encoder of the mask branch network is used for further extracting visual features, the decoder of the mask branch network is used for generating a prediction result from the visual features to the binary mask, and a character detection result in any shape is output and generated.

Compared with the prior art, the character detection method based on the multi-layer perceptron mask decoder provided by the invention effectively distinguishes different examples by utilizing the multi-layer perceptron mask decoder, the multi-layer perceptron mask decoder cancels weight sharing, the learning confusion problem in mask branches is reduced, meanwhile, due to the introduction of global modeling and more context information, the obtained prediction result is more compact, and dense texts can be effectively distinguished.

Drawings

FIG. 1 is a block diagram of a text detection system based on a multi-layered perceptron mask decoder according to the present invention.

Fig. 2 is a diagram showing the structure of the detector head network, in which "conv" indicates a table convolution and "deconv" indicates a deconvolution "FC" indicates a full link layer.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment provides a character detection system based on a multi-layer perceptron mask decoder, which can be used for detecting characters in any shape in a simple and effective network model aiming at dense scenes in any shape, wherein the whole model consists of five parts: the device comprises a feature extraction module, a feature fusion module, a candidate frame generation network module, a RoI-Align network module and a detection head network module, which are shown in figure 1.

The feature extraction module is composed of a residual error network with 50 layers, and the residual error network can extract abundant visual features.

The feature fusion module is composed of a Feature Pyramid Network (FPN), and further enhances the expression capability of features by fusing the features of different layers by utilizing the high resolution of low-layer features and the high semantic information of high-layer features to obtain a feature map;

the candidate frame generation network module is used for generating candidate regions, setting corresponding anchor frames (namely, candidate frames and rectangular frames) on the fused feature maps as initial representations of detection objects, and roughly recalling character regions by classifying and regressing the anchor frames to obtain the candidate regions (namely, regions of interest in fig. 1).

A RoI-Align network module, which is used to extract local area features from the candidate areas, and the network structure and the algorithm of the network are common knowledge, and those skilled in the art can understand;

the detection head network module consists of a detection branch network and a mask branch network. Wherein the detection branch outputs a confidence score and a modified rectangular box for the word, and the mask branch is used to generate an arbitrary shape as a word representation. The structure of the mask branches is further divided into encoder and decoder: four successive convolutions are first used to further extract visual features, which are divided into encoders, and the latter structure is used to generate a prediction from the features to a binary mask, which is divided into decoders. As shown in fig. 2, the decoder of the standard detection head network module (i.e. the mask head in fig. 2) is composed of a "deconvolution-convolution" structure, while the decoder proposed by the present invention is a multi-layered perceptron structure composed of a "full-connection-full-connection" structure.

The embodiment provides a character detection method based on a multi-layer perceptron mask decoder based on the system, and the whole process of the method comprises the following steps:

1) inputting an image, and extracting visual features through a feature extraction module;

2) the extracted visual features are subjected to a feature fusion module to fuse features from different levels, namely, low-level feature high-resolution features and high-level semantic information features in the visual features are fused, the visual features are enhanced, and a feature map is obtained;

3) the obtained feature map generates a candidate region through a candidate region generation network, and positive and negative samples are distributed by adopting the adaptive sample distribution strategy provided by the invention in the training process; the invention simplifies the setting of the anchor frame in the complex candidate area generation network from the view point of positive and negative sample distribution, simultaneously improves the performance of the network, can accurately detect the characters and simultaneously keeps higher speed. Specifically, since the manually set anchor box cannot adaptively match the extreme aspect ratio text, which is essentially no or a small number of positive samples matching to long text instances, an adaptive sample allocation strategy is proposed to solve this problem from the positive and negative sample allocation perspective. The strategy comprises two steps, the first step is pre-allocation and the second step is allocation. In the first step, more positive samples are allowed to match for each instance, so that more positive samples may participate. In a second step, for each instance, the first k samples are selected as positive samples according to the quality of the match, using a loss function as a measure of the quality of the match, so that more positive samples are likely to participate.

4) Extracting the candidate region obtained in the step 3) by utilizing RoI-Align to obtain local region characteristics;

5) inputting the local region characteristics obtained in the step 4) into a detection branch network to obtain a corrected rectangular frame and a detection confidence score;

6) and (5) inputting the detection result obtained in the step 5) into a mask branch network after a confidence threshold value is obtained, so as to obtain a character detection result in any shape.

In order to prove the technical effect of the method (referred to as "MAYOR" for short), the inventor carries out extensive experimental evaluation, and trains and tests the MAYOR on five mainstream multidirectional scene character data sets. DAST1500, among others, is a dense arbitrarily-shaped text-detection dataset that collects images of items from the internet, including a detailed description of the items on a small package that was purchased, that contains 1038 training images and 500 test images. The instances in this dataset are at the text line level and are given in the form of polygon labeled boxes. The image of the MSRA-TD500 contains large angular and dimensional variations, with 300 training samples and 200 test samples; ICDAR2015 contained 1000 training images and 500 test images; CTW1500 contains 1000 training images and 500 test images. There are 10751 total text instances, 3530 of which are curved text and each image has at least one curved text. In this dataset, text instances are labeled as text lines and labeled with 14-point polygons. Total-text (TT) has 1255 training images and 300 test images, which contain curvilinear text, horizontal multi-directional text. Each text is labeled at the word level in a way that a polygon encloses a box.

Table 1 shows the effect comparison between decoders with different structures, and the result proves the effectiveness of the multi-layered perceptron mask decoder proposed by the present invention. Table 2 shows the comparison of the effects of the modules of the present invention, and the results demonstrate the effect of the proposed multi-layered perceptron mask decoder and adaptive sample distribution strategy. Tables 3, 4, and 5 show ablation experiments in the adaptive sample distribution strategy, and the results show that the strategy is robust to hyper-reference selection and can simplify the setting of the anchor frame.

Tables 2, 6 and 7 show the effect comparison of the invention and other mainstream methods on a plurality of data sets, the invention achieves the best performance on a plurality of data sets, and the effectiveness of the invention is proved.

TABLE 1 comparison of Performance (%)

Decoder	Recall rate	Rate of accuracy	F value
				Deconvolution-convolution	79.8	86.7	83.1
Deconvolution-local concatenation	84.0	87.9	85.9
				Deconvolution-full concatenation	84.6	89.1	86.8
Full connect-full connect	85.5	87.8	86.6

TABLE 2 test results on DAST1500 (%)

Note: "MRCNN", "ALA", and "MMD" in Table 2 denote Mask R-CNN, adaptive sample distribution, and proposed Mask decoder of the multi-layered perceptron.

TABLE 3 results on DAST1500 with different k value (%)

k	3	5	7	9	11	13	15
								Recall rate	84.8	85.5	84.6	84.6	85.2	84.5	85.2
Rate of accuracy	88.2	87.8	88.5	88.1	88.0	88.4	87.7
								F value	86.5	86.6	86.5	86.3	86.6	86.4	86.4

TABLE 4 Performance on DAST1500 for different aspect ratios of the anchor frame (%)

TABLE 5 Performance Change (%) -in DAST1500 Using different loss functions as the match metric

Loss of positioning	Loss of classification	Recall rate	Rate of accuracy	F value
					√		82.8	90.1	86.3
	√	82.9	85.9	84.4
					√	√	85.5	87.8	86.6

TABLE 6 Single Scale test results (%) (on ICDAR2015 and MSRA-TD 500)

TABLE 7 Single Scale test results (%) -on CTW1500 and Total-Text

Although the present invention has been described with reference to the above embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A word detection system based on a multi-layer perceptron mask decoder, comprising:

2. The system of claim 1, wherein the candidate area generation network is trained in advance, and wherein positive and negative samples of training data are assigned using an adaptive sample assignment strategy.

3. The system of claim 2, wherein the adaptive sample allocation strategy includes two steps, a pre-allocation step in which as many positive samples as possible are allowed to match for each instance, and an allocation step; in the assigning step, for each instance, the first k samples are selected as positive samples according to the match quality using a loss function as a measure of the match quality.

4. The system of claim 1, wherein the mask branching network comprises an encoder and a decoder, the encoder being comprised of four successive convolutional layers for further extracting visual features; the decoder comprises a multi-layer perceptron structure consisting of two fully-connected layers and is used for generating a prediction result from visual features to a binary mask and outputting a character detection result in any shape.

5. A text detection method based on a multi-layer perceptron mask decoder, the system according to any one of claims 1-4, characterized by comprising the following steps:

6. The method of claim 5, wherein the candidate area generation network is trained in advance, and wherein the training utilizes an adaptive sample allocation strategy to allocate positive and negative samples of training data.

7. The method of claim 6, wherein the adaptive sample allocation strategy includes two steps of pre-allocation and allocation, in which for each instance as many positive samples as possible are allowed to match; in the assigning step, for each instance, the first k samples are selected as positive samples according to the match quality using a loss function as a measure of the match quality.

8. The method of claim 5, wherein the encoder of the mask branching network is used to further extract visual features, the decoder of the mask branching network is used to generate a prediction result from the visual features to the binary mask, and the output generates a text detection result with an arbitrary shape.