CN114926826A

CN114926826A - Scene text detection system

Info

Publication number: CN114926826A
Application number: CN202210451005.0A
Authority: CN
Inventors: 玛依热·依布拉音; 李媛; 艾斯卡尔·艾木都拉
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2022-08-19

Abstract

The invention relates to the field of text detection. The invention provides a scene text detection system, comprising: the image acquisition unit, the characteristic extraction unit, the characteristic fusion unit and the differentiable binarization module, wherein the characteristic extraction unit is used for extracting a characteristic diagram of an original image by using Resnet, and a residual error correction branch is embedded in a Resnet backbone network; the residual error correction branch is used for forming two branches after the Resnet performs conventional convolution on the original image to obtain input characteristics; one branch converts input features into low-dimensional embedding through downsampling, the low-dimensional embedding is used for calibrating convolution transformation of a convolution kernel in the other branch, and finally a feature map of an original image is obtained; and finally, determining a target text area in the image according to the target feature map. The invention introduces a residual error correction branch (RCB) to expand the receptive field and improve the capability of acquiring context information, thereby acquiring the context information with larger receptive field.

Description

Scene text detection system

Technical Field

The invention belongs to the field of text detection, and particularly relates to a scene text detection system.

Background

Characters are one of the indispensable means for transferring information in the world of the present day, and various character information exists in the social scene of the life of the present example. The natural scene text detection is to position a character area in an image through a detection network and express the character area by using a polygonal bounding box, and an accurate detection result is favorable for wide practical application, such as instant translation, image retrieval, scene analysis, geographic positioning, license plate recognition and the like, and is concerned in the fields of computer vision and document analysis. In recent years, with the rapid development of Convolutional Neural Networks (CNNs), scene text detection has made great progress. This example can roughly divide existing CNN-based text detection algorithms into two categories, regression-based methods and segmentation-based methods.

For regression-based scene text detection algorithms, the method is usually represented in the form of a rectangular box or a quadrangular box with a specific direction. Although the detection speed is fast and can avoid the error generation of accumulating multiple stages, most of the existing regression-based methods can not accurately and effectively solve the character detection problem due to the limited Text representation form (axis-aligned rectangle, rotated rectangle or quadrangle), and particularly, the performance is not very good when the regression-based methods are used for detecting texts with any shapes on data sets such as Total-Text, which is very disadvantageous for the subsequent Text recognition in the whole optical character recognition engine.

For segmentation-based scene text detection algorithms, text instances are mainly located by classifying pixels. Although recent approaches have made significant improvements in the task of scene text detection, research has also turned from horizontal text to multi-directional text and more challenging arbitrarily-shaped text (e.g., curved text), challenges still exist in detection of arbitrarily-shaped scene text due to the fact that scene text is significantly different from general target objects due to significant changes in specific properties, such as color, scale, direction, aspect ratio, and shape, in addition to the nature of natural images, such as image blur, lighting conditions, and the like.

Texts in natural scenes have rich and clear semantic information, and the rapid and accurate extraction of the text information in scene images by using a computer technology is one of the popular research subjects in the field of computer vision and pattern recognition. The scene text detection technology is the basis of text recognition and has wide application in daily life and production of people. Compared with the conventional OCR, text detection in natural scene images faces many difficulties and challenges such as complicated background, various text scales and fonts, uncertainty of image quality, and the like. In recent years, with the rapid development of deep learning technology, the deep learning method has an obvious effect on a text detection task, and the existing convolutional neural network has good representation capability, but the network is insufficient in receptive field, weak in positioning capability and inaccurate in positioning of a text, so that false detection or missing detection is easy to occur when a long text or a large text is detected. On the other hand, although the feature pyramid network can fuse features of different scales, high-level semantic information of small-scale texts is lost at a high level of the network, so that the detection capability of the model on multi-scale texts is not strong.

Text information in a natural scene generally has the characteristics of diversity and irregularity, and the complexity of text detection in an arbitrary shape in the natural scene. Due to the adoption of a manual feature design mode, the traditional natural scene character detection method lacks robustness, and the existing text detection method based on deep learning has the problem of losing important feature information in the process of extracting features from each layer of network. The text detection method based on segmentation is one of the most popular detection methods in recent years, and the segmentation result can describe scene texts in various shapes more intuitively. The original DB (differential binarization) algorithm utilizes a differentiable binarization algorithm to simplify the post-processing process, solves the problem of irreducible gradient caused by training, improves the efficiency of scene text detection, but does not fully utilize semantic information and spatial information in the network, and limits the classification capability and the positioning capability of the network. Although segmentation-based algorithms have advantages in detecting arbitrarily shaped text, false positives or false negatives can also result from lack of sufficient contextual information.

Disclosure of Invention

The invention aims to provide a scene text detection system aiming at the problems that in the prior art, a DBNet text detection network has insufficient utilization of semantic information and spatial information in the network, the classification capability and the positioning capability of the network are limited, and false alarm or missing detection is caused by lack of enough context information, so that the DBNet text detection network can obtain deeper semantic information and definite key text characteristics in the characteristic extraction process.

The invention solves the technical problem, and adopts the technical scheme that the scene text detection system comprises: the image acquisition unit, characteristic fusion unit and differentiable binarization module, its characterized in that:

the image acquisition unit is used for acquiring an original image;

the feature extraction unit is used for extracting a feature map of the original image by using Resnet; embedding a residual error correction branch in the Resnet backbone network; the residual error correction branch is used for forming two branches after the Resnet performs conventional convolution on the original image to obtain input characteristics; one branch converts input features into low-dimensional embedding through downsampling, and the low-dimensional embedding is used for calibrating convolution transformation of a convolution kernel in the other branch to finally obtain a feature map of an original image;

the feature fusion unit is used for performing feature fusion on the feature map by using the FPN to finally obtain a target feature map;

and the differentiable binarization module is used for determining a target text area in the image according to the target characteristic diagram.

In the embodiment of the present invention, two branches of the residual error correction branch are a first branch and a second branch, respectively;

the first branch is used for carrying out conventional convolution on the input features to extract first branch features;

the second branch is used for performing average pooling downsampling on the input features by r times, performing convolution, performing upsampling, and finally performing Sigmoid activation function to obtain second branch features;

the residual error correction branch circuit is also used for performing dot product operation on the first branch circuit characteristic and the second branch circuit characteristic to obtain an output characteristic; and after the output features are added with the original image, obtaining a feature map of the original image through a Relu activation function.

In the embodiment of the invention, average pooling downsampling is adopted for r times, and the calculation formula is as follows:

x′ ₂ ＝AvgPool _r (x ₂ )

wherein x is ₂ An input characteristic for the second branch; x' ₂ Feature transformation for the second branch; r is 4.

In the embodiment of the present invention, the calculation formula for obtaining the second branch feature after passing through the Sigmoid activation function is as follows:

wherein, y ₂ Is a second branch feature; up (-) is nearest neighbor interpolated upsampling; x' ₂ Feature transformation for the second branch; k is a radical of ₂ Representing a convolution operation.

In the embodiment of the present invention, the calculation formula of the first branch characteristic is as follows:

wherein, y ₁ Is a first branch feature; x is a radical of a fluorine atom ₁ An input characteristic for the first branch; k is a radical of ₁ Representing a convolution operation.

In the embodiment of the invention, a double-branch attention feature fusion module is embedded in the FPN structure;

the double-branch attention feature fusion module is used for enhancing feature expression of multi-scale scene texts, so that detection accuracy is improved.

In the embodiment of the invention, the dual-branch attention feature fusion module comprises a global feature channel and a local feature channel;

the FPN is used for carrying out initial fusion on any two feature maps of the original image to obtain initial fusion features;

the global feature channel is used for performing global average pooling processing on the initial fusion features and performing convolution on the initial fusion features to extract the attention of the global feature channel;

the local feature channel is used for performing convolution on the initial fusion features to extract the attention of the local feature channel;

the double-branch attention feature fusion module is also used for adding the attention of the global feature channel and the attention of the local feature channel, activating the global feature channel and the local feature channel, and multiplying the global feature channel and the local feature channel by element the feature image with larger size in the feature image of the original image, so as to finally determine the target feature image.

In the embodiment of the present invention, the calculation formula of the attention of the global feature channel is as follows:

g(X)＝B(PWConv ₂ (δ(B(PWConv ₁ (Avg(X))))))

wherein g (x) represents global feature channel attention; b represents a BatchNorm layer; PWConv represents a point-by-point convolution; δ represents the Relu activation function, X represents the initial fusion characteristics; avg denotes global average pooling.

In the embodiment of the present invention, the calculation formula of the local feature channel attention is as follows:

L(X)＝B(PWConv ₂ (δ(B(PWConv ₁ (X)))))

wherein, l (x) represents local feature channel attention; b represents a BatchNorm layer; PWConv represents a point-by-point convolution; δ denotes the Relu activation function and X denotes the initial fusion characteristics.

In the embodiment of the invention, after the attention of the global characteristic channel and the attention of the local characteristic channel are summed, the sum is activated and then multiplied element by element with a characteristic image with a larger size in the characteristic image of the original image, and a calculation formula for acquiring the target characteristic image is as follows:

wherein, X' represents a target feature map;

representing an attention weight; p represents a larger-size feature map in the feature map of the original image; σ represents a Sigmoid activation function; g (x) denotes global feature channel attention; l (X) denotes local feature channel attention.

The method has the advantages that the feature extraction network is improved on the basis of the DBNet algorithm, and the improved ResNet lightweight feature extraction network and a better feature fusion method effectively fuse features at different depths together to guide the segmentation. The ResNet introduces a residual error correction branch (RCB) to expand the receptive field and improve the capability of acquiring context information, thereby acquiring the context information of a larger receptive field. Meanwhile, in order to improve the use efficiency of the features, a double-branch attention feature fusion (TB-AFF) module is added in the FPN structure, a text area is accurately positioned by combining a global attention mechanism and a local attention mechanism, and the text position in a natural scene is accurately detected. And finally, adding a binarization process into a model training process through a differentiable binarization module, adaptively setting a binarization threshold value, converting a probability map generated by a segmentation method into a text region, and obtaining a better text detection effect. The whole model not only ensures the quality of feature extraction, but also achieves good balance in the aspects of speed and precision because the model belongs to a lightweight network. On the premise of not sacrificing speed, the method enlarges the receptive field of the network, learns more precise text position information and further precisely positions the text region.

Drawings

Fig. 1 is a structural diagram of a scene text detection system in embodiment 1 of the present invention.

Fig. 2 is a diagram of a residual error correction branch circuit in embodiment 1 of the present invention.

Fig. 3 is a structural diagram of a dual-branch attention feature fusion module in embodiment 1 of the present invention.

Fig. 4 is a structural diagram of the structure of the micro-binaryzation in embodiment 1 of the present invention.

Fig. 5 is a visualization result on different types of text examples in embodiment 2 of the present invention.

Fig. 6 shows the results of visualization of Baseline in embodiment 2 of the present invention and the present invention.

FIG. 7 is a visual result display on examples of different types of text according to embodiment 2 of the present invention

Detailed Description

Example 1

The invention provides a scene text detection system aiming at the problems that in the prior art, a DBNet text detection network does not fully utilize semantic information and spatial information in the network, the classification capability and the positioning capability of the network are limited, and simultaneously, enough context information is lacked to cause false alarm or missing detection, which comprises the following steps: the image fusion device comprises an image acquisition unit, a feature extraction unit, a feature fusion unit and a differentiable binarization module, wherein the image acquisition unit is used for acquiring an original image; the work flow diagram is shown in fig. 1.

1. As for the feature extraction unit, the following is introduced:

the system extracts the feature map of the original image using the Resnet, which embeds a Residual Correction Branch (RCB) in the Resnet backbone network. When the system works, after Resnet performs conventional convolution on an original image to obtain input characteristics, residual errors correct two branches in the branch; one branch converts input features into low-dimensional embedding through downsampling, the low-dimensional embedding is used for calibrating convolution transformation of a convolution kernel in the other branch, and finally a feature map of an original image is obtained. In particular, the Residual Correction Branch (RCB) does not just perform a traditional convolution on the input in the original space, but first converts the input by downsampling into a low-dimensional embedding, from which the convolution transformation of the convolution kernel in the other branch is calibrated. By virtue of the communication between the convolution and the convolution kernel, each point in the space has the information of the area nearby the point and the interaction information on the channel, so that the interference of irrelevant areas in the whole global information is avoided. Meanwhile, the receptive field of each spatial position can be effectively expanded, so that more context information can be concerned.

The structure of the residual error correction branch (RCB) is shown in fig. 2, and includes a first branch and a second branch;

the second branch is used for performing average pooling downsampling on the input features by r times, performing convolution and upsampling on the input features, and finally obtaining second branch features through a Sigmoid activation function;

the second branch firstly adopts average pooling downsampling r times, and the calculation formula is as follows:

x′ ₂ ＝AvgPool _r (x ₂ )

wherein x is ₂ Is an input characteristic of the second branch; x' ₂ Converting the characteristics; r is 4.

The calculation formula for obtaining the second branch feature after passing through the Sigmoid activation function is as follows:

wherein, y ₂ A second branch characteristic; k is a radical of ₂ Represents a convolution operation; up (-) is nearest neighbor interpolated upsampling, with the goal of mapping the intermediate process from the small scale space to the original feature space. The Sigmoid activation function may increase the nonlinearity of the neural network model in order to increase the ability to fit sample nonlinear relationships. Compared with the standard convolution on the original branch, the residual error correction branch can adaptively establish a dependency relationship for the surrounding environment of each channel and spatial position, allow each channel and spatial position to adaptively treat the surrounding information environment as an input from a potential space to serve as a scalar in a response from an original proportional space, thereby generating more discriminative features and extracting more abundant context information, and therefore, the method can effectively expandField of view of the network with residual correction branches.

The calculation formula of the first branch characteristic is as follows:

wherein, y ₁ Is a first branch feature; x is a radical of a fluorine atom ₁ An input characteristic for the first branch; k is a radical of formula ₁ Representing a convolution operation. The input features of the first branch are consistent with those of the second branch, and are both input features obtained by performing conventional convolution on the original image by Resnet.

Then, performing dot product operation on the first branch characteristic and the second branch characteristic to obtain an output characteristic; and after the output features are added (namely residual connection) with the initial input of the module, a feature map of the original image is obtained through a Relu activation function.

The residual error correction branch (RCB) can generate a global receptive field and fully acquire the context information of the segmented image. When the method is applied to the convolution layer, the visual field is greatly increased, the purpose of amplifying the convolution receptive field can be achieved, and the method is favorable for capturing the whole discrimination region. It enables each spatial position to adaptively encode information context from surrounding regions, increasing feature information extraction capability. Meanwhile, the enhancement of information among channels is considered, richer and more discriminative feature representation is generated, the diversity of output features is further enhanced, and the performance of the convolutional network is improved.

On the other hand, the residual correction branch does not collect global context information, but only considers context information around each channel and spatial location, thereby avoiding to some extent some contaminating information from irrelevant areas (non-text areas). Therefore, the target object can be accurately positioned. Moreover, as can be seen from the figure, the residual error correction branch module has strong universality and convenient use, and can be easily applied to a standard convolution layer. Furthermore, it is well known that most attention or non-local based methods in general require additional learnable parameters to build the corresponding modules, which are then inserted into the building blocks. In contrast, our residual correction branch does not rely on any additional learnable parameters, is suitable for a variety of tasks, and can be easily embedded into modern classification networks.

2. With regard to the feature fusion unit, the following is introduced:

the feature fusion unit of this example is configured to perform feature fusion on the feature map using the FPN, and finally obtain a target feature map.

For the FPN structure, deeper features mean more channels, but the features of each layer are propagated from top to bottom when fused, so that the top-layer features tend to reduce more channels, the loss of context information is inevitably caused by the reduction of the feature channels, the features of the highest layer tend to lose more information, and the context semantic information of the image plays a crucial role in segmenting the network.

To retain more context information, this example selects FPN as an example of the most common scenario: and long jump connection is carried out, a double-branch attention feature fusion (TB-AFF) module is added into the FPN, the features extracted from each layer of the network are fully utilized to deal with the scale change of the text, more deep feature information can be reserved, and the performance of the features of the pyramid is improved. Specifically, TB-AFF is added to FPN to obtain an attention network, we call multiscale attention fusion network (MSAFN). The structure of the method is shown in fig. 3, a double-branch attention feature fusion (TB-AFF) module is embedded in the FPN structure, so that the feature expression of the multi-scale scene text can be enhanced, and the detection accuracy is improved.

A double branch attention feature fusion (TB-AFF) module is feature fusion caused by long jump connection, local features and global features in CNN are combined, the idea of spatial attention is gathered, multi-scale feature context information is polymerized in an attention module, and generated fusion weight is the same as the size of a feature graph, so that dynamic selection is performed in an element-by-element mode, and the method is suitable for most common scenes.

The dual branch attention feature fusion (TB-AFF) module comprises a global feature channel and a local feature channel; the global feature channel is based on SEnet, but the full connection layer is changed into pointwise conv, namely the normal convolution with convolution kernel of 1; the local feature channel adopts pointwise conv to extract the channel attention of the local feature, only utilizes the global channel attention for SEnet, and is biased to the context of the global scope, and the proposed TB-AFF also aggregates the local channel context attention, which is beneficial to the network containing less background clutter and the detection of small targets. By adding a cross-layer connection in a dual branch attention feature fusion (TB-AFF) module, complementation of multi-scale feature information can be realized to obtain a final representation reflecting context information

The working process is as follows: the FPN carries out initial fusion on any two feature maps of the original image to obtain initial fusion features; the global feature channel is used for performing global average pooling on the initial fusion features and performing convolution on the initial fusion features to extract the attention of the global feature channel; and the local feature channel is used for performing convolution on the initial fusion features to extract the attention of the local feature channel, so as to retain details. Then, after the attention of the global characteristic channel is added with the attention of the local characteristic channel, the global characteristic channel is activated, and then element-by-element multiplication is carried out on the global characteristic channel and the local characteristic channel, so that the target characteristic map is finally determined. And fusing the attention of the global feature channel and the attention of the local feature channel through a TB-AFF module, performing attention adjustment on each text position feature on the feature map, updating the features by weighting and summing the aggregated features of all positions, and focusing on a text area.

The initial fusion features are subjected to global average pooling, and then are subjected to convolution to extract a global channel context, wherein a calculation formula is as follows:

g(X)＝B(PWConv ₂ (δ(B(PWConv ₁ (Avg(X))))))

wherein g (X) represents a global channel context; b represents a BatchNorm layer; PWConv denotes point-by-point convolution δ denotes the Relu activation function, X denotes the initial fusion feature; avg represents the global average pooling. The channel attention mechanism adopts a point-by-point convolution mode, the convolution direction is changed by gradually compressing the channels, and larger weights are distributed to the channels with high response in the text area. The difference from l (X) is that Global Average Pooling (GAP) is performed for input X to obtain Global attention information.

Also, the extraction of the focus of local detail is enhanced with a visual attention layer. Calculation formula of channel attention of local features

Also extracted by point-by-point convolution. The calculation formula for extracting the local channel context by performing convolution on the initial fusion features is as follows:

L(X)＝B(PWConv ₂ (δ(B(PWConv ₁ (X)))))

wherein, l (x) represents a local channel context; b represents a BatchNorm layer; PWConv denotes point-by-point convolution δ denotes the Relu activation function, X denotes the initial fusion feature. L (x) has the same shape as the input features and may retain and highlight subtle details in the low-level features.

The global and local attention are collected, and the features needing attention are determined. After the global channel context and the local channel context are added, the global channel context and the local channel context are activated and then multiplied element by element with a feature map with a larger size in the feature map of the original image, and a calculation formula for obtaining a target feature map is as follows:

wherein, X' represents a target feature map; t (x) denotes attention weight; p represents a larger-size feature map in the feature map of the original image; considering that the learned feature vector for highlighting the key area may have limitations, the feature vector and the original input feature vector are added by corresponding elements to learn more comprehensive features. Sigma represents a Sigmoid activation function, and the Sigmoid activation function is used for activation, so that each element value of an attention channel is between [0 and 1], and the effects of enhancing useful image information and inhibiting useless information by an attention module can be achieved.

Here, the global feature channel attention uses the global average pooling operation, so the feature height and width shape obtained is 1 × 1, while the local feature channel attention and the input feature maintain the same height and width dimension, so the two sum up to the broadcast operation.

Which means element-by-element multiplication, i.e. multiplication of corresponding elements of two profiles.

To sum up, the dual-branch attention feature fusion (TB-AFF) module combines local and global feature information, and two input features, and extracts attention weights using two feature maps with different scales, and the main contributions are as follows:

(1) the size problem in channel attention is addressed, and the TB-AFF focuses on the scale problem of the channel by point-by-point convolution, rather than on convolution kernels of different sizes. The point-by-point convolution is used to make the TB-AFF as lightweight as possible.

(2) TB-AFF aggregates global and local feature context information not in the backbone network, but in the feature pyramid attention module (FPN).

3. With respect to the differentiable binarization module, the following is introduced:

the system uses a segmentation network (segmentation network) to segment the target feature map to generate a probability map (probability map P), wherein P belongs to R ^H×W Wherein H and W represent the height and width of the input image, respectively, and a binarization function is crucial to convert the probability map into a binary map, and the standard binarization function is as follows:

among them, a pixel having a value of 1 is considered as a valid text region. t is a set threshold value, and (i, j) represents a coordinate point in the graph. The standard binarization function is not trivial and therefore cannot be optimized with the segmentation of the network. In order to solve the problem that the binarization function is not differentiable, the following formula is used for binarization in this example:

wherein B' is an approximate binary map, T is an adaptive threshold map learned from the network, K is an amplification factor, and the function of K is to amplify the gradient of propagation in the backward propagation in the training process, so that the improvement on most error prediction regions is more friendly, and the more remarkable prediction can be generated. The example sets K to 50, the approximate binarization function is similar to the standard binarization function, has differentiability, and can be optimized along with the segmentation network during training. The threshold value T can be set in a self-adaptive mode through the micro-binaryzation method, the foreground and the background can be well distinguished through the method, and text examples with tight connection can be separated.

Specifically, the probability map (P) and the threshold map (T) are predicted by using the feature F, the probability map and the threshold map are combined to obtain a binary map according to a differentiable and binarizable module, and the threshold of each position is adaptively predicted. And finally forming a detection box for obtaining the text from the approximate binary image through the bounding box. The structure of the microtublization is shown in fig. 4. Path 1 represents the standard binarization process, the dashed lines only represent the inference process, and path 2 is the differentiable binarization used in this example, which adaptively predicts the threshold for each location of the image.

The loss function plays a crucial role in deep neural networks, this example uses L ₁ The loss function and the binary cross-entropy loss function optimize the network of this example. The loss function of this example consists of three parts during the training process: probability map loss L _s Loss L of binary image _b Adaptive threshold map loss L _t Expressed as follows:

L＝L _S +α×L _b +β×L _t

where α and β are weight parameters, α is set to 1 and β is set to 10. Wherein the probability map is lost by L _s And a binarized map loss L _b Adopts a binary cross entropy loss function, the formula is as follows, and hard negative mining is also adopted to overcome positiveImbalance of negative samples.

Wherein S is ₁ Showing that the image is subjected to sampling samples with a positive-negative sample ratio of 1: 3, and the adaptive threshold map is subjected to loss L _t An L1 loss function is used, which is formulated as:

wherein R is _d Is an index of the pixels in the region, y ^* Labels for the adaptive threshold map.

In conclusion, the differentiable binarization module can effectively determine the target text area in the image according to the target feature map.

Example 2

In order to verify the effectiveness of the scene Text detection system of the present invention, the present example also performed experiments on three challenging public data sets, namely, a multidirectional Text data set ICDAR2015, a curved Text data Total-Text, and a multilingual Text data set MSRA-TD 500. The visualization of the present example method on different types of text examples is shown in fig. 5. Including curved (e) and (f), multidirectional (a) and (b), and multilingual texts (c) and (d). For each cell in FIG. 5, the probability map is in the second column, the threshold map is in the third column, and the binarized map is in the fourth column.

1. Training arrangement

The experiment of this example used Python 3.7 as the programming language and the deep learning framework used was pytorch 1.5. In the embodiment, an Adam optimizer is adopted to train a model, cosine learning rate attenuation is adopted as learning rate scheduling, the initial learning rate is 0.001, and the size of a training batch is 16. And performing data enhancement on training data by adopting a random rotation angle, random cutting and turning mode within a range of (-10 degrees and 10 degrees), and readjusting all pictures to be 640 multiplied by 640. All experiments were performed on a TITAN RTX. The initial learning rate was set to 0.007. In these three data sets, all models were trained under the same strategy and tested under the same settings, which is not described here too much.

2. Experiments and discussion

To better demonstrate the implementation of the modules proposed in this example, detailed ablation studies were performed on the multidirectional Text data set ICDAR2015, the curvilinear Text data set Total-Text, and the multilingual Text data set MSRA-TD500, mainly considering 3 performance parameters: the detection performance of the model is evaluated according to the accuracy, the recall rate and the comprehensive evaluation index, and the influence of the residual error correction branch (RCB) and the double-branch attention feature fusion (TB-AFF) module provided by the embodiment is proved. During the network training process, experiments are carried out under the same environment, and the place marked with the square root represents that the method is used. The results are shown in Table 1.

Table 1 testing results in ICDAR2015 dataset

TABLE 2 test results in Total-Text dataset

TABLE 3 test results in MSRA-TD500 dataset

As can be seen from tables 1, 2 and 3, the recall rate and the comprehensive evaluation index are improved by different layers after the RCB module and/or the TB-AFF module are added to the ICDAR2015 data set, the Total-Text data set and the MSRA-TD500 data set. And, it can be seen that the detection performance of the network combining the advantages of the two modules is superior to the network applying the RCB module or the TB-AFF module alone.

In the RCB module, self-calibration is achieved by introducing an averaging pooling downsampling operation, which establishes connections between locations throughout the pooling window, which allows better capture of context information. Experimental results show that by adopting 18 layers of backbone networks, the baseline result can be greatly improved by using the proposed residual error correction branch. This phenomenon indicates that networks employing residual correction branches can generate richer, more distinctive representations of features than ordinary convolutions on the original branches, helping to find more complete target objects despite their small size. The network of this example can also be better confined to semantic regions when the target objects are small. Meanwhile, in order to overcome the problem of semantic and scale inconsistency between input features, the dual-branch attention feature fusion (TB-AFF) module of the present example adds local channel context to the global channel statistics. Experimental results show that the network based on the TB-AFF can improve the performance of an advanced network under a small parameter budget. This indicates that people should pay attention to feature fusion in deep neural networks, and a complex feature fusion attention mechanism is likely to produce better effects. Further illustrating the quality of feature fusion as less of a concern than it blindly increases the depth of the network. A multi-scale attention fusion network (MSAFN) with a dual branch attention feature fusion (TB-AFF) module always provides better performance compared to linear methods (i.e. addition and concatenation).

Figure 6 shows the visualization of baseline and the method of the invention. For each cell in the map, the probability map is in the second column, the threshold map is in the third column, and the binarized map is in the fourth column. The experimental result shows that the residual error correction branch (RCB) and the double-branch attention feature fusion (TB-AFF) module play an important role in feature extraction in model training, so that the attention of the model to text features is effectively enhanced, the extracted text features are effectively utilized, and the detection precision of scene texts is improved to a certain extent.

Fig. 7 shows the visualization results of the present invention and the original DBNet on different types of text examples, and it is noted that the image is randomly selected from three data sets, which can better prove the robustness of the model of the present example. For FIG. 7(a), Baseline missed a portion of the text in the graph (i.e., "CA") as compared to Baseline and Ours, which the method of this example can detect; for fig. 7(b) and 7(c), the Baseline misdetects the non-text, and detects the non-text region as a text region, but the method of the present example can better avoid the misdetection compared with the Baseline; for FIG. 7(d), Baseline missed a portion of the text in the graph (i.e., "1") as compared to Baseline and Ours, which the method of this example can detect; for fig. 7(e), the middle english word is missed by Baseline, but the method of this example can accurately detect it; for fig. 7(f), Baseline detects "coffe" as two-part text, and actually "coffe" indicates semantic information that should be detected as a whole text region.

Experimental results show that the detection capability of the method is improved on the multidirectional Text data set ICDAR2015, the curved Text data set Total-Text and the multilingual Text data set MSRA-TD 500. The method is good under a natural scene text detection data set, and has good performance, accuracy, recall rate and comprehensive evaluation index value index values. The enhancement of extracting the feature information of the text and the azimuth is realized by adding a residual error correction branch (RCB) and a double-branch attention feature fusion (TB-AFF) module, so that the text detection visual field is enlarged, and the detection effect on the multi-scale text is effectively improved. The detection efficiency is not lost, the detection precision of the original algorithm is improved, and the detection method is better than the text detection aiming at the natural scene to a certain extent. In various challenging scenes such as uneven illumination, low resolution, complex background and the like, the model can effectively cope with severe scale change of the text and accurately detect the scene text. On one hand, the residual error correction branch comprises an adaptive response calibration operation, which is helpful for more accurately positioning the accurate position of the target object. The ResNet with the residual correction branch can more accurately and completely locate the target object (text region) without including excessive background portions, even at low network depths. On the other hand, the dual branch attention feature fusion (TB-AFF) method has excellent performance and good versatility, so that the neural network can extract features more efficiently, the existing model can be effectively improved, the target related to the label can be focused, and the strong positioning capability of the target can be shown. This also demonstrates that early feature fusion also has some impact on attention feature fusion.

In summary, in order to make up for the defects of the light-weight network that the feature extraction capability and the receptive field are insufficient, a residual error correction branch (RCB) is embedded into the backbone network to enhance the feature extraction capability of the backbone network; a double-branch attention feature fusion (TB-AFF) module is embedded into the FPN and used for enhancing feature expression of multi-scale scene texts, so that the detection accuracy is improved.

Claims

1. Scene text detection system, including: the image acquisition unit, the feature extraction unit, the feature fuses unit and differentiable binarization module, its characterized in that:

the image acquisition unit is used for acquiring an original image;

and the differentiable binarization module is used for determining a target text area in the image according to the target feature map.

2. The scene text detection system according to claim 1, wherein two branches of the residual error correction branch are a first branch and a second branch, respectively;

3. The scene text detection system according to claim 2, wherein an average pooled downsampling r times is adopted, and the calculation formula is as follows:

x′ ₂ ＝AvgPool _r (x ₂ )

wherein x is ₂ Is an input characteristic of the second branch; x' ₂ Feature transformation for the second branch; r is 4.

4. The scene text detection system according to claim 3, wherein a calculation formula for obtaining the second branch feature after passing through the Sigmoid activation function is as follows:

wherein, y ₂ A second branch characteristic; up (-) is nearest neighbor interpolated upsampling; x' ₂ Feature transformation for the second branch; k is a radical of ₂ Representing a convolution operation.

5. The scene text detection system according to claim 4, wherein the calculation formula of the first branch feature is as follows:

wherein, y ₁ Is a first branch feature; x is the number of ₁ An input characteristic for the first branch; k is a radical of ₁ Representing a convolution operation.

6. The scene text detection system according to any one of claims 1 to 5, wherein a double-branch attention feature fusion module is embedded in the FPN structure;

7. The scene text detection system according to claim 6, wherein the dual-branch attention feature fusion module includes a global feature channel and a local feature channel;

the global feature channel is used for performing global average pooling processing on the initial fusion features and then performing convolution on the initial fusion features to extract the attention of the global feature channel;

8. The scene text detection system according to claim 7, wherein the global feature channel attention is calculated as follows:

g(X)＝B(PWConv ₂ (δ(B(PWConv ₁ (Avg(X))))))

wherein g (x) represents global feature channel attention; b represents a BatchNorm layer; PWConv represents a point-by-point convolution; δ denotes the Relu activation function, X denotes the initial fusion characteristics; avg represents the global average pooling.

9. The scene text detection system according to claim 8, wherein the local feature channel attention is calculated as follows:

L(X)＝B(PWConv ₂ (δ(B(PWConv ₁ (x)))))

10. The scene text detection system according to claim 9, wherein the global feature channel attention and the local feature channel attention are summed and then activated to be multiplied element by element with a feature map of a larger size in a feature map of an original image, and a calculation formula for obtaining a target feature map is as follows:

wherein, X' represents a target feature map;

representing an attention weight; p represents a larger-size feature map in the feature map of the original image; σ represents a Sigmoid activation function; g (X) represents global feature channel attention; l (x) denotes local feature channel attention.