CN116229445A

CN116229445A - Natural scene text detection method, system, storage medium and computing device

Info

Publication number: CN116229445A
Application number: CN202310185149.0A
Authority: CN
Inventors: 杜振锋; 周晓清; 龚汝洪; 曾凡智; 周燕
Original assignee: Guangdong Etonedu Co ltd
Current assignee: Guangdong Etonedu Co ltd
Priority date: 2023-03-01
Filing date: 2023-03-01
Publication date: 2023-06-06

Abstract

The invention discloses a natural scene text detection method, a natural scene text detection system, a storage medium and a computing device, wherein the natural scene text detection method comprises the following steps: 1) Performing Imgaug data enhancement on the original data set; 2) Inputting the processed images in the data set into an improved DBNet, and acquiring characteristic information of a target image through a backbone network module added with a LAFE module, wherein three layers of cavity convolution, channel attention and spatial attention are sequentially carried out when the images pass through the LAFE module so as to enhance distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box. The invention has the characteristic of deep learning technology, can continuously optimize the network through training, and improves the detection capability of the natural scene text.

Description

Natural scene text detection method, system, storage medium and computing device

Technical Field

The invention relates to the technical field of deep learning image processing, in particular to a natural scene text detection method, a system, a storage medium and computing equipment based on multi-level feature enhancement and fusion.

Background

As information technology is increasingly popular in life, text becomes a carrier of a large amount of information and is saved in the form of documents, images or video data, strongly promoting person-to-person communication. The natural scene text refers to texts of environments in daily life of people, such as street supermarkets, commodity packages or shop boards, and the like, and the texts are rich in content, so that people can be helped to quickly judge the current environments, and related life behaviors can be carried out. However, unlike the conventional regular document image, the writing is standard and orderly arranged, the natural scene text has the characteristics of different font styles, changeable shapes and the like, and the natural scene image usually has various interference factors such as noise, shielding, confusion, perspective distortion and the like, so that the difficulty of detecting the natural scene text is sharply increased. When we want to search for the desired text information with eyes, it is too costly, time consuming, and inefficient. Therefore, it is necessary to apply the object detection and semantic segmentation techniques to the detection of natural scene text.

Along with the rapid development of two-dimensional object detection technology, students apply mainstream object detectors such as YOLO, SSD and fast R-CNN to the field of natural scene text detection, and obtain better effects. However, since the preset frame and the network candidate frame are generally rectangular frames, it is disadvantageous to detect a text such as a curve and an arbitrary shape. In recent years, semantic segmentation technology from the viewpoint of pixel point level is gradually widely applied by domestic and foreign scholars and research institutions, and has better effect in the field of natural scene text detection. Meanwhile, since the operation of a preset frame is not needed, texts in various shapes can be effectively detected. However, the existing natural scene text detection methods are basically from the perspective of how to detect any shape text, and have insufficient detection capability for other types of natural scene text, such as unfocused small text, complex background text, wide-space curved text, and the like.

Disclosure of Invention

The first aim of the invention is to provide a natural scene text detection method based on multi-level feature enhancement and fusion, aiming at the characteristics of the existing DBNet model and considering the detection problems of unfocused small text, complex background text, wide-space bent text and the like in natural scene text detection.

The second object of the invention is to provide a natural scene text detection system based on multi-level feature enhancement and fusion.

A third object of the present invention is to provide a storage medium.

It is a fourth object of the present invention to provide a computing device.

The first object of the invention is achieved by the following technical scheme: the method is based on an improved DBNet for realizing accurate detection of natural scene text, wherein the improved DBNet is an improvement on both a backbone network module and a feature pyramid module of the original DBNet, and the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: adding a MEFF module which introduces a deformable convolution network in the fusion process of the multi-level features;

the specific implementation of the natural scene text detection method comprises the following steps:

1) Preprocessing data, including performing Imgaug data enhancement on an original data set, and then processing an image in the data set before being input into a training network into a size of 640×640;

2) Inputting the processed images in the data set into an improved DBNet, and acquiring characteristic information of a target image through a backbone network module added with a LAFE module, wherein three layers of cavity convolution, channel attention and spatial attention are sequentially carried out when the images pass through the LAFE module so as to enhance distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box.

Further, in step 1), the case of the imgauge data enhancement is: performing rotation in (-10 degrees, 10 degrees) range, scaling by 0.5 times and 3 times, and regularization, random cropping and flipping of the image on the original data; by the data enhancement mode, the network performance of the improved DBNet can be effectively improved, and the detection of the natural scene text is more robust and effective.

Further, in step 2), the backbone network module is composed of ResNet50+DCN networks including the LAFE module, and the processed data set is input into the backbone network module not including the LAFE module, which reads the input natural scene image information to output the original features C of different levels _i ＝{C ₂ ,C ₃ ,C ₄ ,C ₅ }, wherein C ₂ ,C ₃ ,C ₄ ,C ₅ Features of layers 2,3,4 and 5, respectively, of backbone network module output without LAFE module, and layers 2,3,4And layer 5 features have channel numbers of 2048, 1024, 512, 256, respectively;

the number of channels of the features of layers 2,3,4 and 5 is reduced to 256 by 1 x 1 convolution, and then the number of channels is reduced to 256 for the original features C of different layers _i Input to the LAFE module in parallel;

in the LAFE module, a three-layer cavity convolution mode is adopted to enlarge a network receptive field so as to strengthen modeling capability of a network on global context information, and each time the cavity convolution is carried out, a feature map of the upper layer needs to fill a plurality of pixel points so as to ensure the size of an output feature map and an original feature C _i The same calculation process is shown in the formula (1):

where i= {2,3,4,5}, m, n, and p represent the expansion coefficients r= {1,2,3}, F of the three-layer hole convolution _m Representing the use of a 3 x 3 convolution with a coefficient of expansion of 1, F _n Representing the use of a 3 x 3 convolution with a coefficient of expansion of 2, F _p A 3 x 3 convolution with a coefficient of expansion of 3 is shown,

the fusion characteristics after three-layer cavity convolution are represented;

after three-layer cavity convolution, in order to supplement channel dimension information of the features, the features are fused

Add channel attention gain feature->

As shown in formula (2):

wherein, avgPool and MaxPool respectively represent average pooling and maximum pooling operation, MLP represents the number of channels compressed first and then expanded by using two full connection layers, and sigma represents a Sigmoid function;

after the channel attention is added, in order to supplement the space dimension information of the features, the features also need to be subjected to

Adding spatial attention to get final enhancement features +.>

The calculation process is shown as a formula (3), wherein F _7×7 Representing extracting spatial information with a 7 x 7 convolution; finally, enhance feature L _i Then it is the original feature C _i Fusion characteristics->

Enhancement feature->

The three types of features are added, as shown in formula (4):

in the method, in the process of the invention,

representing fusion characteristics output through three-layer cavity convolution, < + >>

Characteristic of attention output through the channel, +.>

Representing enhancement of attention and spatial attention output through a channelFeatures, L _i Representing enhanced features of the LAFE module output.

Further, in step 2), the feature pyramid module takes the feature information output by the backbone network module as input, fuses the feature graphs with different scales, and adds the feature pyramid module to the improvement DBNet for effectively reducing the information loss caused by the traditional continuous linear up-sampling operation, so as to better extract the fused feature to reduce the missing detection and false detection situations of the natural scene text, and specifically, the following situations are:

will enhance the feature L ₅ Upsampling 2 times and enhancing feature L ₄ Add, enhance feature L ₄ Upsampling 2 times and enhancing feature L ₃ Add them to the enhancement feature L ₂ Sending the spatial information to a MEFF module to enhance the expression of the spatial information and obtain multi-level characteristics M _i ＝{M ₂ ,M ₃ ,M ₄ ,M ₅ M is }, where M ₂ ，M ₃ ，M ₄ ，M ₅ The features of the layer 2, the layer 3, the layer 4 and the layer 5 output by the feature pyramid module are respectively;

feature M ₂ Needs to be acquired by the MEFF module, firstly, aiming at the far-apart characteristic M ₄ Adding DCN after 4 times of linear up-sampling; next, for the feature M ₃ 2 times of linear up-sampling is carried out; finally, the enhancement feature L is combined ₂ The characteristics M fused with multi-level characteristic information are obtained through DCN ₂ The specific calculation process is shown as the formula (5):

M ₂ ＝DCN(L ₂ +DCN(Up(M ₄ ,4))+Up(M ₃ ,2)) (5)

wherein, up (M) ₃ 2) and Up (M) ₄ 4) representation of the characteristics M ₃ And feature M ₄ The linear up-sampling operation is respectively carried out by 2 times and 4 times, and DCN represents that the characteristics are introduced into a deformable convolution network when the characteristics are extracted through convolution kernel, and L ₂ Layer 2 features representing the output of the LAFE module;

multilayer feature M using 1X 1 convolution _i The number of channels is reduced from 256 to 64, and the final multi-layer is obtained through linear up-sampling operation with different multiplying powerAnd (3) sub-features, and then splicing to obtain the features with the sizes of 1/4 of the original figures.

In step 2), the prediction module of the improved DBNet predicts the final features output by the feature pyramid module, predicts the image probability map and the image threshold map respectively, and adopts a micro binarization post-processing module to calculate and process the images to obtain an approximate binary map, and further adopts a pixel point to text box aggregation post-processing mode to determine the final text detection result.

The second object of the invention is achieved by the following technical scheme: the natural scene text detection system based on multi-level feature enhancement and fusion is used for realizing the natural scene text detection method based on multi-level feature enhancement and fusion, and comprises the following steps:

the data preprocessing module is used for performing Imgaug data enhancement on the original data set, and then processing the image in the data set before being input into the training network into the size of 640 multiplied by 640;

the improved DBNet network module is used for realizing accurate detection of natural scene texts, the improved DBNet is an improvement on both a backbone network module and a characteristic pyramid module of the original DBNet, wherein the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: adding a MEFF module which introduces a deformable convolution network in the fusion process of the multi-level features;

the natural scene text detection module is used for inputting the processed images in the data set into the improved DBNet, and acquiring the characteristic information of the target image through a backbone network module added with the LAFE module, wherein three layers of cavity convolution, channel attention and space attention are sequentially carried out when the images pass through the LAFE module to enhance the distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box.

The third object of the invention is achieved by the following technical scheme: a storage medium stores a program which, when executed by a processor, implements the natural scene text detection method based on multi-level feature enhancement and fusion.

The fourth object of the invention is achieved by the following technical scheme: the computing device comprises a processor and a memory for storing a program executable by the processor, wherein the natural scene text detection method based on multi-level feature enhancement and fusion is realized when the processor executes the program stored by the memory.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. aiming at the difficulty of natural scene text detection, the invention improves on the basis of a DBNet network. Firstly, adding a LAFE module into a backbone network module; secondly, a MEFF module is added into the feature pyramid module. The improved DBNet network has higher accuracy and recall rate in the detection of the natural scene text and better generalization capability.

2. After the invention is improved on the DBNet network, unfocused small text, background complex text and wide-space curved text can be effectively detected, and the network can more effectively distinguish foreground and background pixel points of an image by enhancing and fusing multi-level features, so that the condition of missed detection and false detection is reduced.

3. The invention has wide application prospect, and the end-to-end training mode can effectively reduce the cost and improve the accuracy of natural scene text detection. Meanwhile, the method can be well applied to other different fields of natural scene text detection, and has certain market and prospect.

Drawings

Fig. 1 is a diagram of an overall architecture of the method of the present invention, in which an UpSample block is an upsampling operation, a Concat block is a fusion operation, an Add block is a pixel corresponding addition operation, a Probability Map block is an image Probability Map prediction operation, a Threshold Map module is an image Threshold Map prediction operation, and a DB block is a binarizable post-processing module.

Fig. 2 is a schematic diagram of a LAFE module, in which AvgPool blocks are an average pooling operation, maxPool blocks are a maximum pooling operation, sigmoid blocks are Sigmoid function operations, and Mul is a pixel corresponding multiplication operation.

FIG. 3 is a schematic block diagram of a MEFF module, where the DCN blocks are deformable convolutional networks.

Fig. 4 is a diagram of the architecture of the system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Example 1

As shown in fig. 1 to 3, the present embodiment discloses a method for detecting a natural scene text based on multi-level feature enhancement and fusion, which is based on an improved DBNet for realizing accurate detection of a natural scene text, where the improved DBNet is an improvement on both a backbone network module and a feature pyramid module of an original DBNet, and the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: a MEFF module is added that introduces a deformable convolutional network in the fusion process of the multi-level features. The specific implementation of the method comprises the following steps:

1) Preprocessing data, including performing Imgaug data enhancement on an original data set, and then processing an image in the data set before being input into a training network into a size of 640×640; wherein, the enhanced situation of the imgauge data is: performing rotation in (-10 degrees, 10 degrees) range, scaling by 0.5 times and 3 times, and regularization, random cropping and flipping of the image on the original data; by the data enhancement mode, the network performance of the improved DBNet can be effectively improved, and the detection of the natural scene text is more robust and effective.

2) Inputting the processed images in the data set into an improved DBNet, and acquiring characteristic information of a target image through a backbone network module added with a LAFE module, wherein three layers of cavity convolution, channel attention and spatial attention are sequentially carried out when the images pass through the LAFE module so as to enhance distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box. The specific cases are as follows:

the backbone network module is composed of ResNet50+DCN networks including LAFE modules, and inputs the processed data set into backbone network modules not including LAFE modules, which reads the input natural scene image information to output different levels of original features C _i ＝{C ₂ ,C ₃ ,C ₄ ,C ₅ }, wherein C ₂ ,C ₃ ,C ₄ ,C ₅ The characteristics of the layer 2, the layer 3, the layer 4 and the layer 5 which are respectively output by the backbone network module without the LAFE module, and the channel numbers of the characteristics of the layer 2, the layer 3, the layer 4 and the layer 5 are 2048, 1024, 512 and 256 respectively;

Add channel attention gain feature->

As shown in formula (2):

Adding spatial attention to get final enhancement features +.>

Enhancement feature->

The three types of features are added, as shown in formula (4):

in the method, in the process of the invention,

Characteristic of attention output through the channel, +.>

Enhanced features representing attentiveness and spatial attentiveness output through a channel, L _i Representing enhanced features of the LAFE module output.

The feature pyramid module takes the feature information output by the backbone network module as input, fuses the feature graphs with different scales, and adds the feature pyramid module into the improvement DBNet for effectively reducing the information loss caused by the traditional continuous linear up-sampling operation so as to better extract the fusion features to reduce the missing detection and false detection conditions of the natural scene text, wherein the specific conditions are as follows:

feature M ₂ It is required to go through the MEFF module to obtain,first, for the far-apart feature M ₄ Adding DCN after 4 times of linear up-sampling; next, for the feature M ₃ 2 times of linear up-sampling is carried out; finally, the enhancement feature L is combined ₂ The characteristics M fused with multi-level characteristic information are obtained through DCN ₂ The specific calculation process is shown as the formula (5):

M ₂ ＝DCN(L ₂ +DCN(Up(M ₄ ,4))+Up(M ₃ ,2)) (5)

multilayer feature M using 1X 1 convolution _i The number of channels is reduced from 256 to 64, the final multi-level characteristic is obtained through linear up-sampling operation of different multiplying powers, and then the characteristic with the size of original figure 1/4 is obtained after splicing.

The prediction module of the improved DBNet predicts the final characteristics output by the characteristic pyramid module, predicts an image probability diagram and an image threshold diagram respectively, adopts a micro binarization post-processing module to calculate and process the image probability diagram and the image threshold diagram to obtain an approximate binary diagram, and further adopts an aggregation post-processing mode from pixel points to a text box to determine a final text detection result.

Example 2

The embodiment discloses a natural scene text detection system based on multi-level feature enhancement and fusion, which is used for realizing the natural scene text detection method based on multi-level feature enhancement and fusion described in embodiment 1, and as shown in fig. 4, the system comprises the following functional modules:

Example 3

The embodiment discloses a storage medium storing a program, which when executed by a processor, implements the natural scene text detection method based on multi-level feature enhancement and fusion described in embodiment 1.

The storage medium in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a usb disk, a removable hard disk, or the like.

Example 4

The embodiment discloses a computing device, which comprises a processor and a memory for storing a program executable by the processor, wherein when the processor executes the program stored by the memory, the natural scene text detection method based on multi-level feature enhancement and fusion described in the embodiment 1 is realized.

The computing device described in this embodiment may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, a programmable logic controller (PLC, programmable Logic Controller), or other terminal devices with processor functionality.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The natural scene text detection method based on multi-level feature enhancement and fusion is characterized in that the method is based on improving DBNet to realize accurate detection of natural scene text, the improving DBNet is improved on a backbone network module and a feature pyramid module of the original DBNet, wherein the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: adding a MEFF module which introduces a deformable convolution network in the fusion process of the multi-level features;

2. The method for detecting natural scene text based on multi-level feature enhancement and fusion according to claim 1, wherein in step 1), the situation of enhancing the imgauge data is: performing rotation in (-10 degrees, 10 degrees) range, scaling by 0.5 times and 3 times, and regularization, random cropping and flipping of the image on the original data; by the data enhancement mode, the network performance of the improved DBNet can be effectively improved, and the detection of the natural scene text is more robust and effective.

3. The method for detecting natural scene text based on multi-level feature enhancement and fusion according to claim 2, wherein in step 2), the backbone network module is composed of a res net50+dcn network including a LAFE module, the processed data set is input into the backbone network module not including the LAFE module, and the backbone network module reads the input natural scene image information to output different levels of original features C _i ＝{C ₂ ,C ₃ ,C ₄ ,C ₅ }, wherein C ₂ ,C ₃ ,C ₄ ,C ₅ The characteristics of the layer 2, the layer 3, the layer 4 and the layer 5 which are respectively output by the backbone network module without the LAFE module, and the channel numbers of the characteristics of the layer 2, the layer 3, the layer 4 and the layer 5 are 2048, 1024, 512 and 256 respectively;

Add channel attention gain feature->

As shown in formula (2):

Adding spatial attention to get final enhancement features +.>

Enhancement feature->

The three types of features are added, as shown in formula (4):

in the method, in the process of the invention,

Characteristic of attention output through the channel, +.>

4. The method for detecting natural scene text based on multi-level feature enhancement and fusion according to claim 3, wherein in step 2), the feature pyramid module takes the feature information output by the backbone network module as input, fuses feature graphs with different scales, and adds the feature pyramid module into the MEFF module to improve DBNet for the purpose of effectively reducing information loss caused by traditional continuous linear up-sampling operation, so as to better extract the fusion features to reduce the condition of missing detection and false detection of natural scene text, and the specific conditions are as follows:

M ₂ ＝DCN(L ₂ +DCN(Up(M ₄ ,4))+Up(M ₃ ,2))(5)

5. The method for detecting natural scene text based on multi-level feature enhancement and fusion according to claim 4, wherein in step 2), the prediction module of improved DBNet predicts the final features output by the feature pyramid module, predicts the image probability map and the image threshold map respectively, and uses a micro binarizable post-processing module to calculate and process them to obtain an approximate binary map, and further uses a pixel-to-text box aggregation post-processing mode to determine the final text detection result.

6. The natural scene text detection system based on multi-level feature enhancement and fusion is characterized by being used for realizing the natural scene text detection method based on multi-level feature enhancement and fusion as claimed in any one of claims 1 to 5, and comprising the following steps:

7. A storage medium storing a program, wherein the program, when executed by a processor, implements the natural scene text detection method based on multi-level feature enhancement and fusion of any one of claims 1 to 5.

8. A computing device comprising a processor and a memory for storing a program executable by the processor, wherein the processor, when executing the program stored in the memory, implements the natural scene text detection method based on multi-level feature enhancement and fusion of any one of claims 1 to 5.