CN116229445A - Natural scene text detection method, system, storage medium and computing device - Google Patents

Natural scene text detection method, system, storage medium and computing device Download PDF

Info

Publication number
CN116229445A
CN116229445A CN202310185149.0A CN202310185149A CN116229445A CN 116229445 A CN116229445 A CN 116229445A CN 202310185149 A CN202310185149 A CN 202310185149A CN 116229445 A CN116229445 A CN 116229445A
Authority
CN
China
Prior art keywords
module
feature
natural scene
features
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310185149.0A
Other languages
Chinese (zh)
Inventor
杜振锋
周晓清
龚汝洪
曾凡智
周燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Etonedu Co ltd
Original Assignee
Guangdong Etonedu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Etonedu Co ltd filed Critical Guangdong Etonedu Co ltd
Priority to CN202310185149.0A priority Critical patent/CN116229445A/en
Publication of CN116229445A publication Critical patent/CN116229445A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a natural scene text detection method, a natural scene text detection system, a storage medium and a computing device, wherein the natural scene text detection method comprises the following steps: 1) Performing Imgaug data enhancement on the original data set; 2) Inputting the processed images in the data set into an improved DBNet, and acquiring characteristic information of a target image through a backbone network module added with a LAFE module, wherein three layers of cavity convolution, channel attention and spatial attention are sequentially carried out when the images pass through the LAFE module so as to enhance distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box. The invention has the characteristic of deep learning technology, can continuously optimize the network through training, and improves the detection capability of the natural scene text.

Description

Natural scene text detection method, system, storage medium and computing device
Technical Field
The invention relates to the technical field of deep learning image processing, in particular to a natural scene text detection method, a system, a storage medium and computing equipment based on multi-level feature enhancement and fusion.
Background
As information technology is increasingly popular in life, text becomes a carrier of a large amount of information and is saved in the form of documents, images or video data, strongly promoting person-to-person communication. The natural scene text refers to texts of environments in daily life of people, such as street supermarkets, commodity packages or shop boards, and the like, and the texts are rich in content, so that people can be helped to quickly judge the current environments, and related life behaviors can be carried out. However, unlike the conventional regular document image, the writing is standard and orderly arranged, the natural scene text has the characteristics of different font styles, changeable shapes and the like, and the natural scene image usually has various interference factors such as noise, shielding, confusion, perspective distortion and the like, so that the difficulty of detecting the natural scene text is sharply increased. When we want to search for the desired text information with eyes, it is too costly, time consuming, and inefficient. Therefore, it is necessary to apply the object detection and semantic segmentation techniques to the detection of natural scene text.
Along with the rapid development of two-dimensional object detection technology, students apply mainstream object detectors such as YOLO, SSD and fast R-CNN to the field of natural scene text detection, and obtain better effects. However, since the preset frame and the network candidate frame are generally rectangular frames, it is disadvantageous to detect a text such as a curve and an arbitrary shape. In recent years, semantic segmentation technology from the viewpoint of pixel point level is gradually widely applied by domestic and foreign scholars and research institutions, and has better effect in the field of natural scene text detection. Meanwhile, since the operation of a preset frame is not needed, texts in various shapes can be effectively detected. However, the existing natural scene text detection methods are basically from the perspective of how to detect any shape text, and have insufficient detection capability for other types of natural scene text, such as unfocused small text, complex background text, wide-space curved text, and the like.
Disclosure of Invention
The first aim of the invention is to provide a natural scene text detection method based on multi-level feature enhancement and fusion, aiming at the characteristics of the existing DBNet model and considering the detection problems of unfocused small text, complex background text, wide-space bent text and the like in natural scene text detection.
The second object of the invention is to provide a natural scene text detection system based on multi-level feature enhancement and fusion.
A third object of the present invention is to provide a storage medium.
It is a fourth object of the present invention to provide a computing device.
The first object of the invention is achieved by the following technical scheme: the method is based on an improved DBNet for realizing accurate detection of natural scene text, wherein the improved DBNet is an improvement on both a backbone network module and a feature pyramid module of the original DBNet, and the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: adding a MEFF module which introduces a deformable convolution network in the fusion process of the multi-level features;
the specific implementation of the natural scene text detection method comprises the following steps:
1) Preprocessing data, including performing Imgaug data enhancement on an original data set, and then processing an image in the data set before being input into a training network into a size of 640×640;
2) Inputting the processed images in the data set into an improved DBNet, and acquiring characteristic information of a target image through a backbone network module added with a LAFE module, wherein three layers of cavity convolution, channel attention and spatial attention are sequentially carried out when the images pass through the LAFE module so as to enhance distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box.
Further, in step 1), the case of the imgauge data enhancement is: performing rotation in (-10 degrees, 10 degrees) range, scaling by 0.5 times and 3 times, and regularization, random cropping and flipping of the image on the original data; by the data enhancement mode, the network performance of the improved DBNet can be effectively improved, and the detection of the natural scene text is more robust and effective.
Further, in step 2), the backbone network module is composed of ResNet50+DCN networks including the LAFE module, and the processed data set is input into the backbone network module not including the LAFE module, which reads the input natural scene image information to output the original features C of different levels i ={C 2 ,C 3 ,C 4 ,C 5 }, wherein C 2 ,C 3 ,C 4 ,C 5 Features of layers 2,3,4 and 5, respectively, of backbone network module output without LAFE module, and layers 2,3,4And layer 5 features have channel numbers of 2048, 1024, 512, 256, respectively;
the number of channels of the features of layers 2,3,4 and 5 is reduced to 256 by 1 x 1 convolution, and then the number of channels is reduced to 256 for the original features C of different layers i Input to the LAFE module in parallel;
in the LAFE module, a three-layer cavity convolution mode is adopted to enlarge a network receptive field so as to strengthen modeling capability of a network on global context information, and each time the cavity convolution is carried out, a feature map of the upper layer needs to fill a plurality of pixel points so as to ensure the size of an output feature map and an original feature C i The same calculation process is shown in the formula (1):
Figure BDA0004103461640000031
where i= {2,3,4,5}, m, n, and p represent the expansion coefficients r= {1,2,3}, F of the three-layer hole convolution m Representing the use of a 3 x 3 convolution with a coefficient of expansion of 1, F n Representing the use of a 3 x 3 convolution with a coefficient of expansion of 2, F p A 3 x 3 convolution with a coefficient of expansion of 3 is shown,
Figure BDA0004103461640000041
the fusion characteristics after three-layer cavity convolution are represented;
after three-layer cavity convolution, in order to supplement channel dimension information of the features, the features are fused
Figure BDA0004103461640000042
Add channel attention gain feature->
Figure BDA0004103461640000043
As shown in formula (2):
Figure BDA0004103461640000044
wherein, avgPool and MaxPool respectively represent average pooling and maximum pooling operation, MLP represents the number of channels compressed first and then expanded by using two full connection layers, and sigma represents a Sigmoid function;
after the channel attention is added, in order to supplement the space dimension information of the features, the features also need to be subjected to
Figure BDA0004103461640000045
Adding spatial attention to get final enhancement features +.>
Figure BDA0004103461640000046
The calculation process is shown as a formula (3), wherein F 7×7 Representing extracting spatial information with a 7 x 7 convolution; finally, enhance feature L i Then it is the original feature C i Fusion characteristics->
Figure BDA0004103461640000047
Enhancement feature->
Figure BDA0004103461640000048
The three types of features are added, as shown in formula (4):
Figure BDA0004103461640000049
Figure BDA00041034616400000410
in the method, in the process of the invention,
Figure BDA00041034616400000411
representing fusion characteristics output through three-layer cavity convolution, < + >>
Figure BDA00041034616400000412
Characteristic of attention output through the channel, +.>
Figure BDA00041034616400000413
Representing enhancement of attention and spatial attention output through a channelFeatures, L i Representing enhanced features of the LAFE module output.
Further, in step 2), the feature pyramid module takes the feature information output by the backbone network module as input, fuses the feature graphs with different scales, and adds the feature pyramid module to the improvement DBNet for effectively reducing the information loss caused by the traditional continuous linear up-sampling operation, so as to better extract the fused feature to reduce the missing detection and false detection situations of the natural scene text, and specifically, the following situations are:
will enhance the feature L 5 Upsampling 2 times and enhancing feature L 4 Add, enhance feature L 4 Upsampling 2 times and enhancing feature L 3 Add them to the enhancement feature L 2 Sending the spatial information to a MEFF module to enhance the expression of the spatial information and obtain multi-level characteristics M i ={M 2 ,M 3 ,M 4 ,M 5 M is }, where M 2 ,M 3 ,M 4 ,M 5 The features of the layer 2, the layer 3, the layer 4 and the layer 5 output by the feature pyramid module are respectively;
feature M 2 Needs to be acquired by the MEFF module, firstly, aiming at the far-apart characteristic M 4 Adding DCN after 4 times of linear up-sampling; next, for the feature M 3 2 times of linear up-sampling is carried out; finally, the enhancement feature L is combined 2 The characteristics M fused with multi-level characteristic information are obtained through DCN 2 The specific calculation process is shown as the formula (5):
M 2 =DCN(L 2 +DCN(Up(M 4 ,4))+Up(M 3 ,2)) (5)
wherein, up (M) 3 2) and Up (M) 4 4) representation of the characteristics M 3 And feature M 4 The linear up-sampling operation is respectively carried out by 2 times and 4 times, and DCN represents that the characteristics are introduced into a deformable convolution network when the characteristics are extracted through convolution kernel, and L 2 Layer 2 features representing the output of the LAFE module;
multilayer feature M using 1X 1 convolution i The number of channels is reduced from 256 to 64, and the final multi-layer is obtained through linear up-sampling operation with different multiplying powerAnd (3) sub-features, and then splicing to obtain the features with the sizes of 1/4 of the original figures.
In step 2), the prediction module of the improved DBNet predicts the final features output by the feature pyramid module, predicts the image probability map and the image threshold map respectively, and adopts a micro binarization post-processing module to calculate and process the images to obtain an approximate binary map, and further adopts a pixel point to text box aggregation post-processing mode to determine the final text detection result.
The second object of the invention is achieved by the following technical scheme: the natural scene text detection system based on multi-level feature enhancement and fusion is used for realizing the natural scene text detection method based on multi-level feature enhancement and fusion, and comprises the following steps:
the data preprocessing module is used for performing Imgaug data enhancement on the original data set, and then processing the image in the data set before being input into the training network into the size of 640 multiplied by 640;
the improved DBNet network module is used for realizing accurate detection of natural scene texts, the improved DBNet is an improvement on both a backbone network module and a characteristic pyramid module of the original DBNet, wherein the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: adding a MEFF module which introduces a deformable convolution network in the fusion process of the multi-level features;
the natural scene text detection module is used for inputting the processed images in the data set into the improved DBNet, and acquiring the characteristic information of the target image through a backbone network module added with the LAFE module, wherein three layers of cavity convolution, channel attention and space attention are sequentially carried out when the images pass through the LAFE module to enhance the distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box.
The third object of the invention is achieved by the following technical scheme: a storage medium stores a program which, when executed by a processor, implements the natural scene text detection method based on multi-level feature enhancement and fusion.
The fourth object of the invention is achieved by the following technical scheme: the computing device comprises a processor and a memory for storing a program executable by the processor, wherein the natural scene text detection method based on multi-level feature enhancement and fusion is realized when the processor executes the program stored by the memory.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. aiming at the difficulty of natural scene text detection, the invention improves on the basis of a DBNet network. Firstly, adding a LAFE module into a backbone network module; secondly, a MEFF module is added into the feature pyramid module. The improved DBNet network has higher accuracy and recall rate in the detection of the natural scene text and better generalization capability.
2. After the invention is improved on the DBNet network, unfocused small text, background complex text and wide-space curved text can be effectively detected, and the network can more effectively distinguish foreground and background pixel points of an image by enhancing and fusing multi-level features, so that the condition of missed detection and false detection is reduced.
3. The invention has wide application prospect, and the end-to-end training mode can effectively reduce the cost and improve the accuracy of natural scene text detection. Meanwhile, the method can be well applied to other different fields of natural scene text detection, and has certain market and prospect.
Drawings
Fig. 1 is a diagram of an overall architecture of the method of the present invention, in which an UpSample block is an upsampling operation, a Concat block is a fusion operation, an Add block is a pixel corresponding addition operation, a Probability Map block is an image Probability Map prediction operation, a Threshold Map module is an image Threshold Map prediction operation, and a DB block is a binarizable post-processing module.
Fig. 2 is a schematic diagram of a LAFE module, in which AvgPool blocks are an average pooling operation, maxPool blocks are a maximum pooling operation, sigmoid blocks are Sigmoid function operations, and Mul is a pixel corresponding multiplication operation.
FIG. 3 is a schematic block diagram of a MEFF module, where the DCN blocks are deformable convolutional networks.
Fig. 4 is a diagram of the architecture of the system of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Example 1
As shown in fig. 1 to 3, the present embodiment discloses a method for detecting a natural scene text based on multi-level feature enhancement and fusion, which is based on an improved DBNet for realizing accurate detection of a natural scene text, where the improved DBNet is an improvement on both a backbone network module and a feature pyramid module of an original DBNet, and the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: a MEFF module is added that introduces a deformable convolutional network in the fusion process of the multi-level features. The specific implementation of the method comprises the following steps:
1) Preprocessing data, including performing Imgaug data enhancement on an original data set, and then processing an image in the data set before being input into a training network into a size of 640×640; wherein, the enhanced situation of the imgauge data is: performing rotation in (-10 degrees, 10 degrees) range, scaling by 0.5 times and 3 times, and regularization, random cropping and flipping of the image on the original data; by the data enhancement mode, the network performance of the improved DBNet can be effectively improved, and the detection of the natural scene text is more robust and effective.
2) Inputting the processed images in the data set into an improved DBNet, and acquiring characteristic information of a target image through a backbone network module added with a LAFE module, wherein three layers of cavity convolution, channel attention and spatial attention are sequentially carried out when the images pass through the LAFE module so as to enhance distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box. The specific cases are as follows:
the backbone network module is composed of ResNet50+DCN networks including LAFE modules, and inputs the processed data set into backbone network modules not including LAFE modules, which reads the input natural scene image information to output different levels of original features C i ={C 2 ,C 3 ,C 4 ,C 5 }, wherein C 2 ,C 3 ,C 4 ,C 5 The characteristics of the layer 2, the layer 3, the layer 4 and the layer 5 which are respectively output by the backbone network module without the LAFE module, and the channel numbers of the characteristics of the layer 2, the layer 3, the layer 4 and the layer 5 are 2048, 1024, 512 and 256 respectively;
the number of channels of the features of layers 2,3,4 and 5 is reduced to 256 by 1 x 1 convolution, and then the number of channels is reduced to 256 for the original features C of different layers i Input to the LAFE module in parallel;
in the LAFE module, a three-layer cavity convolution mode is adopted to enlarge a network receptive field so as to strengthen modeling capability of a network on global context information, and each time the cavity convolution is carried out, a feature map of the upper layer needs to fill a plurality of pixel points so as to ensure the size of an output feature map and an original feature C i The same calculation process is shown in the formula (1):
Figure BDA0004103461640000091
where i= {2,3,4,5}, m, n, and p represent the expansion coefficients r= {1,2,3}, F of the three-layer hole convolution m Representing the use of a 3 x 3 convolution with a coefficient of expansion of 1, F n Representing the use of a 3 x 3 convolution with a coefficient of expansion of 2, F p A 3 x 3 convolution with a coefficient of expansion of 3 is shown,
Figure BDA0004103461640000092
the fusion characteristics after three-layer cavity convolution are represented;
after three-layer cavity convolution, in order to supplement channel dimension information of the features, the features are fused
Figure BDA0004103461640000093
Add channel attention gain feature->
Figure BDA0004103461640000094
As shown in formula (2):
Figure BDA0004103461640000095
wherein, avgPool and MaxPool respectively represent average pooling and maximum pooling operation, MLP represents the number of channels compressed first and then expanded by using two full connection layers, and sigma represents a Sigmoid function;
after the channel attention is added, in order to supplement the space dimension information of the features, the features also need to be subjected to
Figure BDA0004103461640000096
Adding spatial attention to get final enhancement features +.>
Figure BDA0004103461640000097
The calculation process is shown as a formula (3), wherein F 7×7 Representing extracting spatial information with a 7 x 7 convolution; finally, enhance feature L i Then it is the original feature C i Fusion characteristics->
Figure BDA0004103461640000098
Enhancement feature->
Figure BDA0004103461640000099
The three types of features are added, as shown in formula (4):
Figure BDA00041034616400000910
Figure BDA00041034616400000911
in the method, in the process of the invention,
Figure BDA0004103461640000101
representing fusion characteristics output through three-layer cavity convolution, < + >>
Figure BDA0004103461640000102
Characteristic of attention output through the channel, +.>
Figure BDA0004103461640000103
Enhanced features representing attentiveness and spatial attentiveness output through a channel, L i Representing enhanced features of the LAFE module output.
The feature pyramid module takes the feature information output by the backbone network module as input, fuses the feature graphs with different scales, and adds the feature pyramid module into the improvement DBNet for effectively reducing the information loss caused by the traditional continuous linear up-sampling operation so as to better extract the fusion features to reduce the missing detection and false detection conditions of the natural scene text, wherein the specific conditions are as follows:
will enhance the feature L 5 Upsampling 2 times and enhancing feature L 4 Add, enhance feature L 4 Upsampling 2 times and enhancing feature L 3 Add them to the enhancement feature L 2 Sending the spatial information to a MEFF module to enhance the expression of the spatial information and obtain multi-level characteristics M i ={M 2 ,M 3 ,M 4 ,M 5 M is }, where M 2 ,M 3 ,M 4 ,M 5 The features of the layer 2, the layer 3, the layer 4 and the layer 5 output by the feature pyramid module are respectively;
feature M 2 It is required to go through the MEFF module to obtain,first, for the far-apart feature M 4 Adding DCN after 4 times of linear up-sampling; next, for the feature M 3 2 times of linear up-sampling is carried out; finally, the enhancement feature L is combined 2 The characteristics M fused with multi-level characteristic information are obtained through DCN 2 The specific calculation process is shown as the formula (5):
M 2 =DCN(L 2 +DCN(Up(M 4 ,4))+Up(M 3 ,2)) (5)
wherein, up (M) 3 2) and Up (M) 4 4) representation of the characteristics M 3 And feature M 4 The linear up-sampling operation is respectively carried out by 2 times and 4 times, and DCN represents that the characteristics are introduced into a deformable convolution network when the characteristics are extracted through convolution kernel, and L 2 Layer 2 features representing the output of the LAFE module;
multilayer feature M using 1X 1 convolution i The number of channels is reduced from 256 to 64, the final multi-level characteristic is obtained through linear up-sampling operation of different multiplying powers, and then the characteristic with the size of original figure 1/4 is obtained after splicing.
The prediction module of the improved DBNet predicts the final characteristics output by the characteristic pyramid module, predicts an image probability diagram and an image threshold diagram respectively, adopts a micro binarization post-processing module to calculate and process the image probability diagram and the image threshold diagram to obtain an approximate binary diagram, and further adopts an aggregation post-processing mode from pixel points to a text box to determine a final text detection result.
Example 2
The embodiment discloses a natural scene text detection system based on multi-level feature enhancement and fusion, which is used for realizing the natural scene text detection method based on multi-level feature enhancement and fusion described in embodiment 1, and as shown in fig. 4, the system comprises the following functional modules:
the data preprocessing module is used for performing Imgaug data enhancement on the original data set, and then processing the image in the data set before being input into the training network into the size of 640 multiplied by 640;
the improved DBNet network module is used for realizing accurate detection of natural scene texts, the improved DBNet is an improvement on both a backbone network module and a characteristic pyramid module of the original DBNet, wherein the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: adding a MEFF module which introduces a deformable convolution network in the fusion process of the multi-level features;
the natural scene text detection module is used for inputting the processed images in the data set into the improved DBNet, and acquiring the characteristic information of the target image through a backbone network module added with the LAFE module, wherein three layers of cavity convolution, channel attention and space attention are sequentially carried out when the images pass through the LAFE module to enhance the distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box.
Example 3
The embodiment discloses a storage medium storing a program, which when executed by a processor, implements the natural scene text detection method based on multi-level feature enhancement and fusion described in embodiment 1.
The storage medium in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a usb disk, a removable hard disk, or the like.
Example 4
The embodiment discloses a computing device, which comprises a processor and a memory for storing a program executable by the processor, wherein when the processor executes the program stored by the memory, the natural scene text detection method based on multi-level feature enhancement and fusion described in the embodiment 1 is realized.
The computing device described in this embodiment may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, a programmable logic controller (PLC, programmable Logic Controller), or other terminal devices with processor functionality.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (8)

1. The natural scene text detection method based on multi-level feature enhancement and fusion is characterized in that the method is based on improving DBNet to realize accurate detection of natural scene text, the improving DBNet is improved on a backbone network module and a feature pyramid module of the original DBNet, wherein the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: adding a MEFF module which introduces a deformable convolution network in the fusion process of the multi-level features;
the specific implementation of the natural scene text detection method comprises the following steps:
1) Preprocessing data, including performing Imgaug data enhancement on an original data set, and then processing an image in the data set before being input into a training network into a size of 640×640;
2) Inputting the processed images in the data set into an improved DBNet, and acquiring characteristic information of a target image through a backbone network module added with a LAFE module, wherein three layers of cavity convolution, channel attention and spatial attention are sequentially carried out when the images pass through the LAFE module so as to enhance distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box.
2. The method for detecting natural scene text based on multi-level feature enhancement and fusion according to claim 1, wherein in step 1), the situation of enhancing the imgauge data is: performing rotation in (-10 degrees, 10 degrees) range, scaling by 0.5 times and 3 times, and regularization, random cropping and flipping of the image on the original data; by the data enhancement mode, the network performance of the improved DBNet can be effectively improved, and the detection of the natural scene text is more robust and effective.
3. The method for detecting natural scene text based on multi-level feature enhancement and fusion according to claim 2, wherein in step 2), the backbone network module is composed of a res net50+dcn network including a LAFE module, the processed data set is input into the backbone network module not including the LAFE module, and the backbone network module reads the input natural scene image information to output different levels of original features C i ={C 2 ,C 3 ,C 4 ,C 5 }, wherein C 2 ,C 3 ,C 4 ,C 5 The characteristics of the layer 2, the layer 3, the layer 4 and the layer 5 which are respectively output by the backbone network module without the LAFE module, and the channel numbers of the characteristics of the layer 2, the layer 3, the layer 4 and the layer 5 are 2048, 1024, 512 and 256 respectively;
the number of channels of the features of layers 2,3,4 and 5 is reduced to 256 by 1 x 1 convolution, and then the number of channels is reduced to 256 for the original features C of different layers i Input to the LAFE module in parallel;
in the LAFE module, a three-layer cavity convolution mode is adopted to enlarge a network receptive field so as to strengthen modeling capability of a network on global context information, and each time the cavity convolution is carried out, a feature map of the upper layer needs to fill a plurality of pixel points so as to ensure the size of an output feature map and an original feature C i The same calculation process is shown in the formula (1):
Figure FDA0004103461620000021
where i= {2,3,4,5}, m, n, and p represent the expansion coefficients r= {1,2,3}, F of the three-layer hole convolution m Representing the use of a 3 x 3 convolution with a coefficient of expansion of 1, F n Representing the use of a 3 x 3 convolution with a coefficient of expansion of 2, F p A 3 x 3 convolution with a coefficient of expansion of 3 is shown,
Figure FDA0004103461620000022
the fusion characteristics after three-layer cavity convolution are represented;
after three-layer cavity convolution, in order to supplement channel dimension information of the features, the features are fused
Figure FDA0004103461620000023
Add channel attention gain feature->
Figure FDA0004103461620000024
As shown in formula (2):
Figure FDA0004103461620000025
wherein, avgPool and MaxPool respectively represent average pooling and maximum pooling operation, MLP represents the number of channels compressed first and then expanded by using two full connection layers, and sigma represents a Sigmoid function;
after the channel attention is added, in order to supplement the space dimension information of the features, the features also need to be subjected to
Figure FDA0004103461620000026
Adding spatial attention to get final enhancement features +.>
Figure FDA0004103461620000031
The calculation process is shown as a formula (3), wherein F 7×7 Representing extracting spatial information with a 7 x 7 convolution; finally, enhance feature L i Then it is the original feature C i Fusion characteristics->
Figure FDA0004103461620000032
Enhancement feature->
Figure FDA0004103461620000033
The three types of features are added, as shown in formula (4):
Figure FDA0004103461620000034
Figure FDA0004103461620000035
in the method, in the process of the invention,
Figure FDA0004103461620000036
representing fusion characteristics output through three-layer cavity convolution, < + >>
Figure FDA0004103461620000037
Characteristic of attention output through the channel, +.>
Figure FDA0004103461620000038
Enhanced features representing attentiveness and spatial attentiveness output through a channel, L i Representing enhanced features of the LAFE module output.
4. The method for detecting natural scene text based on multi-level feature enhancement and fusion according to claim 3, wherein in step 2), the feature pyramid module takes the feature information output by the backbone network module as input, fuses feature graphs with different scales, and adds the feature pyramid module into the MEFF module to improve DBNet for the purpose of effectively reducing information loss caused by traditional continuous linear up-sampling operation, so as to better extract the fusion features to reduce the condition of missing detection and false detection of natural scene text, and the specific conditions are as follows:
will enhance the feature L 5 Upsampling 2 times and enhancing feature L 4 Add, enhance feature L 4 Upsampling 2 times and enhancing feature L 3 Add them to the enhancement feature L 2 Sending the spatial information to a MEFF module to enhance the expression of the spatial information and obtain multi-level characteristics M i ={M 2 ,M 3 ,M 4 ,M 5 M is }, where M 2 ,M 3 ,M 4 ,M 5 The features of the layer 2, the layer 3, the layer 4 and the layer 5 output by the feature pyramid module are respectively;
feature M 2 Needs to be acquired by the MEFF module, firstly, aiming at the far-apart characteristic M 4 Adding DCN after 4 times of linear up-sampling; next, for the feature M 3 2 times of linear up-sampling is carried out; finally, the enhancement feature L is combined 2 The characteristics M fused with multi-level characteristic information are obtained through DCN 2 The specific calculation process is shown as the formula (5):
M 2 =DCN(L 2 +DCN(Up(M 4 ,4))+Up(M 3 ,2))(5)
wherein, up (M) 3 2) and Up (M) 4 4) representation of the characteristics M 3 And feature M 4 The linear up-sampling operation is respectively carried out by 2 times and 4 times, and DCN represents that the characteristics are introduced into a deformable convolution network when the characteristics are extracted through convolution kernel, and L 2 Layer 2 features representing the output of the LAFE module;
multilayer feature M using 1X 1 convolution i The number of channels is reduced from 256 to 64, the final multi-level characteristic is obtained through linear up-sampling operation of different multiplying powers, and then the characteristic with the size of original figure 1/4 is obtained after splicing.
5. The method for detecting natural scene text based on multi-level feature enhancement and fusion according to claim 4, wherein in step 2), the prediction module of improved DBNet predicts the final features output by the feature pyramid module, predicts the image probability map and the image threshold map respectively, and uses a micro binarizable post-processing module to calculate and process them to obtain an approximate binary map, and further uses a pixel-to-text box aggregation post-processing mode to determine the final text detection result.
6. The natural scene text detection system based on multi-level feature enhancement and fusion is characterized by being used for realizing the natural scene text detection method based on multi-level feature enhancement and fusion as claimed in any one of claims 1 to 5, and comprising the following steps:
the data preprocessing module is used for performing Imgaug data enhancement on the original data set, and then processing the image in the data set before being input into the training network into the size of 640 multiplied by 640;
the improved DBNet network module is used for realizing accurate detection of natural scene texts, the improved DBNet is an improvement on both a backbone network module and a characteristic pyramid module of the original DBNet, wherein the improvement on the backbone network module is as follows: adding a LAFE module which effectively fuses the three layers of hole convolution, channel attention and space attention together; the improvement of the feature pyramid module is as follows: adding a MEFF module which introduces a deformable convolution network in the fusion process of the multi-level features;
the natural scene text detection module is used for inputting the processed images in the data set into the improved DBNet, and acquiring the characteristic information of the target image through a backbone network module added with the LAFE module, wherein three layers of cavity convolution, channel attention and space attention are sequentially carried out when the images pass through the LAFE module to enhance the distinction of foreground and background characteristics of the images; and inputting the features output by the backbone network module into a feature pyramid module added with the MEFF module to output feature graphs of different scales supplemented with space semantic information, and finally predicting an approximate binary graph generated by the probability graph and the threshold graph, and further obtaining a detection result through aggregation post-processing from pixel points to a text box.
7. A storage medium storing a program, wherein the program, when executed by a processor, implements the natural scene text detection method based on multi-level feature enhancement and fusion of any one of claims 1 to 5.
8. A computing device comprising a processor and a memory for storing a program executable by the processor, wherein the processor, when executing the program stored in the memory, implements the natural scene text detection method based on multi-level feature enhancement and fusion of any one of claims 1 to 5.
CN202310185149.0A 2023-03-01 2023-03-01 Natural scene text detection method, system, storage medium and computing device Pending CN116229445A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310185149.0A CN116229445A (en) 2023-03-01 2023-03-01 Natural scene text detection method, system, storage medium and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310185149.0A CN116229445A (en) 2023-03-01 2023-03-01 Natural scene text detection method, system, storage medium and computing device

Publications (1)

Publication Number Publication Date
CN116229445A true CN116229445A (en) 2023-06-06

Family

ID=86588788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310185149.0A Pending CN116229445A (en) 2023-03-01 2023-03-01 Natural scene text detection method, system, storage medium and computing device

Country Status (1)

Country Link
CN (1) CN116229445A (en)

Similar Documents

Publication Publication Date Title
US10740640B2 (en) Image processing method and processing device
US11151725B2 (en) Image salient object segmentation method and apparatus based on reciprocal attention between foreground and background
US20190311223A1 (en) Image processing methods and apparatus, and electronic devices
CN108734210B (en) Object detection method based on cross-modal multi-scale feature fusion
CN111814794B (en) Text detection method and device, electronic equipment and storage medium
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN114549913B (en) Semantic segmentation method and device, computer equipment and storage medium
CN112164077B (en) Cell instance segmentation method based on bottom-up path enhancement
CN111062854A (en) Method, device, terminal and storage medium for detecting watermark
CN113591719A (en) Method and device for detecting text with any shape in natural scene and training method
CN114782355B (en) Gastric cancer digital pathological section detection method based on improved VGG16 network
CN114998756A (en) Yolov 5-based remote sensing image detection method and device and storage medium
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium
CN116612280A (en) Vehicle segmentation method, device, computer equipment and computer readable storage medium
CN112749576B (en) Image recognition method and device, computing equipment and computer storage medium
CN115082598B (en) Text image generation, training, text image processing method and electronic equipment
CN114241411B (en) Counting model processing method and device based on target detection and computer equipment
CN115471718A (en) Construction and detection method of lightweight significance target detection model based on multi-scale learning
CN113052156B (en) Optical character recognition method, device, electronic equipment and storage medium
CN112785601B (en) Image segmentation method, system, medium and electronic terminal
CN116229445A (en) Natural scene text detection method, system, storage medium and computing device
CN112801960B (en) Image processing method and device, storage medium and electronic equipment
CN113222016B (en) Change detection method and device based on cross enhancement of high-level and low-level features
CN114972947A (en) Depth scene text detection method and device based on fuzzy semantic modeling
CN113743291B (en) Method and device for detecting texts in multiple scales by fusing attention mechanisms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination