CN114494728B

CN114494728B - Small target detection method based on deep learning

Info

Publication number: CN114494728B
Application number: CN202210123900.XA
Authority: CN
Inventors: 杜金莲; 李攀; 张潇; 苏航; 赵青
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2024-06-07
Anticipated expiration: 2042-02-10
Also published as: CN114494728A

Abstract

The invention discloses a small target detection method based on deep learning, which comprises the following steps: and carrying out data enhancement on the small target in the training data. And carrying out feature extraction on the processed image through a feature extraction network, and fusing the feature images through cascading to obtain the feature image. The feature map is weighted by the channel attention module and then is subjected to the space attention module to obtain a final feature map. The extracted potential targets are classified into regular targets and small targets according to the area size. Carrying out RoIAlign region pooling operation on the small target region, and carrying out category judgment and position regression on the pooled result to obtain a final detection result; and the mixed attention module is used for improving the extraction capacity of the RPN region, dividing the extracted region into two types of small targets and other targets according to the size of the area, pooling the small target region by using RoIAlign regions, and fully utilizing the characteristic information of the small target region, thereby improving the detection capacity of a network on the small targets while reducing the increase of the calculated amount.

Description

Small target detection method based on deep learning

Technical Field

The invention relates to the technical field of target detection in computer vision, in particular to a small target detection method based on deep learning.

Background

New generation technologies represented by artificial intelligence are becoming one of important driving forces for promoting the development of social productivity and realizing industrial digitization and industrial modernization. Among them, as an important research direction in the field of artificial intelligence, a task of target detection has been rapidly developed in recent years. The object detection task is one of four tasks of computer vision, and aims to identify an object in a picture and mark the specific position of the object in the form of a rectangular frame. The detection effect of small targets has been difficult to compare favorably with the detection effect of large targets in a picture. The reason for this problem is that the small target is small in size, and it is difficult for the detection method to acquire relatively rich information from these limited pixel information. However, the small target detection task has great significance for the production practice activities of human beings, such as the detection of small traffic signs in the automatic driving field, the identification of small area focuses in the medical image field, the search and rescue of sea surface personnel in the remote sensing image and the like, and in the tasks, the improvement of the detection efficiency of the small target can effectively promote the development of social productivity and ensure the life and property safety of people.

At present, research on a target detection method is mainly divided into two major directions, wherein one is RCNN series of algorithms based on two-stage detection, the two-stage detection algorithm performs region extraction in a first stage, and the second stage performs position regression and classification on the extracted region, wherein the representative algorithm is FASTER RCNN algorithm; and secondly, based on a one-stage target detection algorithm, the one-stage detection algorithm discards the two-stage region extraction step and directly carries out position regression and classification on the feature map, wherein the representative algorithms are SSD algorithm and YOLO series algorithm. Research on small target detection methods is generally carried out by modifying the method so as to improve the detection effect. Six feature graphs with different sizes are used for detecting the targets by the SSD algorithm proposed by Liu et al, so that the detection capability of a network model on small targets is improved, but the network result ignores the connection among all feature layers, and particularly the low-layer feature graph for detecting the small targets lacks rich semantic information of a high layer, so that the method has limited lifting effect on the detection of the small targets. The DSSD algorithm proposed by Fu et al considers that the problem of inaccurate classification of targets by a network model caused by lack of abundant semantic information in a feature map for detecting the small targets in SSD, so as to cause the problem of false detection of the small targets.

The reason for the poor detection of small targets can be summarized as the following three points: (1) The feature map for detecting the small target lacks enough semantic information and detail information; (2) the number of small target instances in the dataset is relatively poor; (3) The small target loses part of small target information after a plurality of downsampling. In the existing detection method, although the characteristics of the targets are enhanced on a general detection model so as to improve the detection effect of the small targets, the targets with different sizes are not treated differently, and the targets with different sizes are detected by using a unified method, so that the small targets with relatively lacking pixel information lose part of precious characteristic information. In order to solve the problems, the invention provides a small target detection method based on deep learning.

The invention comprises the following steps:

The invention provides a small target detection method based on deep learning, which improves the phenomenon of poor detection effect on a small target in a general target detection method. According to the method, potential targets are divided into conventional targets and small targets according to the potential target areas extracted by the network model, and key processing is carried out on the potential small targets so as to improve the detection effect of the small targets.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the specific flow of the method is shown in the attached figure 1, and the small target detection method based on deep learning comprises the following steps:

Step one: the data enhancement is performed on the small targets in the training data, and the enhancement method used comprises the steps of zooming, overturning, color gamut warping, copying of small target examples and Mosaic enhancement on pictures in the data set.

Step two: and (3) carrying out feature extraction on the image processed in the step one through a feature extraction network, wherein the feature extraction network uses ResNet as a backbone network, and the feature graphs of the third stage and the fourth stage are fused through cascading to be used as feature graphs.

Step three: and (3) improving the capability of the network to distinguish the foreground from the background in the feature map by using a mixed attention mechanism on the feature map generated in the step two, wherein the mixed attention mechanism adopts a serial form, and the final feature map is obtained by weighting the feature map through a channel attention module and then through a space attention module.

Step four: and (3) extracting the areas of potential targets in the characteristic diagram obtained in the step three by using the RPN area extraction network, and dividing the extracted potential targets into conventional targets and small targets according to the area size. In view of the adjustment of the target position thereafter, an area having an area of 64×64 or less is taken as a small target area, and a target area having a value larger than this is taken as a large target area.

Step five: performing RoI alignment region pooling operation on the small target region obtained in the step four, performing RoI Pooling region pooling operation on other target regions, and performing category judgment and position regression on the pooled result to obtain a final detection result

Compared with the prior art, the invention has the following advantages:

(1) Aiming at the problem that the feature map for detecting the small target lacks enough semantic information and detail information, the invention uses ResNet feature extraction network, improves the small target extraction capability, and fuses the low-level feature map and the high-level feature map, so that the feature map has rich semantic information and space detail information.

(2) Aiming at the problem that the number of small target examples in the data set is relatively deficient, the method and the device improve the robustness of a network model by using a data enhancement means, wherein the small target example copy and the Mosaic data enhancement improve the capturing probability of the network to the small target, thereby improving the detection capability of the network.

(3) Aiming at the problem that the prior art uses a unified method to detect targets with different sizes to cause that characteristic information is lost by the small targets, the invention uses a mixed attention module to improve the extraction capability of RPN regions, divides the extracted regions into two types of the small targets and other targets according to the area size, uses RoI alignment regions for pooling the small target regions, fully utilizes the characteristic information of the small target regions, and improves the detection capability of a network on the small targets while reducing the increase of calculation amount.

Drawings

Fig. 1 is a schematic view of the overall structure of the present invention.

Fig. 2 is a network diagram of feature extraction of the present invention.

Fig. 3 is a network diagram of an attention module of the present invention.

Fig. 4 is a region extraction network diagram of the present invention.

FIG. 5a is a schematic diagram of RoI Pooling area pooling in accordance with the present invention.

FIG. 5b is a schematic representation of the pooling of RoI Align regions of the present invention.

Detailed Description

The invention provides a small target detection method based on deep learning, which is further described in detail below with reference to the attached drawings.

The overall structure of the invention is schematically shown in fig. 1, and comprises:

firstly, carrying out data enhancement on image data in a training data set, wherein the enhancement method comprises five means of scaling, overturning, color gamut distortion, copying of a small target instance and Mosaic enhancement; secondly, performing feature extraction on the image subjected to data enhancement by using ResNet feature extraction network, wherein a feature fusion technology is used in a feature extraction stage, and a feature extraction network diagram is shown in fig. 2; the feature diagram uses a mixed attention mechanism to improve the distinguishing capability of the network to the foreground and the background in the feature diagram, wherein the mixed attention mechanism uses a serial form, and the attention module network diagram is shown in fig. 3; then, using an RPN area extraction network to extract the areas of potential targets from the feature images processed by the attention module, and dividing the extracted potential target areas into conventional targets and small targets according to the area size, wherein the threshold value for distinguishing the conventional target small targets is 64 multiplied by 64; and finally carrying out RoI Pooling area pooling operation on the obtained conventional target area, carrying out RoI Align area pooling operation on the small target area, and effectively solving the problem of characteristic loss caused by twice quantization and rounding in the network processing flow by RoI Align area pooling, thereby improving the detection capability of the small target, and finally carrying out category judgment and position regression on the pooled result to obtain a final detection result.

The feature extraction network diagram of the present invention is shown in fig. 2, and includes:

ResNet50 overall structure: the ResNet network has five phases in total, the first phase comprising one convolution layer with a convolution kernel of 7 and one maximum pooling layer with a pooling kernel of 3 steps of 2; the second stage to the fifth stage are similar in structure and comprise a convolution block and a plurality of identification blocks, wherein the number of the identification blocks is 2,3,5,2. Wherein the input dimension and the output dimension of the convolution block are inconsistent and used for changing the dimension of the network, and the input dimension and the output dimension of the identification block are consistent, serial connection can be used to deepen the depth of the network.

Feature fusion stage: feature fusion is shown by a broken line in fig. 2, in the prior art, a feature map of a stage four is generally used for extracting a later region, however, the feature map generated by the stage four has rich semantic information in a high-level feature map, but lacks space detail information in a low-level feature map, so that the feature fusion is performed on the feature map generated by the stage three and the stage four, thereby enhancing the detail information and the semantic information of the feature map. The feature fusion generally adopts pixel addition or channel cascade operation, and takes negative effects caused by direct pixel addition into consideration. Taking a three-channel picture with a size of 600×600 as an example: the picture is subjected to a stage III to obtain a feature map C ₁ with the width and the height of 75 and the channel number of 512, a stage four to obtain a feature map C ₂ with the width and the height of 38 and the channel number of 1024, the size of the feature map C ₁ is adjusted to the size of the feature map C ₂ through maximum pooling to obtain a feature map C ₃, then the feature map C ₂ and the feature map C ₃ are connected in series in the channel direction in a channel mode, and finally the feature map after being connected in series is subjected to the channel number adjustment operation through 1X 1 convolution to obtain a final fusion feature map C.

The attention module of the present invention is shown in fig. 3, and includes:

Because each channel and position in the feature map has different information, if the feature map is directly used for region extraction, certain important channel information and position information are ignored, so that the extraction capacity of the region extraction network is reduced, and the invention uses an attention mechanism to enable the network to automatically judge the important channels and positions in the feature map before region extraction. The invention adopts a series structure of classical attention modules CBAM, firstly, a channel attention module is used for judging the importance of a characteristic map channel, and secondly, a space attention module is used for judging the importance of a characteristic map space position, and the specific steps are as follows:

Step one: the input feature map F is processed by the maximum pooling block and the average pooling block of the channel attention module, respectively. Taking the maximum pooling block as an example, the feature map F firstly obtains compressed channel information through global maximum pooling, the shape is changed from 1024×38×38 to 1024×1×1, then one-dimensional convolution with a convolution kernel size of k is used for aggregating the compressed channel information, and due to the fact that the convolution has the property of parameter sharing, the number of parameters of a module can be effectively reduced by using the one-dimensional convolution compared with the use of a full-link layer in the conventional channel attention, wherein the convolution kernel size k is determined by the following formula, and C is the number of channels of the feature map:

Step two: and (3) carrying out pixel-by-pixel addition operation on the channel aggregation information processed by the full-maximum pooling block and the average pooling block obtained in the step (I), obtaining the weight of each channel importance through sigmoid nonlinear activation, and then carrying out pixel-by-pixel multiplication operation on the channel importance weight and the original feature map F to obtain the feature map F ₁ reinforced by the channel attention module.

Step three: and D, performing spatial attention module processing on the feature map F ₁ obtained in the second step. The feature map F ₁ is subjected to global maximum pooling and global average pooling respectively in the channel dimension to obtain compressed space information, the shape is changed from 1024×38×38 to 1×38×38, then cascade operation is carried out on the pooling result in the channel dimension to obtain compressed information with the shape of 2×38×38, and then two-dimensional convolution with the convolution kernel size of 7 is used for polymerizing the compressed space information to obtain aggregation information with the shape of 1×38×38.

Step four: and (3) carrying out pixel-by-pixel addition operation on the spatial aggregation information obtained in the step (III), then carrying out nonlinear activation through sigmoid to obtain a weight of spatial importance, and then carrying out pixel-by-pixel multiplication operation on the weight and the original feature map F ₁ to obtain a feature map F ₂ reinforced by a spatial attention module.

The area extraction network diagram of the present invention is shown in fig. 4, and includes the following steps:

step one: and carrying out feature integration on the feature map processed by the attention module to enhance robustness, wherein the feature map is operated to be processed by a convolution layer with a convolution kernel size of 3.

Step two: m prior frames are laid on the feature map, the set of the prior frame sizes is {8,16, 32}, and the set of the aspect ratios is {0.5,1,2}, that is, each feature point corresponds to 9 different prior frames.

Step three: judging whether the prior frame is a foreground or a background, namely whether the prior frame contains targets or not, wherein a 1×1 convolution is used for obtaining an information matrix with the channel number of 2×9, the information matrix is used for predicting whether 9 prior frames on each feature point on the feature map contain targets or not, and then the foreground prior frame is obtained through softmax classification.

Step four: and carrying out coordinate adjustment on the prior frames, and obtaining an information matrix with 4 multiplied by 9 channels by using 1 multiplied by 1 convolution for predicting the change of the position coordinates of 9 prior frames on each feature point on the feature map.

Step five: and (3) screening the areas obtained in the step (III) and the step (IV), preventing the extracted areas from being too small or exceeding the boundary, sorting according to the softmax score, taking out corresponding suggestion frames, and using non-maximum suppression for the suggestion frames to obtain a final target area.

The area extraction network is used for extracting the area of the potential target in the feature map, the area has four values corresponding to the left upper corner and the right lower corner of the area respectively, the area value of the potential target area can be calculated according to the four values, and in order to enhance the detection capability of the small target area, the potential target area is divided into the small target area and other target areas, the dividing threshold value is 64 multiplied by 64, namely the small target area smaller than the value is the small target area, and the other target areas larger than the value are the other target areas.

The RoI Pooling region pooling schematic diagram of the present invention is shown in fig. 5a, and includes:

The RoI Pooling area is pooled for pooling conventional targets, and the pooled results are used for final position prediction and classification prediction of the targets. The following description will take a picture size of 256×256 and a certain extraction area size of 72×72 as an example: firstly, generating a feature map with the size of 16 multiplied by 16 after a picture passes through a convolutional neural network, and performing first quantization according to RoI Pooling operation to obtain a feature map mapping region with the size of 4 multiplied by 4 after a region passes through the convolutional neural network; then, taking RoIPooling as an example, dividing the size of 3×3 into 9 small areas with the size of 1.33×1.33 on the 4×4 feature map mapping area, and performing second quantization according to RoI Pooling to obtain a feature map mapping area with the size of 1×1; finally, the maximum value of each region is taken out to obtain a final result of 3×3 size.

The RoI Pooling region pooling is performed on the suggested region twice, partial characteristic information is lost in the process, and the lost information has little influence on the detection of a large target, but is difficult to apply to a small target with less information, so that the RoI alignment region pooling method is adopted for the detection of the small target.

The schematic of the pooling of the RoI Align region of the present invention is shown in FIG. 5b, comprising:

The RoI Align region pooling is used for pooling small targets, and the pooled results are used for final position prediction and classification prediction of the targets. The following description will take a picture size of 256×256 and a certain extraction area size of 72×72 as an example: firstly, generating a feature map with the size of 16 multiplied by 16 after a picture passes through a convolutional neural network, wherein the size of a region passes through the convolutional neural network and is 4.5 multiplied by 4.5, and reserving a decimal part of a result according to the operation of RoI alignment to obtain a feature map mapping region with the size of 4.5 multiplied by 4.5; then, taking the example of the RoI Align size of 3×3, dividing the feature map mapping region of 4.5X4.5 into 9 small regions with the size of 1.5X1.5, and reserving the decimal part of the result according to the operation of the RoI Align to obtain the feature map mapping region with the size of 1.5X1.5; and finally, dividing each area into four parts on average, taking the position of each center point, obtaining the value of the point through a bilinear interpolation method, and obtaining a final result with the size of 3 multiplied by 3 for the four points.

The pooling of the RoI Align region can effectively utilize the information of each characteristic point in the region, and is beneficial to the subsequent detection of small targets. After the potential target area is pooled through RoI Pooling and RoI alignment, feature information is subjected to feature extraction in the fifth stage of ResNet, average pooling and flat operation are performed after the feature information is extracted, and finally a final position prediction result and a classification prediction result are obtained by using a full connection layer.

The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the spirit of the present invention should be included in the scope of the present invention.

Claims

1. A small target detection method based on deep learning is characterized by comprising the following steps: the method comprises the following steps:

step one: the method comprises the steps of carrying out data enhancement on small targets in training data, wherein the used enhancement method comprises the steps of zooming, overturning, color gamut distortion, replication of small target examples and mosaics enhancement on pictures in a data set;

step two: extracting the characteristics of the image processed in the first step through a characteristic extraction network, wherein the characteristic extraction network uses ResNet as a backbone network, and the characteristic images of the third stage and the fourth stage are fused through cascading to be used as characteristic images;

step three: the mixed attention mechanism is used for improving the distinguishing capability of the network to the foreground and the background in the feature map, the mixed attention mechanism is in a serial form, and the feature map is weighted by the channel attention module and then is subjected to the space attention module to obtain a final feature map;

Step four: extracting the areas of potential targets in the feature map obtained in the third step by using an RPN area extraction network, and dividing the extracted potential targets into conventional targets and small targets according to the area size; taking an area with the area being more than or equal to 64 multiplied by 64 as a small target area, and taking a target area with the area being more than 64 multiplied by 64 as a large target area;

step five: performing RoI alignment region pooling operation on the small target region obtained in the step four, performing RoI Pooling region pooling operation on other target regions, and performing category judgment and position regression on the pooled result to obtain a final detection result;

The feature extraction network includes:

ResNet50 overall structure: the ResNet network has five phases in total, the first phase comprising one convolution layer with a convolution kernel of 7 and one maximum pooling layer with a pooling kernel of 3 steps of 2; the structures of the second stage to the fifth stage are similar, and the second stage comprises a convolution block and a plurality of identification blocks, wherein the number of the identification blocks is 2,3,5,2; the input dimension and the output dimension of the convolution block are inconsistent and are used for changing the dimension of the network, and the input dimension and the output dimension of the identification block are consistent and are connected in series, so that the depth of the network is deepened; the attention module includes:

Using an attention mechanism to enable the network to automatically judge important channels and positions in the feature map; the serial structure of the attention module CBAM is adopted, firstly, the channel attention module is used for judging the importance of the feature map channel, and secondly, the space attention module is used for judging the importance of the feature map space position, and the specific steps are as follows:

Step one: the input feature map F is processed by a maximum pooling block and an average pooling block of the channel attention module respectively; taking the maximum pooling block as an example, the feature map F firstly obtains compressed channel information through global maximum pooling, the shape is changed from 1024×38×38 to 1024×1×1, then one-dimensional convolution with a convolution kernel size of k is used for aggregating the compressed channel information, and due to the fact that the convolution has the property of parameter sharing, the number of parameters of a module can be effectively reduced by using the one-dimensional convolution compared with the use of a full-link layer in the conventional channel attention, wherein the convolution kernel size k is determined by the following formula, and C is the number of channels of the feature map:

step two: carrying out pixel-by-pixel addition operation on the channel aggregation information processed by the full-maximum pooling block and the average pooling block obtained in the step one, obtaining the weight of each channel importance through sigmoid nonlinear activation, and then carrying out pixel-by-pixel multiplication operation on the channel aggregation information and the original feature map F to obtain a feature map F ₁ reinforced by a channel attention module;

Step three: carrying out space attention module processing on the feature map F ₁ obtained in the second step; the feature map F ₁ is subjected to global maximum pooling and global average pooling respectively in the channel dimension to obtain compressed space information, the shape is changed from 1024×38×38 to 1×38×38, then cascade operation is carried out on the pooling result in the channel dimension to obtain compressed information with the shape of 2×38×38, and then two-dimensional convolution with the convolution kernel size of 7 is used for polymerizing the compressed space information to obtain aggregation information with the shape of 1×38×38;

2. The deep learning-based small target detection method according to claim 1, wherein: firstly, carrying out data enhancement on image data in a training data set, wherein the enhancement method comprises five means of scaling, overturning, color gamut distortion, copying of a small target instance and Mosaic enhancement; secondly, performing feature extraction on the image subjected to data enhancement by using ResNet feature extraction network, wherein a feature fusion technology is used in a feature extraction stage; the feature map uses a mixed attention mechanism to improve the distinguishing capability of the network to the foreground and the background in the feature map, and the mixed attention mechanism uses a serial form; then, using an RPN area extraction network to extract the areas of potential targets from the feature images processed by the attention module, dividing the extracted potential target areas into targets and small targets according to the area size, and distinguishing the small targets of the conventional targets to obtain a threshold value of 64 multiplied by 64; and finally carrying out RoI Pooling area pooling operation on the obtained target area, carrying out RoI alignment area pooling operation on the small target area, solving the problem of characteristic loss caused by twice quantitative rounding in the network processing flow by the RoI alignment area pooling, thereby improving the detection capability of the small target, and finally carrying out category judgment and position regression on the pooled result to obtain a final detection result.

3. The deep learning-based small target detection method according to claim 1, wherein: feature fusion stage: feature fusion is carried out on the feature graphs generated in the third stage and the fourth stage, so that detail information and semantic information of the feature graphs are enhanced; the feature fusion adopts pixel addition or channel cascading operation, and takes negative effects caused by direct pixel addition into consideration, and adopts a channel cascading mode to perform the feature fusion; a three-channel picture of size 600 x 600 is explained as follows: the three-channel picture can obtain a feature map C ₁ with the width and the height of 75 and the channel number of 512 after passing through a stage three, a feature map C ₂ with the width and the height of 38 and the channel number of 1024 after passing through a stage four, the size of the feature map C ₁ is adjusted to the size of a feature map C ₂ through maximum pooling to obtain a feature map C ₃, then the feature map C ₂ and the feature map C ₃ are connected in series in the channel direction in a channel mode, and finally the feature map after being connected in series is subjected to channel number adjustment operation through 1X 1 convolution to obtain a final fusion feature map C.

4. The deep learning-based small target detection method according to claim 1, wherein: an area extraction network comprising the steps of:

step one: feature integration is carried out on the feature graphs processed by the attention module so as to enhance robustness, and the feature graphs are operated to be processed by a convolution layer with the convolution kernel size of 3;

Step two: paving m prior frames on the feature map, wherein the size sets of the prior frames are {8,16, 32}, and the aspect ratio sets are {0.5,1,2}, namely each feature point corresponds to 9 different prior frames;

Step three: judging whether the prior frame is a foreground or a background, namely whether the prior frame contains targets or not, wherein 1X 1 convolution is used for obtaining an information matrix with the channel number of 2X 9, the information matrix is used for predicting whether 9 prior frames on each feature point on the feature map contain targets or not, and then the foreground prior frame is obtained through softmax classification;

Step four: carrying out coordinate adjustment on the prior frames, and obtaining an information matrix with 4 multiplied by 9 channels by using 1 multiplied by 1 convolution, wherein the information matrix is used for predicting the change of the position coordinates of 9 prior frames on each feature point on the feature map;