CN114494728A

CN114494728A - Small target detection method based on deep learning

Info

Publication number: CN114494728A
Application number: CN202210123900.XA
Authority: CN
Inventors: 杜金莲; 李攀; 张潇; 苏航; 赵青
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-05-13
Anticipated expiration: 2042-02-10
Also published as: CN114494728B

Abstract

The invention discloses a small target detection method based on deep learning, which comprises the following steps: and performing data enhancement on small targets in the training data. And performing feature extraction on the processed image through a feature extraction network, and fusing the feature graphs through cascade to obtain the feature graph. And weighting the feature map by a channel attention module, and then obtaining a final feature map by a space attention module. And dividing the extracted potential targets into a regular target and a small target according to the area size. Performing RoIAlign region pooling operation on the small target region, and performing category judgment and position regression on a pooling result to obtain a final detection result; the RPN region extraction capability is improved by using the mixed attention module, the extracted region is divided into a small target and other targets according to the area size, the small target region is subjected to the RoIAlign region pooling, the characteristic information of the small target region is fully utilized, and therefore the small target detection capability of the network is improved while the increase of calculated amount is reduced.

Description

Small target detection method based on deep learning

Technical Field

The invention relates to the technical field of target detection in computer vision, in particular to a small target detection method based on deep learning.

Background

A new generation of technology represented by artificial intelligence is becoming one of the important driving forces for promoting the development of social productivity and realizing industrial digitization and industrial modernization. Among them, as an important research direction in the field of artificial intelligence, the objective detection task has been rapidly developed in recent years. The object detection task is one of four major tasks of computer vision, and aims to identify an object in a picture and mark a specific position of the object in a rectangular frame. The detection effect of small objects is difficult to compare with the detection effect of large-size objects in a picture. Such a problem arises because the size of the small target is small, and it is difficult for the detection method to acquire relatively rich information from these limited pixel information. However, the task of detecting small objects is significant to human production practice activities, such as detection of small traffic signs in the field of automatic driving, identification of small-area lesions in the field of medical images, search and rescue of people on the sea surface in remote sensing images, and the like, and in these tasks, improvement of the detection efficiency of small objects can effectively promote development of social productivity and guarantee life and property safety of people.

At present, research aiming at a target detection method is mainly divided into two major directions, one is an RCNN series algorithm based on two-stage detection, the two-stage detection algorithm carries out region extraction in the first stage, and the second stage carries out position regression and classification on the extracted region, wherein the representative algorithm is a fast RCNN algorithm; the second is a target detection algorithm based on a first stage, the first stage detection algorithm abandons a region extraction step of the second stage, and position regression and classification are directly carried out on the feature map, wherein the representative algorithms are an SSD algorithm and a YOLO series algorithm. The research on the small target detection method is generally to modify the method so as to improve the detection effect. The SSD algorithm proposed by Liu et al uses six feature maps with different sizes to detect targets, so that the detection capability of a network model on small targets is improved, but because the network result ignores the relation among feature layers, and particularly a low-layer feature map for detecting small targets lacks high-layer rich semantic information, the improvement effect of the method on small target detection is limited. The DSSD algorithm proposed by Fu et al considers that the problem of inaccurate classification of a network model to a target caused by the fact that a feature map for detecting the small target in an SSD lacks rich semantic information, and therefore the problem of false detection of the small target is caused.

The reasons for poor detection of small targets can be summarized as the following three points: (1) the feature map used for detecting the small target lacks enough semantic information and detail information; (2) the number of small target examples in the data set is deficient; (3) the small target loses part of small target information after being downsampled for multiple times. Although the existing detection method improves the detection effect of small targets by enhancing the characteristics of the targets on a general detection model, the targets with different sizes are not treated differently, and the targets with different sizes are detected by using a uniform method, so that the small targets with less pixel information lose part of valuable characteristic information. In order to solve the problems, the invention provides a small target detection method based on deep learning.

The invention content is as follows:

the invention provides a small target detection method based on deep learning, which improves the phenomenon of poor small target detection effect in a general target detection method. The method divides the potential targets into conventional targets and small targets according to the potential target area extracted by the network model, and performs key processing on the potential small targets to improve the detection effect of the small targets.

In order to achieve the purpose, the invention adopts the following technical scheme:

the specific flow of the method is shown in the attached figure 1, and the small target detection method based on deep learning comprises the following steps:

the method comprises the following steps: and carrying out data enhancement on the small target in the training data, wherein the used enhancement method comprises the steps of zooming, turning, color gamut distortion, copying of the small target instance and Mosaic enhancement on the picture in the data set.

Step two: and (3) performing feature extraction on the image processed in the step one through a feature extraction network, wherein the feature extraction network uses ResNet50 as a backbone network, and feature graphs of the third stage and the fourth stage are fused through cascade connection to serve as feature graphs.

Step three: and (3) using a mixed attention mechanism to the feature map generated in the step two to improve the distinguishing capability of the foreground and the background in the feature map by the network, wherein the mixed attention mechanism uses a serial form, and firstly weighting the feature map by a channel attention module and then obtaining a final feature map by a space attention module.

Step four: and (4) extracting the region of the potential target in the feature map obtained in the network extraction step three by using the RPN region, and dividing the extracted potential target into a conventional target and a small target according to the area size. In consideration of the adjustment of the target position thereafter, a region having a region size of 64 × 64 or less is set as a small target region, and a target region larger than this value is set as a large target region.

Step five: carrying out RoI Align regional Pooling operation on the small target region obtained in the fourth step, carrying out RoI Pooling regional Pooling operation on other target regions, carrying out category judgment and position regression on Pooling results to obtain final detection results

Compared with the prior art, the invention has the following advantages:

(1) aiming at the problem that the feature map for detecting the small target lacks enough semantic information and detail information, the invention uses the ResNet50 feature extraction network, improves the small target extraction capability, and fuses the low-level feature map and the high-level feature map, so that the feature map has rich semantic information and space detail information.

(2) Aiming at the problem that the number of small target examples in a data set is deficient, the robustness of a network model is improved by using a data enhancement means, wherein the small target example copying and the Mosaic data enhancement improve the capture probability of the network on the small target, so that the detection capability of the network is improved.

(3) Aiming at the problem that the small targets lose the characteristic information when the uniform method is used for detecting the targets with different sizes in the prior art, the RPN region extraction capacity is improved by using the mixed attention module, the extracted regions are divided into the small targets and other targets according to the area size, the small target regions are subjected to RoI Align region pooling, the characteristic information of the small target regions is fully utilized, and the detection capacity of a network on the small targets is improved while the calculated amount is reduced and increased.

Drawings

Fig. 1 is a schematic view of the overall structure of the present invention.

Fig. 2 is a diagram of a feature extraction network of the present invention.

FIG. 3 is a network diagram of an attention module of the present invention.

Fig. 4 is a diagram of an area extraction network of the present invention.

FIG. 5a is a schematic illustration of the RoI Pooling region Pooling of the present invention.

FIG. 5b is a schematic representation of the RoI Align region pooling of the present invention.

Detailed Description

The small target detection method based on deep learning proposed by the present invention is further described in detail below with reference to the drawings of the specification.

The overall structure of the invention is schematically shown in fig. 1, and comprises:

firstly, data enhancement is carried out on image data in a training data set, and the used enhancement method comprises five means such as scaling, turning, color gamut distortion, small target instance copying and Mosaic enhancement; secondly, using the ResNet50 feature extraction network to extract features of the image after data enhancement, particularly indicating that a feature fusion technology is used in the feature extraction stage, and the feature extraction network diagram is shown in FIG. 2; then, the feature map improves the distinguishing capability of the network on the foreground and the background in the feature map by using a mixed attention mechanism, wherein the mixed attention mechanism uses a serial form, and an attention module network map is shown in FIG. 3; then, using an RPN region extraction network to perform region extraction of potential targets on the feature map processed by the attention module, dividing the extracted potential target region into a conventional target and a small target according to the area size, wherein the threshold for distinguishing the conventional target small target is 64 multiplied by 64; and finally, performing RoI Pooling region Pooling operation on the obtained conventional target region, and performing RoI Align region Pooling operation on the small target region, wherein the RoI Align region Pooling can effectively solve the problem of characteristic loss caused by twice quantization rounding in a network processing flow, so that the detection capability of the small target is improved, and finally performing category judgment and position regression on a Pooling result to obtain a final detection result.

The feature extraction network diagram of the present invention is shown in fig. 2, and includes:

ResNet50 overall structure: the ResNet50 network has five total stages, the first stage comprises a convolution layer with convolution kernel of 7 and a maximum pooling layer with pooling kernel of 3 and step size of 2; the structures of the second stage to the fifth stage are similar, and the structure comprises a volume block and a plurality of identification blocks, wherein the number of the identification blocks is 2,3,5 and 2 respectively. The input dimension and the output dimension of the rolling blocks are not consistent and are used for changing the dimension of the network, the input dimension and the output dimension of the identification blocks are consistent, and the series connection can be used to deepen the depth of the network.

A characteristic fusion stage: feature fusion is shown by dotted lines in fig. 2, the feature map of stage four is generally used for subsequent region extraction in the prior art, however, although the feature map generated in stage four has abundant semantic information in the high-level feature map, spatial detail information in the low-level feature map is lacking, so the feature maps generated in stage three and stage four are subjected to feature fusion in the present invention, thereby enhancing detail information and semantic information of the feature map. The feature fusion generally adopts pixel addition or channel cascade operation, and in consideration of negative effects brought by direct pixel addition, the invention adopts a channel cascade mode to perform the feature fusion. Take a three-channel picture with a size of 600 × 600 as an example: the picture is in the passing stageThirdly, a feature graph C with a width and height of 75 and a channel number of 512 is obtained₁After the fourth step, a feature map C with a width and height of 38 and a channel number of 1024 is obtained₂The feature map C₁Is adjusted to feature map C by maximum pooling₂Dimension of (D) to obtain a feature map C₃Then, the feature map C is set₂And characteristic diagram C₃And (3) performing channel series connection in the channel direction, and finally performing channel number adjustment operation on the series-connected feature map through 1 × 1 convolution to obtain a final fusion feature map C.

The attention module of the present invention is shown in fig. 3 and includes:

since each channel and position in the feature map have different information, if the feature map is directly used for region extraction, some important channel information and position information can be ignored, and the extraction capability of the region extraction network is reduced, before the region extraction, the invention uses an attention mechanism to enable the network to automatically judge the important channel and position in the feature map. The invention adopts a series structure of classical attention modules CBAM, firstly a channel attention module is used for judging the importance of a characteristic diagram channel, and secondly a space attention module is used for judging the importance of a characteristic diagram space position, the concrete steps are as follows:

the method comprises the following steps: the input feature map F is processed by the maximum pooling block and the average pooling block of the channel attention module, respectively. Taking the maximum pooling block as an example, the feature map F is first subjected to global maximum pooling to obtain compressed channel information, the shape of the compressed channel information is changed from 1024 × 38 × 38 to 1024 × 1 × 1, and then the compressed channel information is aggregated by using one-dimensional convolution with a convolution kernel size of k, because the convolution has the property of parameter sharing, compared with using a fully-connected layer in the conventional channel attention, the parameter number of modules can be effectively reduced by using the one-dimensional convolution, wherein the convolution kernel size k is determined by the following formula, where C is the channel number of the feature map:

step two:performing pixel-by-pixel addition operation on the channel aggregation information processed by the full-maximum pooling block and the average pooling block obtained in the step one, then performing sigmoid nonlinear activation to obtain the weight of the importance of each channel, and then performing pixel-by-pixel multiplication operation on the weight and the original feature map F to obtain a feature map F reinforced by a channel attention module₁。

Step three: obtaining a characteristic diagram F in the second step₁Spatial attention module processing is performed. Feature map F₁And performing global maximum pooling and global mean pooling on channel dimensions to obtain compressed spatial information, wherein the shape of the compressed spatial information is changed from original 1024 × 38 × 38 to 1 × 38 × 38, performing cascade operation on pooling results on the channel dimensions to obtain compressed information with the shape of 2 × 38 × 38, and aggregating the compressed spatial information by using two-dimensional convolution with a convolution kernel size of 7 to obtain aggregated information with the shape of 1 × 38 × 38.

Step four: carrying out pixel-by-pixel addition operation on the space aggregation information obtained in the step three, then obtaining a weight value of space importance through sigmoid nonlinear activation, and then combining the weight value with the original feature map F₁Carrying out pixel-by-pixel multiplication operation to obtain a feature map F strengthened by a space attention module₂。

The area extraction network diagram of the present invention is shown in fig. 4, and includes the following steps:

the method comprises the following steps: and performing feature integration on the feature map processed by the attention module to enhance the robustness, wherein the operation is to perform convolution layer processing with a convolution kernel size of 3.

Step two: and (3) laying m prior frames on the feature map, wherein the set of the sizes of the prior frames is {8,16, 32}, and the set of the aspect ratios is {0.5,1,2}, namely each feature point corresponds to 9 different prior frames.

Step three: judging whether the prior frame is a foreground or a background, namely whether the prior frame contains the target or not, wherein an information matrix with the channel number of 2 x 9 is obtained by using 1 x 1 convolution and is used for predicting whether 9 prior frames on each feature point on the feature map contain the target or not, and then the foreground prior frame is obtained through softmax classification.

Step four: and (3) carrying out coordinate adjustment on the prior frame, and obtaining an information matrix with the channel number of 4 multiplied by 9 by using 1 multiplied by 1 convolution for predicting the change of the position coordinates of the 9 prior frames on each feature point on the feature map.

Step five: and screening the areas obtained in the third step and the fourth step to prevent the extracted areas from being too small or exceeding the boundary, sorting the extracted areas according to the softmax score to take out corresponding suggestion frames, and inhibiting the suggestion frames by using a non-maximum value for de-duplication to obtain a final target area.

The region extraction network is used for extracting a region of a potential target in the feature map, the region has four values, the four values respectively correspond to coordinates of the upper left corner and the lower right corner of the region, accordingly, an area value of the potential target region can be obtained through calculation, in order to enhance the detection capability of the small target region, the potential target region is divided into the small target and other target regions, the division threshold value is 64 x 64, namely the small target region is smaller than the value, and the other target regions are larger than the value.

Fig. 5a shows a schematic view of the Pooling of the RoI Pooling area of the present invention, which includes:

the RoI Pooling area Pooling is used for Pooling conventional targets, and the Pooling result is used for final position prediction and classification prediction of the targets. The following description will be made by taking an example in which the picture size is 256 × 256 and the size of a certain extraction area is 72 × 72: firstly, a picture generates a characteristic diagram with the size of 16 multiplied by 16 after passing through a convolutional neural network, the size of an area is 4.5 multiplied by 4.5 after passing through the convolutional neural network, at the moment, the first quantization is carried out according to the operation of RoI Pooling, and the result is rounded to obtain a characteristic diagram mapping area with the size of 4 multiplied by 4; then, taking the example that the size of RoIPooling is 3 × 3, dividing a 4 × 4 feature map mapping area into 9 small areas, wherein the size of each area is 1.33 × 1.33, performing second quantization according to the operation of RoI Pooling, and rounding the result to obtain a feature map mapping area with the size of 1 × 1; finally, the maximum value for each region is taken to get the final result of 3 × 3 size.

The RoI Pooling region Pooling is subjected to two times of quantization on the proposed region, partial characteristic information is lost in the process, the lost information has little influence on the detection of a large target but is difficult to be applied to a small target with less information, so that the RoI Align region Pooling method is adopted for the detection of the small target.

A schematic representation of the pooling of the RoI Align region of the present invention is shown in FIG. 5b, comprising:

and the RoI Align region pooling is used for performing pooling operation on the small targets, and the pooling result is used for final position prediction and classification prediction of the targets. The following description will be made by taking an example in which the picture size is 256 × 256 and the size of a certain extraction area is 72 × 72: firstly, a picture generates a feature map with the size of 16 multiplied by 16 after passing through a convolutional neural network, the size of an area after passing through the convolutional neural network is 4.5 multiplied by 4.5, and at the moment, a decimal part of a result is reserved according to the operation of RoI Align to obtain a feature map mapping area with the size of 4.5 multiplied by 4.5; then, taking the size of the RoI Align as 3 × 3 as an example, dividing the feature map mapping area of 4.5 × 4.5 into small areas of 9 sizes, where the size of each area is 1.5 × 1.5, and then reserving the fractional part of the result according to the RoI Align operation to obtain a feature map mapping area of 1.5 × 1.5; finally, equally dividing each region into four parts, taking the position of the central point of each part, obtaining the value of the point by a bilinear interpolation method, and taking the maximum value of the four points to obtain the final result of 3 multiplied by 3.

The information of each characteristic point in the region can be effectively utilized by the aid of the RoI Align region pooling, and follow-up detection of small targets is facilitated. After the potential target area is subjected to RoI Pooling and RoI Align Pooling, feature information is subjected to feature extraction of a fifth stage of ResNet50, after extraction, average Pooling and Flatten operation are performed, and finally a final position prediction result and a classification prediction result are obtained by using a full connection layer.

The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the technical spirit of the present invention should be included within the scope of the present invention.

Claims

1. A small target detection method based on deep learning is characterized in that: the method comprises the following steps:

the method comprises the following steps: carrying out data enhancement on a small target in training data, wherein the used enhancement method comprises the steps of zooming, turning, color gamut distortion, copying of a small target example and Mosaic enhancement on a picture in a data set;

step two: performing feature extraction on the image processed in the step one through a feature extraction network, wherein the feature extraction network uses ResNet50 as a backbone network, and feature graphs of a third stage and a fourth stage are fused through cascade connection to serve as feature graphs;

step three: using a mixed attention mechanism to the feature map generated in the second step to improve the distinguishing capability of the network on the foreground and the background in the feature map, wherein the mixed attention mechanism uses a serial form, and firstly weighting the feature map by a channel attention module and then obtaining a final feature map by a space attention module;

step four: extracting the region of the potential target in the feature map obtained in the network extraction step three by using the RPN region, and dividing the extracted potential target into a conventional target and a small target according to the area size; taking an area with the area size of 64 multiplied by 64 as a small target area and taking a target area larger than 64 multiplied by 64 as a large target area;

step five: and performing RoI Align regional Pooling operation on the small target region obtained in the fourth step, performing RoI Pooling regional Pooling operation on other target regions, and performing category judgment and position regression on a Pooling result to obtain a final detection result.

2. The small target detection method based on deep learning of claim 1, wherein: firstly, performing data enhancement on image data in a training data set, wherein the used enhancement method comprises five means of zooming, turning, color gamut distortion, copying of small target instances and Mosaic enhancement; secondly, performing feature extraction on the image subjected to data enhancement by using a ResNet50 feature extraction network, and particularly pointing out that a feature fusion technology is used in the feature extraction stage; the feature map uses a mixed attention mechanism to improve the distinguishing capability of the network on the foreground and the background in the feature map, and the mixed attention mechanism uses a serial form; then, using an RPN region extraction network to perform region extraction of potential targets on the feature map processed by the attention module, dividing the extracted potential target region into targets and small targets according to the area size, wherein the threshold value for distinguishing the small targets of the conventional targets is 64 multiplied by 64; and finally, performing RoI Pooling region Pooling operation on the obtained target region, performing RoI Align region Pooling operation on the small target region, wherein the RoI Align region Pooling solves the problem of characteristic loss caused by twice quantization rounding in a network processing flow, so that the detection capability of the small target is improved, and finally performing category judgment and position regression on a Pooling result to obtain a final detection result.

3. The small target detection method based on deep learning of claim 1, wherein: the feature extraction network includes:

ResNet50 overall structure: the ResNet50 network has five total stages, the first stage comprises a convolution layer with convolution kernel of 7 and a maximum pooling layer with pooling kernel of 3 and step size of 2; the structures from the second stage to the fifth stage are similar, and the structure comprises a rolling block and a plurality of identification blocks, wherein the number of the identification blocks is 2,3,5 and 2; the input dimensionality and the output dimensionality of the rolling blocks are inconsistent and used for changing the dimensionality of the network, the input dimensionality and the output dimensionality of the identification blocks are consistent, and the depth of the network is deepened by using serial connection.

4. The small target detection method based on deep learning of claim 1, wherein: a characteristic fusion stage: performing feature fusion on the feature maps generated in the third stage and the fourth stage so as to enhance the detail information and semantic information of the feature maps; the feature fusion adopts pixel addition or channel cascade operation, and takes negative effects brought by direct pixel addition into consideration, and adopts a channel cascade mode to carry out the feature fusion; a three-channel picture with dimensions 600 x 600 is explained as follows: after the three-channel picture passes through the third stage, a feature map C with the width and height of 75 and the number of channels of 512 is obtained₁After the fourth stage, the width and height of 38 and the number of channels are obtainedFeature map C of 1024₂A feature map C₁Is adjusted to feature map C by maximum pooling₂The dimension of (A) to obtain a feature map C₃Then, the feature map C is set₂And characteristic diagram C₃And (3) performing channel series connection in the channel direction, and finally performing channel number adjustment operation on the series-connected feature map through 1 × 1 convolution to obtain a final fusion feature map C.

5. The small target detection method based on deep learning of claim 1, wherein: the attention module includes:

using an attention mechanism to enable the network to automatically judge important channels and positions in the feature map; the method adopts a series structure of the attention module CBAM, firstly uses the channel attention module to judge the importance of the characteristic diagram channel, and secondly uses the space attention module to judge the importance of the characteristic diagram space position, and comprises the following steps:

the method comprises the following steps: the input feature graph F is processed by a maximum pooling block and an average pooling block of the channel attention module respectively; taking the maximum pooling block as an example, the feature map F is first subjected to global maximum pooling to obtain compressed channel information, the shape of the compressed channel information is changed from 1024 × 38 × 38 to 1024 × 1 × 1, and then the compressed channel information is aggregated by using one-dimensional convolution with a convolution kernel size of k, because the convolution has the property of parameter sharing, compared with using a fully-connected layer in the conventional channel attention, the parameter number of modules can be effectively reduced by using the one-dimensional convolution, wherein the convolution kernel size k is determined by the following formula, where C is the channel number of the feature map:

step two: performing pixel-by-pixel addition operation on the channel aggregation information processed by the full-maximum pooling block and the average pooling block obtained in the step one, then performing sigmoid nonlinear activation to obtain the weight of the importance of each channel, and then performing pixel-by-pixel multiplication operation on the weight and the original feature map F to obtain the channel attentionCharacteristic diagram F after module reinforcement₁；

Step three: obtaining a characteristic diagram F in the second step₁Performing space attention module processing; feature map F₁Respectively performing global maximum pooling and global mean pooling on channel dimensions to obtain compressed space information, wherein the shape of the compressed space information is changed from original 1024 × 38 × 38 to 1 × 38 × 38, then performing cascade operation on pooling results on the channel dimensions to obtain compressed information with the shape of 2 × 38 × 38, and then aggregating the compressed space information by using two-dimensional convolution with the convolution kernel size of 7 to obtain aggregated information with the shape of 1 × 38 × 38;

step four: carrying out pixel-by-pixel addition operation on the space aggregation information obtained in the step three, then obtaining a weight value of space importance through sigmoid nonlinear activation, and then combining the weight value with the original feature map F₁Carrying out pixel-by-pixel multiplication operation to obtain a feature map F enhanced by a space attention module₂。

6. The small target detection method based on deep learning of claim 1, wherein: an area extraction network comprising the steps of:

the method comprises the following steps: performing feature integration on the feature map processed by the attention module to enhance robustness, wherein the feature map is processed by a convolution layer with a convolution kernel size of 3;

step two: laying m prior frames on the feature map, wherein the size set of the prior frames is {8,16, 32}, the width-to-height ratio set is {0.5,1,2}, namely each feature point corresponds to 9 different prior frames;

step three: judging whether the prior frame is a foreground or a background, namely whether the prior frame contains a target or not, wherein a 1 × 1 convolution is used for obtaining an information matrix with the number of channels being 2 × 9 for predicting whether 9 prior frames on each feature point on the feature map contain the target or not, and then the foreground prior frame is obtained through softmax classification;

step four: adjusting the coordinates of the prior frames, obtaining an information matrix with the channel number of 4 multiplied by 9 by using 1 multiplied by 1 convolution, and predicting the change of the position coordinates of the 9 prior frames on each feature point on the feature map;