CN111914917A

CN111914917A - Target detection improved algorithm based on feature pyramid network and attention mechanism

Info

Publication number: CN111914917A
Application number: CN202010710684.XA
Authority: CN
Inventors: 王燕妮; 刘祥; 翟会杰; 余丽仙; 孙雪松
Original assignee: Xian University of Architecture and Technology
Current assignee: Xian University of Architecture and Technology
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-10

Abstract

The invention discloses a target detection improved algorithm based on a characteristic pyramid network and an attention mechanism, which fuses 6 multi-scale characteristic graphs extracted from a basic network in an original SSD algorithm by combining the principle of the characteristic pyramid network, wherein a new characteristic graph formed after fusion simultaneously contains rich context information so as to improve the detection capability; and an attention model is added to the fused feature graph, so that feature information of the small target is effectively extracted. The condition of missed detection is improved, the robustness of the algorithm is improved, and meanwhile the requirement of real-time performance is still met in the aspect of detection speed.

Description

Target detection improved algorithm based on feature pyramid network and attention mechanism

Technical Field

The invention belongs to the field of digital image processing, relates to target detection, and particularly relates to a target detection improvement algorithm based on a characteristic pyramid network and an attention mechanism.

Background

The task of target detection is to find out interested targets in images and determine the types and the positions of the interested targets, is one of the core problems in the field of computer vision, and is widely applied to infrared detection technology, intelligent video monitoring, remote sensing image target detection, medical diagnosis and fire and smoke detection in intelligent buildings. The target detection algorithm can be divided into a traditional target detection algorithm and a target detection algorithm based on deep learning; the traditional target detection algorithm is represented by an SIFT algorithm, a V-J detection algorithm and the like, but the method is high in time complexity and has no good robustness. The target detection algorithm based on deep learning comprises an R-CNN algorithm, a Fast R-CNN algorithm, a Faster R-CNN algorithm, a YOLO algorithm, an SSD algorithm and the like. Although many excellent target detection algorithms exist in the prior art, the detection performance is still insufficient, so that the problems of missing detection, false detection and the like are caused.

Disclosure of Invention

In view of the above-mentioned drawbacks and disadvantages of the prior art, an object of the present invention is to provide an improved algorithm for object detection based on a feature pyramid network and an attention mechanism.

In order to realize the task, the invention adopts the following technical solution:

an improved target detection algorithm based on a feature pyramid network and an attention mechanism is characterized by comprising the following steps:

step 1) combining the principle of a feature pyramid network, extracting 6 multi-scale feature maps of an input image from a basic network VGG-16 in an original SSD algorithm, and performing feature fusion according to the sequence of small features and large features; obtaining feature maps fusing different layers, wherein the fused feature maps simultaneously contain rich semantic information and detail information;

in the original SSD algorithm, the scale of a feature map extracted from an input image through a basic network VGG-16 is gradually decreased from large to small, wherein the resolution of a bottom-layer feature map is large and contains more detailed information, and the resolution of a high-layer feature map is small and contains more abstract semantic information, so that the original SSD algorithm uses the bottom-layer feature map for detecting small targets and the high-layer feature map for detecting medium and large targets;

step 2) introducing a channel attention mechanism, adding an attention model to two feature graphs which have richer detail information and semantic information after feature fusion and are more sensitive to small target detection; namely, a mask (mask) is added to a feature map to realize an attention mechanism, the features of the region of interest are identified, the network learns the region of interest needing to be focused in each image through continuous training of the network, and influences caused by other interference regions are inhibited, so that the detection capability of the algorithm on small target objects is enhanced.

According to the invention, the size of the input image in step 1) is 300 × 300, and the sizes of the feature maps for detection obtained after passing through the basic network VGG-16 are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1, respectively. According to the principle of the feature pyramid network, feature fusion is sequentially carried out on feature graphs from small to large in size, and 6 feature graphs with the sizes still being 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 are obtained.

Further, in the step 2), an attention model is added to the feature map fused according to the feature pyramid principle in the step 1), because the fusion process is performed according to the order of the size of the feature map from small to large, the feature map with the most abundant information after fusion is (38, 38), and the two feature maps (19, 19) have more abundant detail information and semantic information compared with other feature maps, and are more sensitive to small target detection; in order to maintain the detection speed of the algorithm and reduce the calculation amount of the algorithm, the attention model is only added to the two feature maps (38, 38) and (19, 19) after fusion, and the specific detection algorithm process is as follows:

a) and (3) target detection based on a single-stage network model, and directly regressing the category and the frame of the target on the input image through a convolutional neural network by utilizing the regression idea. Firstly, combining the principle of a characteristic pyramid network, and sequentially performing characteristic fusion on multi-scale characteristic graphs extracted by an original SSD algorithm according to the sequence of sizes from small to large; in the original SSD algorithm, multi-scale feature maps extracted from a basic network VGG-16 are respectively 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 in size, feature fusion is performed according to the principle of a feature pyramid network and the order of size from small to large, and 6 feature maps with the sizes of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 are obtained through fusion, and all the feature maps contain rich semantic information and detailed information.

b) The method is characterized in that channel attention is introduced by combining the principle of an attention mechanism, and an attention model is added to a feature map subjected to feature fusion; the attention model is added to the feature map after feature fusion in 1a), and the two feature maps of 38 × 38 and 19 × 19 after fusion contain the most abundant information, and in order to keep the real-time performance of the algorithm, the attention model is only added to the two feature maps.

c) Setting candidate frames with different sizes and aspect ratios in each unit according to the 6 multi-scale feature maps obtained in the steps a) and b), and calculating the scale of the candidate frames according to the following formula (1):

wherein m represents the number of feature layers; s_kRepresenting the ratio of the candidate frame to the picture; s_maxAnd s_minThe maximum value and the minimum value of the representative proportion are respectively 0.9 and 0.2; obtaining the scale of each candidate frame by using the formula (1);

for aspect ratio, the value is generally

And the width and height of the candidate frame are calculated according to the following formula (2):

for a candidate box with aspect ratio of 1, a scale is also added

The candidate frame of (1), the center coordinates of the candidate frame are

Wherein | f_k| represents the size of the feature layer;

d) detecting the category and the confidence coefficient of the multi-scale feature map by using a convolution kernel of 3 multiplied by 3 through convolution operation, and training a target detection algorithm; the loss function during model training is defined as a weighted sum of position loss (loc) and confidence loss (conf), and the calculation formula is as follows:

in the formula, N is the number of matched candidate frames; x belongs to {1,0} and represents whether the candidate frame is matched with the real frame, if so, x is 1, otherwise, x is 0; c is a category confidence degree predicted value; g is a position parameter of the real frame; l is the position predicted value of the predicted frame; and the alpha weight coefficient is set to be 1.

For the position loss function in SSD, the center (cx, cy) of the candidate frame, and the offset of the width (w) and height (h) are regressed using Smooth L1 loss. The formula is as follows:

for the confidence loss function in SSD, a typical softmax loss is used, which is formulated as:

the invention relates to a target detection improved algorithm based on a feature pyramid network and an attention mechanism, which is based on a single-stage target detection algorithm (SSD) algorithm, takes the influence of the resolution of a feature map on the target detection performance into consideration, improves the original algorithm, combines the thought of the feature pyramid network, fuses multi-scale feature maps extracted by the original SSD algorithm, and fuses to form a feature map with abundant semantic information and detailed information; and combining the principle of an attention mechanism, adding an attention model for the two feature maps with the fused sizes of 38 multiplied by 38 and 19 multiplied by 19 so as to enhance the recognition effect on the small target object.

Drawings

FIG. 1 is a schematic diagram of a network architecture for an object detection algorithm that combines a feature pyramid network and an attention mechanism;

FIG. 2 is a picture comparing the detection effect of the original SSD algorithm with the improved target detection algorithm, wherein the left graph a1, the graph a2, the graph a3, the graph a4 and the graph a5 are the detection pictures of the original SSD algorithm; the right-hand graph b1, the graph b2, the graph b3, the graph b4 and the graph b5 are all detection pictures of the improved target detection algorithm.

The invention is described in further detail below with reference to the figures and examples.

Detailed Description

The invention discloses a target detection improvement algorithm based on a feature pyramid network and an attention mechanism. The method comprises the steps of integrating the principle of a characteristic pyramid network, fusing 6 characteristic graphs extracted by an original SSD algorithm to form a new characteristic graph, and meanwhile, having rich semantic information and detail information; then, an attention model is added to the fused feature map, but in order to keep the real-time performance of the algorithm, the attention model is only added to the 38 × 38 feature map and the 19 × 19 feature map which contain most abundant information and are more sensitive to small target detection. The detection capability of the target detection algorithm is improved through the improvement of the algorithm, and the problems of missing detection and the like are solved.

The embodiment provides an improved target detection algorithm based on a feature pyramid network and an attention mechanism, which comprises the following steps:

In step 1), the size of the input image is 300 × 300, the sizes of the feature maps extracted through the basic network VGG-16 are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, and 1 × 1, respectively, and the extracted 6 feature maps are fused from small to large in size by combining the idea of the feature pyramid network, that is, 1 × 1 and 3 × 3, 3 × 3 and 5 × 5, 5 × 5 and 10 × 10, 10 × 10 and 19 × 19, and 19 × 19 and 38 × 38. The sizes of the fused feature maps are still 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1.

In the step 2), an attention model is added to the fused feature map in combination with the principle of an attention mechanism, because the 38 × 38 and 19 × 19 feature maps after feature fusion contain the richest information, and in order to keep the real-time performance of the detection algorithm and reduce the calculation amount, the attention model is only added to the two feature maps, and the extraction of the features of the small target object can be enhanced after the attention model is added.

The detection process of the improved target detection algorithm is as follows:

a) and (3) target detection based on a single-stage network model, and directly regressing the category and the frame of the target on the input image through a convolutional neural network by utilizing the regression idea. Firstly, combining the principle of a characteristic pyramid network, and sequentially performing characteristic fusion on multi-scale characteristic graphs extracted by an original SSD algorithm according to the sequence of sizes from small to large; in the original SSD algorithm, multi-scale feature maps extracted by a basic network VGG-16 are respectively 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 in size, feature fusion is performed according to the principle of a feature pyramid network and the order of the sizes from small to large, and the two feature maps (1, 1) and (3, 3) are taken as examples:

firstly, up-sampling a characteristic diagram with the size of (1, 1), and inserting new elements between pixel points by adopting a proper interpolation algorithm on the basis of the original image pixels by adopting an interpolation value method, so that the size of the characteristic diagram is enlarged, and the enlarged characteristic diagram is consistent with the size of the characteristic diagram (3, 3); then carrying out 1 × 1 convolution operation on the feature maps of (3, 3), and changing the number of channels of the feature maps to enable the number of channels to be the same as that of the feature maps obtained through up-sampling; and finally, carrying out feature fusion, and carrying out convolution operation on the fused feature graph by using a 3-by-3 convolution kernel after fusion so as to eliminate the aliasing effect of up-sampling. The fusion between other adjacent feature maps is consistent with the method described above. The fusion obtains 6 feature maps with the sizes of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1, and the feature maps all contain rich semantic information and detail information.

b) The method is characterized in that channel attention is introduced by combining the principle of an attention mechanism, and an attention model is added to a feature map subjected to feature fusion; adding an attention model to the feature map subjected to feature fusion in the step a), wherein the 38 × 38 and 19 × 19 feature maps after fusion contain the most abundant information, and in order to keep the real-time performance of the algorithm, only adding the attention model to the two feature maps. The process of adding the attention model is divided into three steps: squeeze, excite, notice.

The formula for the pressing operation is as follows:

h, W respectively represents the height and width of an input, U represents the input, Y represents the output, and C is the number of input channels;

the function of the formula (1) is to convert the input of H × W × C into the output of 1 × C, which corresponds to performing a global average pooling (global average pooling) operation.

The formula for the excitation operation is as follows:

S＝h-Swish(W₂×ReLU6(W₁Y))(2)

wherein Y represents an output of the pressing operation, S represents an output of the energizing operation, and W₁Has the dimension of C/r C, W₂The dimension of (C) is C/r, r is a scaling parameter, and the value is 4. W₁Multiplication by Y represents a full join operation, and then activation of the function via ReLU 6; then with W₂Multiplication also represents a full connection operation, and finally passes through a hard-Swish activation function, so that the excitation operation is completed. ReLU6 and hard-Swish activation function formulas are shown in equation (3) below.

The operation of Attention is shown below:

X＝S×U (4)

in the formula, X represents the feature map after the attention mechanism is added, U represents the original input, and S represents the output of the excitation operation, and the weight of each feature map is multiplied by the feature of the feature map.

c) Setting candidate frames with different sizes and aspect ratios in each unit according to the 6 multi-scale feature maps obtained in the step a) and the step b), and calculating the scale of the candidate frames according to the following formula:

wherein m represents the number of feature layers; s_kRepresenting the ratio of the candidate frame to the picture; s_maxAnd s_minThe maximum value and the minimum value of the representative proportion are respectively 0.9 and 0.2;

obtaining the scale of each candidate frame by using the formula (5);

for aspect ratio, the value is generally

And the width and height of the candidate frame are calculated according to the following formula (6):

for a candidate box with aspect ratio of 1, a scale is also added

The candidate frame of (1), the center coordinates of the candidate frame are

Wherein | f_k| represents the size of the feature layer;

and then training the improved target detection algorithm model.

In this embodiment, the PASCAL VOC2007 data set and the PASCAL VOC2012 data set are used as training sets for model training, and data amplification techniques are adopted to expand images of the training sets by performing operations such as horizontal flipping and random cropping on the data sets.

Data used for the experiment: the PASCAL VOC data set is a set of standardized data set for image identification and classification, and the data set comprises 20 categories, namely human, bird, cat, cow, dog, horse, sheep, airplane, bicycle, boat, bus, automobile, motorcycle, train, bottle, chair, dining table, potted plant, sofa and television.

This example was trained using the VOC2007 data set and the VOC2012 data set described above and tested using the VOC2007 data set. During training, a random gradient descent method (SGD) is adopted, the batchsize is set to be 32, the initial learning rate is set to be 0.001, the momentum parameter momentum is set to be 0.9, the learning rate is reduced by 90% when the iteration times are 100000 and 150000, and the training times are 200000.

In order to verify the detection effect of the target detection improvement algorithm based on the single-stage network model in this embodiment, the applicant selects a test set in the PASCAL VOC2007 data set for detection, and uses an mAP (mean Average precision) as an evaluation index of the detection algorithm, each detected category obtains a curve formed by precision ratio and recall ratio, i.e., a P-R curve, the area under the curve is an AP value, and the AP values of all detected categories are averaged to obtain the mAP value. The detection effect is compared with other mainstream target detection models in terms of both subjective and objective aspects (see tables 1 and 2).

TABLE 1

TABLE 2

In the subjective evaluation of detection effect, the original SSD algorithm and the improved detection algorithm effect graph are compared (as shown in fig. 2, the graphs a1, a2, a3, a4 and a5 are original SSD algorithm detection pictures, and the graphs b1, b2, b3, b4 and b5 are target detection algorithm detection pictures). As can be seen from the figure, compared with the original SSD algorithm, the improved target detection algorithm significantly improves the problems of missing detection and the like in the original algorithm, has more excellent detection capability for densely distributed small target objects, and can detect more targets. The detection effect is obviously improved compared with the original SSD algorithm.

Claims

1. An improved target detection algorithm based on a feature pyramid network and an attention mechanism is characterized by comprising the following steps:

step 1) combining the principle of a feature pyramid network, and performing feature fusion on 6 multi-scale feature maps extracted from an input image by a basic network VGG-16 in an original SSD algorithm according to the sequence of small features and large features; obtaining feature maps fusing different layers, wherein the fused feature maps simultaneously contain rich semantic information and detail information;

2. The algorithm of claim 1, wherein the size of the input image in step 1) is 300 x 300, and the sizes of the feature maps for detection obtained after passing through the underlying network VGG-16 are 38 x 38, 19 x 19, 10 x 10, 5 x 5, 3 x 3, 1 x 1; according to the principle of the feature pyramid network, feature fusion is sequentially carried out on feature graphs for detection from small to large in size, and 6 feature graphs with the feature graph size still being 38 multiplied by 38, 19 multiplied by 19, 10 multiplied by 10, 5 multiplied by 5, 3 multiplied by 3 and 1 multiplied by 1 are obtained.

3. The algorithm according to claim 1, wherein in step 2), an attention model is added to the feature map fused according to the feature pyramid principle in step 1), because the fusion process is performed in the order from small to large in feature map size, so that the feature map with the most abundant information after fusion is (38, 38), (19, 19), and the two feature maps have more abundant detail information and semantic information and are more sensitive to small object detection than other feature maps; in order to maintain the detection speed of the algorithm and reduce the calculation amount of the algorithm, the attention model is only added to the two feature maps (38, 38) and (19, 19) after fusion, and the detection process of the target detection algorithm is as follows:

a) and (3) target detection based on a single-stage network model, and directly regressing the category and the frame of the target on the input image through a convolutional neural network by utilizing the regression idea. Firstly, combining the principle of a characteristic pyramid network, and sequentially performing characteristic fusion on multi-scale characteristic graphs extracted by an original SSD algorithm according to the sequence of sizes from small to large; in the original SSD algorithm, input image multi-scale feature maps extracted by a basic network VGG-16 are respectively 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 in size, feature fusion is carried out according to the principle of a feature pyramid network and the order of the sizes from small to large, 6 feature maps with the sizes of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 are obtained through fusion, and the feature maps contain rich semantic information and detailed information.

b) The method is characterized in that channel attention is introduced by combining the principle of an attention mechanism, and an attention model is added to a feature map subjected to feature fusion; adding an attention model to the feature map subjected to feature fusion in the step 1a), wherein the two feature maps of 38 × 38 and 19 × 19 contain the most abundant information after fusion, and in order to keep the real-time performance of the algorithm, only adding the attention model to the two feature maps;

obtaining the scale of each candidate frame by using the formula (1);

for aspect ratio, the value is generally

for a candidate box with aspect ratio of 1, a scale is also added

The candidate frame of (1), the center coordinates of the candidate frame are

Wherein | f_k| represents the size of the feature layer;

in the formula, N is the number of matched candidate frames; x belongs to {1,0} and represents whether the candidate frame is matched with the real frame, if so, x is 1, otherwise, x is 0; c is a category confidence degree predicted value; g is a position parameter of the real frame; l is the position predicted value of the predicted frame; an alpha weight coefficient set to 1;