CN113657285A

CN113657285A - Real-time target detection method based on small-scale target

Info

Publication number: CN113657285A
Application number: CN202110949953.2A
Authority: CN
Inventors: 毕建权; 杨朝红; 王伟男; 张国辉; 赵萌; 田相轩; 金丽亚; 张威; 邢萌; 陈波; 王璇
Original assignee: Academy of Armored Forces of PLA
Current assignee: Academy of Armored Forces of PLA
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-11-16
Anticipated expiration: 2041-08-18
Also published as: CN113657285B

Abstract

The invention discloses a real-time target detection method based on a small-scale target, which comprises the following steps: s1, adjusting a network detection layer, S2 reconstructing a network default frame, S3, extracting and enhancing network deep information, and S4, fusing and enhancing network shallow information. The invention constructs a multi-path combined feature extraction enhancement module in a deep network, constructs a secondary recursive feature pyramid for the shallow network to generate a semantic and positioning information complementary fusion path, adds an attention mechanism to enhance the fusion effect, adds a rejection loss function to the positioning loss function design to reduce the mutual interference of mutually overlapped targets on detection, and other innovative methods. Through the improvement, the detection model can be lighter than the original detection network, and the detection performance of the small-scale target is greatly improved.

Description

Real-time target detection method based on small-scale target

Technical Field

The invention relates to the field of unmanned aerial vehicle target detection, in particular to a real-time target detection method based on a small-scale target.

Background

In recent years, with the rapid development of unmanned aerial vehicle technology, more and more countries use unmanned aerial vehicles as the key point of military research, along with the rapid development of target detection technology and computer vision technology, unmanned aerial vehicles are also rapidly developed in the field of battlefield information analysis, which is the basis of future investigation and play and is also an important component of unmanned combat in future intelligent combat.

At present, the unmanned aerial vehicle target identification technology based on deep learning develops rapidly to the wide application is in civil affairs fields such as security protection control, license plate discernment. Commonly used algorithms are RCNN, SSD, YOLO, RetinaNet algorithms, etc. The method is limited by the hardware condition of the unmanned aerial vehicle in the field of swarm unmanned aerial vehicles, the real-time performance of a target identification algorithm is low, the accuracy of small-scale target detection is low, and the method is easily influenced by weather and the like. However, the small unmanned aerial vehicle carrying the deep learning algorithm gradually becomes the trend of unmanned swarm tactics in the future, and research in the field is urgently needed to be developed.

The SSD-based target detection speed is always the key influencing the rapid application of the SSD-based target detection speed, and in order to improve the real-time performance of target detection and adapt to the application requirements in the field of target detection of unmanned aerial vehicles, a basic network with a large calculated amount in the SSD-based target detection needs to be optimized in a light weight mode. The current MobileNet series network is an ideal lightweight basic network, wherein the MobileNet v1 adopts deep separable convolution to generate a characteristic layer, and convolution parameters are reduced, so that the model calculation time is greatly reduced, and the detection rate is rapidly increased. The method is based on the MobileNetv1-SSD of the MobileNetv1 basic network to adjust, set a new detection layer and a new characteristic layer, achieve speed increase through pruning and convolution kernel optimization, carry out targeted design aiming at small-scale target detection, and improve the detection effect.

Disclosure of Invention

Aiming at the problems that the detection precision of an original SSD (solid State disk) made of MobileNetv1-SSD is not high enough, and especially the design of small-scale target detection is not enough, the invention provides a real-time target detection method based on a small-scale target, which constructs an embedded small-target real-time target detection model by comprehensively considering the aspects of the mass scale of the model, the detection precision and the detection speed of the small-scale target and the like; based on an original SSD frame and based on the deployment requirement of an unmanned aerial vehicle AI processing platform, the network frame is subjected to lightweight processing, and meanwhile, the detection precision of small-scale targets is improved.

The content of the invention comprises:

a real-time target detection method based on small-scale targets comprises the following steps:

s1, adjusting a network detection layer: adjusting a feature layer used by the MobileNet v1-SSD for detection, and constructing a network structure MobileNet v1-SSDimpr more suitable for small-scale target detection; namely, adding Conv5 layers for classification detection, and setting the default frame size number to be 4; conv11 default box size number set to 4; remove the Conv17 layer; constructing new detection feature layers such as Conv5 and Conv 11;

s2, reconstructing a network default box: clustering data by adopting a K-means algorithm according to the distribution characteristics of specific task data, and reconstructing the constructed default frame of the MobileNet v1-SSDimpr according to the clustering scale;

s3, extracting and enhancing network deep information: for the MobileNetv1-SSDimpr after the default frame is reconstructed, a multi-path joint feature extraction enhancement mechanism is constructed at three deep-layer positions of Conv14, Conv15 and Conv16 through grouping convolution and point convolution to obtain three feature extraction enhancement modules MFEE14, MFEE15 and MFEE16, the original standard convolution layer Conv14_1 and Conv14_2 are replaced by MFEE14, Conv15_1 and Conv15_2 are replaced by MFEE15, and Conv16_1 and Conv16_2 are replaced by MFEE 16;

s4, network shallow information fusion enhancement: constructing a secondary recursive feature pyramid RFPN on three shallow feature layers of Conv5, Conv11 and Conv13 based on a MobileNet v1 frame, generating three detection layers P1, P2 and P3, and replacing Conv5, Conv11 and Conv13 of the MobileNet v1-SSDimpr after a default frame is reconstructed respectively; three detection layers P1, P2, and P3 are used for small target detection.

Further, the method further comprises the steps of: s5, feature layer fusion self-adaptation: an Attention anchoring mechanism is added to the RFPN based on the MobileNetv1, and three new detection layers P1, P2 and P3 are generated through adaptive fusion.

Further, in the step S1, the size of the Conv5 is 38 × 38 × 256, the size of the Conv11 is 19 × 19 × 512, the size of the Conv13 is 10 × 10 × 1024, the size of the Conv14_2 is 5 × 5 × 512, the size of the Conv15_2 is 3 × 3 × 256, and the size of the Conv16_2 is 2 × 2 × 256.

Further, the specific step of step S2 is: s2-1, determining the default frame scale number as the clustering center number; s2-2, selecting different scales as initial clustering centers according to target scale distribution to accelerate clustering convergence; s2-3, performing intersection comparison IOU calculation on each target scale data and a clustering center respectively, and classifying the target scale data into a cluster with the largest IOU value; s2-4, calculating the height-width mean value of all targets in each cluster, and taking the mean value scale as a new cluster center; s2-5, repeating the step S2-3 and the step S2-4 until the cluster center is not changed any more; s2-6, after the clustering scale is obtained, the original setting of the default box is changed into the clustering scale.

Further, in the step S4, the constructed quadratic recursive feature pyramid RFPN includes Section1, Section2, Section3, and Section4, and performs information fusion enhancement on the MobileNetv1-ssdim shallow network after the default frame is constructed.

Further, in the steps S4-2 to S4-4, the original way of progressive fusion and residual fusion is changed into the way of direct mutual fusion of three feature layers by adaptively allocating weights, and then three new detection layers are generated: new P3, new P2, new P1.

Further, the fusion process is realized in a grouping pooling mode.

Further, after the scales and channels of the new Conv5, the new Conv11 and the new Conv13 are adjusted to be consistent, adaptive fusion is carried out to generate a new P1.

Further, the original localization loss function L to MobileNetv1-SSDimpr_loc(x, L, g) followed by a rejection loss function L_RepAnd forming a new positioning loss function.

The invention has the beneficial effects that:

according to the invention, MobileNetv1-SSD is selected as a basic framework of algorithm research according to task requirements; and then aiming at the difference of MobileNetv1-SSD relative task requirements, developing improved detection model construction algorithm research, and gradually and progressively researching and improving the light processing, detection layer adjustment, default frame reconstruction, deep network information extraction enhancement, shallow network information fusion enhancement, feature layer fusion self-adaptation, positioning loss function optimization and the like in sequence. In the improvement, the following steps are proposed: establishing a multi-path combined feature extraction enhancement module in a deep network, establishing a secondary recursive feature pyramid for a shallow network to generate a complementary fusion path of semantics and positioning information, adding an attention mechanism to enhance the fusion effect, and adding a rejection loss function to the positioning loss function design to reduce mutual interference of mutually overlapped targets on detection. Through the improvement, the detection model can be lighter than the original detection network, and the detection performance of the small-scale target is greatly improved.

Drawings

FIG. 1 is a diagram of the MobileNetv1-SSDimpr structure after the default box is reconfigured;

FIG. 2 is a MFEE structural diagram;

FIG. 3 is a diagram of the structure of MobileNet v 1-SSDimprr + MFEE + RFPN; (ii) a

FIG. 4 is a diagram of RFPN structure based on MobileNetv 1;

FIG. 5 is a block diagram of RFPN join Attention based on MobileNetv 1;

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention provides a real-time target detection method based on a small-scale target, which comprises the following steps:

step one, network detection layer adjustment: and adjusting a feature layer used by the MobileNet v1-SSD for detection, and constructing a network structure MobileNet v1-SSDimpr more suitable for small-scale target detection.

Since the shallow feature layer has fine-grained information which is more beneficial to detecting and identifying small-scale targets, and although the Conv5 layer and the Conv4 layer have feature maps with the same scale and channel number, the Conv5 layer has stronger semantic expression, a Conv5 layer is added for classification detection, and the default frame scale number is set to be 4. The default box scale number on Conv11 is changed to 4 to increase the density of predictions for small scale target locations. Although the Conv17 layer is at the deepest layer, semantic information is most abundant, the loss of fine-grained characteristics and positioning information is large, so that the fine-grained characteristics and the positioning information are deleted, and meanwhile, new detection layers are built, wherein the size of the feature map of the last detection layer is changed to 2 x 2, the number of detection layer channels of the adjusted network structure is increased, and the loss of characteristic information is reduced.

The relative position of the MobileNet v1-SSDimpr network shallowest detection layer in the whole network is similar to that of the original SSD, but the channel output of the MobileNet v1-SSDimpr on the 38 x 38 detection layer is 256, which is lower than that of the original SSD, considering that the detection layer is in the shallowest position of the network, too many feature output channels can cause weak semantic feature information redundancy, so that the detection time can be saved while the detection precision is not greatly influenced by keeping the feature channels of the layer small. The positions of the MobileNet v1-SSDimpr from the 19 multiplied by 19 detection layer to the deepest detection layer are all forward, so that fine-grained information is more, and basic conditions can be provided for fine feature extraction and accurate positioning of small-scale targets. Meanwhile, the scales of the feature layers from Conv6 to Conv11 are kept unchanged, but feature screening is continuously performed on the previous feature layer by adopting 1 × 1 convolution from Conv6, so that the semantic expression is continuously deepened, and feature channels with higher values are continuously output. In this way, when detection is performed in the Conv11 layer, although the number of the feature channels is not larger than that of the original SSD corresponding detection layer, the detection overhead is reduced while key semantics are extracted from the simplified channels. Although the feature map scales of the MobileNetv1-SSDimpr at 10 × 10 and 5 × 5 detection layers are the same as those of the detection layer corresponding to the original SSD, the number of channels is doubled compared with that of the detection layer corresponding to the original SSD, because the detection layer is located at a deeper feature layer, fine-grained information loss is serious, and information loss caused by image compression compensation through expanding the channels is beneficial to enhancing feature extraction. And Conv15_1 greatly reduces the characteristic channels on the basis of the unchanged scale of Conv14_2, then convolves a 3 x 3 characteristic layer, and expands the channels, so that the layer has stronger semantic expression and higher-value channel information although the number of channels of the detection layer corresponding to the original SSD is consistent. The feature map scale and the channel number on the 2 x 2 detection layer are both larger than those of the detection layer corresponding to the original SSD, and the position of the layer is more front than that of the original SSD, so that the detection on the layer has more advantages compared with the original SSD. The 1 x 1 improved network has smaller detection convolution kernel scale (3 x 3) compared with the original SSD, but also increases the output graph scale and reduces convolution loss.

In addition, the parameter number is smaller because the MobileNet v1-SSDimpr removes the feature layer Conv17, and the improved MobileNet v1-SSDimpr detection rate is improved because the base network still consists of deep separable convolutions and the convolution kernel filtering computation amount for detection is much lower than that of the original SSD3 x 3.

In conclusion, in terms of small-scale target detection accuracy, MobileNet v1-SSDimpr is higher than MobileNet v 1-SSD; in terms of detection rate, MobileNet v1-SSDimpr is slightly lower than MobileNet v1-SSD, but much higher than the original SSD.

And secondly, reconstructing a network default frame.

Compared with the original SSD, MobileNet v1-SSDimpr does not change the calculation of the default frame, does not fully consider the characteristics of a small-scale target, does not flexibly adjust according to the characteristics of data distribution, and is not strong in adaptability. Therefore, the invention improves the original default frame construction method, reconstructs the default frame according to the characteristics of the detected target data, clusters the target scale by adopting a K-means algorithm, and adaptively selects the default frame scale according to the clustering scale, and comprises the following specific steps: (1) determining the default frame scale number as a clustering center number k; (2) selecting different scales as initial clustering centers according to the target scale distribution so as to accelerate clustering convergence; (3) performing intersection ratio IOU calculation on each target scale data and each clustering center respectively, and classifying the data into a cluster with the largest IOU value; (4) calculating the height and width mean values of all targets in each cluster, and taking the mean value scale as a new cluster center; (5) repeating the step 3 and the step 4 until the clustering center is not changed any more; (6) and after the clustering scale is obtained, the original setting of the default frame is changed into the clustering scale.

From the viewpoint of detection calculation amount, the detection calculation amount is slightly reduced by reducing the number of 1 default frame size on Conv13 compared with the original setting. Although the detection precision of the MobileNetv1-SSDimpr on the small-scale target can be improved by adopting the clustering scale to reset the default frame, the clustering scale is still not the most suitable small scale, and each clustering scale needs to be subjected to increase, decrease and fine adjustment through a training model so as to obtain stronger robustness and better generalization during frame regression.

Thirdly, extracting and enhancing network deep information.

Although the reconstruction default frame can improve the detection accuracy of the small-scale target, the reconstruction default frame still has defects. Especially, small-scale target information is lost, and feature information among detection layers is lack of fusion, so that positioning and semantic information distribution are unbalanced, and classification regression processing is not facilitated. The invention firstly aims at the problems and further optimizes and improves the MobileNetv 1-SSDimpr.

The invention builds a multi-Path combined feature extraction enhancement mechanism at Conv14, Conv15 and Conv16 positions by grouping convolution and point convolution by using the thought of a dual Path network DPN (Dual Path network) algorithm, replaces standard convolution layers such as original Conv14_1, Conv14_2, Conv15_1, Conv15_2, Conv16_1 and Conv16_2 by three feature extraction enhancement modules, and changes the feature extraction mode of the original deep layer network depending on single Path standard convolution. After improvement, MobileNet v1-SSDimpr performs target detection on the output feature maps of the new modules, and default box dimension settings on the feature maps are unchanged.

Because the parameter quantity of the grouping convolution is equal to the reciprocal of the grouping number compared with the standard convolution, compared with the standard convolution adopted on the original characteristic layers, the characteristic extraction enhancement module greatly reduces the calculated quantity and can effectively compress the detection time of the model. The grouping convolution divides the total number of the channels of the input image into groups, different convolution kernels are used for carrying out different convolution on the corresponding grouping channels, the defect that the filtering of all the channels of the input image by a standard convolution kernel generates excessive redundant information can be effectively avoided, and the extraction rate of key features is improved. Meanwhile, compared with single path extraction, the multi-path combined extraction features can reduce the loss of original features, improve the diversity of information extraction quantity and extracted feature scales and facilitate the improvement of detection precision.

For convenience, the present invention names a module using a Multi-channel joint Feature Extraction Enhancement mechanism as MFEE (Multi-channel joint Feature Extraction Enhancement), replaces original Conv14_1 and Conv14_2 with MFEE14, replaces Conv15_1 and Conv15_2 with MFEE15, and replaces Conv16_1 and Conv16_2 with MFEE 16.

Fig. 2 is a structural diagram of the MFEE. In fig. 2, the input diagram is the output diagram of the previous volume of layers, and the dimension is the same as the number of channels in the output diagram of the previous volume of layers. And (4) carrying out feature extraction on the input graph by respectively carrying out convolution on three branches, and finally combining feature channels to form an output graph. The Batch Norm layer is used for carrying out Batch standardization processing on the previous data so as to ensure that the network always has standard and uniform data distribution. The left path firstly carries out channel reduction on the input graph through point convolution, redundant channels are filtered, meanwhile, characteristic information extraction is carried out on each reserved channel through step length 2 (s 2 in the graph 2), and an output graph which is reduced compared with the input graph in size and is compressed to X through the channels is obtained after convolution. The middle path firstly reduces the dimension of an input image channel to N through point convolution with the step length of 1, but keeps the scale of an original image; then, extracting features from the input graph by using a k × k convolution kernel and a block convolution with a step length of 2 (k is 3 in MFEE14 and 15, and k is 2 in MFEE 16), and obtaining an output graph with a reduced scale compared with the input graph but with a channel number increased in dimension to L (2N) after convolution; and then, screening the channels by point convolution with the step length of 1 to obtain an output image of the filtered redundant channel. The output graph divides the channel into two parts, namely Y and Z, and the Z part is used for carrying out channel merging with the left output graph; the Y part keeps the same with the channel number of the right output graph and carries out information fusion with the Y part. And (3) extracting features of the input image by point convolution with the step length of 2 in the right path, performing dimensionality reduction on the number of channels, and performing convolution to obtain an output image. Finally, the two fused output graphs are subjected to Concat operation to complete channel merging to obtain a final output graph, and compared with the output graph obtained by the network at the same position before improvement, the output graph has richer characteristic information.

After improvement, MobileNetv1-SSDimpr places three detection layers of the deep network on MFEE14, MFEE15 and MFEE16 (see FIG. 3). The replacement of the original convolution layer with the MFEE module, although slightly more complicated than the model before replacement, reduces both the number of convolution parameters and the amount of computation, since the packet convolution plays an important parameter compression role therein. Meanwhile, as unimportant redundant channel information is filtered in a grouping mode, the feature extraction is enriched, and the features are extracted jointly through multi-path, more small-scale target information can be extracted from the deep layer of the network, and the detection recall rate of the deep layer detection layer is improved.

Fourthly, the fusion of the network shallow information is enhanced.

Although the improved network enhances the feature extraction, the semantic expression of the shallow network features is not strong, and the feature extraction of small-scale targets is performed by fully utilizing a high-resolution feature map. According to the method, a secondary recursive feature pyramid RFPN is constructed on three feature layers Conv5, Conv11 and Conv13, new detection layers P1, P2 and P3 are generated, and Conv5, Conv11 and Conv13 are replaced respectively, so that bidirectional transfer complementation between semantic information of a deeper feature layer and positioning information of a shallower feature layer is realized, and depth fusion is achieved. FIG. 4 shows a recursive feature pyramid RFPN structure of the present invention, which is divided into Section1, Section2, Section3, and Section 4.

S4-1, Section1 feature pyramid fusion.

In the feature pyramid structure of Section1, input graph information flows from Conv4 to Conv13 through a network convolution operation. Conv5, Conv11 and Conv13 have less fine-grained information and more semantic information on feature maps as convolutional layers are raised. In the three layers, Conv5 has the most fine-grained information and the most sufficient positioning information for small-scale targets, and Conv13 has the most semantic information, but the loss of the fine-grained information and the positioning information is the largest after the fine-grained information and the positioning information are subjected to multiple convolution compression of the front layer. In order to realize complementary fusion of lower-layer fine-grained information and higher-layer semantic information on the three layers, the invention adds information feedback on the right side. Namely, the original Conv13 feature graph keeps the scale and dimension unchanged (named as C3 layer), the channel is reduced through point convolution with the step length of 1, the number of the channels is compressed to 512, the original feature graph is subjected to linear interpolation upsampling to enlarge the scale of the original feature graph by 2 times, and at the moment, the feature graph with the scale of 19 multiplied by 19 and the channel 512 is obtained. On the feature map, a horizontal connection is established with Conv11, and corresponding channel pixel-by-pixel add fusion is completed. Because interpolation up-sampling can cause image ghost distortion to a certain extent, and Add operation can introduce image aliasing, the fusion graph is further subjected to 3 × 3 depth separable convolution with the step length of 1 to filter out aliasing effects (named as a C2 layer), then subjected to point convolution reduction channels, 2 times interpolation up-sampling, transverse connection with the original Conv5 to complete feature graph fusion, and then subjected to 3 × 3 convolution to filter out aliasing effects, so that the fusion graph (named as a C1 layer) is obtained.

S4-2, Section2 feature pyramid fusion.

The fusion layers C1, C2 and C3 obtained through the Section1 feature pyramid network are not directly used for target detection, but are further processed through the Section2 pyramid network. At this time, the MobileNetv1-SSDimpr network generates a bottom-up feature pyramid through a second forward pass, but when Conv5, Conv11 and Conv13 are regenerated, the feature pyramids are fused with C1, C2 and C3 feature maps transmitted in the transverse direction respectively, and then the generation of a subsequent feature layer is completed. The specific process is as follows: after convoluting Conv4 to generate Conv5, Add fusion is firstly completed with a channel corresponding to C1 pixel by pixel, aliasing effects are filtered by 3 × 3 depth separable convolution with step length of 1, a new fusion graph is generated, and at the moment, the new Conv5 continues to complete subsequent feature layer generation. Similarly, when the information is passed upwards to the Conv10, after convoluting to generate the Conv11 by the Conv10, Add fusion is firstly completed with the C2, aliasing effects are filtered by the 3 × 3 depth separable convolution with the step size of 1, a new fusion graph is generated, and the generation of the Conv12 is continuously completed as the new Conv 11. After Conv12 convolution generates Conv13, Add fusion is completed with C1, aliasing effects are filtered by 3 x 3 depth separable convolution with step size of 1, a new fusion graph is generated, and the new Conv13 continues to complete subsequent feature layer generation.

S4-3, Section3 feature pyramid fusion.

At this time, new Conv5, new Conv11 and new Conv13 are still not directly used for target detection, and the network still needs to further process in Section3 by fusing enhanced feature information from bottom to top in Section 2. Section3 recurses the algorithm flow of Section1, but differs in that, at this time, bottom-up Conv5, Conv11 and Conv13 are generated by fusing C1, C2 and C3, and then new C3, new C2 and new C1 are sequentially generated by the same algorithm as that of Section 1. According to the algorithm principle, the new C3, the new C2 and the new C1 have richer semantic and positioning information in turn.

S4-4, Section4 feature pyramid fusion.

At this time, the new C1, the new C2 and the new C3 have richer feature information than a feature layer at a corresponding position in Section2, but in order to further improve the information fusion amount and make up for information loss caused by multiple convolutions in Section3, the invention further fuses C1 with the new C1, C2 and the new C2 by using an algorithm structure of a residual error network, and uses the fused information together with the new C3 for final detection. Section4 in fig. 4 illustrates this process. C3 is not merged with new C3, because in Section2, Conv13 generated by convolution of C3 and Conv12 is merged into new Conv13, and new Conv13 is merged into new C3 without any convolution operation, so that complete information of C3 is already merged into new C3. Thus the new C3 was used directly for detection, now named P3 detection layer. The C2 and the new C2 are fused, and then are subjected to 3 multiplied by 3 depth separable convolution to filter out aliasing effects, so that the P2 detection layer is formed. C1 and new C1 are subjected to Add fusion, and then are subjected to 3 x 3 depth separable convolution to filter out aliasing effects, so that the P1 detection layer is formed. Three detection layers P1, P2 and P3 are used for detecting small-scale targets instead of Conv5, Conv11 and Conv13 of the original MobileNetv 1-SSDimpr.

As can be seen from the above discussion, the detection layers P1, P2 and P3 can be generated by two forward transmissions and two local feedbacks of MobileNet v 1-SSDimpr. The first forward pass to Conv13 and the second forward pass to MFEE 16. The first local feedback generates C1, C2 and C3, and the second local feedback updates C1, C2 and C3, thereby generating P1, P2 and P3. From the viewpoint of newly added detection calculation amount, if the calculation amount added by a feedback path is ignored, at this time, because the network added with the recursive feature pyramid is equivalent to the Conv13 of the original MobileNetv1-SSDimpr which undergoes a complete forward transfer and then undergoes a forward transfer to the MobileNetv1, and the MobileNet v1 adopts a deep separable convolution from Conv1 to Conv13, the convolution calculation amount is saved by nearly 8/9 compared with the 3 × 3 standard convolution, so the newly added calculation amount from the input layer to the Conv13 is not too much; but the addition of 4 point convolutions with two feedback paths added and 9 3 x 3 depth separable convolutions with aliasing filtering effects increases the overall computational complexity significantly. However, the MobileNet v1-SSDimpr still weighs much less than the original SSD after increasing RFPN overall. From the aspect of improving the detection performance of small-scale targets, the information quantity of 3 feature layers subjected to 2 times of top-down, 1 time of bottom-up and 1 time of residual fusion is greatly increased compared with the original information quantity, and the 3 fusion layers not only increase the positioning information but also increase the fused semantic information along with the increase of the scale, so that the detection precision of the small-scale targets on a shallow feature layer can be improved to a greater extent, and false detection and missing detection are reduced; meanwhile, in the second forward transmission process of the network, the new Conv13 fuses more positioning and semantic information than the original Conv13, and when the information is continuously transmitted upwards to the MFEE14, the MFEE15 and the MFEE16, the positioning accuracy and the feature semantic expression of the MFEE module on the small-scale target can be further improved.

Fifthly, feature layer fusion self-adaptation: an Attention anchoring mechanism is added to the RFPN based on the MobileNetv1, and three new detection layers P1, P2 and P3 are generated through adaptive fusion.

The P1, P2 and P3 detection layers all contain characteristic diagram information of a new Conv5, a new Conv11 and a new Conv13, but when the new Conv5, the new Conv11 and the new Conv13 are further fused, details such as appropriate fusion weight and the like are not considered, and the fusion path design is relatively dependent on prior knowledge and engineering experience, and is not optimal. Therefore, the invention adds an Attention mechanism to the RFPN based on the MobileNetv1, adds an adaptive Spatial Feature fusion ASFF (adaptive Spatial Feature fusion) weight parameter to the mutual fusion of the new Conv5, the new Conv11 and the new Conv13, and further optimizes the fusion path.

FIG. 5 is a structural diagram of the addition of Attention to RFPN based on MobileNetv 1. Comparing fig. 5 with fig. 4, it can be seen that after the Attention is added, in the original structure, the information fusion path from the new C3 to the new C1 from top to bottom and the two fusion branches of C2 and C1 based on the residual structure are deleted, and the two fusion branches are changed into 3 paths which are fused with each other after the new Conv13, the new Conv11 and the new Conv5 are added into the Attention mechanism, that is, the original mode of step-by-step fusion and residual fusion is changed into the mode of three feature layers which are directly fused with each other through self-adaptive weight distribution, and then three new detection layers are generated: new P3, new P2, new P1. Compared with the original fusion mode, the method reduces the dependence on prior knowledge and engineering experience on the fusion path and improves the feature fusion effect. In addition, the extra cost caused by the weight calculation introduced by referring to the ASFF is small, so that the improved network is lighter than the original network.

Firstly, scaling and channel number of each layer are still required, and weight fusion is redistributed after the scaling and the channel number are consistent. For the change of the scale and the channel from the high layer to the low layer, the invention is still realized by simplifying the characteristic diagram channel through point convolution and adopting the interpolation up-sampling amplification scale; while the scale and channel change from the lower layer to the higher layer can be realized by adopting a convolution with the step size of 2 multiplied by 3, the invention is realized by adopting a grouping pooling mode because the new Conv11 and the new Conv13 are obtained by convolution from bottom to top and the information of the convolution of the lower layer is fused to be redundant. The pooling operation does not introduce extra parameters, and simultaneously achieves channel dimension expansion. The method comprises the following specific steps:

(1) new P1 generation process: the new Conv13 is firstly reduced to 256 by the point convolution with step length of 1, and then the scale is enlarged to 38 x 38 by 4 times of interpolation upsampling; the new Conv11 is firstly reduced to 256 by the point convolution simplified channel with the step length of 1, and then the scale of the characteristic diagram is 38 multiplied by 38 by the scale of the characteristic diagram amplified by 2 times of upsampling; the new Conv5 keeps the original scale and number of channels unprocessed; the three are fused by network self-adaptive distribution weights, and because the aliasing effect can be better avoided by proper fusion weights, a new P1 (scale 38 x 38, channel number 256) can be obtained without a 3 x 3 depth separable convolution with step size of 1.

(2) New P2 generation process: the new Conv13 is reduced to 512 by a point convolution simplified channel, and then the image scale is enlarged to 19 x 19 by 2 times of upsampling; the new Conv11 keeps the scale, channel unchanged; the new Conv5 undergoes 2 groups of pooling operations, the 1 st group of 2 × 2 maximal pooling, the 2 nd group of 2 × 2 average pooling, and after 2 groups of pooling, feature maps with the dimension of 19 × 19 and the number of channels of 256 are obtained, and the feature maps with the number of channels of 512 are obtained by merging the 2 groups of feature maps into channels by Concat; the three are fused by network self-adaptive distribution weight to become a new P2 (the scale is 19 multiplied by 19, the channel number is 512).

(3) New P3 generation process: the new Conv13 keeps the scale, channel unchanged; the new Conv11 undergoes 2 groups of pooling operations, the 1 st group of 2 × 2 maximal pooling, the 2 nd group of 2 × 2 average pooling, and after 2 groups of pooling, feature maps with the dimension of 10 × 10 and the number of channels of 512 are obtained, and the feature maps with the number of channels of 1024 are obtained by merging the 2 groups of feature maps into channels by Concat; the new Conv5 undergoes 4 groups of pooling operations, the 1 st group is subjected to 4 × 4 maximum pooling, the 2 nd group is subjected to 4 × 4 average pooling, the 3 rd group is subjected to 2 × 2 maximum pooling and then to 2 × 2 average pooling, the 4 th group is subjected to 2 × 2 average pooling and then to 2 × 2 maximum pooling, 4 groups of pooling all obtain characteristic maps with the dimension of 10 × 10 and the number of channels of 256, and the four groups of characteristic maps are subjected to Concat merging channels to obtain a characteristic map with the number of channels of 1024; the three are fused by self-adaptive weight distribution of the network to become a new P3.

α i, β i, and γ i represent weight matrices when Pi is fused, and i is 1, 2, and 3. In the aspect of obtaining the weight matrix, the invention is based on the ASFF algorithm to achieve the aim of generating and obtaining only by introducing smaller calculation cost. The following discusses the application of the algorithm of the present invention.

New Conv5, new Conv11 and new Conv13 adjust the scale and the channel to be consistent, and then self-adaptive fusion is carried out to generate P1 as an example. Since the 3 feature maps are pixel blends, each pixel location on each feature map corresponds to a blend weight parameter. Assume that a pixel point value on the new Conv5 adjusted feature map is

The fusion weight parameter is

The pixel value of the corresponding position on the new Conv11 adjusted feature map is

The fusion weight parameter is

The pixel value of the corresponding position on the new Conv13 adjusted feature map is

The fusion weight parameter is

The pixel value of the corresponding position on the fusion map is

(where i, j represents the position of the pixel point in the graph), then there is a fusion relationship (3-1) between them:

wherein the content of the first and second substances,

and

and the constraint conditions are met:

and the number of the first and second electrodes,

and

the values are obtained by the Softmax function:

wherein the parameters

And

in the network detection stage, the three adjusted feature graphs are respectively convolved by point convolution with three output channels being 1 to obtain specific numerical values; obtaining parameters

And

after the numerical value of (3), calculating a fusion weight parameter by the formulas (3-3) to (3-5)

And

taking the value of (A); fusion weight parameters corresponding to all pixels on three characteristic graphs

And

the three formed matrixes are weight matrixes alpha 1, beta 1 and gamma 1 representing the fusion into P1; and finally, calculating the value of each pixel point on the fusion map by the formula (3-1) to generate a new P1.

From the above, adaptive weight acquisition is realized based on the ASFF algorithm, the calculation amount introduced in the network detection stage is mainly concentrated on the point convolution operation (the parameters on the point convolution are trained by the network in the early stage), and each point convolution output channel is 1, so that the total calculation amount is less. But the fusion proportion of the feature information of the positive sample can be greatly improved, and the interference of the negative sample information on the fusion is effectively inhibited and reduced, so that the improvement of the network on the small-scale target detection performance is promoted.

And sixthly, optimizing a positioning loss function.

When the unmanned aerial vehicle shoots forward from high altitude, due to the fact that the unmanned aerial vehicle is far away from the target, sometimes the targets with the same shapes have the situations of consistent postures, close dimensions and large overlapping in imaging, and particularly when the targets are dense, the phenomenon is easy to occur, so that a prediction frame of one target is deviated to an adjacent target overlapped with the target. The same problems are faced since MobileNetv1-SSDimpr follows the non-maximum suppression technique employed in raw SSD detection. In order to solve the problems of inaccurate positioning, missed detection and the like which are easy to occur when the overlapped targets are detected, the invention provides a method for adding an autonomously designed rejection loss function to the positioning loss function of the MobileNetv1-SSDimpr from the perspective of optimizing machine learning prediction frame regression offset variance based on the rejection loss Reploss thought, so that the network trains the offset variance required by prediction frame regression by using the optimized positioning loss function, and therefore, when the overlapped targets are detected, the current target prediction frame can reject the inhibition interference caused by other nearest target prediction frames.

The original MobileNetv1-SSDimpr follows the localization loss function of the original SSD, and the expression is:

in the formula (3-6), N is the number of the real boxes on the default box matching, k is the label of the current prediction category, cx and cy are the horizontal and vertical coordinates of the center point of the default box on the matching, w and h are the width and height of the default box,

represents the matching degree of the ith default frame and the jth real frame (1 for matching and 0 for not matching) of the k category, l_iIs a position correction amount, g, for correcting the displacement from the true position in the ith prediction frame_jIs the position offset of the jth real frame relative to the ith predicted frame.

As shown in the formulas (3-6) and (3-7),

the smaller the variance of (c), the smaller the value of the positioning loss function. The positioning loss function in the network training stage plays a role in finding the optimal offset variance for correcting the offset between the prediction frame and the actual position offset, and the optimal offset variance is used for correcting the position of the regression of the prediction frame in the network detection stage. However, the target attribution of the prediction frame cannot be distinguished by the positioning loss function, so that the obtained prediction offset cannot avoid the confusion of the prediction frames regressed from each target in the network detection stage. For this purpose, the invention provides a loss function L in the location_loc(x, L, g) is followed by an autonomously designed rejection loss function L_RepTherefore, prediction frames regressed by each target in the detection stage can be mutually exclusive, the mutual interference on detection is reduced, and inaccurate positioning and missing detection are reduced.

Setting the lower limit value of the intersection ratio IOU of the matching of the default frame and the real frame to be 0.5, and setting the set of all the real frames G as G₊The set of all default frames D matching the real frame is D₊. Then for some default box D e D₊The real box with which the maximum IOU can be matched can be obtained by equation (3-8):

in the formula (3-8), argmax represents the real box G (i.e., the box G in which the subsequent IOU takes the maximum value)

) Mam is an abbreviation for Maximum match.

The real box that matches the default box next largest IOU, except for the largest matching real box, is given by equation (3-9):

in the formula (3-9), the metal oxide,

representation belongs to the set G₊But exclude

Rep is an abbreviation that excludes Repulsion.

If P is the predicted frame from the default frame D, then P and

the ratio of (IOG) can be obtained from the equation (3-10):

equation (3-10) can be calculated to obtain the prediction box P and the interference it wants to rejectSecond largest matching real frame

Occupied by the intersection of

The ratio of (a) to (b). The smaller the ratio, the larger the predicted frame P matches the next largest matched real frame

The higher the degree of repulsion, i.e. the higher the sensitivity to repulsion of other nearest neighboring targets. And if want to make in the network training

And becomes smaller, it needs to be implemented by a repulsive loss function. From the formula (3-10)

The range of (2) is (0,1), so the smoothing function in the repulsive loss function has a range of (0, 1). The expressions (3-11) to (3-12) are the repulsion loss function L of the design of the present invention_Rep：

In the formula (3-11), D₊Is the set of all default boxes D that match the real boxes, P represents the predicted box that each specific default box regresses out,

is the real box of the target to which the prediction box P is excluded from being the nearest neighbor,

is the ratio of the two, | D₊Is D₊Norm table ofThe default box number is summed.

In the formula (3-11), the metal oxide,

the smaller the rejection loss function L_RepThe smaller the value of (c). Then L_RepIn the network training, the positioning loss function of the formula (3-6) is cooperated, and the L is continuously reduced through updating the inverse gradient_loc(x, L, g) and L_RepThe value of (2) can make the network simultaneously learn to be close to the difference of the target real position and make

The reduced displacement of the prediction box P corrects the variance. The corrected variance is used for finely correcting the position of the prediction frame during network detection, so that the prediction frame can reject other targets nearest to the current target, and therefore inaccurate positioning and detection omission caused by mutual interference of the prediction frames among overlapped targets are effectively reduced.

The MobileNet v1-SSDimpr optimized localization loss function is given by the following formula (3-13):

L_locnew＝L_loc(x,l,g)+δL_Rep (3-13)

in the equations (3 to 13), δ is a hyperparameter for adjusting the weight of participation of the repulsion loss function, and the empirical value is set to 0.5. Since the exclusive loss function only consumes computing resources during network training, no extra computing amount is added during model detection, and the detection rate is not influenced.

Claims

1. A real-time target detection method based on small-scale targets is characterized by comprising the following steps:

s1, adjusting a network detection layer: adjusting a feature layer used by the MobileNet v1-SSD for detection, and constructing a network structure MobileNet v1-SSDimpr more suitable for small-scale target detection;

the construction process of the MobileNetv1-SSDimpr comprises the following steps: adding a Conv5 layer for classification detection, and setting the default frame size number to be 4; changing the number of default frame scales on Conv11 to 4, and calculating the default original SSD method by the default frame scales of each detection layer; remove the Conv17 layer; the new detection characteristic layer for constructing the MobileNetv1-SSDimpr comprises the following components: conv5, Conv11, Conv13, Conv14_2, Conv15_2 and Conv16_ 2;

s3, extracting and enhancing network deep information: for the MobileNetv1-SSDimpr after the default frame is reconstructed, a multipath joint feature extraction enhancement mechanism is constructed at three deep layer positions of Conv14, Conv15 and Conv16 through grouping convolution and point convolution to obtain three feature extraction enhancement modules MFEE14, MFEE15 and MFEE16, and the original corresponding standard convolution layer is replaced;

2. The method for real-time object detection based on small-scale objects as claimed in claim 1, wherein said method further comprises the steps of: s5, feature layer fusion self-adaptation: an Attention anchoring mechanism is added to the RFPN based on the MobileNetv1, and three new detection layers P1, P2 and P3 are generated through adaptive fusion.

3. The method according to claim 1, wherein in step S1, the size of Conv5 is 38 × 38 × 256, the size of Conv11 is 19 × 19 × 512, the size of Conv13 is 10 × 10 × 1024, the size of Conv14_2 is 5 × 5 × 512, the size of Conv15_2 is 3 × 3 × 256, and the size of Conv16_2 is 2 × 2 × 256.

4. The method for real-time object detection based on small-scale objects as claimed in claim 1, wherein the step S2 comprises the following steps:

s2-1, determining the default frame scale number as the clustering center number;

s2-2, selecting different scales as initial clustering centers according to target scale distribution to accelerate clustering convergence;

s2-3, performing intersection comparison IOU calculation on each target scale data and each cluster center respectively, and classifying the data into the cluster with the largest IOU value;

s2-4, calculating the height-width mean value of all targets in each cluster, and taking the mean value scale as a new cluster center;

s2-5, repeating the step S2-3 and the step S2-4 until the cluster center is not changed any more;

s2-6, after the clustering scale is obtained, the original setting of the default box is changed into the clustering scale.

5. The method according to claim 3, wherein different scales are selected as initial cluster centers; setting 4 default frame scales on Conv5 and Conv11 detection layers of MobileNetv1-SSDimpr, setting 5 default frame scales on Conv13 detection layers, wherein the default frame scales comprise the 2 nd and 3 rd merged cluster scales and the residual cluster scales; six default box dimensions are set on the Conv14_2, Conv15_2, Conv16_2 detection layers.

6. The method as claimed in claim 1, wherein in step S4, the constructed quadratic recursive feature pyramid RFPN includes Section1, Section2, Section3, and Section4, and the information fusion enhancement on the reconstructed default frame of the MobileNetv 1-ssdimm shallow network specifically includes:

s4-1, Section1 feature pyramid fusion: in the first forward transmission of the MobileNetv1-SSDimpr network, input graph information continuously flows to Conv13 through convolution operation Conv4, and a top-down information feedback path is added on the right side on the basis of a bottom-up information transmission path on the left side; in the top-down path, the original Conv13 feature graph remains the same in scale and dimension and still stays at the top level, and is renamed to be C3 level; the feature map of the C3 layer is subjected to convolution and simplification channels, the original feature map is subjected to up-sampling to be amplified by 2 times, pixel fusion is carried out on the feature map and a channel corresponding to Conv11 in a path on the left side, aliasing effects are filtered through depth separable convolution, and the fusion map with aliasing filtered is named as a C2 layer; the C2 layer is subjected to point convolution simplifying channel, 2 times of upsampling, feature map fusion and convolution filtering aliasing completion with the original Conv5 to obtain a fusion map, and the fusion map is named as a C1 layer;

s4-2, Section2 feature pyramid fusion: the MobileNet v1-SSDimpr network generates a bottom-up feature pyramid through second forward transmission, but when Conv5, Conv11 and Conv13 are regenerated, the feature pyramid is fused with C1, C2 and C3 feature graphs transmitted transversely respectively, and then the generation of a subsequent feature layer is completed; the specific process is as follows: when Conv4 is convolved to generate a new Conv5, the new Conv5 is fused with a channel corresponding to C1 pixel by pixel, and the separable convolution is carried out to filter the aliasing effect to generate a fusion graph; the new Conv11 and the new Conv13 have the same principle;

s4-3, Section3 feature pyramid fusion: at this time, new Conv5, Conv11 and Conv13 from bottom to top are generated by fusing C1, C2 and C3, and then new C3, new C2 and new C1 are sequentially generated by the same algorithm as Section 1;

s4-4, Section4 feature pyramid fusion: c1 was fused to new C1, C2 was fused to new C2, and used with new C3 for final testing.

7. The method according to claim 6, wherein in steps S4-2 to S4-4, the original progressive fusion plus residual fusion mode is changed into three feature layers which are directly fused with each other by self-adaptive weight distribution, so as to generate three new detection layers: new P3, new P2, new P1.

8. The method for real-time target detection based on small-scale targets of claim 7, wherein the fusion process is implemented in a grouping pooling manner.

9. The method of real-time object detection based on small-scale objects as claimed in claim 8,after the scales and channels of the new Conv5, the new Conv11 and the new Conv13 are adjusted to be consistent, the process of generating the new P1 by self-adaptive fusion is as follows: a pixel point value on the new Conv5 adjusted feature map is

The fusion weight parameter is

The fusion weight parameter is

The fusion weight parameter is

The pixel value of the corresponding position on the fusion map is

Wherein i, j represents the position of the pixel point in the image, and the fusion relationship between them is:

wherein the content of the first and second substances,

and

and the constraint conditions are met:

and the number of the first and second electrodes,

and

the values are obtained by the Softmax function:

wherein the parameters

And

And

And

And

the three formed matrixes are weight matrixes alpha 1, beta 1 and gamma 1 representing the fusion of the new Conv5, the new Conv11 and the new Conv13 into P1; and finally, calculating the value of each pixel point on the fusion map by the formula (3-1) to generate a new P1.

10. The method for real-time object detection based on small-scale objects as claimed in claim 1,

original localization loss function L to MobileNetv1-SSDimpr_loc(x, L, g) followed by a rejection loss function L_RepForming a new localization loss function as:

L_locnew＝L_loc(x,l,g)+δL_Rep (3-13)

in the formula (3-13), δ is a function for adjusting the repulsion loss L_RepA hyperparameter participating in the weight;

added rejection loss function L_RepComprises the following steps:

in the formula (3-11), D₊Is all default boxes matching the real boxD, P represents a prediction box returned from each specific default box,

is the ratio of the two, Smooth_Rep(x) Is a smoothing function, | D₊Is D₊The norm of (d) represents the sum over the default box number.