CN111767944B

CN111767944B - Single-stage detector design method suitable for multi-scale target detection based on deep learning

Info

Publication number: CN111767944B
Application number: CN202010462591.XA
Authority: CN
Inventors: 赵敏; 孙棣华; 陈宇浩
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2023-08-15
Anticipated expiration: 2040-05-27
Also published as: CN111767944A

Abstract

The invention discloses a design method of a single-stage detector suitable for multi-scale target detection based on deep learning, which is characterized in that a feature pyramid with balanced and sufficient feature information of each stage is constructed from three angles of semantic information, detail information and receptive field; on the other hand, in order to improve the recall rate of the detector to the polar end size target, the invention abandons the manual setting of the dimension and length-width ratio parameters of the Anchor, and utilizes the network to learn the dimension and distribution required by the Anchor. According to the invention, the single-stage detector is redesigned from the angles set by the feature pyramid and the Anchor, so that the detection precision of the single-stage detector on a multi-scale target is improved, and the detection speed is simultaneously considered.

Description

Single-stage detector design method suitable for multi-scale target detection based on deep learning

Technical Field

The invention discloses a single-stage detector design method suitable for multi-scale target detection, which can be effectively used for detection scenes with large scale range changes such as vehicles.

Background

Object detection, one of the most important basic tasks in the field of computer vision, is often applied to multiple downstream tasks such as object tracking, re-recognition, and instance segmentation. In recent years, along with the explosion of deep learning, the target detection algorithm based on the convolutional neural network rapidly takes the first of each large detection list in the academic world by virtue of the advantages of high speed, high precision, strong robustness and the like. However, due to the structural characteristics of convolution, it is assumed that the convolutional neural network does not have scale invariance, so that the scale change between object examples becomes one of the difficulties of the target detection algorithm. On the other hand, the target detection algorithm based on the convolutional neural network can be divided into a single-stage detector and a multi-stage detector from the aspect of regional suggestion generation, and the single-stage detector generated by regional suggestion generation is discarded to directly classify and regress Anchor, so that real-time reasoning speed is realized, and the target detection algorithm has wide application in a real scene. But has the disadvantage of limited detection accuracy, especially in multi-scale object-densely distributed scenarios. Therefore, the precision of the single-stage detector is improved while the high-efficiency reasoning speed is ensured, and the method is always one of research hotspots in the field of target detection.

The current method for improving the multi-scale target detection effect mainly starts from the following three aspects of multi-scale training, feature pyramid construction and feature receptive field. Multi-scale training, namely randomly changing the input resolution of a training image after iterating for a certain number of times, forces the network to learn the characteristics of targets under various scales. When SNIP trains at each fixed scale, only the gradient of the target of the corresponding scale is returned, and the target which is too large and too small is ignored, and when in test, detection is carried out on all scales, but only the detection result of the corresponding scale is reserved. The multi-scale training requires a larger video memory and longer training time, and the multi-scale test seriously reduces the reasoning speed. Constructing feature pyramids is the most widely used method at present. The SSD utilizes features of different resolutions in the backbone network to construct a feature pyramid to correspondingly detect targets of different scales, and the FPN and the TDM additionally construct a top-down path to supplement the feature pyramid with unbalanced semantic information in the backbone network, so that the multi-scale target detection effect is improved. The top-down branch, although supplementing the semantic information of the shallow features, ignores the detail information lacking of the top-level features. The STDN, PFPNet and other works jump out of the conventional thinking, and the multi-scale feature pyramid is obtained by utilizing SPP or DenseNet and other structures with the aim of constructing the feature pyramid with balanced information. The feature pyramid method has remarkable effect on improving multi-scale target detection, but attention is paid to the fact that the complex feature pyramid can introduce excessive parameter quantity and calculated quantity, so that the model reasoning speed is reduced. The feature receptive field, as the name implies, is to increase the detection effect of the small target by expanding the receptive field of the shallow features. Inspired by the receptive field structure in the human visual system, RFBNet adds a hole convolution into the acceptance structure, designs a novel RFB module and embeds the RFB module into the SSD algorithm, however, RFBNet only focuses on shallow features, ignores information supplementation to high-level features, and limits the multi-scale performance of the detector. The expressway has been developed in China from the 90 th century of the 20 th century, and has very important roles and functions in modern transportation by inherent characteristics and advantages. As more and more vehicles travel on highways, various problems come to the end, the first thing is traffic jam. Due to the occurrence of abnormal events such as traffic accidents, road maintenance and the like on the expressway, the expressway resources which are quite limited originally are difficult to fully utilize, and serious traffic jam and vehicle queuing problems are caused. Unlike urban roads, vehicles on highways generally have faster driving speeds, so that once traffic jams occur, serious consequences are often caused, and the influence time of the jams is generally long, which may cause serious economic loss.

The current method for predicting the queuing length is mainly improved based on a queuing theory or a traffic wave model, wherein patent CN106887141A obtains the queuing length of a road section on the basis of assuming that the arrival rate of vehicles obeys a certain distribution according to the queuing length among each node by setting continuous flow acquisition nodes based on the queuing theory. Patent CN106571030a proposes a traffic wave model-based queuing length prediction method for a specific scene of a road intersection based on multi-source data collected by floating cars, and the method requires that a certain proportion of floating cars exist on the road although the requirement for layout of detection equipment is low, which is obviously difficult to meet in most cases for highways. Meanwhile, the current method for predicting the queuing length is mainly aimed at simpler and closed road environments such as intersections, but non-closed road scenes including ramp tollgates and the like exist on highways, and related researches are lacked.

Therefore, by means of the multisource data which can be acquired on the expressway, the influence range of the abnormal event and the queuing length change process can be effectively analyzed and mastered, traffic managers can be guided to formulate reasonable traffic control strategies, and further the improvement of the control and service level of the expressway is an urgent need of the development of the current intelligent traffic system, and is also an important and difficult problem of research.

Disclosure of Invention

Accordingly, it is an object of the present invention to provide a method for designing a single-stage detector suitable for multi-scale target detection based on deep learning. According to the analysis, aiming at the defects existing in the prior art, the single-stage detector is redesigned from the angles set by the feature pyramid and the Anchor, so that the detection precision of the single-stage detector on a multi-scale target is improved, and the detection speed is simultaneously considered. Specifically, the invention constructs a feature pyramid with balanced and sufficient feature information of each level from three angles of semantic information, detail information and receptive field. On the other hand, in order to improve the recall rate of the detector to the polar end size target, the invention abandons the manual setting of the dimension and length-width ratio parameters of the Anchor, and utilizes the network to learn the dimension and distribution required by the Anchor.

The design method provided by the invention comprises the following steps.

The aim of the invention is realized by the following technical scheme:

a single-stage detector design method suitable for multi-scale target detection based on deep learning, comprising the following steps:

step one: carrying out data enhancement on the image;

step two: the method comprises the steps of obtaining the characteristic f with high semantics, high detail and large receptive field, wherein the characteristic f comprises the following four parts:

1) The input picture is used for obtaining a feature image f with sufficient semantic information, which is sampled 32 times through a backbone network _c ；

2) Parallel to backbone network, inputThe picture is subjected to 16 times pooling downsampling, and a feature map f for coding abundant detail information is obtained through a shallow layer network of a plurality of convolution modules _d ；

3) For f _c And f _d Fusion is carried out: will f _c Up-sampling to obtainAt the same time, 1X 1 convolution is used to ensure f _d And->Feature dimensions are completely consistent for->post-Sigmoid operation sum f _d Multiplying to obtain f _cd ；

4) Will f _cd Inputting the characteristic diagram f into a multi-branch cavity convolution module ASPP;

step three: based on the feature map f obtained in the second step, classifying features in the feature pyramid into two types according to the resolution ratio higher than the feature map f and lower than the feature map f, and adopting different processing methods for the two types of features to construct the feature pyramid suitable for multi-scale target detection;

step four: the automatic generation of the Anchor, namely the Guide Anchor, comprises the following three parts:

1) In the classification branch feature map f _cls A single-channel 1 multiplied by 1 convolution is connected, and the probability of whether an Anchor is placed at a certain position is obtained through Sigmoid operation;

2) In the regression branch feature map f _reg A double-channel 1 multiplied by 1 convolution is connected to calculate two parameters of the width and the height of an Anchor placed at a certain position;

3) Calculating convolution sampling point deviation by using an Anchor width and height feature map generated by regression branches, and respectively comparing f _cls And f _reg And carrying out deformable convolution to obtain characteristics for classifying and regressing the Anchor.

Step five: design of loss function

The loss function of the whole network is expressed as

Loss＝L _cls +L _reg +λ(L _loc +L _shape )

Wherein L is _loc Represents Anchor position loss, L _shape Represents the loss of Anchor-shaped branches, L _cls Representing the loss of classification of the predictive part of Anchor, L _reg The regression loss of the predicted portion of Anchor is represented, and λ is the weighting coefficient.

Further, the specific process of the first step is as follows:

1) Randomly clipping a region from the training image, but ensuring that the clipped region has a target;

2) Randomly expanding the cut area by using zero pixels;

3) The extended picture is scaled to the input resolution size.

Further, the number of convolution modules in the step 2) in the step two is 3.

Further, the specific process of the third step is as follows:

1) The feature of the feature pyramid, the resolution of which is higher than that of the feature f obtained in the step two, is enlarged by nearest neighbor up-sampling, and then 1 multiplied by 1 convolution is used for purification;

2) For features in the feature pyramid with resolution less than and equal to f, the features are obtained directly from f by 3×3 convolution with the specified step size.

Further, the calculation method of each loss in the fifth step is as follows:

1) In the Anchor generating section, focal Loss is used to calculate Anchor position Loss L in consideration of the fact that positive and negative samples are in an extremely unbalanced state _loc ；

2) In the Anchor generation section, only the width and height are considered, and the Loss L of the Anchor shape branch is calculated by using the GIoU Loss _shape ；

3) Prediction of Anchor includes classification and regression, and classification loss L _cls Regression loss L using a Softmax-based cross entropy loss function _reg A smoothl 1 loss function is used.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

the present invention takes into account the sparse distribution of road detection devices,

on one hand, the feature pyramid with balanced and sufficient feature information at each level is constructed from three angles of semantic information, detail information and receptive field; on the other hand, in order to improve the recall rate of the detector to the polar end size target, the invention abandons the manual setting of the dimension and length-width ratio parameters of the Anchor, and utilizes the network to learn the dimension and distribution required by the Anchor. According to the invention, the single-stage detector is redesigned from the angles set by the feature pyramid and the Anchor, so that the detection precision of the single-stage detector on a multi-scale target is improved, and the detection speed is simultaneously considered.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

The drawings of the present invention are described below.

FIG. 1 is a schematic flow diagram of a single-stage detector suitable for multi-scale target detection.

Fig. 2 is a schematic diagram of a single-stage detector suitable for multi-scale target detection.

Detailed Description

The invention is further described below with reference to the drawings and examples.

Example 1

As shown in fig. 1-2, the method for designing a single-stage detector suitable for multi-scale target detection based on deep learning according to the present embodiment includes the following steps:

step one: the data enhancement of the image mainly comprises the following three parts:

1) Firstly, randomly generating two parameters of width and height of a clipping region, and then randomly generating the left upper corner of the clipping region, thereby obtaining the clipping region. The intersection ratio IoU of the clipping region and all target frames is calculated as follows:

where area represents the area of a box, and u represents the intersection part of two boxes, and u represents the union part of two boxes. If the IoU minimum value is smaller than the set specified threshold (e.g., 0.5), it indicates that the overlap ratio of the clipping region and all targets exceeds the specified threshold. If the conditions are not met, repeating cutting for 50 times, and if the conditions are not met, directly outputting an original picture;

2) Randomly generating two parameters of width and height of the extended picture, randomly generating a left upper corner for placing a clipping region, filling the clipping region into the extended picture, and filling zero pixels in other parts;

3) Randomly selecting one of the following five scaling modes: nearest neighbor interpolation, bilinear interpolation, pixel region correlation, bicubic interpolation in the 4 x 4 pixel domain, and Lanczos interpolation in the 8 x 8 pixel domain, scaling the expanded picture to the specified input resolution size. And after random clipping, scaling is expanded to generate more multi-scale targets, so that the multi-scale performance of the model is enhanced. In addition, if the test stage is performed, the first step is to perform only the current scaling part.

Step two: the method for acquiring the characteristic f with high semantics, high detail and large receptive field mainly comprises the following four parts:

1) Taking ResNet-50 as an example, removing the last global average pooling and full connection layer, taking the ResNet-50 as a backbone network, and obtaining a feature map f with 32 times of downsampled semantic information after the input picture is extracted by the backbone network _c ；

2) In parallel with the backbone network, a shallow convolutional neural network is designed to supplement the detail information. Specifically, the convolution, the batch normalization layer (BN) and the activation function (ReLu) are combined into one convolution module, a plurality of (for example, 3) convolution modules are stacked, the input picture is subjected to 16 times downsampling by using the pooling layer and then is input into the convolution module, and the code enrichment is obtainedFeature map f of detail information _d ；

3) Due to f _c Spatial resolution is less than f _d Thus, the pair f is required before fusion _c Up-sampling is performed 2 times. Combining the pixel points at the same position into a 2X 2 matrix by taking 4 channels as a group, and performing up-sampling _c The resolution is increased to twice, the channel number is reduced to four times, and the spatial resolution is complemented by the channel information. Then f is convolved with a 1 x 1 convolution _c ^up And f _d The number of channels is equal. Finally to f _c ^up Operating multiplication to f using Sigmoid _d Obtaining the feature f with sufficient semantic information and detail information _cd The overall process is shown in the following formula:

wherein W represents the weight of a 1×1 convolutional layer;

4) Will f _cd Inputting the high-quality features into a multi-branch cavity convolution module ASPP to further improve the receptive field of the features, so as to obtain high-quality features f with sufficient semantic information, abundant detail information and sufficient receptive field;

step three: the method for constructing the feature pyramid suitable for multi-scale target detection mainly comprises the following two parts:

1) In order to obtain 8 times of downsampling feature images in the feature pyramid, f is required to be upsampled by 2 times, the feature images of the adjacent 4 channels are recombined into features with 2 times of enlarged spatial resolution, and meanwhile, the number of channels of the output feature images is ensured to be unified to 256 by using 1X 1 convolution;

2) Feature graphs in the feature pyramid, which are as large as f in resolution, are obtained for f by using a 3 x 3 convolution of 256 channels; for feature maps with a resolution less than f, taking a 64-fold downsampled feature map as an example, f needs to be downsampled by a further 4 times, and is thus obtained using a 3×3 convolution cascade with two steps of 2, and so on. The feature pyramid is set to 5 levels in total, the largest scale is 8 times of the downsampled feature map, and the smallest scale is 128 times of the downsampled feature map.

Step four: the automatic generation of Anchor mainly comprises the following three parts:

1) In the classification branch feature map f _cls The 1×1 convolution followed by a single channel converts the pixel value into a (0, 1) interval by Sigmoid operation to represent the probability of placing an Anchor at that location. Therefore, in the training link, the label graph of the Anchor position needs to be obtained by utilizing the position of the target frame, and the basic principle of generating the label graph is that the Anchor should be placed in the center area of the target frame, otherwise, the Anchor is not placed in the pixel points far away from the target frame. Specifically, the method comprises the following two steps: firstly, selecting different levels of features for targets of different scales according to the following formula to detect:

where w and h represent the width and height of the target, the highest resolution feature in the feature pyramid is four times downsampled, and typically, the reference Anchor size is considered to be 8 times downsampled multiple for each point on the feature map, i.e., 32. The logarithm is taken, added with 0.5 and then rounded downwards to achieve rounding. Thus when the target area is less than 32 ² When the resolution is highest, detecting by the feature map with the highest resolution; when the target area is 32 ² To 64 ² In between, detection is performed by the feature map of the next highest resolution, and so on. Next, the area ratio ε of the super-parametric center region is set ₁ =0.2 and ignored region area ratio ε ₂ =0.5, the target box is denoted by (x, y, w, h), where (x, y) is the center point and (w, h) is the width and height. The center region CR, the neglected region IR and the outside region OR may be expressed as follows:

CR＝(x,y,ε ₁ w,ε ₁ h)

IR＝(x,y,ε ₂ w,ε ₂ h)\CR

OR＝R\(x,y,ε ₂ w,ε ₂ h)

wherein R represents the whole feature map space, and A\B represents the subtraction of the B region from A. In the CR area of the feature map, the value of the Anchor position label map is 1, namely, anchor is placed; in the IR region, whether an Anchor is placed or not is not considered, i.e. the gradient is not returned; in the OR area, the Anchor position tag map value is 0, i.e., anchor is not placed. When the CR area and other areas among different targets overlap, the CR area is used as the reference; when the IR region and the OR region overlap, the IR region is subject to. Briefly, from a priority perspective, CR > IR > OR. In addition, to alleviate gradient contradictions between feature pyramids, CR regions on adjacent level feature maps would also be considered IR regions of the current feature map. In the test link, anchor is only set for the position of which the Anchor position score exceeds a specified threshold (such as 0.01);

2) For regression branch feature map f _reg The two parameters of Anchor width and height are obtained by the following calculation by using a double-channel 1X 1 convolution:

w＝σ ₁ ×e ^dw ,h＝σ ₂ ×e ^dh

where dw and dh represent the two parameters of width and height, respectively, σ is a scaled variable, which can be obtained by learning, or can be set manually, here for simplicity, σ is set to 8s, s is the feature map step size. The purpose of the exponential form is to ensure wide and high non-negativity of the Anchor. Similar to the previous step, in the training link, an Anchor matching target on each feature point needs to be found, so that the current branch is optimized by using the loss function. Anchor variable a with unknown width and height _wh = (x, y, w, h) and target box gt= (x) _g ,y _g ,w _g ,h _g ) The vIoU is defined as IoU with the maximum Anchor and target frame, and the calculation formula is as follows:

obviously, w and h are endless as two real numbers. Therefore, enumeration of common w and h is considered (for example, 9 Anchor parameters set in RetinaNet are adopted), so that a target frame corresponding to each Anchor and IoU are obtained. Next, a positive sample Anchor was obtained by two routes: first, a positive sample threshold pos=0.5 is set, and when IoU is greater than the positive sample threshold, anchor is considered as a positive sample. And secondly, considering each target independently, and considering that if the maximum Anchor with IoU exceeds 0.4, the target is considered as a positive sample. Finally, 128 positive samples are randomly sampled, and the difference between the shape of the Anchor and the shape of the target frame is calculated and optimized by using a loss function;

3) Considering the difference of the Anchor shape on the same feature map, the feature map of the Anchor shape in the previous step is convolved and then used as sampling point deviation, deformable convolution is respectively applied to classification and regression features, and taking 3×3 deformable convolution as an example, the calculation process is as follows:

wherein Δp _n The convolution sampling point deviation generated by the Anchor shape feature map is represented, W is a parameter of the deformable convolution, x is an original feature, and y is a new feature generated by the deformable convolution.

Step five: the loss function of the whole network is calculated, and the loss function mainly comprises the following four parts:

1) Calculating the loss of the Anchor position portion using a cross entropy loss function:

L _loc ＝-(ylgp+(1-y)lg(1-p))

wherein y and p represent the corresponding values in the Anchor position tag map and the prediction map, respectively. According to the description in the fourth step, the difference between the number of positive and negative samples in the Anchor position label graph is known to be great, so that the loss generated by the positive and negative samples needs to be weighted to relieve the gradient direction being dominated by the negative samples due to the unbalance problem. P is defined according to the following _t ：

Then L is _loc ＝-lgp _t . Further, the number and difficulty of positive and negative samples are balanced by using the Focal loss, as shown in the following formula:

L _loc ＝-α _t (1-p _t ) ^r lgp _t ，

wherein α is used to balance the imbalance of positive and negative sample numbers, (1-p _t ) ^r The method is used for balancing unbalance of the number of difficult samples, so that the network is more focused on categories with less learning quantity and high difficulty;

2) Anchor shape branch loss function: according to the analysis of the fourth step, 128 positive samples Anchor are randomly selected, and the predicted width and height are combined with the current feature point coordinate position and restored to the original image, so that the specific position and the specific shape of the generated Anchor are obtained. The partial loss function was calculated using a IoU based method, as follows:

wherein,,representing the pair of generated Anchor box B and target box B ^gt Penalty terms of (2). The invention calculates the partial regression loss by sampling the DIoU loss, and the penalty term calculation formula is as follows:

wherein ρ is ² (b,b ^gt ) Representing Anchor box B and target box B ^gt Euclidean distance between center points, c represents B and B ^gt Is defined as the diagonal length of the smallest outside rectangle. Optimizing the Anchor shape branches by using DIoU to generate anchors which are more in line with target distribution;

3) The cross entropy loss function based on Softmax is adopted for the classification part of Anchor, and the formula is as follows:

wherein x is _j Representing a sample's predicted value for the true class, C is the total number of classes, ε is a minimum number (e.g., 10 ^-5 ) To prevent the denominator from falling back to 0 when the denominator is less than the computer representation accuracy. In addition, to avoid unbalance of the positive and negative samples, the algorithm sorts all negative samples according to the loss, and only three times of the positive samples are selected for returning the gradient. The regression loss part of Anchor is calculated by adopting a Smooth L1 loss function, and the formula is as follows:

wherein x represents the difference between the encoding amount and the predicted amount of the target frame relative to Anchor, and the encoding process is consistent with most detection algorithms (e.g. fast R-CNN, SSD);

4) Combining the three previous small steps, the loss function of the whole network generates a weighted sum of the loss generated by the Anchor portion and the loss generated by the network prediction portion, as shown in the following formula:

Loss＝L _cls +L _reg +λ(L _loc +L _shape )

where λ is the weighting factor of the two-part loss, generally taken as 1.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution, and the present invention is intended to be covered in the scope of the present invention.

Claims

1. A method for designing a single-stage detector suitable for multi-scale target detection based on deep learning, comprising the steps of:

step one: carrying out data enhancement on the image;

2) Parallel to backbone network, the input picture is subjected to 16 times pooling downsampling, and a feature map f of coding abundant detail information is obtained through a shallow layer network of a plurality of convolution modules _d ；

1) In the classification branch feature map f _cls 1X 1 convolution followed by a single channel and passed through Sigmoid, operating to obtain the probability of whether an Anchor is placed at a certain position;

3) Calculating convolution sampling point deviation by using an Anchor width and height feature map generated by regression branches, and respectively comparing f _cls And f _reg Performing deformable convolution to obtain characteristics for classifying and regressing the Anchor;

step five: design of loss function

The loss function of the whole network is expressed as

Loss＝L _cls +L _reg +λ(L _loc +L _shape )

2. The design method according to claim 1, wherein the specific process of the first step is as follows:

2) Randomly expanding the cut area by using zero pixels;

3) The extended picture is scaled to the input resolution size.

3. The design method according to claim 1 or 2, wherein the number of convolution modules in step 2) in the second step is 3.

4. The design method according to claim 1, wherein the specific process of the third step is as follows:

5. The design method according to claim 1, wherein the calculation method of each loss in the fifth step is as follows:

1) In the Anchor generating section, focalLoss is used to calculate Anchor position loss L in consideration of the positive and negative samples being in an extremely unbalanced state _loc ；