CN111767944B - Single-stage detector design method suitable for multi-scale target detection based on deep learning - Google Patents

Single-stage detector design method suitable for multi-scale target detection based on deep learning Download PDF

Info

Publication number
CN111767944B
CN111767944B CN202010462591.XA CN202010462591A CN111767944B CN 111767944 B CN111767944 B CN 111767944B CN 202010462591 A CN202010462591 A CN 202010462591A CN 111767944 B CN111767944 B CN 111767944B
Authority
CN
China
Prior art keywords
anchor
loss
feature
convolution
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010462591.XA
Other languages
Chinese (zh)
Other versions
CN111767944A (en
Inventor
赵敏
孙棣华
陈宇浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202010462591.XA priority Critical patent/CN111767944B/en
Publication of CN111767944A publication Critical patent/CN111767944A/en
Application granted granted Critical
Publication of CN111767944B publication Critical patent/CN111767944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a design method of a single-stage detector suitable for multi-scale target detection based on deep learning, which is characterized in that a feature pyramid with balanced and sufficient feature information of each stage is constructed from three angles of semantic information, detail information and receptive field; on the other hand, in order to improve the recall rate of the detector to the polar end size target, the invention abandons the manual setting of the dimension and length-width ratio parameters of the Anchor, and utilizes the network to learn the dimension and distribution required by the Anchor. According to the invention, the single-stage detector is redesigned from the angles set by the feature pyramid and the Anchor, so that the detection precision of the single-stage detector on a multi-scale target is improved, and the detection speed is simultaneously considered.

Description

Single-stage detector design method suitable for multi-scale target detection based on deep learning
Technical Field
The invention discloses a single-stage detector design method suitable for multi-scale target detection, which can be effectively used for detection scenes with large scale range changes such as vehicles.
Background
Object detection, one of the most important basic tasks in the field of computer vision, is often applied to multiple downstream tasks such as object tracking, re-recognition, and instance segmentation. In recent years, along with the explosion of deep learning, the target detection algorithm based on the convolutional neural network rapidly takes the first of each large detection list in the academic world by virtue of the advantages of high speed, high precision, strong robustness and the like. However, due to the structural characteristics of convolution, it is assumed that the convolutional neural network does not have scale invariance, so that the scale change between object examples becomes one of the difficulties of the target detection algorithm. On the other hand, the target detection algorithm based on the convolutional neural network can be divided into a single-stage detector and a multi-stage detector from the aspect of regional suggestion generation, and the single-stage detector generated by regional suggestion generation is discarded to directly classify and regress Anchor, so that real-time reasoning speed is realized, and the target detection algorithm has wide application in a real scene. But has the disadvantage of limited detection accuracy, especially in multi-scale object-densely distributed scenarios. Therefore, the precision of the single-stage detector is improved while the high-efficiency reasoning speed is ensured, and the method is always one of research hotspots in the field of target detection.
The current method for improving the multi-scale target detection effect mainly starts from the following three aspects of multi-scale training, feature pyramid construction and feature receptive field. Multi-scale training, namely randomly changing the input resolution of a training image after iterating for a certain number of times, forces the network to learn the characteristics of targets under various scales. When SNIP trains at each fixed scale, only the gradient of the target of the corresponding scale is returned, and the target which is too large and too small is ignored, and when in test, detection is carried out on all scales, but only the detection result of the corresponding scale is reserved. The multi-scale training requires a larger video memory and longer training time, and the multi-scale test seriously reduces the reasoning speed. Constructing feature pyramids is the most widely used method at present. The SSD utilizes features of different resolutions in the backbone network to construct a feature pyramid to correspondingly detect targets of different scales, and the FPN and the TDM additionally construct a top-down path to supplement the feature pyramid with unbalanced semantic information in the backbone network, so that the multi-scale target detection effect is improved. The top-down branch, although supplementing the semantic information of the shallow features, ignores the detail information lacking of the top-level features. The STDN, PFPNet and other works jump out of the conventional thinking, and the multi-scale feature pyramid is obtained by utilizing SPP or DenseNet and other structures with the aim of constructing the feature pyramid with balanced information. The feature pyramid method has remarkable effect on improving multi-scale target detection, but attention is paid to the fact that the complex feature pyramid can introduce excessive parameter quantity and calculated quantity, so that the model reasoning speed is reduced. The feature receptive field, as the name implies, is to increase the detection effect of the small target by expanding the receptive field of the shallow features. Inspired by the receptive field structure in the human visual system, RFBNet adds a hole convolution into the acceptance structure, designs a novel RFB module and embeds the RFB module into the SSD algorithm, however, RFBNet only focuses on shallow features, ignores information supplementation to high-level features, and limits the multi-scale performance of the detector. The expressway has been developed in China from the 90 th century of the 20 th century, and has very important roles and functions in modern transportation by inherent characteristics and advantages. As more and more vehicles travel on highways, various problems come to the end, the first thing is traffic jam. Due to the occurrence of abnormal events such as traffic accidents, road maintenance and the like on the expressway, the expressway resources which are quite limited originally are difficult to fully utilize, and serious traffic jam and vehicle queuing problems are caused. Unlike urban roads, vehicles on highways generally have faster driving speeds, so that once traffic jams occur, serious consequences are often caused, and the influence time of the jams is generally long, which may cause serious economic loss.
The current method for predicting the queuing length is mainly improved based on a queuing theory or a traffic wave model, wherein patent CN106887141A obtains the queuing length of a road section on the basis of assuming that the arrival rate of vehicles obeys a certain distribution according to the queuing length among each node by setting continuous flow acquisition nodes based on the queuing theory. Patent CN106571030a proposes a traffic wave model-based queuing length prediction method for a specific scene of a road intersection based on multi-source data collected by floating cars, and the method requires that a certain proportion of floating cars exist on the road although the requirement for layout of detection equipment is low, which is obviously difficult to meet in most cases for highways. Meanwhile, the current method for predicting the queuing length is mainly aimed at simpler and closed road environments such as intersections, but non-closed road scenes including ramp tollgates and the like exist on highways, and related researches are lacked.
Therefore, by means of the multisource data which can be acquired on the expressway, the influence range of the abnormal event and the queuing length change process can be effectively analyzed and mastered, traffic managers can be guided to formulate reasonable traffic control strategies, and further the improvement of the control and service level of the expressway is an urgent need of the development of the current intelligent traffic system, and is also an important and difficult problem of research.
Disclosure of Invention
Accordingly, it is an object of the present invention to provide a method for designing a single-stage detector suitable for multi-scale target detection based on deep learning. According to the analysis, aiming at the defects existing in the prior art, the single-stage detector is redesigned from the angles set by the feature pyramid and the Anchor, so that the detection precision of the single-stage detector on a multi-scale target is improved, and the detection speed is simultaneously considered. Specifically, the invention constructs a feature pyramid with balanced and sufficient feature information of each level from three angles of semantic information, detail information and receptive field. On the other hand, in order to improve the recall rate of the detector to the polar end size target, the invention abandons the manual setting of the dimension and length-width ratio parameters of the Anchor, and utilizes the network to learn the dimension and distribution required by the Anchor.
The design method provided by the invention comprises the following steps.
The aim of the invention is realized by the following technical scheme:
a single-stage detector design method suitable for multi-scale target detection based on deep learning, comprising the following steps:
step one: carrying out data enhancement on the image;
step two: the method comprises the steps of obtaining the characteristic f with high semantics, high detail and large receptive field, wherein the characteristic f comprises the following four parts:
1) The input picture is used for obtaining a feature image f with sufficient semantic information, which is sampled 32 times through a backbone network c
2) Parallel to backbone network, inputThe picture is subjected to 16 times pooling downsampling, and a feature map f for coding abundant detail information is obtained through a shallow layer network of a plurality of convolution modules d
3) For f c And f d Fusion is carried out: will f c Up-sampling to obtainAt the same time, 1X 1 convolution is used to ensure f d And->Feature dimensions are completely consistent for->post-Sigmoid operation sum f d Multiplying to obtain f cd
4) Will f cd Inputting the characteristic diagram f into a multi-branch cavity convolution module ASPP;
step three: based on the feature map f obtained in the second step, classifying features in the feature pyramid into two types according to the resolution ratio higher than the feature map f and lower than the feature map f, and adopting different processing methods for the two types of features to construct the feature pyramid suitable for multi-scale target detection;
step four: the automatic generation of the Anchor, namely the Guide Anchor, comprises the following three parts:
1) In the classification branch feature map f cls A single-channel 1 multiplied by 1 convolution is connected, and the probability of whether an Anchor is placed at a certain position is obtained through Sigmoid operation;
2) In the regression branch feature map f reg A double-channel 1 multiplied by 1 convolution is connected to calculate two parameters of the width and the height of an Anchor placed at a certain position;
3) Calculating convolution sampling point deviation by using an Anchor width and height feature map generated by regression branches, and respectively comparing f cls And f reg And carrying out deformable convolution to obtain characteristics for classifying and regressing the Anchor.
Step five: design of loss function
The loss function of the whole network is expressed as
Loss=L cls +L reg +λ(L loc +L shape )
Wherein L is loc Represents Anchor position loss, L shape Represents the loss of Anchor-shaped branches, L cls Representing the loss of classification of the predictive part of Anchor, L reg The regression loss of the predicted portion of Anchor is represented, and λ is the weighting coefficient.
Further, the specific process of the first step is as follows:
1) Randomly clipping a region from the training image, but ensuring that the clipped region has a target;
2) Randomly expanding the cut area by using zero pixels;
3) The extended picture is scaled to the input resolution size.
Further, the number of convolution modules in the step 2) in the step two is 3.
Further, the specific process of the third step is as follows:
1) The feature of the feature pyramid, the resolution of which is higher than that of the feature f obtained in the step two, is enlarged by nearest neighbor up-sampling, and then 1 multiplied by 1 convolution is used for purification;
2) For features in the feature pyramid with resolution less than and equal to f, the features are obtained directly from f by 3×3 convolution with the specified step size.
Further, the calculation method of each loss in the fifth step is as follows:
1) In the Anchor generating section, focal Loss is used to calculate Anchor position Loss L in consideration of the fact that positive and negative samples are in an extremely unbalanced state loc
2) In the Anchor generation section, only the width and height are considered, and the Loss L of the Anchor shape branch is calculated by using the GIoU Loss shape
3) Prediction of Anchor includes classification and regression, and classification loss L cls Regression loss L using a Softmax-based cross entropy loss function reg A smoothl 1 loss function is used.
Due to the adoption of the technical scheme, the invention has the following beneficial effects:
the present invention takes into account the sparse distribution of road detection devices,
on one hand, the feature pyramid with balanced and sufficient feature information at each level is constructed from three angles of semantic information, detail information and receptive field; on the other hand, in order to improve the recall rate of the detector to the polar end size target, the invention abandons the manual setting of the dimension and length-width ratio parameters of the Anchor, and utilizes the network to learn the dimension and distribution required by the Anchor. According to the invention, the single-stage detector is redesigned from the angles set by the feature pyramid and the Anchor, so that the detection precision of the single-stage detector on a multi-scale target is improved, and the detection speed is simultaneously considered.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
The drawings of the present invention are described below.
FIG. 1 is a schematic flow diagram of a single-stage detector suitable for multi-scale target detection.
Fig. 2 is a schematic diagram of a single-stage detector suitable for multi-scale target detection.
Detailed Description
The invention is further described below with reference to the drawings and examples.
Example 1
As shown in fig. 1-2, the method for designing a single-stage detector suitable for multi-scale target detection based on deep learning according to the present embodiment includes the following steps:
step one: the data enhancement of the image mainly comprises the following three parts:
1) Firstly, randomly generating two parameters of width and height of a clipping region, and then randomly generating the left upper corner of the clipping region, thereby obtaining the clipping region. The intersection ratio IoU of the clipping region and all target frames is calculated as follows:
where area represents the area of a box, and u represents the intersection part of two boxes, and u represents the union part of two boxes. If the IoU minimum value is smaller than the set specified threshold (e.g., 0.5), it indicates that the overlap ratio of the clipping region and all targets exceeds the specified threshold. If the conditions are not met, repeating cutting for 50 times, and if the conditions are not met, directly outputting an original picture;
2) Randomly generating two parameters of width and height of the extended picture, randomly generating a left upper corner for placing a clipping region, filling the clipping region into the extended picture, and filling zero pixels in other parts;
3) Randomly selecting one of the following five scaling modes: nearest neighbor interpolation, bilinear interpolation, pixel region correlation, bicubic interpolation in the 4 x 4 pixel domain, and Lanczos interpolation in the 8 x 8 pixel domain, scaling the expanded picture to the specified input resolution size. And after random clipping, scaling is expanded to generate more multi-scale targets, so that the multi-scale performance of the model is enhanced. In addition, if the test stage is performed, the first step is to perform only the current scaling part.
Step two: the method for acquiring the characteristic f with high semantics, high detail and large receptive field mainly comprises the following four parts:
1) Taking ResNet-50 as an example, removing the last global average pooling and full connection layer, taking the ResNet-50 as a backbone network, and obtaining a feature map f with 32 times of downsampled semantic information after the input picture is extracted by the backbone network c
2) In parallel with the backbone network, a shallow convolutional neural network is designed to supplement the detail information. Specifically, the convolution, the batch normalization layer (BN) and the activation function (ReLu) are combined into one convolution module, a plurality of (for example, 3) convolution modules are stacked, the input picture is subjected to 16 times downsampling by using the pooling layer and then is input into the convolution module, and the code enrichment is obtainedFeature map f of detail information d
3) Due to f c Spatial resolution is less than f d Thus, the pair f is required before fusion c Up-sampling is performed 2 times. Combining the pixel points at the same position into a 2X 2 matrix by taking 4 channels as a group, and performing up-sampling c The resolution is increased to twice, the channel number is reduced to four times, and the spatial resolution is complemented by the channel information. Then f is convolved with a 1 x 1 convolution c up And f d The number of channels is equal. Finally to f c up Operating multiplication to f using Sigmoid d Obtaining the feature f with sufficient semantic information and detail information cd The overall process is shown in the following formula:
wherein W represents the weight of a 1×1 convolutional layer;
4) Will f cd Inputting the high-quality features into a multi-branch cavity convolution module ASPP to further improve the receptive field of the features, so as to obtain high-quality features f with sufficient semantic information, abundant detail information and sufficient receptive field;
step three: the method for constructing the feature pyramid suitable for multi-scale target detection mainly comprises the following two parts:
1) In order to obtain 8 times of downsampling feature images in the feature pyramid, f is required to be upsampled by 2 times, the feature images of the adjacent 4 channels are recombined into features with 2 times of enlarged spatial resolution, and meanwhile, the number of channels of the output feature images is ensured to be unified to 256 by using 1X 1 convolution;
2) Feature graphs in the feature pyramid, which are as large as f in resolution, are obtained for f by using a 3 x 3 convolution of 256 channels; for feature maps with a resolution less than f, taking a 64-fold downsampled feature map as an example, f needs to be downsampled by a further 4 times, and is thus obtained using a 3×3 convolution cascade with two steps of 2, and so on. The feature pyramid is set to 5 levels in total, the largest scale is 8 times of the downsampled feature map, and the smallest scale is 128 times of the downsampled feature map.
Step four: the automatic generation of Anchor mainly comprises the following three parts:
1) In the classification branch feature map f cls The 1×1 convolution followed by a single channel converts the pixel value into a (0, 1) interval by Sigmoid operation to represent the probability of placing an Anchor at that location. Therefore, in the training link, the label graph of the Anchor position needs to be obtained by utilizing the position of the target frame, and the basic principle of generating the label graph is that the Anchor should be placed in the center area of the target frame, otherwise, the Anchor is not placed in the pixel points far away from the target frame. Specifically, the method comprises the following two steps: firstly, selecting different levels of features for targets of different scales according to the following formula to detect:
where w and h represent the width and height of the target, the highest resolution feature in the feature pyramid is four times downsampled, and typically, the reference Anchor size is considered to be 8 times downsampled multiple for each point on the feature map, i.e., 32. The logarithm is taken, added with 0.5 and then rounded downwards to achieve rounding. Thus when the target area is less than 32 2 When the resolution is highest, detecting by the feature map with the highest resolution; when the target area is 32 2 To 64 2 In between, detection is performed by the feature map of the next highest resolution, and so on. Next, the area ratio ε of the super-parametric center region is set 1 =0.2 and ignored region area ratio ε 2 =0.5, the target box is denoted by (x, y, w, h), where (x, y) is the center point and (w, h) is the width and height. The center region CR, the neglected region IR and the outside region OR may be expressed as follows:
CR=(x,y,ε 1 w,ε 1 h)
IR=(x,y,ε 2 w,ε 2 h)\CR
OR=R\(x,y,ε 2 w,ε 2 h)
wherein R represents the whole feature map space, and A\B represents the subtraction of the B region from A. In the CR area of the feature map, the value of the Anchor position label map is 1, namely, anchor is placed; in the IR region, whether an Anchor is placed or not is not considered, i.e. the gradient is not returned; in the OR area, the Anchor position tag map value is 0, i.e., anchor is not placed. When the CR area and other areas among different targets overlap, the CR area is used as the reference; when the IR region and the OR region overlap, the IR region is subject to. Briefly, from a priority perspective, CR > IR > OR. In addition, to alleviate gradient contradictions between feature pyramids, CR regions on adjacent level feature maps would also be considered IR regions of the current feature map. In the test link, anchor is only set for the position of which the Anchor position score exceeds a specified threshold (such as 0.01);
2) For regression branch feature map f reg The two parameters of Anchor width and height are obtained by the following calculation by using a double-channel 1X 1 convolution:
w=σ 1 ×e dw ,h=σ 2 ×e dh
where dw and dh represent the two parameters of width and height, respectively, σ is a scaled variable, which can be obtained by learning, or can be set manually, here for simplicity, σ is set to 8s, s is the feature map step size. The purpose of the exponential form is to ensure wide and high non-negativity of the Anchor. Similar to the previous step, in the training link, an Anchor matching target on each feature point needs to be found, so that the current branch is optimized by using the loss function. Anchor variable a with unknown width and height wh = (x, y, w, h) and target box gt= (x) g ,y g ,w g ,h g ) The vIoU is defined as IoU with the maximum Anchor and target frame, and the calculation formula is as follows:
obviously, w and h are endless as two real numbers. Therefore, enumeration of common w and h is considered (for example, 9 Anchor parameters set in RetinaNet are adopted), so that a target frame corresponding to each Anchor and IoU are obtained. Next, a positive sample Anchor was obtained by two routes: first, a positive sample threshold pos=0.5 is set, and when IoU is greater than the positive sample threshold, anchor is considered as a positive sample. And secondly, considering each target independently, and considering that if the maximum Anchor with IoU exceeds 0.4, the target is considered as a positive sample. Finally, 128 positive samples are randomly sampled, and the difference between the shape of the Anchor and the shape of the target frame is calculated and optimized by using a loss function;
3) Considering the difference of the Anchor shape on the same feature map, the feature map of the Anchor shape in the previous step is convolved and then used as sampling point deviation, deformable convolution is respectively applied to classification and regression features, and taking 3×3 deformable convolution as an example, the calculation process is as follows:
wherein Δp n The convolution sampling point deviation generated by the Anchor shape feature map is represented, W is a parameter of the deformable convolution, x is an original feature, and y is a new feature generated by the deformable convolution.
Step five: the loss function of the whole network is calculated, and the loss function mainly comprises the following four parts:
1) Calculating the loss of the Anchor position portion using a cross entropy loss function:
L loc =-(ylgp+(1-y)lg(1-p))
wherein y and p represent the corresponding values in the Anchor position tag map and the prediction map, respectively. According to the description in the fourth step, the difference between the number of positive and negative samples in the Anchor position label graph is known to be great, so that the loss generated by the positive and negative samples needs to be weighted to relieve the gradient direction being dominated by the negative samples due to the unbalance problem. P is defined according to the following t
Then L is loc =-lgp t . Further, the number and difficulty of positive and negative samples are balanced by using the Focal loss, as shown in the following formula:
L loc =-α t (1-p t ) r lgp t
wherein α is used to balance the imbalance of positive and negative sample numbers, (1-p t ) r The method is used for balancing unbalance of the number of difficult samples, so that the network is more focused on categories with less learning quantity and high difficulty;
2) Anchor shape branch loss function: according to the analysis of the fourth step, 128 positive samples Anchor are randomly selected, and the predicted width and height are combined with the current feature point coordinate position and restored to the original image, so that the specific position and the specific shape of the generated Anchor are obtained. The partial loss function was calculated using a IoU based method, as follows:
wherein,,representing the pair of generated Anchor box B and target box B gt Penalty terms of (2). The invention calculates the partial regression loss by sampling the DIoU loss, and the penalty term calculation formula is as follows:
wherein ρ is 2 (b,b gt ) Representing Anchor box B and target box B gt Euclidean distance between center points, c represents B and B gt Is defined as the diagonal length of the smallest outside rectangle. Optimizing the Anchor shape branches by using DIoU to generate anchors which are more in line with target distribution;
3) The cross entropy loss function based on Softmax is adopted for the classification part of Anchor, and the formula is as follows:
wherein x is j Representing a sample's predicted value for the true class, C is the total number of classes, ε is a minimum number (e.g., 10 -5 ) To prevent the denominator from falling back to 0 when the denominator is less than the computer representation accuracy. In addition, to avoid unbalance of the positive and negative samples, the algorithm sorts all negative samples according to the loss, and only three times of the positive samples are selected for returning the gradient. The regression loss part of Anchor is calculated by adopting a Smooth L1 loss function, and the formula is as follows:
wherein x represents the difference between the encoding amount and the predicted amount of the target frame relative to Anchor, and the encoding process is consistent with most detection algorithms (e.g. fast R-CNN, SSD);
4) Combining the three previous small steps, the loss function of the whole network generates a weighted sum of the loss generated by the Anchor portion and the loss generated by the network prediction portion, as shown in the following formula:
Loss=L cls +L reg +λ(L loc +L shape )
where λ is the weighting factor of the two-part loss, generally taken as 1.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution, and the present invention is intended to be covered in the scope of the present invention.

Claims (5)

1. A method for designing a single-stage detector suitable for multi-scale target detection based on deep learning, comprising the steps of:
step one: carrying out data enhancement on the image;
step two: the method comprises the steps of obtaining the characteristic f with high semantics, high detail and large receptive field, wherein the characteristic f comprises the following four parts:
1) The input picture is used for obtaining a feature image f with sufficient semantic information, which is sampled 32 times through a backbone network c
2) Parallel to backbone network, the input picture is subjected to 16 times pooling downsampling, and a feature map f of coding abundant detail information is obtained through a shallow layer network of a plurality of convolution modules d
3) For f c And f d Fusion is carried out: will f c Up-sampling to obtainAt the same time, 1X 1 convolution is used to ensure f d And->Feature dimensions are completely consistent for->post-Sigmoid operation sum f d Multiplying to obtain f cd
4) Will f cd Inputting the characteristic diagram f into a multi-branch cavity convolution module ASPP;
step three: based on the feature map f obtained in the second step, classifying features in the feature pyramid into two types according to the resolution ratio higher than the feature map f and lower than the feature map f, and adopting different processing methods for the two types of features to construct the feature pyramid suitable for multi-scale target detection;
step four: the automatic generation of the Anchor, namely the Guide Anchor, comprises the following three parts:
1) In the classification branch feature map f cls 1X 1 convolution followed by a single channel and passed through Sigmoid, operating to obtain the probability of whether an Anchor is placed at a certain position;
2) In the regression branch feature map f reg A double-channel 1 multiplied by 1 convolution is connected to calculate two parameters of the width and the height of an Anchor placed at a certain position;
3) Calculating convolution sampling point deviation by using an Anchor width and height feature map generated by regression branches, and respectively comparing f cls And f reg Performing deformable convolution to obtain characteristics for classifying and regressing the Anchor;
step five: design of loss function
The loss function of the whole network is expressed as
Loss=L cls +L reg +λ(L loc +L shape )
Wherein L is loc Represents Anchor position loss, L shape Represents the loss of Anchor-shaped branches, L cls Representing the loss of classification of the predictive part of Anchor, L reg The regression loss of the predicted portion of Anchor is represented, and λ is the weighting coefficient.
2. The design method according to claim 1, wherein the specific process of the first step is as follows:
1) Randomly clipping a region from the training image, but ensuring that the clipped region has a target;
2) Randomly expanding the cut area by using zero pixels;
3) The extended picture is scaled to the input resolution size.
3. The design method according to claim 1 or 2, wherein the number of convolution modules in step 2) in the second step is 3.
4. The design method according to claim 1, wherein the specific process of the third step is as follows:
1) The feature of the feature pyramid, the resolution of which is higher than that of the feature f obtained in the step two, is enlarged by nearest neighbor up-sampling, and then 1 multiplied by 1 convolution is used for purification;
2) For features in the feature pyramid with resolution less than and equal to f, the features are obtained directly from f by 3×3 convolution with the specified step size.
5. The design method according to claim 1, wherein the calculation method of each loss in the fifth step is as follows:
1) In the Anchor generating section, focalLoss is used to calculate Anchor position loss L in consideration of the positive and negative samples being in an extremely unbalanced state loc
2) In the Anchor generation section, only the width and height are considered, and the Loss L of the Anchor shape branch is calculated by using the GIoU Loss shape
3) Prediction of Anchor includes classification and regression, and classification loss L cls Regression loss L using a Softmax-based cross entropy loss function reg A smoothl 1 loss function is used.
CN202010462591.XA 2020-05-27 2020-05-27 Single-stage detector design method suitable for multi-scale target detection based on deep learning Active CN111767944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010462591.XA CN111767944B (en) 2020-05-27 2020-05-27 Single-stage detector design method suitable for multi-scale target detection based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010462591.XA CN111767944B (en) 2020-05-27 2020-05-27 Single-stage detector design method suitable for multi-scale target detection based on deep learning

Publications (2)

Publication Number Publication Date
CN111767944A CN111767944A (en) 2020-10-13
CN111767944B true CN111767944B (en) 2023-08-15

Family

ID=72719742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010462591.XA Active CN111767944B (en) 2020-05-27 2020-05-27 Single-stage detector design method suitable for multi-scale target detection based on deep learning

Country Status (1)

Country Link
CN (1) CN111767944B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417990B (en) * 2020-10-30 2023-05-09 四川天翼网络股份有限公司 Examination student illegal behavior identification method and system
CN112529005B (en) * 2020-12-11 2022-12-06 西安电子科技大学 Target detection method based on semantic feature consistency supervision pyramid network
CN112633162B (en) * 2020-12-22 2024-03-22 重庆大学 Pedestrian rapid detection and tracking method suitable for expressway external field shielding condition
CN112733929B (en) * 2021-01-07 2024-07-19 南京工程学院 Improved Yolo underwater image small target and shielding target detection method
CN113052170B (en) * 2021-03-22 2023-12-26 江苏东大金智信息***有限公司 Small target license plate recognition method under unconstrained scene
CN113221754A (en) * 2021-05-14 2021-08-06 深圳前海百递网络有限公司 Express waybill image detection method and device, computer equipment and storage medium
CN113780358A (en) * 2021-08-16 2021-12-10 华北电力大学(保定) Real-time hardware fitting detection method based on anchor-free network
CN114189876B (en) * 2021-11-17 2023-11-21 北京航空航天大学 Flow prediction method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN109584227A (en) * 2018-11-27 2019-04-05 山东大学 A kind of quality of welding spot detection method and its realization system based on deep learning algorithm of target detection
CN110807384A (en) * 2019-10-24 2020-02-18 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Small target detection method and system under low visibility
CN111008619A (en) * 2020-01-19 2020-04-14 南京智莲森信息技术有限公司 High-speed rail contact net support number plate detection and identification method based on deep semantic extraction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10262237B2 (en) * 2016-12-08 2019-04-16 Intel Corporation Technologies for improved object detection accuracy with multi-scale representation and training

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN109584227A (en) * 2018-11-27 2019-04-05 山东大学 A kind of quality of welding spot detection method and its realization system based on deep learning algorithm of target detection
CN110807384A (en) * 2019-10-24 2020-02-18 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Small target detection method and system under low visibility
CN111008619A (en) * 2020-01-19 2020-04-14 南京智莲森信息技术有限公司 High-speed rail contact net support number plate detection and identification method based on deep semantic extraction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FFBNET:LIGHTWEIGHT BACKBONE FOR OBJECT DETECTION BASED FEATURE FUSION BLOCK;Binqi Fan等;《2019 IEEE International Conference on Image Processing (ICIP)》;3920-3924 *

Also Published As

Publication number Publication date
CN111767944A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
CN111767944B (en) Single-stage detector design method suitable for multi-scale target detection based on deep learning
CN110147763B (en) Video semantic segmentation method based on convolutional neural network
CN110188705B (en) Remote traffic sign detection and identification method suitable for vehicle-mounted system
CN112418236B (en) Automobile drivable area planning method based on multitask neural network
CN114445430B (en) Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion
CN111967480A (en) Multi-scale self-attention target detection method based on weight sharing
CN113486764B (en) Pothole detection method based on improved YOLOv3
CN113255589B (en) Target detection method and system based on multi-convolution fusion network
CN113313706B (en) Power equipment defect image detection method based on detection reference point offset analysis
CN112257793A (en) Remote traffic sign detection method based on improved YOLO v3 algorithm
CN111832453A (en) Unmanned scene real-time semantic segmentation method based on double-path deep neural network
CN112990065A (en) Optimized YOLOv5 model-based vehicle classification detection method
CN115082672A (en) Infrared image target detection method based on bounding box regression
CN114120280A (en) Traffic sign detection method based on small target feature enhancement
CN114202803A (en) Multi-stage human body abnormal action detection method based on residual error network
CN113052108A (en) Multi-scale cascade aerial photography target detection method and system based on deep neural network
CN112949635B (en) Target detection method based on feature enhancement and IoU perception
CN115527096A (en) Small target detection method based on improved YOLOv5
CN116630932A (en) Road shielding target detection method based on improved YOLOV5
CN113361528B (en) Multi-scale target detection method and system
Bogdoll et al. Multimodal detection of unknown objects on roads for autonomous driving
CN117746264A (en) Multitasking implementation method for unmanned aerial vehicle detection and road segmentation
Leroux et al. Automated training of location-specific edge models for traffic counting
CN116681976A (en) Progressive feature fusion method for infrared small target detection
CN113269171B (en) Lane line detection method, electronic device and vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant