CN111680619A - Pedestrian detection method based on convolutional neural network and double-attention machine mechanism - Google Patents
Pedestrian detection method based on convolutional neural network and double-attention machine mechanism Download PDFInfo
- Publication number
- CN111680619A CN111680619A CN202010506077.1A CN202010506077A CN111680619A CN 111680619 A CN111680619 A CN 111680619A CN 202010506077 A CN202010506077 A CN 202010506077A CN 111680619 A CN111680619 A CN 111680619A
- Authority
- CN
- China
- Prior art keywords
- attention
- channel
- map
- feature map
- double
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 25
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 23
- 230000007246 mechanism Effects 0.000 title claims abstract description 16
- 238000010586 diagram Methods 0.000 claims abstract description 29
- 238000000034 method Methods 0.000 claims abstract description 24
- 230000004913 activation Effects 0.000 claims description 16
- 230000009977 dual effect Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 12
- 239000000654 additive Substances 0.000 claims description 5
- 230000000996 additive effect Effects 0.000 claims description 5
- 241000287196 Asthenes Species 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 238000013459 approach Methods 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 abstract description 12
- 230000002860 competitive effect Effects 0.000 abstract description 2
- 238000012360 testing method Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000012800 visualization Methods 0.000 description 6
- 238000002679 ablation Methods 0.000 description 5
- 230000004927 fusion Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which comprises the following steps of: inputting images from the Caltech dataset and the CityPersons dataset; a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features; and the pedestrian is framed out in the form of a frame. The invention provides a lightweight double-attention-machine modeling method, which can not only model the relationship between characteristic channels, but also improve the expression capability of a characteristic diagram at a pixel level. The invention constructs a single-stage pedestrian detector CSANT based on a double-attention machine system, and further analyzes the performance influence factors in the CSANT through experiments. CSANT achieves the latest performance of a Caltech reference and the competitive performance of a CityPersons reference while maintaining the computational efficiency.
Description
Technical Field
The invention relates to the technical field of pedestrian detection in computer vision, in particular to a pedestrian detection method based on a convolutional neural network and an attention mechanism.
Background
Pedestrian detection plays a crucial role in computer vision tasks such as autopilot, robotics and surveillance. With the rise of deep learning, the pedestrian detector in recent years has made great progress in the revival of deep learning. However, the current state-of-the-art pedestrian detectors still far from reaching cognitive levels as fast and accurate as humans. Current mainstream pedestrian detectors tend to directly benefit from Convolutional Neural Networks (CNNs) designed for image classification. Although CNNs require a large number of downsampling factors to generate high-level semantic features, they cannot adaptively focus on useful channels and regions of the feature map, which limits the improvement of pedestrian detection performance.
Notably, pedestrians in traffic scenes have different characteristics from general objects, such as diverse backgrounds and multi-scale pedestrians. Typically, researchers will employ depth models to abstract high-level semantics of object instances, which help identify pedestrians in traffic scenes. Unfortunately, this approach filters out the location information of many small-scale pedestrians as well as large-scale pedestrians. Due to the inherent nature of CNN, critical channels cannot be highlighted and critical spatial locations cannot be illuminated. Convolution is a local operation that obtains local information of an image by applying a convolution kernel to a local image. The local operation of CNN results in its inability to capture images from a global view. Therefore, designing an effective backbone for pedestrian detection remains a difficult task.
Disclosure of Invention
In accordance with the technical problem set forth above, a pedestrian detection method based on a convolutional neural network and an attention mechanism is provided. The invention mainly relates to a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which is characterized by comprising the following steps of:
step S1: input is from Caltech[1]Data sets and CityPersons[2]An image of the data set;
step S2: a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features;
step S3: framing the pedestrian in a frame mode;
further, the mathematical modeling process of the convolutional neural network based on the dual-attention mechanism mainly comprises the following steps:
step S21: given the output of the residual blockDefining as an original characteristic graph; the CAM and SAM are sequentially deduced to obtain a 1D channel attention diagram and a 2D space attention diagram; the raw feature map is sequentially re-labeled by two attention figures as:
wherein the content of the first and second substances,representing addition per pixel, MCRepresenting a channel attention map, MCRepresenting a spatial attention map; fCRepresents passing through MCCalibrated characteristic map, FSRepresents passing through MSA calibrated characteristic diagram;
step S22: compressing a 2D feature map into real numbers along a spatial axis direction of the original feature map; aggregating primitives by using global pooling operationsCharacteristic diagramTo obtain a channel attention map
Wherein GAP represents a global average pooling operation, fcRepresenting the original feature map F, ucA c-th real number representing the channel profile U;
step S23: will be provided withInputting the data into two full-connection layers, and obtaining a final channel attention feature map through a sigmoid activation function
MC=σ(W2(W1U))(12);
Wherein, the role of the activation function ReLU is expressed, the sigma is expressed as sigmoid activation, and W is expressed as a scaling parameter in the full connection layer, comprisingAndr is the compression ratio set to 16;
step S24: using MCFor input characteristic diagramCarrying out feature recalibration; firstly, M isCThe attention map, which is broadcast to the same dimension as F, is shown asThen M 'is processed by pixel addition operation'CBroadcasting the characteristic diagram F to obtain a calibrated characteristic diagram FC:
f′c=Fadd(m′c,fc)=m′c+fc.(13);
Wherein, FaddDenotes addition by channel, m'cIs a characteristic map M'CThe c channel of (f)cIs the c-th channel of the original feature map F; f'cIs represented by FCOf the c channel, m'cRepresents M'CGlobal information of the c-th channel;
step S25: compressing the 3D feature map into 2D feature channels along a channel axis direction of the feature map in order to compute an effective channel attention map; characteristic diagram after calibration of given channel attention diagramFeature map acquisition using average pooling operationsComprises the following steps:
where AP denotes average pooling operation, fijIs f'cThe pixel value of the middle ij point, C represents the number of characteristic channels, vijA pixel value representing an ij point in the feature map V;
step S26, convolving the characteristic diagram V by the convolution layer with the convolution kernel size of 7 × 7 and the moving step size of 1, and then using the sigmoid activation function to obtain the attention diagram
MS=σ(f7×7(V)) (15);
Where σ denotes the role of the activation function sigmoid, f7×7A convolution operation representing a convolution kernel of 7 × 7;
step S27: using MSRecalibration profile FCFirst step MSIs broadcast as sum FCA feature map with the same dimension is expressed as
f′s=Fadd(m′s,fs)=m′s+fs. (16);
FaddRepresents addition by space, m'sIs M'SThe s channel of (a), (b), f)sIs FCOf s channel, f'sIs FSThe s channel of (1).
Further, the dual attention module is organized using a serial-in-sequence approach using additive operations in the broadcast attention map process, and the dual attention map mechanism is embedded in the convolutional layers of the ResNet-50.
Compared with the prior art, the invention has the following advantages:
1. the double-attention-machine system modeling method is capable of not only modeling the relation between the characteristic channels, but also improving the expression capacity of the characteristic diagram at the pixel level.
2. A single-stage pedestrian detector CSANT based on a double-attention machine mechanism is constructed, and performance influence factors in the CSANT are further analyzed through experiments.
3. CSANT achieves the latest performance of a Caltech reference and the competitive performance of a CityPersons reference while maintaining the computational efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 shows the overall architecture of CSANET in accordance with the present invention. The device mainly comprises two parts, a backbone network module and a detection head. An example of a dual attention module embedded in the Resnet-50 is shown in a dashed box, which in turn combines a Channel Attention Model (CAM) and a Spatial Attention Model (SAM).
FIG. 2 is a network structure of the channel attention module and the spatial attention module of the present invention. The feature map is shown in feature dimensions, such as H × W × C for a height H, width W and channel C, representing a broadcast operation that adds element-by-element.
Fig. 3 shows the comparison result of CSANet of the present invention and other advanced pedestrian detector models under the condition that the threshold value is IoU ═ 0.5, which is shown as FPPI curve.
Fig. 4 shows the comparison result of CSANet of the present invention and other advanced pedestrian detector models under the condition that the threshold value is IoU-0.75, which is shown as FPPI curve.
FIG. 5 is a model visualization of the present invention comparing a visualization using a dual attention model with a visualization without a dual attention model.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in the figure, the invention provides a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which is characterized by comprising the following steps of:
s1: input is from Caltech[1]Data sets and CityPersons[2]An image of the data set;
s2: the convolution neural network based on the double-attention machine system is used as a main network to extract image characteristics, and the detection part classifies and regresses the characteristics;
s3: and the pedestrian is framed out in the form of a frame.
The overall framework of CSANet is shown in fig. 1, with the backbone network being a ResNet-50 embedded with dual attention modules. The detection head module mainly comprises three convolution layers which respectively predict the center position, the scale and the offset of the pedestrian. ResNet-50 is divided into 5 stages, and the output characteristic graphs of 2 to 5 stages are respectively defined asAndthe feature map of stage2-5 was downsampled by 4, 8, 16, and 16, respectively. Wherein, the low-level feature map can provide more accurate position information, and the deeper feature map contains more semantic information. And (4) connecting the multi-scale feature maps of each stage in series in a simple manner to obtain a fused feature map. Before feature map fusion, the resolution of the output feature maps at each stage is unified using a deconvolution operation. In general, shallow features are referred to as generic features, while the semantic information expressed by each channel of deep features is category-specific.
Taking the third residual block of phase 5 as an example, the process of the dual attention network broadcast attention map is as follows:
given the output of the residual blockIt is defined as the original feature map. The CAM and SAM are derived in turn to derive a 1D channel attention map and a 2D spatial attention map. The original features are sequentially re-scaled by two attention figures. The calculation process of the two calibration feature maps can be summarized as follows:
here, theRepresenting addition per pixel, MCIs a channel attention map, MCIs a spatial attention map. FCIs passing through MCCalibrated characteristic map, FSIs passing through MSAnd (5) calibrating the characteristic diagram.
Fig. 2(a) shows a network structure of the channel attention module. To compute an effective channel attention map, the 2D feature map is compressed into real numbers along the spatial axis direction of the feature map. First aggregating raw feature maps using global pooling operationsTo obtain a channel attention mapThe whole calculation process is as follows:
here, GAP denotes a global average pooling operation, fcRepresenting the original feature map F, ucRepresenting the channel characteristicsThe c-th real number of graph U.
Then will beInputting the data into two full-connection layers, and obtaining a final channel attention feature map through a sigmoid activation functionThe whole calculation process is expressed as:
MC=σ(W2(W1U)),(20)
here, the two fully-connected layers are for better fitting the complex correlation between channels, representing the effect of the activation function ReLU, σ representing sigmoid activation, W representing scaling parameters in the fully-connected layers, includingAndr is the compression ratio and is set to 16.
Finally, M is usedCFor input characteristic diagramAnd carrying out characteristic recalibration. First step, MCThe attention map, which is broadcast to the same dimension as F, is shown asThen M 'is processed by pixel addition operation'CBroadcasting the characteristic diagram F to obtain a calibrated characteristic diagram FCThe whole calculation process is as follows:
f′c=Fadd(m′c,fc)=m′c+fc.(21)
here, FaddDenotes addition by channel, m'cIs a characteristic map M'CThe c channel of (f)cIs the c-th channel of the original feature map F. f'cIs FCOf the c channel, m'cRepresents M'CGlobal information of the c-th channel.
Fig. 2(b) shows a network structure of the channel attention module. To compute an effective channel attention map, the 3D feature map is compressed along the channel axis direction of the feature map into 2D feature channels. Firstly, a characteristic diagram after the attention diagram calibration of a given channel is givenFeature map acquisition using average pooling operationsThe calculation process is as follows:
where AP denotes an average pooling operation, fijIs f'cThe pixel value of the point of intermediate ij. C represents the number of characteristic channels, vijThe pixel values at the ij point in the feature map V are shown.
Then, a convolution layer with convolution kernel size of 7 × 7 and moving step size of 1 is used for convolving the feature map V, and then the sigmoid activation function is used for obtaining the attention mapThe calculation process is as follows:
MS=σ(f7×7(V)),(23)
where σ denotes the activation function sigmoid action, f7×7Representing a convolution operation with a convolution kernel of 7 × 7.
Finally, M is usedSRecalibration profile FCFirst step MSIs broadcast as sum FCA feature map with the same dimension is expressed asThe whole calculation process is as follows:
f′s=Fadd(m′s,fs)=m′s+fs.(24)
here, FaddRepresents addition by space, m'sIs M'SThe s channel of (a), (b), f)sIs FCThe s channel of (1). f'sIs FSThe s channel of (1).
Spatial attention seeks to integrate the global context information of the feature map. Furthermore, spatial attention seeks to focus on the global information of each channel, increasing the size of the receptive field, and allowing the CNN to capture image information from a global perspective.
The channel attention module and the spatial attention module may be embedded in the ResNet-50 in a parallel or serial fashion. The channel attention module focuses on important channels, and the spatial attention module focuses on important regions of the feature map. The correct combination of the two attention modules may maximize the effectiveness of the attention mechanism.
Table 1 is a discussion of the generation method of the fusion profile in the ablation experiments. Wherein the column of Extraction methods is the Extraction method of each feature map, e.g. stage2-5 indicates that the double attention network is embedded in ResNet-50[3] Stages 2, 3, 4 and 5. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.
TABLE 1
Table 2 is a discussion of the signature fusion method in ablation experiments. In Feature maps Andare respectively provided withOutput profiles representing stages 2, 3, 4 and 5 of ResNet-50. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.
TABLE 2
Table 3 is a discussion of additive and multiplicative broadcasts in ablation experiments. The Description column represents the corresponding model. For example, p3p4p5+ add represents a fusion feature mapAndand using an additive broadcast attention map; p3p4p5+ multiplex denotes the fusion profileAndand broadcast attention maps using multiplications. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test the work (of) one picture. The bold font in the table indicates the optimal result in the corresponding column.
TABLE 3
Table 4 is a discussion of the connection of the channel and spatial attention module in ablation experiments. The Description column represents the corresponding model. CAM + SAM means that the Channel Attention Module (CAM) and the Spatial Attention Module (SAM) are connected in series. CAM// SAM means that CAM and SAM are connected in series. SAM + CAM means that the Spatial Attention Module (SAM) and the Channel Attention Module (CAM) are in series. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.
TABLE 4
Table 5 compares the most advanced detectors on CityPersons when IoU ═ 0.5. The Hardware column represents the GPU equipment used for network training, while the Scale column represents the number of GPUs. Bold numbers indicate the best results.
TABLE 5
Example (b):
results and analysis of the experiments
(1) Ablation experiment
As shown in Table 1, the embedding method of stage3-5 achieved an MR of 3.88% at IoU ═ 0.5-2The best performance of (1). At IoU ═ 0.75, the stage2-5 embedding method improved performance by about 36% compared to the stage5 embedding method. Notably, at IoU ═ 0.5, stage2-4 and stage5 had comparable performance, representing 4.28% MR respectively-2And 4.27% MR-2. However, at the threshold of IoU-0.75, the performance difference between them is large, 4.77% MR-2. This comparison shows that the double attention network is more favorable for regression of high quality bounding boxes.
As can be seen from Table 2, the model fused with the low-level features has poor accuracy, but the model has less parameter quantity and higher detection speed. As the number of the fused feature maps is increased, the detection precision of the model is improved. In the case of IoU ═ 0.5, fusionAndthe model of (2) has, with the fused model, a significant improvement of 47%, but the least accurate. At IoU ═ 0.75,andthe fusion effect of (2) is optimal. Generally, deeper features are helpful for pedestrian detection, but they take up more operating memory.
As can be seen from table 3, the model using the additive broadcast is better than the model using the multiplicative broadcast. MR between p3p4p5+ add and p3p4p5+ multiply-2The difference is about 37%. In both the second and third set of experiments, the difference was about 17%. Furthermore, it was found that the broadcast method of attention-seeking hardly affects the test time. In practice, runtime is mainly affected by model parameters.
In fact, multiplication is much more computationally complex than addition. Although the multiplication enhances the useful information representation of the feature map, it also over-amplifies the effect of noise. In addition, the multiplicative weighting operation unduly suppresses some contextual details, which is detrimental to locating pedestrians. In addition to accuracy, the pedestrian detection task also needs to consider the real-time indicators of the model. The multiplication increases the running time of the network to some extent.
As shown in Table 4, CAM + SAM indicates that the channel attention module and the spatial attention module are connected in a sequential manner. Analysis shows that sequential arrangement has better results than parallel arranged models. The best result of the CAM-first mode is 3.88% MR-2. CAM// SAM indicates that the two modules are arranged in parallel, with a performance 0.27% lower than CAM + SAM-2. In the third mode, SAM + CAM SAM takes precedence in the dual attention network, and the model has the worst performance, 4.57% MR-2。
(2) Visualization experiment
As shown in fig. 5, the experimental results are visualized for the model. Using a visualization algorithm Grad-CAM[4]To qualitatively explain the model for CSANet. The interpretability of CNN has been improved to a certain level. The algorithm may derive a class activation map that may be used to locate regions of classes in the image. The Grad-CAM uses mainly the gradient of the last convolutional layer of the network to generate a heatmap, which can highlight important pixels in the input image.
As can be seen from fig. 3, the thermodynamic diagram of the model with dual attention networks covers the pedestrian area better than the model without dual attention networks. In other words, the dual attention network may better focus on the pixel information of the target area. The visualization result qualitatively shows that the pixel expression capability of the target area is enhanced to a certain extent in the improved feature map.
(3) Comparative experiments on SOTA
Fig. 3(IoU ═ 0.5) and fig. 4(IoU ═ 0.75) are comparisons of performance of the advanced pedestrian detection algorithm on the Caltech dataset, respectively. The algorithm includes DeepParts[5]、MS-CNN[6]、SA-FasterRCNN[7]、RPN+BF[8]、FasterRCNN+ATT[9]、SDS-RCNN[10]、AdaptFasterRCNN[2]、CSP+City[11]、CSANet[ours]And CSANET + City[ours]。
As shown in FIG. 3, the model initialized from the City Persons dataset has the best performance, compared to the current high performance method CSP + City, CSANT has 3.55% MR-2The best performance of (1). CSANET model initialized with ImageNet dataset exceeded 3.88% MR-2This indicates a significant improvement over the baseline model. As shown in fig. 4, the CSANet model also achieves a smaller loss rate at a tighter threshold setting, which means that the dual attention network also helps to improve the quality of the bounding box.
Table 5 shows the comparison of advanced pedestrian detectors on the CityPersons dataset. This set of experiments was at IoU ═ 0.5, and only a single NVIDIA GTX 1080Ti GPU and mini-batch ═ 2 were used in training the network. Table 5 shows that the CSANET detector is in CityPersons achieved 7.25% of the latest performance on the Bare subset. On a reasonable subset, CSANet's performance is only inferior to CSP trained with mini-batch ═ 8[11]And (4) modeling. The CSANET detector can be connected with ALFNet[12]A comparison with a competitor Reploss[13]Compared with the prior art, the quality can be improved by about 2.6%. In fact, a larger batchsize will, within reasonable limits, make the gradient descent direction more accurate.
Reference to the literature
[1]P.Dollar,C.Wojek,B.Schiele,and P.Perona,“Pedestrian detection:Anevaluation of the state of the art,”IEEE Trans.Pattern Anal.Mach.Intell.(PAMI),vol.34,no.4,pp.743-761,Apr.2012.
[2]S.Zhang,R.Benenson,and B.Schiele,“Citypersons:A diverse datasetfor pedestrian detection,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),Jul.2017,pp.3213-3221.
[3]K.He,X.Zhang,S.Ren,et al.,“Deep Residual Learning for ImageRecognition,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),June,2016,pp.770-778.
[4]R.R.Selvaraju,M.Cogswell,et al.,“Grad-cam:Visual explanations fromdeep networks via gradient-based localization,”in Proc.IEEEInt.Conf.Comput.Vis.(ICCV),2017,pp.618-626.
[5]Y Tian,P Luo,X Wang,et al.,“Deep learning strong parts forpedestrian detection,”in Proc IEEE Int.Conf.Comput.Vis.(ICCV),Dec.2015,pp.1904-1912.
[6]Z.Cai,Q.Fan,R.S.Feris,and N.Vasconcelos,“A unified multi-scaledeep convolutional neural network for fast object detection,”inProc.Eur.Conf.Comput.Vis.(ECCV),2016,pp.354–370.
[7]J.Li,X.Liang,S.Shen,T.Xu,J.Feng,and S.Yan,“Scale-aware fast R-CNNfor pedestrian detection,”IEEE Trans.Multimedia(ICME),vol.20,no.4,pp.985-996,Apr,2017.
[8]L.Zhang,L.Lin,X.Liang,and K.He,“Is faster r-cnn doing well forpedestrian detection?,”in Eur.Conf.Comput.Vis.(ECCV),Oct.2016,pp.443-457.
[9]S.Zhang,J.Yang,and B.Schiele,“Occluded pedestrian detectionthrough guided attention in CNNs,”in Proc.IEEE conf.Comput.Vis.PatternRecognit.(CVPR),Jun.2018,pp.6995-7003.
[10]G.Brazil,X.Yin,X.Liu,“Illuminating pedestrians via simultaneousdetection&segmentation,”in Proc IEEE Int.Conf.Comput.Vis.(ICCV),2017,pp.4950-4959.
[11]W.Liu,S.Liao,W.Ren,W.Hu,and Y.Yu,“High-level Semantic FeatureDetection:A New Perspective for Pedestrian Detection,”in Proc.IEEEComput.Vis.Pattern Recognit.(CVPR),Jun.2019,pp.5187-5196.
[12]W.Liu,S.Liao,W.Hu,X.Liang,and X.Chen,“Learning efficient single-stage pedestrian detectors by asymptotic localization fitting,”inProc.Eur.Conf.Comput.Vis.(ECCV),2018,pp.618-634.
[13]X.Wang,T.Xiao,Y.Jiang,S.Shao,J.Sun,and C.Shen,“Repulsion loss:Detecting pedestrians in a crowd,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),Jun.2018,pp.7774-7783.
[14]T.Song,L.Sun,D.Xie,H.Sun,and S.Pu,“Small-scale pedestriandetection based on topological line localization and temporal featureaggregation,”in Proc.Eur.Conf.Comput.Vis.(ECCV),2018,pp.536-551.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (3)
1. The pedestrian detection method based on the convolutional neural network and the double-attention machine mechanism is characterized by comprising the following steps of:
s1: inputting images from the Caltech dataset and the CityPersons dataset;
s2: a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features;
s3: and the pedestrian is framed out in the form of a frame.
2. The convolutional neural network and dual-attention mechanism-based pedestrian detection method of claim 1, wherein:
the mathematical modeling process of the convolutional neural network based on the double-attention mechanism mainly comprises the following steps of:
s21: given the output of the residual blockDefining as an original characteristic graph; the CAM and SAM are sequentially deduced to obtain a 1D channel attention diagram and a 2D space attention diagram; the raw feature map is sequentially re-labeled by two attention figures as:
wherein the content of the first and second substances,representing addition per pixel, MCRepresenting a channel attention map, MCRepresenting a spatial attention map; fCRepresents passing through MCCalibrated characteristic map, FSRepresents passing through MSA calibrated characteristic diagram;
s22: compressing a 2D feature map into real numbers along a spatial axis direction of the original feature map; aggregating raw feature maps by using global pooling operationsTo obtain a channel attention map
Wherein GAP represents a global average pooling operation, fcRepresenting the original feature map F, ucA c-th real number representing the channel profile U;
s23: will be provided withInputting the data into two full-connection layers, and obtaining a final channel attention feature map through a sigmoid activation function
MC=σ(W2(W1U)) (4);
Wherein, the role of the activation function ReLU is expressed, the sigma is expressed as sigmoid activation, and W is expressed as a scaling parameter in the full connection layer, comprisingAndr is the compression ratio set to 16;
s24: using MCFor input characteristic diagramCarrying out feature recalibration; firstly, M isCThe attention map, which is broadcast to the same dimension as F, is shown asThen M 'is processed by pixel addition operation'CBroadcasting the characteristic diagram F to obtain a calibrated characteristic diagram FC:
fc′=Fadd(m′c,fc)=m′c+fc. (5);
Wherein, FaddDenotes addition by channel, m'cIs a characteristic map M'CThe c channel of (f)cIs the c-th channel of the original feature map F; f. ofc' means FCOf the c channel, m'cRepresents M'CGlobal information of the c-th channel;
s25: to compute an effective channel attention map, the 3D features are mapped along the channel axis of the feature mapCompressing the characteristic diagram into a 2D characteristic channel; characteristic diagram after calibration of given channel attention diagramFeature map acquisition using average pooling operationsComprises the following steps:
where AP denotes average pooling operation, fijDenotes fc' the pixel value of the point ij in, C represents the number of characteristic channels, vijA pixel value representing an ij point in the feature map V;
s26, convolving the feature map V by the convolution layer with the convolution kernel size of 7 × 7 and the moving step size of 1, and then using the sigmoid activation function to obtain the attention map
MS=σ(f7×7(V)) (7);
Where σ denotes the role of the activation function sigmoid, f7×7A convolution operation representing a convolution kernel of 7 × 7;
s27: using MSRecalibration profile FCFirst step MSIs broadcast as sum FCA feature map with the same dimension is expressed as
fs′=Fadd(m′s,fs)=m′s+fs. (8);
FaddRepresents addition by space, m'sIs M'SThe s channel of (a), (b), f)sIs FCThe s channel of (a), (b), f)s' is FSThe s channel of (1).
3. The convolutional neural network and dual-attention mechanism-based pedestrian detection method of claim 1, further characterized by:
the dual attention module is organized using a serial-in-sequence approach using additive operations in the broadcast attention map process, and the dual attention map mechanism is embedded in multiple convolutional layers of the ResNet-50.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010506077.1A CN111680619A (en) | 2020-06-05 | 2020-06-05 | Pedestrian detection method based on convolutional neural network and double-attention machine mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010506077.1A CN111680619A (en) | 2020-06-05 | 2020-06-05 | Pedestrian detection method based on convolutional neural network and double-attention machine mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111680619A true CN111680619A (en) | 2020-09-18 |
Family
ID=72434993
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010506077.1A Pending CN111680619A (en) | 2020-06-05 | 2020-06-05 | Pedestrian detection method based on convolutional neural network and double-attention machine mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680619A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560720A (en) * | 2020-12-21 | 2021-03-26 | 奥比中光科技集团股份有限公司 | Pedestrian identification method and system |
CN112800964A (en) * | 2021-01-27 | 2021-05-14 | 中国人民解放军战略支援部队信息工程大学 | Remote sensing image target detection method and system based on multi-module fusion |
CN113450366A (en) * | 2021-07-16 | 2021-09-28 | 桂林电子科技大学 | AdaptGAN-based low-illumination semantic segmentation method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135243A (en) * | 2019-04-02 | 2019-08-16 | 上海交通大学 | A kind of pedestrian detection method and system based on two-stage attention mechanism |
CN110675406A (en) * | 2019-09-16 | 2020-01-10 | 南京信息工程大学 | CT image kidney segmentation algorithm based on residual double-attention depth network |
CN110991362A (en) * | 2019-12-06 | 2020-04-10 | 西安电子科技大学 | Pedestrian detection model based on attention mechanism |
CN111160628A (en) * | 2019-12-13 | 2020-05-15 | 重庆邮电大学 | Air pollutant concentration prediction method based on CNN and double-attention seq2seq |
-
2020
- 2020-06-05 CN CN202010506077.1A patent/CN111680619A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135243A (en) * | 2019-04-02 | 2019-08-16 | 上海交通大学 | A kind of pedestrian detection method and system based on two-stage attention mechanism |
CN110675406A (en) * | 2019-09-16 | 2020-01-10 | 南京信息工程大学 | CT image kidney segmentation algorithm based on residual double-attention depth network |
CN110991362A (en) * | 2019-12-06 | 2020-04-10 | 西安电子科技大学 | Pedestrian detection model based on attention mechanism |
CN111160628A (en) * | 2019-12-13 | 2020-05-15 | 重庆邮电大学 | Air pollutant concentration prediction method based on CNN and double-attention seq2seq |
Non-Patent Citations (1)
Title |
---|
YUNBO ZHANG 等: "CSANet: Channel and Spatial Mixed Attention CNN for Pedestrian Detection" * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560720A (en) * | 2020-12-21 | 2021-03-26 | 奥比中光科技集团股份有限公司 | Pedestrian identification method and system |
CN112800964A (en) * | 2021-01-27 | 2021-05-14 | 中国人民解放军战略支援部队信息工程大学 | Remote sensing image target detection method and system based on multi-module fusion |
CN112800964B (en) * | 2021-01-27 | 2021-10-22 | 中国人民解放军战略支援部队信息工程大学 | Remote sensing image target detection method and system based on multi-module fusion |
CN113450366A (en) * | 2021-07-16 | 2021-09-28 | 桂林电子科技大学 | AdaptGAN-based low-illumination semantic segmentation method |
CN113450366B (en) * | 2021-07-16 | 2022-08-30 | 桂林电子科技大学 | AdaptGAN-based low-illumination semantic segmentation method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pang et al. | Hierarchical dynamic filtering network for RGB-D salient object detection | |
Jiang et al. | Crowd counting and density estimation by trellis encoder-decoder networks | |
Giraldo et al. | Graph moving object segmentation | |
He et al. | Towards fast and accurate real-world depth super-resolution: Benchmark dataset and baseline | |
Tran et al. | Deep end2end voxel2voxel prediction | |
Liao et al. | Video-based person re-identification via 3d convolutional networks and non-local attention | |
CN110188239B (en) | Double-current video classification method and device based on cross-mode attention mechanism | |
Liu et al. | Del: Deep embedding learning for efficient image segmentation. | |
CN111680619A (en) | Pedestrian detection method based on convolutional neural network and double-attention machine mechanism | |
Fang et al. | Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks | |
CN113158723A (en) | End-to-end video motion detection positioning system | |
Duta et al. | Histograms of motion gradients for real-time video classification | |
CN111797841B (en) | Visual saliency detection method based on depth residual error network | |
Chen et al. | Dr-tanet: Dynamic receptive temporal attention network for street scene change detection | |
Xie et al. | Context-aware pedestrian detection especially for small-sized instances with Deconvolution Integrated Faster RCNN (DIF R-CNN) | |
CN111931603A (en) | Human body action recognition system and method based on double-current convolution network of competitive combination network | |
Xia et al. | Pedestrian detection algorithm based on multi-scale feature extraction and attention feature fusion | |
Hou et al. | A super-fast deep network for moving object detection | |
Guo et al. | Learning efficient stereo matching network with depth discontinuity aware super-resolution | |
Wen et al. | Deep fusion based video saliency detection | |
Li et al. | MSFFA: a multi-scale feature fusion and attention mechanism network for crowd counting | |
Islam et al. | Representation for action recognition with motion vector termed as: SDQIO | |
Liu et al. | Hypergraph attentional convolutional neural network for salient object detection | |
Geng et al. | Adaptive multi-level graph convolution with contrastive learning for skeleton-based action recognition | |
Zhang et al. | Multi-prior driven network for RGB-D salient object detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200918 |