CN118015539A

CN118015539A - Improved YOLOv intensive pedestrian detection method based on GSConv +VOV-GSCSP

Info

Publication number: CN118015539A
Application number: CN202410137920.1A
Authority: CN
Inventors: 曾岳; 张千龙
Original assignee: Jinling Institute of Technology
Current assignee: Jinling Institute of Technology
Priority date: 2024-02-01
Filing date: 2024-02-01
Publication date: 2024-05-10

Abstract

The invention discloses an improved YOLOv intensive pedestrian detection method based on GSConv +VOV-GSCSP, which comprises the following steps: firstly, using a public dataset WiderPerson, dividing the WiderPerson dataset into a training set, a testing set and a verification set; then, the original YOLOv algorithm is improved and optimized to be a GSConv + VOVG-SCSP detection method; then, an improved YOLOv detection method network model is built, and a data set is utilized to train a target detection model to obtain a dense pedestrian detection model; and finally, selecting a real-time video stream access model with dense people stream for detection, optimizing a network model according to detection result analysis and processing results by referring to an attention mechanism and a dropout regularization technology, reducing over-fitting and network complexity, and improving the accuracy and the robustness of pedestrian detection.

Description

Improved YOLOv intensive pedestrian detection method based on GSConv +VOV-GSCSP

Technical Field

The invention relates to the technical field of computer vision, in particular to an improved YOLOv intensive pedestrian detection method based on GSConv +VOV-GSCSP.

Background

Pedestrian detection is one of the most important research directions in the field of computer vision (ComputerVision, CV) to detect objects using visual models. With the rapid development of computer vision technology and the wide application of deep learning algorithms, pedestrian detection has received extensive attention and research as an important computer vision task. Pedestrian detection plays an important role in various fields, such as intelligent transportation, video monitoring, automatic driving and the like, and can provide pedestrian position information in real time and basic data for traffic safety, behavior analysis, environment sensing and other applications.

1. Prior to the advent of deep learning methods, conventional pedestrian detection methods were based primarily on image processing and machine learning techniques in computer vision. The method comprises the following steps: the method based on the feature extraction and the classifier comprises the following steps: this method first extracts features of the pedestrian, such as color, texture, shape, etc., from the image. And then classifying the extracted features by using a classifier such as a Support Vector Machine (SVM), adaBoost and the like to judge whether the extracted features are pedestrians.

2.Histogram of Oriented Gradients (HOG) HOG is a feature descriptor based on local gradient direction statistics. It characterizes the pedestrian by computing a gradient direction histogram of the local area of the image. Pedestrian detection is typically performed in conjunction with a Support Vector Machine (SVM) classifier.

Sift is an algorithm based on scale-invariant features by detecting key points in an image and extracting local feature descriptors describing those key points. SIFT may be used for keypoint detection and feature matching in pedestrian detection.

4. Pedestrian detection methods based on statistical shape models typically implement detection by building a statistical model of the pedestrian shape. These models may be models based on principal component analysis (PRINCIPAL COMPONENT ANALYSIS, PCA) or other shape description methods. By comparing the degree of matching of the shape in the image with the model, it can be judged whether or not a pedestrian exists. Pedestrian detection methods based on probabilistic graphical models utilize contextual information and a priori knowledge in the image to infer and detect. Common probability map models include Markov random fields (Markov Random Field, MRF) and conditional random fields (Conditional Random Field, CRF). These models can model pixels in the image and infer the pedestrian's location and bounding box.

However, the conventional statistical model method has certain limitations in terms of complex scene, occlusion, posture change, and the like. These methods typically require manual feature extraction and have limited handling for complex scenes and occlusion cases. With the rise of deep learning, the deep learning-based method can automatically extract advanced semantic information in an image by learning the characteristic representation and mode of a large amount of data, thereby improving the accuracy and the robustness of pedestrian detection. Deep learning methods such as convolutional neural networks (Convolutional Neural Networks, CNN), target detection networks (e.g., SSD, YOLO), etc. have become the dominant methods in the pedestrian detection field. The method can learn image characteristics end to end and classify pedestrians, and has better performance and adaptability.

In the target detection method, two general categories can be divided: single-Stage (One-Stage) and double-Stage (Two-Stage) methods. The single-stage target detection method regards a target detection task as a regression problem, and predicts the category and the position of the target directly from the input image. Of these, YOLO (You Only Look Once) is a representative method. YOLO divides the image into grids and predicts the category and location of the object in each grid cell. This approach has a faster detection speed, but may be less accurate in small size or dense scenes of the target. The dual-stage target detection method divides the target detection task into two independent stages: candidate region generation and target classification and accurate localization. Representative methods include RCNN (Region-based Convolutional Neural Networks) and Faster R-CNN. These methods first generate candidate regions, and then target-classify and position-pinpoint these candidate regions. The dual-stage method generally has higher detection accuracy but is somewhat slower than the single-stage method. With the development of deep learning methods and the wide application of convolutional neural networks, many different pedestrian detection techniques and models, such as Faster R-CNN, YOLO, SSD, etc., have emerged. These methods have respective advantages and features in pedestrian detection tasks, such as accuracy, speed and adaptability. Researchers can select a proper method to realize a pedestrian detection task according to specific application scenes and requirements.

Disclosure of Invention

In order to solve the technical problems, the invention provides an improved YOLOv intensive pedestrian detection method based on GSConv + VOVGSCSP, which can effectively improve the speed, accuracy and adaptability of detection.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

The improved YOLOv intensive pedestrian detection method based on GSConv + VOVGSCSP is characterized in that: comprises the following steps of

S1: using the public dataset WiderPerson, the WiderPerson dataset is divided into a training set, a testing set, and a validation set;

s2: YOLOv8 includes a backbone network, a C2f module, a detection head, and a neck module,

The improvement YOLOv is specifically: the backup layer of the C2f module replaces the two standard convolutions with GSconv in YOLOv and replaces all C2f modules of the neck module layer with VoV-GSCSP in the neck module layer;

s3: the SPPF module of the back bone layer is added with CBAM attention mechanisms;

s4: dropout regularization is introduced, and the mathematical formula of Dropout regularization is as follows:

r＝Bernoulli(p,N)(1)

h＝r⊙a(2)

Wherein: r is a binary random vector with the size of N, which represents a binary vector obtained by sampling from Bernoulli distribution with the probability of p, a is an original input vector, and represents an activation value transmitted from the last layer;

S5: constructing an improved YOLOv detection method network model, training a target detection model by utilizing a data set, obtaining a dense pedestrian detection model, selecting a real-time video stream access model with dense pedestrian flows for detection, analyzing and processing results according to the detection results, and optimizing the network model by referring to a CBAM attention mechanism and dropout regularization;

S6: and obtaining index data and performing experimental comparison.

As a preferable technical scheme of the invention: in step S1, widerperson datasets comprised 13382 pictures, consisting of 1000 validation sets, 4380 test sets, and 8000 training sets.

As a preferable technical scheme of the invention: in the step S2 of the process of the present invention,

Backbone network: YOLOv8 using Darknet-53 as the backbone network, darknet-53 is a network comprising 53 convolutions layers, for extracting image features,

Detection head: YOLOv8 the detection head uses a combined loss function, CIOU, including binary cross entropy and regression of classification, for object class classification and bounding box regression,

Neck module: YOLOv8 uses PAN-FPN as a neck module for feature fusion and to exploit feature layer information on different scales, which includes multiple C2f modules and a final decoupled header structure.

First, lightweight convolution method GSConv is used instead of Standard Conv, then GSbottleneck is introduced continuously on the basis of GSConv, again, one-time aggregation method is used to design cross-level partial network modules VoV-GSCSP, and finally VoV-GSCSP is used instead of all C2f modules of the neck module layer.

As a preferable technical scheme of the invention: in step S3, the CBAM attentiveness mechanism consists of two sub-modules, namely a channel attentiveness module and a spatial attentiveness module,

The channel attention module adaptively adjusts the importance of each channel by learning the attention mechanism of the channel dimension of the feature map, and comprises two key steps:

s31: global average pooling: performing global average pooling operation on each channel, compressing the spatial dimension of the feature map into a vector of the size of one channel dimension,

S32: full tie layer: the weights of the channels are learned by a full connection layer, the compressed channel vectors are mapped to an activation value that contains the importance of each channel in the attention mechanism,

The spatial attention module adaptively adjusts the importance of each spatial position by learning an attention mechanism on the spatial dimension of the feature map, and comprises two key steps:

S311: maximum pooling and average pooling: the feature map is subjected to maximum pooling and average pooling operations in the spatial dimension, resulting in two attention attempts, capturing the most significant features and evenly distributed features,

S312: two-dimensional convolution and sigmoid activation: by performing a two-dimensional convolution operation on the two attention attempts, and mapping it into the range of 0,1 using a sigmoid activation function, a spatial attention weight map is obtained, which represents the importance of each spatial location in the attention mechanism,

Finally, CBAM attentiveness mechanisms apply attentiveness weights to the original feature map by multiplying the outputs of the channel attentiveness module and the spatial attentiveness module element by element, thereby obtaining an enhanced feature representation.

As a preferable technical scheme of the invention: in step S6, a comparison experiment is performed on the dataset and the patcher dataset, firstly, complex scene pictures under different scenes are selected, the detection effects of the algorithm and YOLOv algorithm in the actual scene are compared, after 120 iterations of the algorithm are obtained, convergence is started,

Setting parameters: batch size=8 and epoch=200,

The evaluation index comprises mAP, average precision AP, precision P and recall R,

The formulas of the precision P and the recall R are shown as (3) and (4),

TP is the number of correctly predicted bounding boxes, FP is the number of erroneously determined positive samples, FN is the number of undetected targets, average precision AP is the average precision of the model, mAP is the average value of the average precision AP, K is the number of categories, the formulas of the average precision AP and mAP are shown in formulas (5) and (6),

Compared with the prior art, the invention has the beneficial effects that:

The invention provides an improved lightweight detection method based on YOLOv, a CA attention module is added on the basis of a YOLOv model by the algorithm, a GSConv lightweight convolution is adopted to replace an original convolution module, a VOV-GSCSP structure is adopted to replace a c2f structure in a neck layer, a decoupling structure is added in a detection layer, and the detection speed can be improved by a lightweight network structure. Experimental results show that the precision of the improved method is greatly improved, the recall ratio reaches 0.8212, and the mAP reaches 0.87344. The research result can greatly improve the detection precision of blocked pedestrians and fuzzy pedestrians, the lightweight network structure of the improved algorithm can also improve the detection speed, and the practicability and the efficiency of the algorithm are further enhanced.

Drawings

FIG. 1 is a block diagram of GSConv;

FIG. 2 is a block diagram of GSbottleneck modules;

FIG. 3 is a block diagram of VoV-GSCSP;

FIG. 4 is a block diagram of a channel attention module;

FIG. 5 is a block diagram of a spatial attention module;

FIG. 6 is a graph of average accuracy versus two algorithms

FIG. 7 is a table diagram of optimal mAP versus model size for two classes of models;

FIG. 8 is a table diagram of test mAP and inference speeds for two types of models;

FIG. 9 is a view of the scene of FIG. 1;

FIG. 10 is a view of the scene of FIG. 2;

FIG. 11 is a comparison of face detection in the scene of FIG. 1;

Fig. 12 is a comparison of face detection in fig. 2.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and detailed description:

As shown in fig. 1-12, the invention provides an improved YOLOv intensive pedestrian detection method based on GSConv + VOVGSCSP, which comprises the following steps of

r＝Bernoulli(p,N)(1)

h＝r⊙a(2)

S6: and obtaining index data and performing experimental comparison.

In step S1, widerperson datasets comprised 13382 pictures, consisting of 1000 validation sets, 4380 test sets, and 8000 training sets.

In the step S2 of the process of the present invention,

In step S3, the CBAM attentiveness mechanism consists of two sub-modules, namely a channel attentiveness module and a spatial attentiveness module,

In step S6, a comparison experiment is performed on the dataset and the patcher dataset, firstly, complex scene pictures under different scenes are selected, the detection effects of the algorithm and YOLOv algorithm in the actual scene are compared, after 120 iterations of the algorithm are obtained, convergence is started,

Setting parameters: batch size=8 and epoch=200,

The formulas of the precision P and the recall R are shown as (3) and (4),

The model structure design of YOLOv mainly comprises a backbone network, a neck module Neck and a Head part, and the backbone network and Neck part refer to the design thought of YOLOv ELAN, and a C2f structure is used for improving the model performance. The Head part adopts a decoupling Head structure to separate the classification and detection heads, and performs target detection in a mode of changing from Anchor-Based to Anchor-Free. In the model Backbone network Backbone YOLOv, GSconv is first substituted for the standard convolution, GSconv has a computational cost of about 60% -70% of SC, but its contribution to model learning ability is not comparable to the latter. The neck module neg layer is replaced by VOVGSCSP with gsconv as a basic convolution structure in the neck module neg layer, so that the neck module neg layer is light and Binary Cross Entropy Loss is used as a classification loss function. Frame regression loss: the frame regression Loss of YOLOv consists of DFL Loss and CIOU Loss. First, the target bounding box is converted into distances relative to the top left and bottom right corners of the feature map size, and it is ensured that the distances do not exceed a certain range. The positive sample portion is then screened out of the bounding box of the network prediction and converted into a distributed form of the network output.

The improved YOLOv in Backbone network Backbone replaces the generic convolution module StandardConv with GSConv first, which functions to make the output of the convolution computation as close as possible to the standard convolution operation, while reducing the computation cost, GSConv combines the advantages of standard convolution and depth separable convolution, and by using GSConv in the neck module Neck portion of the model, higher computation cost benefits can be achieved, improving the performance of the model. At the neck module neg layer, the original C2f structure is replaced by a VOVGSCSP structure consisting of GSConv, and the VOV-GSCSP module consists of a plurality of GSBottleneck, each GSBottleneck containing two GSConv operations and one short connection shortcut. GSConv in GSBottleneck operates to increase the expressive power of the feature map, while short links are used to preserve the original features. Through the stacking of a plurality of GSBottleneck, the VOV-GSCSP module can extract richer characteristic information, lower calculation cost is kept, and the performance of the model can be improved by replacing the VOV-GSCSP with the VOV-GSCSP module, and parameters of the model are reduced. And light weight and high performance are achieved.

In order to reduce the overfitting of the model and not lose accuracy, dropout regularization was introduced for processing.

To better understand the mathematical principles of Dropout, a simple linear neural network can be considered. Assuming that the network has n neuron nodes, the output value of each node is O_i, the weight is w_i, and the input value is I_i. The output of the network can be expressed as:

O＝∑(i)w_i I_i

In the case where Dropout is not used, the error of the network can be expressed as:

E_N＝1/2*(t-∑(i＝1to n)w_i'I_i)^2

where t is the target value, w 'is the scaling factor of the weights, w' =p×w. When we introduce Dropout, the error of the network can be expressed as:

E_D＝1/2*(t-∑(i＝1to n)δ_i w_i I_i)^2

Where delta_i is a random variable obeying the Bernoulli distribution, which takes on a value of 1 or 0, and the probabilities are p and 1-p. By deriving E_D for w_i, we can get the gradient of the Dropout network to be expected to be equivalent to the gradient of a normal network with a regularized term. Thus, dropout acts as a regularization term, w_ I p _i (1-p_i) I_i≡2.

And accessing the real-time video stream into a pedestrian detection model at the dense people stream to detect, judging whether pedestrians exist in the monitoring area, processing according to the detection result, and analyzing the result.

Performing comparative evaluation on YOLOv8 pedestrian detection models or GSConv + VOVGSCSP _ YOLOv8 models, and analyzing the performance differences of the models in terms of accuracy, speed and robustness;

Data is first split into two sources: ① Obtaining a video stream; ② Actual scene monitoring data, widerPerson datasets disclosed by the network; and pre-processing the training data and the test data. The hardware environment and related configuration information are as follows:

operating system: ubuntu 18.04.6LTS;

Memory: 32GB;

Display card: NVIDIA TESLAV100-PCIE-32GB;

A frame: pytorch.

Analyzing the detection result:

The basic model selected in the comparative study is YOLOv. The invention mainly uses YOLOv as a basic network structure, introduces different attention mechanisms by modifying a network algorithm model in the network structure, replaces different loss functions and introduces a dropout regularization method, and improves the network structure of YOLOv. Among the attention mechanisms introduced in the experiments are: CBAM attention mechanisms, with the loss function of substitution: WIoU. The experiment trains two types of models aiming at the same data set, namely YOLOv intensive pedestrian target detection models and GSConv + VOVGSCSP _ YOLOv8 intensive pedestrian target detection models, and performs experimental result analysis from the accuracy, the reasoning speed and the robustness of the models.

CBAM (Convolutional Block Attention Module) is an attention mechanism module for enhancing the expressive power of Convolutional Neural Networks (CNNs) in a feature representation. The weight of the feature map is adaptively adjusted through learning channel attention and spatial attention so as to improve the performance and generalization capability of the model.

CBAM modules consist of two sub-modules: a channel attention module (Channel Attention Module) and a spatial attention module (Spatial Attention Module).

Channel attention module (Channel Attention Module):

The channel attention module adaptively adjusts the importance of each channel by learning the attention mechanism for the channel dimension of the feature map, as shown in fig. 4. The method comprises the following two key steps:

(1) Global average pooling (Global Average Pooling): carrying out global average pooling operation on each channel, and compressing the space dimension of the feature map into a vector with the dimension of one channel;

(2) Full tie layer (Fully Connected Layer): the compressed channel vector is mapped to an activation value (attention weight) by a full connection layer learning the weights of the channels. This activation value contains the importance of each channel in the attention mechanism.

Spatial attention module (Spatial Attention Module):

The spatial attention module adaptively adjusts the importance of each spatial location by learning the attention mechanism for the spatial dimension of the feature map, as shown in fig. 5. The method comprises the following two key steps:

(1) Maximum pooling and average pooling (Max Pooling AND AVERAGE Pooling): carrying out maximum pooling and average pooling operation on the feature map in the space dimension to obtain two attention force diagrams, and respectively capturing the most obvious features and the features distributed evenly;

(2) Two-dimensional convolution and sigmoid activation: the spatial attention weight map is obtained by performing a two-dimensional convolution operation on the two attention attempts and mapping it into the range of [0,1] using a sigmoid activation function. This weight map shows the importance of each spatial location in the attention mechanism.

Finally, CBAM module applies the attention weights to the original feature map by multiplying the outputs of the channel attention module and the spatial attention module element by element, thereby obtaining an enhanced feature representation. The mechanism enables the model to pay attention to important channels and space positions in a self-adaptive mode, and improves the perception capability and distinguishing capability of the model for different targets.

The CBAM module can be flexibly embedded in different layers of the CNN to enhance feature representation capabilities. The method has the advantages that remarkable performance improvement is achieved in tasks such as image classification, target detection and image segmentation, and model performance can be further improved by combining with other attention mechanisms and modules.

FIG. 6 is a graph of training results for two models on the same dataset, where Epoch represents training round and mAP represents training average accuracy of the model. As can be seen from fig. 6, the maps of the two models gradually tend to gradually and gently increase to approach a certain value along with the increase of the epochs, and the training maps of the GSConv + VOVGSCSP _ YOLOv8 pedestrian target detection model are better.

The best training mAP for the two models at IoU, 0.5, and the trained model size are shown in FIG. 7. From the best mAP of model training, the highest training mAP of a GSConv + VOVGSCSP _ YOLOv8 pedestrian target detection model is 90.12%; secondly, a YOLOv pedestrian target detection model is adopted, mAP is 88.64%, and the training precision of the GSConv + VOVGSCSP _ YOLOv8 pedestrian target detection model is 87.34% highest from the view of the best precision of model training; next, a YOLOv pedestrian target detection model, precision being 86.01%, training recall of the GSConv + VOVGSCSP _ YOLOv8 pedestrian target detection model being 82.12% highest from the best recall of model training; next, YOLOv was a pedestrian target detection model, recall was 80.25%.

FIG. 8 shows the test mAP and the reasoning speed of the six models; as can be seen from fig. 8, the detection accuracy of the GSConv + VOVGSCSP _ YOLOv8 pedestrian target detection model is highest and is 88.65%; secondly, a YOLOv s pedestrian target detection model is adopted, and the reasoning speed of GSConv + VOVGSCSP _ YOLOv is obviously faster than that of the YOLOv model, but the model size of GSConv + VOVGSCSP _ YOLOv8 is smaller than that of a common YOLOv model.

As shown in fig. 9-12, which are partial test set detection results, it can be seen from fig. 9-10 that the GSConv + VOVGSCSP _ YOLOv8 target pedestrian detection model has the best recognition and detection effects, can correctly face and can give dense pedestrians to be detected. The GSConv + VOVGSCSP _ YOLOv8 model has good detection robustness and generalization capability, the parameter quantity of the model is small, the detection effect of the GSConv + VOVGSCSP _ YOLOv8 target pedestrian detection model is good, and a small target with a long view can be identified.

The application uses pictures of two groups of scenes to qualitatively evaluate the detection effect of YOLOv and GSConv +VOV-GSCSP _ YOLOv. The input picture size at the time of experiment is 640 x 640, and the confidence threshold is 0.5.

In the scene of fig. 1, the situation that YOLOv is missed can be seen, but GSConv +vov-GSCSP _ YOLOv8 detects more correct targets, which proves that GSConv +vov-GSCSP _ YOLOv8 can extract more abundant features from the input picture.

In the scene of fig. 2, faces are denser, YOLOv is in missed detection, the number of missed detection of YOLOv8 is increased, but GSConv +vov-GSCSP _ YOLOv8 is not in missed detection. Overall, the effect of GSConv +vov-GSCSP _ YOLOv8 on target detection is generally better than YOLOv, and the network is proved to be capable of extracting more abundant semantic information and better in performance.

In summary, two types of pedestrian detection models are trained through the scenes, but certain differences and challenges still exist in different scenes. Future research can continue to explore pedestrian detection methods with higher efficiency and higher robustness, and the requirements of specific application scenes are combined to perform customized optimization, so that a more reliable and efficient solution is provided for practical application.

According to the improved lightweight detection method based on YOLOv provided by the invention, a CBAM attention mechanism is added on the basis of a YOLOv model by the algorithm, GSConv lightweight convolution is adopted to replace an original convolution module, a VOV-GSCSP structure is adopted to replace a C2f structure in a neck layer, a decoupling structure is added in a detection layer, and the detection speed can be improved by a lightweight network structure. Experimental results show that the precision of the improved method is greatly improved, the recall ratio reaches 0.8212, and the mAP reaches 0.87344. The research result can greatly improve the detection precision of blocked pedestrians and fuzzy pedestrians, the lightweight network structure of the improved algorithm can also improve the detection speed, and the practicability and the efficiency of the algorithm are further enhanced.

The above description is only of the preferred embodiment of the present invention, and is not intended to limit the present invention in any other way, but is intended to cover any modifications or equivalent variations according to the technical spirit of the present invention, which fall within the scope of the present invention as defined by the appended claims.

Claims

1. The improved YOLOv dense pedestrian detection method based on GSConv +VOV-GSCSP is characterized in that: the method comprises the following step S1: using the public dataset WiderPerson, the WiderPerson dataset is divided into a training set, a testing set, and a validation set;

r＝Bernoulli(p,N) (1)

h＝r⊙a(2)

S6: and obtaining index data and performing experimental comparison.

2. The improved YOLOv dense pedestrian detection method based on GSConv +vov-GSCSP of claim 1, wherein: in step S1, widerperson datasets comprised 13382 pictures, consisting of 1000 validation sets, 4380 test sets, and 8000 training sets.

3. The improved YOLOv dense pedestrian detection method based on GSConv +vov-GSCSP of claim 1, wherein: in the step S2 of the process of the present invention,

4. The improved YOLOv dense pedestrian detection method based on GSConv +vov-GSCSP of claim 1, wherein: in the step S2 of the process of the present invention,

5. The improved YOLOv dense pedestrian detection method based on GSConv +vov-GSCSP of claim 1, wherein: in step S3, the CBAM attentiveness mechanism consists of two sub-modules, namely a channel attentiveness module and a spatial attentiveness module,

6. The improved YOLOv dense pedestrian detection method based on GSConv +vov-GSCSP of claim 1, wherein: in step S6, a comparison experiment is performed on the dataset and the patcher dataset, firstly, complex scene pictures under different scenes are selected, the detection effects of the algorithm and YOLOv algorithm in the actual scene are compared, after 120 iterations of the algorithm are obtained, convergence is started,

Setting parameters: batch size=8 and epoch=200,

The formulas of the precision P and the recall R are shown as (3) and (4),