CN116863227A

CN116863227A - Hazardous chemical vehicle detection method based on improved YOLOv5

Info

Publication number: CN116863227A
Application number: CN202310834688.2A
Authority: CN
Inventors: 陈伯伦; 朱鹏程; 刘步实; 许雪; 戚梓凡; 赵月; 于翠莹; 于永涛
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-10

Abstract

The patent discloses a dangerous chemical vehicle detection method based on improved YOLOv5, which comprises the following specific steps: firstly, acquiring dangerous chemical vehicle images through road monitoring, marking the images by using Labelme, and converting a label format into a yolo format; then preprocessing the image dataset; improving a YOLOv5 model, adding an attention mechanism in a back bone network and a neg structure of the model, improving the neg structure into a cross-layer path aggregation network, and replacing a space pyramid pooling layer into an S-ASPP structure; a context feature enhancement network is designed to enhance the detection of small targets and feature refinements are used to filter noise features; adopting SIoU loss to replace original CIoU loss; and inputting the training set into the improved YOLOv5 algorithm model to train the model and test the result. Compared with the prior art, the method can better identify and position the dangerous chemical vehicle, and obtain larger improvement on the progress.

Description

Hazardous chemical vehicle detection method based on improved YOLOv5

Technical Field

The application relates to the technical field of target detection, in particular to a dangerous chemical vehicle detection method based on an improved YOLOv5 algorithm.

Background

With the continuous development of industrialization, the demand for chemical raw materials is increasing. However, some chemical raw materials have the characteristics of explosion, corrosion, inflammability, toxicity and the like, and once leakage occurs, the chemical raw materials can cause a certain degree of harm to human bodies and surrounding environments. Therefore, it is important to ensure the safety of dangerous chemicals (hazardous chemicals). The transportation of hazardous chemicals is mostly carried out by road. As the transportation amount of dangerous chemicals is continuously increased, the number of accidents of dangerous chemical vehicles during transportation is also increased, and the transportation of dangerous chemicals has become the highest risk in the safety of dangerous chemicals. Therefore, in order to reduce casualties and property loss, supervision on the way to the transportation of dangerous chemicals vehicles is not acceptable.

In recent years, deep learning and machine vision have been rapidly developed, such as object detection, image segmentation, and the like. This provides a technological base for hazardous chemical vehicle identification. The vehicle detection technology is utilized to identify an important road, when a dangerous chemical vehicle is detected, a road management department carries out real-time dynamic monitoring on the vehicle, so that traffic accidents can be avoided or reduced, or timely and effective emergency rescue can be provided when dangerous chemical transportation accidents occur, secondary accidents are avoided to the greatest extent, and casualties and property loss are reduced.

In recent years, vehicle detection has been the focus of attention of many researchers, and researches on vehicle detection have been in progress. Djenouri et al propose an improved area convolutional neural network that first uses a SIFT extractor to remove noise (a set of outlier images) and then establishes an improved area convolutional neural network to detect vehicles of different scales. Li et al propose a new condition-sensitive approach based on fast R-CNN and Domain Adaptation (DA), making maximum use of the marked daytime image (source domain) to aid in vehicle detection in the unmarked nighttime image (target domain) to improve the detection capability of the nighttime vehicle.

Aiming at the problem of rapid detection of vehicles in traffic scenes, chen et al propose an improved single lens multi-box detector (SSD) algorithm. The MobileNet v2 is selected as a backbone network feature extraction network of the SSD, so that the algorithm instantaneity is improved, the channel attention mechanism is utilized for feature weighting, and the deconvolution module is utilized for constructing a bottom-up feature fusion structure, so that the detection precision is improved. Kang et al propose a remote sensing satellite video motion vehicle rapid detection method with automatic restriction of a region of interest: firstly, rapidly and automatically acquiring an interested region of a moving vehicle target; and secondly, under the constraint of the region of interest, the rapid detection of the moving vehicle in the region of interest is realized based on an improved Gaussian background difference method.

For the existing deep learning vehicle target detection method, the detection precision of a two-stage algorithm is higher but the speed is slower, and the speed of a single-stage algorithm is higher but the precision is not as high as that of the two-stage algorithm. Therefore, how to design a detection algorithm with high accuracy and high speed is very important.

Disclosure of Invention

The application aims to: aiming at the problem that the precision of the existing algorithm is not high when the object of the dangerous chemical vehicle is detected, the application provides a method for detecting the dangerous chemical vehicle based on an improved YOLOv5 algorithm, and the method for improving the detection precision of the dangerous chemical vehicle on the premise of ensuring the detection speed.

The technical scheme is as follows: the application provides a dangerous chemical vehicle target detection method based on improved YOLOv5, which is characterized by comprising the following specific steps:

s1, acquiring dangerous chemical vehicle images through road monitoring to form a data set, and dividing the data set into a training set, a verification set and a test set;

s2, preprocessing pictures in the dangerous chemical vehicle data set;

s3, improving a YOLOv5 model, adding an attention mechanism in a back bone network and a back structure of the model, improving the back structure into a cross-layer path aggregation network, and replacing a space pyramid pooling layer into an S-ASPP structure;

s4, designing a context feature enhancement network CAH, introducing the context feature enhancement network CAH to the top of a main network to enhance the detection of small targets, and filtering noise features by using a feature refinement module FRB in a neg structure;

s5, modifying a loss function in the Yolov5, and replacing original CIoU loss by SIoU loss;

s6, inputting the training set into the improved YOLOv5 model to train the model and test the result.

Further, the specific steps of the step S1 are as follows:

s1.1, acquiring a dangerous chemical vehicle image through road monitoring, and marking the image by using a Labelme marking tool;

s1.2, converting the json tag format with the marked label into a yolo format suitable for network input, generating id, x, y, w, h, and normalizing;

s1.3, marking the marked vehicle data set according to 8:1:1, dividing training set, verification set and test set.

Further, the specific steps of the preprocessing in the step S2 are as follows:

s2.1, calculating the average brightness of a scene according to the current scene by using a Tone mapping contrast enhancement method, selecting a proper brightness domain according to the average brightness, and finally mapping from the whole scene to the brightness domain to obtain a required result;

s2.2, enhancing the expansion data volume by using Mixup data and enhancing the generalization capability of a network model;

s2.3, enhancing by using Mosaic data, and splicing by using 9 pictures to enhance the robustness of the model.

Further, the specific steps of adding the attention mechanism in the back bone network and the back structure of the model in the step S3 are as follows:

s3.1, introducing an attention mechanism into a back bone network and a back structure of a YOLOv5 model, and inhibiting useless information by giving weights or hard selection part feature graphs to the feature graphs of different parts; the attention module generates attention characteristic map information in two dimensions of a channel and a space in parallel, and then the two kinds of characteristic map information are multiplied by the original input characteristic map to carry out self-adaptive characteristic correction to generate a final attention mechanism characteristic map;

s3.2, the channel attention mechanism is to compress the feature map in the space dimension, carry out global average pooling and maximum pooling on the input feature map in the width dimension and the height dimension respectively, and then carry out cross-channel interaction through one-dimensional convolution with the size of k, so as to enhance the connection among channels; the size k of the convolution kernel is defined as:

wherein, C is the channel number of the feature map, gamma and b are respectively 2 and 1, and the specific definition of the channel attention mechanism is as follows;

Mc(F)＝σ(Conv(avgpool(F)+Conv(maxpool(F)))

＝σ(W _k (F _avg )+W _k (F _max ))

wherein F represents an input feature map, sigma represents a sigmoid activation function, F _avg and F_max Respectively representing the feature graphs after global average pooling and global maximum pooling, W _i Representing the weight of the network;

s3.3, the spatial attention mechanism is to compress a feature map in a channel dimension, perform global average pooling and maximum pooling based on channels on the input feature map, splice two obtained tensors in the channel direction, reduce the channel to 1 by convolution operation, and activate to generate spatial attention features through sigmoid, wherein the method is specifically defined as:

Ms(F)＝σ(f ^7x7 ([Avgpool(F)；Maxpool(F)]))

＝σ(f ^7x7 ([F _avg ；F _max ]))

wherein F represents an input feature map, sigma represents a sigmoid activation function, F ^7x7 Representing a convolution operation with a convolution kernel size of 7 x 7;

s3.4, using a cross-layer path aggregation network as a neg structure;

s3.5, replacing the SPPF layer, and designing a brand new space pyramid pooling layer S-ASPP layer.

Further, the S-ASPP structure in the step 3 realizes extraction of vehicle information of dangerous chemicals of different scales in the image by using different void ratios, specifically:

the S-ASPP structure is divided into two parallel parts: the first part carries out serial convolution operation on four 3X 3 convolution layers with the void ratio of {18, 12,6,1} from large to small, and splices the obtained characteristic information of four different receptive fields in the channel dimension; the second S-ASPP layer sequentially carries out pooling-convolution-up sampling on the input original features, the obtained feature images and the feature images of the first part are spliced together in channel dimension, and the channel dimension is subjected to 1X 1 convolution to obtain the final S-ASPP output.

Further, the specific steps of the step S4 are as follows:

s4.1, designing a brand new context feature enhancement module CAH, and introducing the brand new context feature enhancement module CAH to the top of a backbone network; the context feature enhancement module CAH is composed of four context reasoning branches in total: the first branch and the second branch use a hole convolution kernel with hole rates of 1 and 2 to realize the access to the local context; the third branch uses two cavity convolution kernels with the cavity rate of {2,4}, and the fourth branch uses two convolution kernels with the cavity rate of {3,6}, so as to feel larger context information; finally, splicing and fusing the outputs of the four parallel branches in the channel dimension, and obtaining a final output result through 1 multiplied by 1 convolution with the void ratio of 1;

s4.2, designing a feature refinement module FRB to filter noise features in a cross-layer path aggregation network, and integrating the noise features in a neg structure; the feature refinement module FRB firstly unifies the features output by the prediction head into feature graphs with the same size and the same size, and then the feature graphs are spliced and fused; then respectively entering a space purification module and a channel purification module, and finally adding and fusing the outputs of the two purification modules to generate a final result; the channel purification module aggregates the input features into space dimensions, and combines global average pooling and global maximum pooling, and is specifically defined as:

wherein ,X^m Representing the input of the m (m=1, 2, 3) th layer of the feature refinement module,representing the results of the n-th to m-th layers in the (x, y) position, +.>Representing the output result of the mth layer at the (x, y) position, a ^m ,b ^m ,c ^m Representing the adaptive weights of the channels, the size is 1 multiplied by 1;

the spatial purification module uses a softmax function to generate weights under each channel, and the specific definition of branches is as follows:

wherein ,represents the last output feature at the (x, y) position, k represents the number of input feature map channels, c represents the channel,/and c represents the channel>Values representing the profile of the kth channel from n to m layers in the (x, y) position,/v>Representing spatially adaptive weights.

Further, the specific steps of the step S5 are as follows:

s5.1, calculating a target confidence loss by a sample obtained by positive sample matching, wherein the target confidence loss comprises two parts: target confidence score p in prediction block _o The IoU values of the prediction frame and the target frame corresponding to the prediction frame are calculated by the two to obtain the final target confidence loss, which is specifically defined as:

wherein ,l_obj Representing confidence loss, C _i The true value is represented by a value that is true,representing predicted value lambda _nobj Weight representing non-detection target, S ² Representing the number of grids, B representing the number of anchors in each cell;

s5.2, calculating the category loss through the category score of the prediction frame and the one-hot performance of the target frame category, wherein the category loss is specifically defined as:

wherein ,l_cls Representing a classification loss, c representing a category, and P (c) representing a prediction box category score for the category;

s5.3, replacing the CIoU in the target frame loss with the SIoU; SIoU consists of four parts: angle, distance, shape and IoU, specifically defined as:

the shape cost is defined as:

wherein Ω represents shape cost, w, h, w ^gt ,h ^gt The width and the height of the prediction frame and the real frame are respectively, and theta represents the importance degree of the shape loss;

the angular cost is defined as:

wherein sigma represents the distance between the center point of the real frame and the center point of the predicted frame, c _h Representing the difference between the heights of the center points of the real frame and the predicted frame, alpha represents the included angle between the connecting line and the horizontal between the two points,representing the center position of the real frame b _cx ,b _cy Representing the central position of the prediction frame;

the redefined distance costs are, in terms of angle costs:

wherein ,representing the position of the center point of the real frame b _cx ,b _cy Representing the position of the center point of the prediction frame, c _w ,c _h For both real and predicted framesThe width and height of the minimum circumscribed matrix.

Further, the specific steps of the step S6 are as follows:

s6.1, inputting the training set into the improved YOLOv5 algorithm model to train the model and test the result on the test set;

and S6.2, deploying the trained model on a server, and monitoring the dangerous chemical vehicle in real time.

Compared with the prior art, the method for detecting the dangerous chemical vehicle based on the improved YOLOv5 has the advantages that the effect of improving the detection precision on the premise of ensuring the speed is achieved, and the method is characterized in that:

(1) The Tone mapping contrast enhancement method is used to improve the variability of the vehicle from the image background, and then Mixup and Mosaic data are used to enhance the extended data set.

(2) And improving the YOLOv5 model network, adding an attention mechanism in a back bone network and a neg structure of the model, improving the neg structure into a cross-layer path aggregation network, and replacing a space pyramid pooling layer into an S-ASPP structure. A contextual feature enhancement network (CAH) is designed to be introduced at the top of the backbone network to enhance detection of small objects and Feature Refinement (FRB) is used in the neg structure to filter noise features.

(3) The loss function in YOLOv5 is modified, and the SIoU loss is used for replacing the original CIoU loss.

Aiming at the problem of detection precision of dangerous chemicals, the application firstly uses a Tone mapping contrast enhancement method to enhance the target contrast; secondly, increasing the data volume of a training set by using Mixup and Mosaic data enhancement, and improving the robustness of the model; adding an attention mechanism into the network model to enable the model to pay more attention to vehicle information, improving a path aggregation network to enable the vehicle characteristic information not to be lost easily, and replacing a space pyramid pooling layer; then introducing context feature enhancement to enable the model to capture tiny features and refining and filtering ineffective features by using the features; the SIoU loss is then used as a loss function for the target box. Compared with the same type of method, the method can effectively improve the detection precision of the model on the vehicle.

Drawings

FIG. 1 is an overall flow diagram of the algorithm;

FIGS. 2-4 are block diagrams of added attention mechanisms;

FIG. 5 is a diagram of a replaced spatial pyramid S-ASPP layer;

FIG. 6 is a diagram of a context feature enhanced network (CAH) architecture;

FIG. 7 is a feature refinement module (FRB) block diagram;

FIG. 8 is a diagram of the improved YOLOv5 model.

Detailed Description

The application is further elucidated below in connection with the drawings and the detailed description. It is to be understood that these examples are for the purpose of illustrating the application only and are not to be construed as limiting the scope of the application, since modifications to the application, which are various equivalent to those skilled in the art, will fall within the scope of the application as defined in the appended claims after reading the application.

As shown in fig. 1, the specific steps of the application are as follows:

s1, acquiring dangerous chemical vehicle images through road monitoring to serve as training data for training an improved YOLOv5 network, wherein the method comprises the following specific steps of:

s1.1, acquiring dangerous chemical vehicle images through road monitoring, and marking the images by using a Labelme marking tool.

S1.2, converting the json tag format with the marked label into a yolo format suitable for network input, generating id, x, y, w, h, and normalizing.

S2, preprocessing pictures in the dangerous chemical vehicle data set, wherein the specific steps are as follows:

s2.1, using a Tone mapping contrast enhancement method to improve the difference between the vehicle and the image background. The average brightness of the scene is calculated according to the current scene, then a proper brightness domain is selected according to the average brightness, and finally mapping is carried out from the whole scene to the brightness domain so as to obtain the required result. Specifically defined as:

wherein ,L_w (x, y) represents the pixel brightness at position (x, y), N represents the number of pixels in the scene, and δ represents a very small number for the case of processing a solid black pixel.

S2.2, enhancing the extended data volume and enhancing the generalization capability of the network model by using Mixup data.

S3, improving a YOLOv5 model, adding an attention mechanism in a back bone network and a back structure of the model, and replacing a space pyramid pooling layer into an SPPCSPC structure, wherein the method comprises the following specific steps:

s3.1, introducing a attention mechanism into a back bone network and a back structure of the YOLOv5 model, and inhibiting useless information by giving weights to characteristic diagrams of different parts or rigidly selecting the characteristic diagrams of the parts. The attention module generates attention characteristic map information in two dimensions of a channel and a space in parallel, and then the two kinds of characteristic map information are multiplied by the original input characteristic map to carry out self-adaptive characteristic correction to generate a final attention mechanism characteristic map.

S3.2, the channel attention mechanism is to compress the feature map in the space dimension, to carry out global average pooling and maximum pooling on the input feature map in the width dimension and the height dimension respectively, and to carry out cross-channel interaction through one-dimensional convolution with k, so as to enhance the connection between channels. The size k of the convolution kernel is defined as:

wherein C is the channel number of the feature map, and gamma and b respectively take values of 2 and 1. The specific definition of channel attention mechanism is as follows;

Mc(F)＝σ(Conv(avgpool(F)+Conv(maxpool(F)))

＝σ(W _k (F _avg )+W _k (F _max ))

s3.3, the spatial attention mechanism is to compress the feature map in the channel dimension, make the input feature map global average pooling and maximum pooling based on the channel, splice the two obtained tensors in the channel direction, reduce the channel to 1 by convolution operation, and activate to generate the spatial attention feature through sigmoid. Specifically defined as:

Ms(F)＝σ(f ^7x7 ([Avgpool(F)；Maxpool(F)]))

＝σ(f ^7x7 ([F _avg ；F _max ]))

wherein F represents an input feature map, sigma represents a sigmoid activation function, F ^7x7 Representing a convolution operation with a convolution kernel size of 7 x 7.

S3.4, using a cross-layer path aggregation network as a neg structure.

S3.5, replacing the SPPF layer, and designing a brand new space pyramid pooling layer S-ASPP layer. The S-ASPP structure realizes extraction of vehicle information of dangerous chemicals with different scales in the image by using different void ratios. The method comprises the steps of dividing the method into two parallel parts, wherein the first part respectively carries out serial convolution operation on four 3X 3 convolution layers with the void ratio of {18, 12,6,1} from large to small, and splices the obtained characteristic information of four different receptive fields in the channel dimension. The second S-ASPP layer sequentially carries out pooling-convolution-up sampling on the input original features, the obtained feature images and the feature images of the first part are spliced together in channel dimension, and the channel dimension is subjected to 1X 1 convolution to obtain the final S-ASPP output.

S4, introducing a new module to enable the model to pay more attention to fine features, wherein the method comprises the following specific steps of:

s4.1, designing a brand new context feature enhancement module (CAH), and introducing the brand new context feature enhancement module (CAH) to the top of a backbone network. The context feature enhancement module is composed of four context reasoning branches, and the first branch and the second branch use hole convolution kernels with hole rates of 1 and 2 to achieve access to the local context. The third branch uses two hole convolution kernels with a hole rate of {2,4}, and the fourth branch uses two convolution kernels with a hole rate of {3,6}, for sensing larger context information. And finally, splicing and fusing the outputs of the four parallel branches in the channel dimension, and obtaining a final output result through 1 multiplied by 1 convolution with the void ratio of 1.

S4.2, designing a feature refinement module to filter noise features in the cross-layer path aggregation network, and integrating the noise features in the neg structure. The feature refinement module firstly unifies the features output by the prediction heads into feature graphs with the same size, and then the feature graphs are spliced and fused. And then enter the space purification module and the channel purification module respectively. Finally, the outputs of the two purifying modules are added and fused to generate a final result. The channel purification module aggregates the input features into spatial dimensions and combines global average pooling with global maximum pooling. Specifically defined as:

S5, modifying a loss function in the Yolov5, and replacing the original CIoU loss by adopting the SIoU loss, wherein the specific steps are as follows:

s5.1, calculating a target confidence loss by a sample obtained by positive sample matching, wherein the target confidence loss comprises two parts: target confidence score p in prediction block _o And calculating binary cross entropy by the prediction frame and IoU values of the target frame corresponding to the prediction frame to obtain the final target confidence loss. Specifically defined as:

wherein ,l_obj Representing confidence loss, C _i The true value is represented by a value that is true,representing predicted value lambda _nobj Weight representing non-detection target, S ² Representing the number of grids, and B representing the number of anchors in each cell.

S5.2, calculating the category loss through the category score of the prediction frame and the one-hot performance of the target frame category. Specifically defined as:

wherein ,l_cls Representing the classification penalty, c represents the class, and P (c) represents the prediction box class score for that class.

S5.3, replacing the CIoU in the target frame loss with the SIoU. SIoU consists of four parts: angle, distance, shape, and IoU. Specifically defined as:

wherein, the shape cost is defined as:

wherein Ω represents shape cost, w, h, w ^gt ,h ^gt The width and height of the predicted and real frames, respectively, θ represents the importance of the shape loss.

The angular cost is defined as:

wherein sigma represents the distance between the center point of the real frame and the center point of the predicted frame, c _h Representing the difference between the heights of the center points of the real frame and the predicted frame, alpha represents the included angle between the connecting line and the horizontal between the two points,representing the center position of the real frame b _cx ,b _cy Representing the central position of the prediction box.

The redefined distance costs are, in terms of angle costs:

wherein ,representing the position of the center point of the real frame b _cx ,b _cy Representing the position of the center point of the prediction frame, c _w ,c _h The width and height of the minimum bounding matrix for both the real and predicted frames.

S6, inputting the training set into an improved YOLOv5 algorithm model to train the model and test the result, wherein the training set comprises the following specific steps:

s6.2, deploying the trained model on a server, and detecting the dangerous chemical vehicle in real time.

The application can be deployed at a mobile terminal to realize real-time detection of dangerous chemical vehicles.

The foregoing embodiments are merely illustrative of the technical concept and features of the present application, and are intended to enable those skilled in the art to understand the present application and to implement the same, not to limit the scope of the present application. All equivalent changes or modifications made according to the spirit of the present application should be included in the scope of the present application.

Claims

1. The dangerous chemical vehicle target detection method based on the improved YOLOv5 is characterized by comprising the following specific steps of:

s2, preprocessing pictures in the dangerous chemical vehicle data set;

2. The method for detecting the object of the hazardous chemical vehicle based on the improved YOLOv5 of claim 1, wherein the specific steps of the step S1 are as follows:

3. The method for detecting the object of the hazardous chemical vehicle based on the improved YOLOv5 of claim 1, wherein the specific steps of the preprocessing in the step S2 are as follows:

s2.1, calculating the average brightness of a scene according to the current scene by using a Tonemapping contrast enhancement method, selecting a proper brightness domain according to the average brightness, and finally mapping from the whole scene to the brightness domain to obtain a required result;

4. The method for detecting the dangerous chemical vehicle target based on the improved YOLOv5 according to claim 1, wherein the specific steps of adding the attention mechanism in the back bone network and the neg structure of the model in the step S3 are as follows:

Mc(F)＝σ(Conv(avgpool(F)+Conv(maxpool(F)))

＝σ(W _k (F _avg )+W _k (F _max ))

Ms(F)＝σ(f ^7x7 ([Avgpool(F)；Maxpool(F)]))

＝σ(f ^7x7 ([F _avg ；F _max ]))

s3.4, using a cross-layer path aggregation network as a neg structure;

5. The method for detecting the dangerous chemical vehicle target based on the improved YOLOv5 according to claim 1 or 4, wherein the S-ASPP structure in the step 3 realizes the extraction of dangerous chemical vehicle information with different scales in an image by using different void ratios, specifically:

6. The method for detecting the object of the hazardous chemical vehicle based on the improved YOLOv5 of claim 1, wherein the specific steps of the step S4 are as follows:

7. The method for detecting the object of the hazardous chemical vehicle based on the improved YOLOv5 of claim 1, wherein the specific steps of the step S5 are as follows:

the shape cost is defined as:

the angular cost is defined as:

the redefined distance costs are, in terms of angle costs:

8. The method for detecting the object of the hazardous chemical vehicle based on the improved YOLOv5 of claim 1, wherein the specific steps of the step S6 are as follows: