CN112464765B

CN112464765B - Safety helmet detection method based on single-pixel characteristic amplification and application thereof

Info

Publication number: CN112464765B
Application number: CN202011282208.9A
Authority: CN
Inventors: 姜丽芬; 周雍恒; 孙华志; 马春梅; 梁妍; 马建扬
Original assignee: Tianjin Normal University
Current assignee: Tianjin Normal University
Priority date: 2020-09-10
Filing date: 2020-11-16
Publication date: 2022-09-23
Anticipated expiration: 2040-11-16
Also published as: CN112464765A

Abstract

The invention discloses a safety helmet detection method based on single-pixel characteristic amplification and application thereof, wherein the detection algorithm comprises the following steps: preprocessing and enhancing the safety helmet data set; extracting a characteristic representation form of the target through an Efficientnet-b0 network; using a single-pixel feature scaling module to perform feature filtering on the backbone network features to enhance foreground elements in the features; performing multi-scale feature fusion operation on the enhanced features through a BiFPN feature fusion module; and inputting the fused features into a target prediction network, and classifying and positioning the targets. The SPZ-Det safety helmet wearing detection algorithm mainly uses the SPZ module to scale the features, so that the small target features are ensured not to be lost in a network, and the performance of the algorithm in detecting the small target is improved.

Description

Safety helmet detection method based on single-pixel characteristic amplification and application thereof

Technical Field

The invention relates to the technical field of deep learning and target detection, in particular to a safety helmet detection method based on single-pixel feature amplification and application thereof.

Background

The safety helmet is a necessary safety precaution measure on the construction site of a construction site, and research reports show that nearly hundreds of construction workers in China suffer from construction accidents every year. The accidents are mostly caused by the fact that the safety supervision of a construction site is not in place. In a construction site, a worker wears a safety helmet which is the most basic safety protection measure, but because the safety consciousness of the worker is low and the self-protection consciousness is weak, the worker can take off the safety helmet in the process of working conveniently, so that once an accident happens, the life of a construction worker can be threatened.

At present, the wearing detection scheme of the safety helmet mainly adopts the detection modes of video monitoring, manual patrol and the like, can not give out warning to workers who do not wear the safety helmet in time, and needs a large amount of human resources, so that the automatic detection technology of the safety helmet is very important. The safety helmet detection is the practical application field of target detection, in early safety helmet detection, the position distribution of the safety helmet and the human face is determined through the comparison of the color distribution of the safety helmet and the human face, and whether a worker wears the safety helmet or not is finally determined according to the position distribution information. Such a detection algorithm based on color distribution is extremely dependent on the characteristics of the color difference of the safety helmet, and is difficult to satisfy the detection environment with various types of safety helmets.

With the development and the improvement of the deep learning technology, the deep neural network can automatically capture more fine-grained characteristic information, and the self-adaptively captured information characteristics are used for helping a subsequent detection task to predict the target position. The safety helmet detection method in the deep learning era avoids the dependence on single characteristics, and the network can acquire more precise characteristic information in a self-adaptive manner for predicting the target. The general target detection algorithm can be divided into two categories: the method comprises the steps of a regression-based single-step target detection algorithm represented by YOLO, SSD and RetinalNet; the second is a two-step detection algorithm based on regions, such as Fast-RCNN, etc. Some algorithms adopt a two-step detection algorithm, namely fast-RCNN, to detect the helmet in pursuit of detection accuracy, but the detection is difficult to apply to actual life, and the detection speed of the two-step detection algorithm is very slow due to the complex calculation process of the two-step detection algorithm.

In addition, in the research of related algorithms for wearing detection of the safety helmet, huge challenges are faced, such as large background change of a construction site and complex scene; individuals far away from the camera are often small in size and difficult to distinguish from a complex background; the construction site is dense in personnel, and the situation that a plurality of people are in the same scene and are shielded mutually often occurs. These challenges greatly limit the performance of headgear donning detection algorithms.

Disclosure of Invention

The invention aims to provide a safety helmet detection method based on single-pixel characteristic amplification aiming at the problems of complex safety helmet wearing detection steps, low detection speed and high identification difficulty in the prior art.

In another aspect of the invention, the application of the safety helmet detection method based on single-pixel feature amplification in construction site monitoring is provided.

The technical scheme adopted for realizing the purpose of the invention is as follows:

a safety helmet detection method based on single-pixel feature amplification comprises the following steps:

step 1, preprocessing and enhancing a safety helmet data set to obtain preprocessed and enhanced sample data;

step 2, extracting a characteristic representation form of a target from the preprocessed and enhanced sample data obtained in the step 1 through an Efficientnet-b0 network to obtain a backbone network characteristic;

step 3, performing feature filtering on the backbone network features obtained in the step 2 by using a single-pixel feature scaling module, and enhancing foreground elements in the features to obtain new feature values;

step 4, performing multi-scale feature fusion operation on the new feature value obtained in the step 3 through a BiFPN feature fusion module to obtain a fused feature;

and 5, inputting the fused features obtained in the step 4 into a target prediction network, and classifying and positioning the targets.

In the above technical solution, the pretreatment in step 1 includes the following steps:

step 1.1, expanding a safety helmet data set by using a horizontal turning method, so that each sample in the safety helmet data set has sample data in a positive form and a negative form;

and 1.2, randomly inserting noise into the sample data, improving the complexity of the sample, and improving the robustness of the algorithm on a data level.

In the above technical solution, when selecting the trunk feature layer of Efficientnet-b0 in step 2, selecting the top three-layer feature, a down-sampling layer feature and a lower layer feature;

the backbone network characteristics are extracted by the following method:

step 2.1, in the feature map extracted by Efficientnet-b0, for feature layers with different resolutions of the upper layer, two feature layers are provided, namely a low-layer feature X1 and a high-layer feature X2;

and 2.2, for the feature layer with the same resolution, the algorithm only selects the high-level feature X2 as the feature representation of the subsequent calculation.

In the above technical solution, the new characteristic value in step 3 is obtained through the following steps:

and performing primary spatial attention enhancement on the features of the trunk network to obtain a main area of the foreground element, namely an attention enhancement feature, wherein in the attention enhancement feature, the contribution capacity of each pixel to the overall feature is calculated, then a feature contribution graph is obtained according to the contribution value, different pixels are subjected to scaling control, and the scaled feature is obtained.

step 3.1, the main network characteristic F is calculated by using simple space attention once to obtain the attention enhancement characteristic F, as shown in a formula (1),

where max is the maximum pooling, mean is the average pooling, v is a convolution calculation of 7 x 7,

and S stands for Rule and Sigmoid operations, f _i Is an initial feature;

step 3.2, after obtaining the attention enhancing feature F, performing a feature amplification operation at a pixel level on the attention enhancing feature F, firstly calculating a contribution value of each pixel point to the feature map, then obtaining a feature contribution map according to the contribution value, and scaling primary and secondary elements of the feature elements, which is specifically as follows:

wherein, C _i Characteristic value, n, representing the ith channel _i To scale value, f _i For the initial feature, H and W are the height and width of the feature map, S is a Softmax function, firstly, the Softmax score value of the single channel feature is obtained through S, the score represents the contribution value of each pixel position to the overall feature of the channel, then the contribution value is compared with the average contribution value 1/(H multiplied by W) of the single pixel of the channel, and if the contribution value is larger than the average contribution value, the scaling value is set to n _i If the average contribution value is smaller than the average contribution value, the scaling value is set to (1-n) _i ) And finally obtaining a characteristic contribution graph.

Step 3.3, the feature contribution graph and input features C _i Performing dot multiplication operation to obtain scaling characteristics, introducing a residual error structure, and performing interpolation on the initial characteristics _i And adding the new characteristic values to obtain new characteristic values for the characteristic fusion of the multi-scale module.

In the above technical solution, the multi-scale feature fusion operation method in step 4 includes: and the new characteristic value enhanced by the single-pixel scaling module is transmitted to a BiFPN characteristic fusion module, and the BiFPN characteristic fusion module performs characteristic fusion on the characteristics with different sizes of different levels to compensate the information lost due to downsampling.

In the technical scheme, in a BiFPN feature fusion module, three layers of cross-link operation are used for maintaining the transmission of original features in a backbone network, and the proportion of different features is controlled by a control factor.

In the above technical solution, in the step 5, the detection networks, both of which are three layers of CNNs, classify and locate the target.

In the above technical solution, the method for classifying and positioning the target in step 5 includes the following steps:

step 5.1, in a classification network, using an FcoalLoss loss calculation strategy to limit a large number of background elements and ensure the balance of a positive sample and a negative sample;

and 5.2, in the positioning regression network, using smooth L1 function as a loss calculation strategy as shown in formula (3), and performing loss calculation on the predicted position offset and the offset of the real position of the sample, wherein the offset of the real position can be calculated by formula (4).

Wherein gt is the regression offset after conversion; reg represents the predicted offset of the regression sub-network; (dx, dy, dw, dh) is a regression label, which is replaced by the relative positional offset between the true annotation box (tx, ty, tw, th) and the anchor box (ax, ay, aw, ah).

On the other hand, the application of the safety helmet detection method based on single-pixel characteristic amplification in construction site monitoring is characterized in that a foreground terminal monitors a construction site by using a camera, the camera transmits a real-time picture to a background processing terminal, and the background terminal performs detection and analysis by performing the safety helmet detection method based on single-pixel characteristic amplification and returns a result to the foreground terminal to remind workers in real time.

Compared with the prior art, the invention has the beneficial effects that:

1. when the data set is preprocessed, the data enhancement mode of horizontal turning is used on the detection data set of the open-source safety helmet, so that the volume of the data set is doubled, and the content of the data set is supplemented; meanwhile, noise is randomly inserted into the data set so as to improve the complexity of data samples and ensure stronger robustness of the model from a data level.

2. The invention adopts Efficientnet-b0 as a backbone network for feature extraction. In order to solve the problems of serious loss of characteristic information of the small target and the like, a bottom layer characteristic layer is introduced into a main network characteristic layer, and the occupation ratio of the small target characteristic in the network is increased. Specifically, in the aspect of selecting a backbone feature layer, the most upper three layers of features are selected by Efficientnet, and then two layers of features are sampled at the lower part, namely five layers, but the feature of the lower layer is also added into the backbone network, so that one down-sampling layer is reduced, but the feature of the lower layer is also five layers of feature layers in total, which means that one more layer of backbone network features is reserved compared with the original Efficientnet network, and one additional down-sampling feature layer is reduced. In order to ensure that the features of small objects still exist in the feature fusion and detection network, a single-pixel enhancement module is used for controlling the features to ensure that foreground elements are not lost.

3. The invention carries out pixel-level scaling on the features extracted from the backbone network, then transmits the enhanced features to a multi-scale fusion module BiFPN feature fusion module for interactive fusion of upper and lower layer features, and then carries out classification and positioning prediction of targets by a detection head network (target prediction network).

4. The invention provides a context attention-based single-pixel feature scaling-SPZ-Det detection model aiming at the problems of complex shielding, small target detection and the like. The model introduces the bottom layer characteristics with rich details into the network, ensures the effectiveness of the network in detecting the small target, and solves the problems that the personnel are shielded mutually, the small target is detected, the network is difficult to extract accurate characteristics and the like. The single-pixel feature Scaling (SPZ) module strengthens the main information in the features, ensures that the main feature information is not ignored or replaced by other noise features in the network reasoning process, and relieves the phenomenon of feature loss in the feature transmission process.

5. The loss phenomenon of characteristic information in the network transmission process is solved based on the selection of the characteristic layer and the introduction of the single-pixel characteristic scaling module, the effectiveness of the SPZ module is verified through comparison, the accuracy of target detection is improved on the basis of ensuring the detection speed by the model, and the AP detection accuracy reaches 94% when the safety helmet is worn.

Drawings

FIG. 1 is a diagram of an SPZ-Det network model;

FIG. 2 is a block diagram of an SPZ module, where M is max pooling; a is average pooling; c is a concatenate; s is Sigmoid calculation; ghost represents GhostModule; feature _i Represents formula (2).

Detailed Description

The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

step 1, preprocessing and enhancing a safety helmet data set to obtain preprocessed and enhanced sample data:

step 1.1, expanding a safety helmet data set by using a horizontal turning method, so that each sample in the safety helmet data set has sample data in positive and negative forms;

Step 2, extracting a characteristic representation form of a target through an Efficientnet-b0 network from the preprocessed and enhanced sample data obtained in the step 1 to obtain the characteristics of a backbone network:

And 3, performing feature filtering on the backbone network features obtained in the step 2 by using a single-pixel feature scaling module, enhancing foreground elements in the features, and obtaining a new feature value:

and S stands for Rule and Sigmoid operations, f _i Is an initial feature;

step 3.2, after obtaining the attention enhancing feature F, performing a feature amplification operation at a pixel level on the attention enhancing feature F, first calculating a contribution value of each pixel point to the feature map, then obtaining a feature contribution map according to the contribution value, and scaling primary and secondary elements of the feature elements, which is specifically as follows:

wherein, C _i Representing the characteristic value of the ith channel, n _i To scale value, f _i For the initial features, H and W are the height and width of the feature map, S is the Softmax function, and first a single pass is found by SThe Softmax score value of the channel feature represents the contribution value of each pixel position to the overall feature of the channel, and is compared with the average contribution value 1/(H multiplied by W) of the single pixel of the channel, and if the contribution value is larger than the average contribution value, the scaling value is set to n _i If the average contribution value is smaller than the average contribution value, the scaling value is set to (1-n) _i ) And finally obtaining a characteristic contribution graph.

Step 3.3, the feature contribution graph and input features C _i Performing dot multiplication operation to obtain scaling characteristics, introducing a residual error structure, and performing interpolation on the initial characteristics _i And adding the new characteristic values to obtain new characteristic values for the characteristic fusion of the multi-scale module. The introduction of the residual structure can be referred to: he K, Zhang X, Ren S, et al]// Proceedings of the IEEE conference on computer vision and pattern recognition.2016: 770-778, which will not be described in detail.

and the new characteristic value enhanced by the single-pixel scaling module is transmitted to a BiFPN characteristic fusion module, and the BiFPN characteristic fusion module performs characteristic fusion on the characteristics with different sizes of different levels to compensate the information lost due to downsampling.

In the BiFPN feature fusion module, three-layer cross-link operation is used for maintaining the original features in the backbone network to be transferred, and the proportion between different features is controlled by a control factor. BiFPN feature fusion module feature fusion can refer to Tan M, Pang R, Le Q V.Efficientdet: scalable and effective object detection [ C ]// Proceedings of the IEEE/CVFConreference on Computer Vision and Pattern recognition.2020: 10781-10790.

And 5, inputting the fused features obtained in the step 4 into a target prediction network, and classifying and positioning the targets by two detection networks which are three layers of CNNs:

step 5.1, in a classification network, limiting a large number of background elements by using an FcoalLoss loss calculation strategy to ensure the balance of a positive sample and a negative sample;

and 5.2, in the positioning regression network, using smooth L1 function as a loss calculation strategy, such as formula (3), and performing loss calculation on the predicted position offset and the offset of the real position of the sample, wherein the offset of the real position can be calculated by formula (4).

Example 2

The embodiment adopts the public data set Safety-Helmet-week-Dataset provided by wensihaihui, which comprises 7582 images in total, and comprises 9044 bounding boxes (positive classes) for Wearing the Safety Helmet and 111514 bounding boxes (negative classes) for not Wearing the Safety Helmet, wherein most of the negative class data sets are derived from the SCUT-HEAD data set. The labeled target of the data set has a large number of small heads and an obscured unclear target, the data is disordered and complex, and the labeled data does not belong to the detection category. In the process of data reading, the targets with class errors and difficult detection are firstly eliminated, and a data set which finally participates in training is obtained. The data of two parts of 7582 pieces of image data are each as 8: the structure of 2 divides a training set and a test set, and the uniformity of the distribution of the original data is ensured.

Cross-over ratios IoU are commonly used in the field of object detection to evaluate whether the prediction can locate the position of a real object. IoU is shown in the formula (5).

Wherein DR is a detectionResult result frame as a network prediction result frame, GT is a position frame of a GroundTruth real sample, and the larger the IoU value in the experiment is, the more the model prediction effect accords with the real sample position frame. IoU is a criterion for determining whether the predicted result can be used as the final predicted result, and the location prediction box is determined to be available when the value is set to 0.5 in the experiment.

The performance effect of the algorithm model is often evaluated by using the mAP value in the evaluation of detection performance, and the mAP is the average AP value of results predicted by multiple categories. In order to obtain the AP value of a single category, the accuracy and recall must be obtained, as shown in equations (6) and (7).

Wherein TP, FP and FN are defined as shown in Table 1.

TABLE 1 TP, FP, FN definitions

If a PR curve can be constructed from the values of precision and Recall, the AP value is calculated as shown in equation (8).

It can be seen that the AP value is equal to the area under the PR curve. And (3) calculating the AP value of each category according to a formula (8), and finally taking an average value to obtain a final result mAP, wherein the larger the mAP value is, the better the detection performance of the network is.

Selecting a rolling module group: in order to achieve better effect and minimum calculation amount, general convolution, separation convolution and Ghost convolution are used for comparison, and finally the Ghost convolution with good effect and low calculation amount is selected as a convolution operation scheme of the single-pixel scaling module.

To ensure the validity of our proposed model, we used two baseline models for comparative experiments. The specific method comprises the following steps:

(1) efficientdet-d 0: the original Efficientdet model shows that the detection performance is low in the small target Person category, and the final mAP is only 52.3%.

(2) Efficientdet-change: the improved Efficientdet model increases the bottom layer characteristic information, and finally the mAP reaches 77.9%.

(3) YOLOv 3: the mAP can reach 71.4% when tested on the same dataset using YOLOV 3.

(4) YOLOv3+ SPZ: the introduction of the single pixel scaling module we propose in YOLOv3 increased the value of the mAP to 73.5%.

(5) SPZ-Det: the method is a model finally proposed by us, a novel safety helmet detection method is constructed by using an Efficientdet structure and combining bottom layer characteristics and an SPZ single-pixel-side module, the mAP value of the safety helmet detection method finally reaches 80.2%, and the AP value for safety helmet wearing detection reaches 94.6%.

Compared analysis shows that the characterization capability of the network characteristics can be improved by focusing on the bottom layer characteristics, the single-pixel characteristic scaling module can be embedded into other detection models, the detection performance of the models is improved, and the experimental results are shown in table 2.

TABLE 2 results of the experiment

Detection method	AP(hat)	AP(person)	mAP
				Efficientdet-d0	79.6％	24.9％	52.3％
Efficientdet-change	93.9％	61.9％	77.9％
				YOLOv3	86.3％	56.4％	71.4％
YOLOv3+SPZ	91.5％	55.5％	73.5％
				SPZ-Det	94.6％	65.8％	80.2％

Feature extraction is carried out by using an Efficientnet-do backbone network, an important feature layer is reselected to enhance the representation capability of features extracted by the backbone network, and then a single-pixel feature scaling module is introduced into the model, so that the phenomenon that features generated in the calculation process of a small target disappear is solved. Finally, the detection precision of the model on the wearing safety helmet reaches 94%, and the overall mAP reaches 80.2%.

Example 3

A monitoring system is constructed through a safety helmet detection method based on single-pixel characteristic amplification, a foreground terminal uses a camera to monitor a construction site, the camera transmits a real-time picture to a background processing terminal, the safety helmet detection method based on single-pixel characteristic amplification of the background terminal carries out detection analysis, a result is returned to the foreground terminal, and workers are reminded in real time.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A safety helmet detection method based on single-pixel characteristic amplification is characterized by comprising the following steps:

step 3, performing feature filtering on the backbone network features obtained in the step 2 by using a single-pixel feature scaling module, and enhancing foreground elements in the features to obtain new feature values; the new characteristic value is obtained by the following steps: performing primary spatial attention enhancement on the features of the trunk network to obtain a main area of a foreground element, namely an attention enhancement feature, wherein in the attention enhancement feature, the contribution capacity of each pixel to the overall feature is calculated, then a feature contribution graph is obtained according to the contribution value, different pixels are subjected to scaling control, and the scaled feature is obtained;

2. The safety helmet detection method based on single-pixel feature amplification as claimed in claim 1, wherein the preprocessing in the step 1 comprises the following steps:

and step 1.2, randomly inserting noise into the sample data, improving the complexity of the sample and improving the robustness of the method on the data level.

3. The method for detecting the safety helmet based on the single-pixel feature amplification of claim 1, wherein in the step 2, when selecting the main feature layer of Efficientnet-b0, the topmost feature, the downsampling feature and the next lower feature are selected;

the backbone network characteristics are extracted by the following method:

step 2.1, in the feature map extracted by Efficientnet-b0, for the feature layers with different resolutions of the upper layer, two feature layers are provided, one low-layer feature X1 and one high-layer feature X2;

and 2.2, for the feature layer with the same resolution, only selecting the high-layer feature X2 as the feature representation of the subsequent calculation.

4. The method for detecting the safety helmet based on the single-pixel feature amplification as claimed in claim 1, wherein the new feature value in the step 3 is obtained by the following steps:

and S stands for Rule and Sigmoid operations, f _i F is the initial characteristic, and f is the characteristic of the backbone network;

step 3.2, after obtaining the attention enhancing feature F, performing a feature amplification operation at a pixel level on the attention enhancing feature F, firstly calculating a contribution value of each pixel point to the feature map, then obtaining a feature contribution map according to the contribution value, and scaling primary and secondary elements of the feature elements, specifically as follows:

wherein, C _i Representing the characteristic value of the ith channel, n _i To scale value, f _i For the initial feature, H and W are the height and width of the feature map, S is a Softmax function, firstly, the Softmax score value of the single channel feature is obtained through S, the score represents the contribution value of each pixel position to the overall feature of the channel, then the contribution value is compared with the average contribution value 1/(H multiplied by W) of the single pixel of the channel, and if the contribution value is larger than the average contribution value, the scaling value is set to n _i If the average contribution value is smaller than the average contribution value, the scaling value is set to (1-n) _i ) Finally, a characteristic contribution graph is obtained;

step 3.3, the feature contribution graph and input features C _i Performing dot product operation to obtain scalingCharacteristic, finally introducing a residual error structure, and converting the initial characteristic f _i And adding the new characteristic values to obtain new characteristic values for the characteristic fusion of the multi-scale module.

5. The safety helmet detection method based on single-pixel feature amplification as claimed in claim 1, wherein the multi-scale feature fusion operation method in the step 4 is as follows: and the new characteristic value enhanced by the single-pixel scaling module is transmitted to a BiFPN characteristic fusion module, and the BiFPN characteristic fusion module performs characteristic fusion on the characteristics with different sizes of different levels to compensate the information lost due to downsampling.

6. The method as claimed in claim 5, wherein in the BiFPN feature fusion module, three-layer cross-link operation is used to maintain the original features in the backbone network to be transferred, and the control factor controls the ratio between different features.

7. The helmet detection method based on single-pixel feature amplification of claim 1, wherein in the step 5, the target is classified and located by the detection networks which are both three layers of CNNs.

8. The safety helmet detection method based on single-pixel feature amplification of claim 1, wherein the method for classifying and positioning the target in the step 5 comprises the following steps:

step 5.2, in the positioning regression network, using smooth L1 function as a loss calculation strategy as formula (3), and performing loss calculation on the predicted position offset and the offset of the real position of the sample, wherein the offset of the real position can be calculated by formula (4);

9. The application of the safety helmet detection method based on single-pixel characteristic amplification in construction site monitoring as claimed in any one of claims 1 to 8, wherein a front-end terminal uses a camera to monitor a construction site, the camera transmits a real-time picture to a back-end processing terminal, and the back-end terminal performs detection analysis by performing the safety helmet detection method based on single-pixel characteristic amplification and returns the result to the front-end terminal to perform real-time reminding on workers.