CN112949572A

CN112949572A - Slim-YOLOv 3-based mask wearing condition detection method

Info

Publication number: CN112949572A
Application number: CN202110330611.2A
Authority: CN
Inventors: 姜小明; 向富贵; 张中华; 吕明鸿; 王添; 赖春红; 王伟; 李章勇
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-11
Anticipated expiration: 2041-03-26
Also published as: CN112949572B

Abstract

The invention belongs to the technical field of deep learning target detection and computer vision, and particularly relates to a mask wearing condition detection method based on Slim-YOLOv3, which comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained Slim-YOLOv3 model, and judging whether the user wears the mask correctly; according to the method, the Slim-YOLOv 3-based video detection method for the wearing condition of the mask is adopted, and an improved unsupervised self-classification method is adopted to perform subclass classification on data of the mask which is worn irregularly, so that the task of detecting the wearing condition of the mask video can be realized more accurately and rapidly. And the proposed network is more concise, so that the application cost is further reduced.

Description

Slim-YOLOv 3-based mask wearing condition detection method

Technical Field

The invention belongs to the technical field of deep learning target detection and computer vision, and particularly relates to a mask wearing condition detection method based on Slim-YOLOv 3.

Background

Harmful gas, smell, spray, virus and the like can invade the human body through air, and the substance can be effectively prevented from invading the human body through the standard wearing of the mask. The effect of the standard wearing mask is not only to prevent the virus from spreading to other people from asymptomatic person, reduce the probability of secondary spreading to protect other people, but also can protect the wearer simultaneously, reduce the inoculation amount of the virus that the wearer contacted for the virus infection risk is lower.

In recent years, deep learning has been greatly advanced in the fields of object detection, image classification, semantic segmentation, and the like. Various algorithms combined with the convolutional neural network make great progress in both precision and operation speed. The mask wearing condition video detection task is a target detection problem, and the target detection is a multi-task deep learning problem combining target classification and target positioning.

At present, according to the requirement of an actual detection task, two key technologies need to be solved for video detection:

(1) real-time; the wearing condition of the mask of the current task object can be effectively captured only by ensuring real-time field video detection;

(2) high precision; only the wearing condition of the mask of the current object can be accurately obtained, and the effective auxiliary effect can be achieved.

At present, although many mask wearing condition video detection devices are already in practical application, the devices with high detection precision often consume expensive computing resources, and the cheap detectors cannot achieve high precision.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a mask wearing condition detection method based on Slim-YOLOv3, which comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly; the improved Slim-YOLOv3 model comprises a backbone network Darknet-53, a feature enhancement and prediction network and a decoding network;

the process of training the improved Slim-Yolov3 model includes:

s1: acquiring an original data set, and preprocessing the original data set to obtain a training sample set and a test sample set;

s2: classifying and re-labeling the data in the training sample set and the test sample set;

s3: inputting the classified training sample set into a backbone network Darknet-53 for multi-scale transformation, and extracting a plurality of scale features;

s4: inputting a plurality of scale features into a feature enhancement and prediction network to obtain a classification prediction result;

s5: inputting the classification prediction result into a decoding network for decoding;

s6: calculating a loss function of the model according to the decoding result;

s7: and inputting the data in the test set into the model for prediction, optimizing the loss function of the model according to the prediction result, and finishing the training of the model when the change of the loss function is small or the iteration times are reached.

Preferably, the preprocessing the raw data set comprises: compressing and turning data in the original data set and changing the brightness of the image to obtain enhanced image data; and segmenting the enhanced image data to obtain a training sample set and a test sample set.

Preferably, the process of classifying the data in the training sample set and the test sample set includes: dividing the face mask wearing conditions into three categories according to the images of the original data set, including standard mask wearing images, non-standard mask wearing images and non-mask wearing images; and adopting an improved image unsupervised self-classification method SCAN to reclassify the mask images which are not worn normally, and obtaining a plurality of subclasses.

Further, the process of reclassifying the irregular mask wearing map by using the improved image unsupervised self-classification method SCAN comprises the following steps:

step 1: extracting a face area which is not standard for wearing the mask in the mask data set as a training set;

step 2: carrying out classification training on the mask wearing condition data set face region data by adopting an ECAResnet50 network to obtain a pre-training weight;

and step 3: leading the pre-training weight into a countermeasure network of ECAResnet50, and extracting high-level semantic features of the image;

and 4, step 4: calculating cosine similarity of each high-level semantic feature, and dividing the image corresponding to the semantic feature with higher similarity into neighbors;

and 5: performing cluster learning by using a nearest neighbor as a priori;

step 6: and carrying out fine-tuning marking processing on the clustering learned image by adopting a self-labeling label to obtain pseudo labels of four categories.

Further, a formula for calculating cosine similarity of the high-level semantic features is as follows:

preferably, the process of extracting the multi-scale features of the classified images in the training sample set by using the backbone network ECADarknet-53 includes: inputting the image into a data enhancement module, and adjusting the image to 416 × 3; inputting the adjusted image into an ECADarknet53 network, and performing primary convolution dimensionality increase on the image by adopting a convolution block to obtain an image with the size of batch _ size 416 32; extracting the characteristics of the graph after dimension increase of the convolution by adopting five residual volume blocks of an attention mechanism ECANet module, wherein the extracted characteristic scale is increased after each residual volume block passes through, and finally outputting two characteristic layers obtained by a fourth residual volume block and a fifth residual volume block; where batch _ size represents the number of images input to the network at a time.

Further, the process of processing the feature by the attention mechanism ECANet module comprises the following steps: performing global average pooling operation on the feature layer by adopting a channel level without reducing the dimension; selecting data of k adjacent channels for each channel, performing 1 × 1 convolution on the data subjected to the global average pooling operation, and activating a function through a sigmod; and expanding the activated data to the size of the input features and multiplying the input features by the activated data to obtain the enhanced features containing the information of the plurality of channels. The ECANet module is added to each of the five residual convolution blocks in ECADarknet-53, each resulting from the superposition of two convolutions with the output and input of one ECA module.

Preferably, the processing of the multiple scale features by using the feature enhancement and prediction network includes:

step 1: performing five times of convolution processing on the features obtained by a fifth residual convolution block in the ECADarknet53 network;

step 2: performing 3 × 3 convolution once and 1 × 1 convolution once again on the feature subjected to the convolution processing, and taking the processed result as the prediction result of the scale feature layer corresponding to the fifth residual convolution block;

and step 3: performing deconvolution UmSamplling 2d operation on the features after the five times of convolution processing, stacking the features with a feature layer obtained by a fourth residual convolution block, and fusing and enhancing information of two scale features;

and 4, step 4: performing five times of convolution processing on the fusion feature map, and performing one time of 3 × 3 convolution and one time of 1 × 1 convolution on the feature map subjected to five times of convolution to obtain a prediction result of a scale feature layer corresponding to the block subjected to the fourth residual convolution;

and 5: and outputting the prediction results of the feature layers of two scales, wherein the prediction result of each scale comprises a prediction frame corresponding to each grid point of two prior frames and the type of the prediction frame, namely, the positions, confidence degrees and the types of the three prior frames on each grid point after the two feature layers are respectively divided into grids with different sizes corresponding to the picture.

Preferably, the process of inputting the classification prediction result into the decoding network for decoding includes:

step 1: adding the corresponding x _ offset and y _ offset to each grid point to obtain the center of the prediction frame;

step 2: combining the prior frame with h and w, and calculating the length and width of the prediction frame;

and step 3: calculating positioning loss through the position information and the actual labeling information, and calculating classification loss through the prediction category information and the actual labeling category information;

and 5: judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;

step 6: calculating the coincidence degree of the real frame and the prior frame, and selecting the prior frame with the highest coincidence degree for verification;

and 7: and obtaining the predicted result which the network should have, and comparing the predicted result with the actual predicted result.

Preferably, the loss function of the model is expressed as:

preferably, the model introduces pre-training weights in the process of classifying the types of the data, firstly freezes parameters of a subsequent network layer of the backbone network to perform 50 times of iterative training, then unfreezes the parameters to perform 100 times of iterative training, and takes the weight with lower classification loss and total loss as a final training result.

The invention has the beneficial effects that:

according to the invention, the mask wearing condition video detection method based on YOLOv3 is adopted, so that the mask wearing video detection task can be realized more accurately and rapidly. And the proposed network is more concise, so that the application cost is further reduced. By further subclassing the data set and adding an ECANet attention mechanism module in a backbone network, the detection precision of the network is improved; by deleting the network feature layer of the minimum scale in the YOLOv3, the network is more focused on the targets of the large and medium scales, and the network detection speed is further improved.

Drawings

FIG. 1 is a diagram illustrating an example of the division of three major classes of data sets in the present invention;

FIG. 2 is a diagram illustrating the subclassing of the mask worn by a wearer without specification;

FIG. 3 is a diagram of the primary network structure of the original YOLOv3 in the present invention;

FIG. 4 is a network architecture diagram of the ECANet in the present invention;

FIG. 5 is a diagram of the main network structure of the mask wearing video detection task proposed by the present invention;

fig. 6 is a diagram of a video transmission and display device of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A mask wearing condition detection method based on Slim-YOLOv3 comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; and inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly. The improved Slim-YOLOv3 model includes a backbone network Darknet-53, a feature enhancement and prediction network, and a decoding network.

The process of training the improved Slim-Yolov3 model includes:

s2: carrying out initial classification on data in a training sample set;

s3: inputting the classified images in the training sample set into a backbone network Darknet-53 for multi-scale transformation, and extracting a plurality of scale features;

s6: calculating a loss function of the model according to the decoding result;

s7: and inputting the data in the test set into the model for measurement, optimizing the loss function of the model according to the measurement result, and finishing the training of the model when the loss function is minimum.

An embodiment of training an improved Slim-Yolov3 model, comprising:

s1: acquiring an original data set, and classifying the original data set, wherein the classification result comprises the following steps: the mask wearing picture is standardized, the mask wearing picture is not standardized, and the mask wearing picture is not standardized;

s2: dividing the classified data set to obtain a training sample set and a testing sample set; carrying out data enhancement processing on the training sample set;

s3: inputting the images in the enhanced training sample set into a YOLOv3 network model of a backbone network Darknet-53 for multi-scale transformation, and extracting multi-scale classification features and positioning features;

s4: two feature layers are required to be output, the two feature layers are located at different positions of a trunk part darknet53 and are respectively located at a middle lower layer and a bottom layer, the sizes of the two feature layers are respectively (26, 512) and (13, 1024), and then 5 times of convolution processing is carried out on the two feature layers;

s5: after the processing, a part of the feature layer with the size of 13 × 13 is used for outputting a prediction result corresponding to the feature layer, and a part of the feature layer is used for combining with the feature layer with the size of 26 × 26 after deconvolution UmSamplling 2d, and then, performing convolution processing of 3 × 3 and 1 × 1 on feature maps with two scales once;

s6: for a picture, if the picture is initially divided into K multiplied by K grids, K represents the number of the grids of the intercepted picture, and the larger the grid size is, the smaller the grid size is; two scales are required, predicting C classes, and the resulting tensor for each scale is K × [2 × (4+1+ C) ]. Where 4 and 1 denote x, y coordinate offsets x _ offset and y _ offset of the predicted position, width h and height w of the predicted target, and the prediction category of the target, respectively. The network uses a plurality of independent logistic regression classifiers for classification, and each classifier only judges whether the object appearing in the target frame belongs to the current label, namely, the classification is simple two-classification, so that the multi-label classification is realized.

S7: and (5) decoding. Adding x _ offset and y _ offset corresponding to each grid point, obtaining the result after adding the result which is the center of the prediction frame, then combining the prior frame with h and w to calculate the length and width of the prediction frame, then calculating the positioning loss through the position information and the actual marking information, and calculating the classification loss through the prediction type information and the actual marking type information, wherein the process is as follows:

1. judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;

2. judging which prior frame has the highest coincidence degree with the real frame;

3. calculating how much prediction result should be obtained for the grid point to obtain a real frame;

4. all real frames are processed as above;

5. and obtaining the predicted result which the network should have, and comparing the predicted result with the actual predicted result.

S8: and according to the classification loss and the positioning loss, finishing the training of the model when the loss converges to a certain degree and is not reduced or reaches a certain iteration number.

As shown in fig. 1, the acquired data are classified into three major categories, i.e., non-wearing mask, non-Standard wearing mask, and Standard wearing mask, according to the actual wearing condition of the mask. Labels are Nomask, Wrmask and Swmask, respectively.

Further, the mask is worn irregularly, so that large inter-class difference exists, the detection precision of the class is influenced, and the overall detection precision is influenced. The data set, which is not a standard respirator, is subdivided into four subclasses labeled Notnorm1, Notnorm2, Notnorm3, and Notnorm 4.

The process of classifying the mask images which are worn out of the specification by adopting an improved image unsupervised self-classification method SCAN comprises the following steps:

step 2: extracting high-level semantic features of the image through a self-supervision method, and eliminating low-level features in the current end-to-end learning method;

and step 3: carrying out classification training on the mask wearing condition data set face region data by adopting an ECAResnet50 network to obtain a pre-training weight;

and 4, step 4: introducing the pre-training weight into a confrontation network formed by ECAResnet50 and a multilayer perceptron, calculating cosine similarity of each high-level semantic feature through the high-level semantic features of the images extracted by the ECAResnet50 network, and dividing the images corresponding to the semantic features with larger similarity into neighbors according to the similarity;

and 5: and performing cluster learning by taking the nearest neighbor as a priori. By learning a clustered function phi_ηRepresenting the prediction class corresponding to the target, η representing the neural network weight parameter, for a sample X in the data set D and its neighbor set N_XPseudo label assignment is done together, threshold assignment over C classes, and the probability that sample X is assigned to the C-th class is expressed as

Learning a function phi by an objective function lambda_ηThe weight parameter of (2). The objective function Λ is as follows:

where D is the data set, X represents the sample, Φ_η(X) represents a clustering function, and eta represents a neural network weight parameterThe number of the first and second groups is,<·>is a dot product, and is represented by lambda,

representing the probability of all samples belonging to a class

C represents the confidence of the corresponding class of the target. To let the sample X_iAnd its neighbors

And generating consistent prediction results, wherein the dot product result is the maximum only if the prediction results are the same class and are all one-hot results. In order to avoid classifying all samples into the same class, a penalty term is added, so that the prediction result is uniformly distributed to all classes. In a specific implementation, for the values of K in K neighbors, when K is 0, only the samples and their enhanced images are used, and when K > 1, the intra-class sample difference is considered, but a penalty term is also introduced because not all neighbors belong to the same class.

Step 6: fine-tuning was performed by self-labeling. That is, a high confidence prediction (p) is selected_max1), a threshold value is required to be defined, samples with confidence degrees larger than the threshold value are selected, so that pseudo labels are obtained, then cross entropy loss can be calculated to update parameters, in order to avoid overfitting, the samples with strong enhancement are used as input, then the samples with the confidence degrees higher than the threshold value are continuously added to be samples with high confidence degrees, iteration is finished after a limited number of times, and four classes of pseudo labels are obtained from classification results.

The improved YOLOv3 was multi-scale trained on full images using the backbone network of Darknet-53. And (4) introducing a characteristic pyramid thought by utilizing a backbone network to extract multi-scale characteristics. And extracting three layers of characteristics with different scales for predicting that the box detects objects with different sizes. And (3) upsampling the feature layer with a smaller scale, converting the upsampled feature layer into the same size as the previous feature layer through deconvolution, and then splicing to obtain information among the three feature layers with different scales.

As shown in fig. 2, the structure of the original YOLOv3 model specifically includes:

1. backbone network Darknet-53. Inputting the images in the enhanced training sample set into a YOLOv3 network model of a backbone network Darknet-53 for multi-scale transformation, and extracting features of multiple scales through the backbone network Darknet-53;

2. feature enhancement and prediction networks. Three feature layers output by Darknet-53 are required as input, the three feature layers are positioned at different positions of the trunk part Darknet53 and are respectively positioned at the middle layer, the middle lower layer and the bottom layer, the sizes of the three feature layers are respectively (52, 256), (26, 512) and (13, 1024), and then 5 times of convolution processing is carried out on the three feature layers; after finishing processing, one part of the feature layer is used for outputting a prediction result corresponding to the feature layer, one part of the feature layer is used for combining with the previous layer after deconvolution UmSamplling 2d, and then each layer of features is subjected to convolution processing of 3 × 3 and 1 × 1; for a picture, if the picture is initially divided into K multiplied by K grids, K represents the number of the grids of the intercepted picture, and the larger the grid size is, the smaller the grid size is; three scales are required, predicting C classes, and the resulting tensor for each scale is K × [3 × (4+1+ C) ]. Where 4 and 1 denote x, y coordinate offsets x _ offset and y _ offset of the predicted position, width h and height w of the predicted target, and the prediction category of the target, respectively. The network uses a plurality of independent logistic regression classifiers for classification, and each classifier only judges whether the object appearing in the target frame belongs to the current label, namely, the classification is simple two-classification, so that the multi-label classification is realized.

3. And decoding the part. Adding x _ offset and y _ offset corresponding to each grid point, obtaining the result after adding the result which is the center of the prediction frame, then combining the prior frame with h and w to calculate the length and width of the prediction frame, then calculating the positioning loss through the position information and the actual marking information, and calculating the classification loss through the prediction type information and the actual marking type information, wherein the process is as follows:

4. all real frames are processed as above;

And according to the classification loss and the positioning loss, finishing the training of the model when the loss converges to a certain degree and is not reduced or reaches a certain iteration number.

A backbone network named Darknet-53 performs multi-scale training on the full image. Darknet-53 includes five large Residual convolution blocks, which respectively include 1, 2, 8 and 4 small Residual convolution units, Darknet-53 uses Residual network Residual, the Residual convolution in Darknet53 is a convolution with 3X3 and 2 step length, then the convolution layer is stored, a convolution with 1X1 and a convolution with 3X3 are carried out again, and the result is added with the layer as the final result, the Residual network is characterized by easy optimization and can improve accuracy rate by increasing a considerable depth, the Residual units inside the Residual network use jump connection, and the gradient disappearance problem caused by increasing the depth in a deep neural network is relieved. Each convolution part of the darknet53 uses a specific darknetv Conv2D structure, l2 regularization is carried out during each convolution, and Batchnormalization and LeakyReLU are carried out after the convolution is completed. The normal ReLU sets all negative values to zero, and the leakage ReLU assigns a non-zero slope to all negative values. Mathematically, it can be expressed as:

wherein x is_iIndicating the input of standardized data, a_iIs a custom scaling value that can adjust the data to a value close to 0, y_iRepresenting the output of the activation function.

A backbone network Darknet-53 is utilized to introduce a characteristic pyramid idea to extract multi-scale characteristics; extracting three layers of characteristics with different scales for predicting boxes and detecting objects with different sizes; and (3) upsampling the characteristic layer with a smaller scale, converting the upsampled characteristic layer into the same size as the previous characteristic layer through deconvolution, and then splicing. In this way, information between feature layers of three different scales can be obtained.

As shown in fig. 3, after the ECANet module utilizes the global average pooling aggregation convolution characteristic without dimension reduction, the size K of the convolution kernel is determined adaptively, then one-dimensional convolution is performed, and then channel attention is learned through the sigmoid function. Because the dependency relationship between all channels cannot be efficiently captured by using the visual channel feature, ECANet only considers the information exchange between the current channel and its k-neighborhood channel. The parameter of each channel is C, then the parameter is k C. The formula of the parameters of each channel is as follows:

wherein, ω is_iIndicates the number of parameters of the ith channel,

denotes y_iA set of k neighbor channels, σ denotes an activation function, w^jA weight parameter representing the jth neighbor channel,

the jth neighbor of the ith channel characteristic of the table, k, represents the number of neighbor channels.

The strategy can be simply and quickly realized in a one-dimensional convolution mode, the kernel size is k, and the processing formula is as follows:

ω＝σ(C1D_k(y)),

where C1D represents a one-dimensional convolution, the ECANet module calls the above formula to make the final number of parameters k.

And adding a lightweight attention mechanism ECANet module at the end of the Darknet-53 residual volume block to obtain an ECA _ Darknet-53 trunk for extracting two-scale fine-grained features. Because only large and mesoscale human faces need to be detected in the actual application scene, the features of the last two scales of the ECA _ Darknet-53 are extracted and then further feature processing is carried out. While larger objects can be further processed to accomplish new tasks.

As shown in fig. 3, in the ECANet attention mechanism module, a convolution block of size W × H × C, given the aggregate nature of using Global Average Pools (GAPs), generates channel weights by performing a fast one-dimensional convolution of size k, where k is adaptively determined by a function of the channel dimension C.

As shown in fig. 4, the main network structure of the mask wearing video detection task proposed by the present invention includes: the data enhancement network is used for enhancing the data of the training data set; ECADarknet-53 is obtained by adding an ECANet attention mechanism module at the end of each small residual convolution unit of Darknet-53, and the backbone network ECADarknet-53 introduced with the ECANet attention mechanism module can better improve the capability of network extraction of more relevant characteristics to tasks; deleting the feature layer with the smallest scale in three feature layers with different scales extracted from the subsequent original Yolov3 network, so that the network concentrates more on the target with large and medium scales, the network is more simplified, and the detection speed is improved, as shown in the two layers of features of fig. 4, after three large residual errors are extracted through a backbone network, only the last two features with large residual errors are output, the shape of the two feature layers are respectively (26, 512), (13, 1024), and the two feature layers contain the features with larger scales; and after 5 times of convolution processing is carried out on the last characteristic layer, one part of the last characteristic layer is used for outputting a prediction result corresponding to the characteristic layer, and after deconvolution is carried out on the last characteristic layer, UmSamplling 2d, the last characteristic layer is combined with the last characteristic layer, and 5 times of convolution processing is carried out to output a corresponding prediction result.

In one embodiment, the labels of the data sets of the mask worn by the non-standard mask are re-labeled by an improved SCAN image unsupervised self-classification method, subclassing is carried out to obtain a final training and testing data set, and the final detection model is obtained by carrying out training by a YOLOv3 self-contained data enhancement method and by combining transfer learning based on an improved YOLOv3 video detection method. The hardware equipment adopts a Haokangwei human body temperature measurement double-optical-tube machine (DS-2TD2637B-10) as image acquisition equipment, and the arrangement and control are simply installed by adopting a tripod in a matching way; a desktop computer with a display card GeForce GTX 1060Ti is combined. As shown in fig. 5, the video acquisition is equipped with a hard disk recorder and connected through a switch to realize data transmission.

The expression of the loss function of the model is:

wherein λ is_coordAnd λ_noobjAs a weight of the corresponding term, S²Representing the number of meshes, B representing the number of candidate frames generated per mesh,

whether the anchor box of the jth deep learning objective checking algorithm representing the ith mesh is responsible for predicting this object,

the jth anchor box representing the ith mesh is not responsible for predicting this object, x_iAbscissa, y, representing the actual center point of the ith mesh_iThe ordinate representing the actual center point of the ith mesh,

the y anchor box of the ith trellis predicts and decodes the horizontal coordinate of the center point,

represents the vertical coordinate of the central point of the ith grid after being predicted and decoded by the jth anchor box, and respectively represents the width and the height of the target by omega and h,

respectively representing the width and the height of the decoded target, C representing the confidence level of the target object contained in the target prediction frame,

representing the confidence of the target object contained in the decoded target prediction frame, classes representing all classes of the data set, P representing the probability that the target belongs to the class c,

representing the probability of interpreting the object as belonging to class c.

The final embodiment can realize the fast and accurate wearing condition identification and body temperature monitoring of the large and medium-scale face masks.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A mask wearing condition detection method based on Slim-YOLOv3 is characterized by comprising the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly; the improved Slim-YOLOv3 model comprises a backbone network ECADArknet-53, a feature enhancement and prediction network and a decoding network;

the process of training the improved Slim-Yolov3 model includes:

s6: calculating a loss function of the model according to the decoding result;

2. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein preprocessing the raw data set comprises: compressing and turning data in the original data set and changing the brightness of the image to obtain enhanced image data; and segmenting the enhanced image data to obtain a training sample set and a test sample set.

3. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of classifying the data in the training sample set and the test sample set comprises: dividing the face mask wearing conditions into three categories according to the images of the original data set, including standard mask wearing images, non-standard mask wearing images and non-mask wearing images; and adopting an improved image unsupervised self-classification method SCAN to reclassify the mask images which are not worn normally, and obtaining a plurality of subclasses.

4. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 3, wherein the process of classifying the irregular wearing mask pattern by using the improved image unsupervised self-classification method SCAN comprises:

and 5: performing cluster learning by using a nearest neighbor as a priori;

5. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 4, wherein the formula for calculating cosine similarity of high-level semantic features is as follows:

wherein x is_iAnd y_iThe ith dimension in the vector representing the two semantic features respectively, and n represents the total dimension of the vector.

6. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 5, wherein the process of extracting the multi-scale features of the classified images in the training sample set by using the backbone network ECADArknet-53 comprises: inputting the image into a data enhancement module, and adjusting the image to 416 × 3; inputting the adjusted image into an ECADarknet53 network, and performing primary convolution dimensionality increase on the image by adopting a convolution block to obtain an image with the size of batch _ size 416 32; extracting the characteristics of the graph after dimension increase of the convolution by adopting five residual volume blocks of an attention mechanism ECANet module, wherein the extracted characteristic scale is increased after each residual volume block passes through, and finally outputting two characteristic layers obtained by a fourth residual volume block and a fifth residual volume block; where batch _ size represents the number of images input to the network at a time.

7. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 6, wherein the process of processing the characteristics by the attention mechanism ECANet module comprises: performing global average pooling operation on the feature layer by adopting a channel level without reducing the dimension; selecting data of k adjacent channels for each channel, performing 1 × 1 convolution on the data subjected to the global average pooling operation, and activating a function through a sigmod; and expanding the activated data to the size of the input features and multiplying the input features by the activated data to obtain the enhanced features containing the information of the plurality of channels.

8. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of processing the multiple scale features by using the feature enhancement and prediction network comprises:

9. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of inputting the classification prediction result into a decoding network for decoding comprises:

step 1: adding the corresponding x _ offset and y _ offset to each grid point to obtain the center of the prediction frame; wherein x _ offset and y _ offset represent the grid upper left coordinate (x, y) and the offset in the x and y directions of the actual predicted point, respectively;

step 2: combining the prior frame with h and w, and calculating the length and width of the prediction frame; wherein h and w respectively represent the scaling values of the prediction frame;

and 7: and obtaining the predicted result which the network should have, and comparing the predicted result with the actual labeled result.

10. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the loss function of the model has the expression: