CN112949572A - Slim-YOLOv 3-based mask wearing condition detection method - Google Patents

Slim-YOLOv 3-based mask wearing condition detection method Download PDF

Info

Publication number
CN112949572A
CN112949572A CN202110330611.2A CN202110330611A CN112949572A CN 112949572 A CN112949572 A CN 112949572A CN 202110330611 A CN202110330611 A CN 202110330611A CN 112949572 A CN112949572 A CN 112949572A
Authority
CN
China
Prior art keywords
mask
convolution
network
slim
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110330611.2A
Other languages
Chinese (zh)
Other versions
CN112949572B (en
Inventor
姜小明
向富贵
张中华
吕明鸿
王添
赖春红
王伟
李章勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110330611.2A priority Critical patent/CN112949572B/en
Publication of CN112949572A publication Critical patent/CN112949572A/en
Application granted granted Critical
Publication of CN112949572B publication Critical patent/CN112949572B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of deep learning target detection and computer vision, and particularly relates to a mask wearing condition detection method based on Slim-YOLOv3, which comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained Slim-YOLOv3 model, and judging whether the user wears the mask correctly; according to the method, the Slim-YOLOv 3-based video detection method for the wearing condition of the mask is adopted, and an improved unsupervised self-classification method is adopted to perform subclass classification on data of the mask which is worn irregularly, so that the task of detecting the wearing condition of the mask video can be realized more accurately and rapidly. And the proposed network is more concise, so that the application cost is further reduced.

Description

Slim-YOLOv 3-based mask wearing condition detection method
Technical Field
The invention belongs to the technical field of deep learning target detection and computer vision, and particularly relates to a mask wearing condition detection method based on Slim-YOLOv 3.
Background
Harmful gas, smell, spray, virus and the like can invade the human body through air, and the substance can be effectively prevented from invading the human body through the standard wearing of the mask. The effect of the standard wearing mask is not only to prevent the virus from spreading to other people from asymptomatic person, reduce the probability of secondary spreading to protect other people, but also can protect the wearer simultaneously, reduce the inoculation amount of the virus that the wearer contacted for the virus infection risk is lower.
In recent years, deep learning has been greatly advanced in the fields of object detection, image classification, semantic segmentation, and the like. Various algorithms combined with the convolutional neural network make great progress in both precision and operation speed. The mask wearing condition video detection task is a target detection problem, and the target detection is a multi-task deep learning problem combining target classification and target positioning.
At present, according to the requirement of an actual detection task, two key technologies need to be solved for video detection:
(1) real-time; the wearing condition of the mask of the current task object can be effectively captured only by ensuring real-time field video detection;
(2) high precision; only the wearing condition of the mask of the current object can be accurately obtained, and the effective auxiliary effect can be achieved.
At present, although many mask wearing condition video detection devices are already in practical application, the devices with high detection precision often consume expensive computing resources, and the cheap detectors cannot achieve high precision.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a mask wearing condition detection method based on Slim-YOLOv3, which comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly; the improved Slim-YOLOv3 model comprises a backbone network Darknet-53, a feature enhancement and prediction network and a decoding network;
the process of training the improved Slim-Yolov3 model includes:
s1: acquiring an original data set, and preprocessing the original data set to obtain a training sample set and a test sample set;
s2: classifying and re-labeling the data in the training sample set and the test sample set;
s3: inputting the classified training sample set into a backbone network Darknet-53 for multi-scale transformation, and extracting a plurality of scale features;
s4: inputting a plurality of scale features into a feature enhancement and prediction network to obtain a classification prediction result;
s5: inputting the classification prediction result into a decoding network for decoding;
s6: calculating a loss function of the model according to the decoding result;
s7: and inputting the data in the test set into the model for prediction, optimizing the loss function of the model according to the prediction result, and finishing the training of the model when the change of the loss function is small or the iteration times are reached.
Preferably, the preprocessing the raw data set comprises: compressing and turning data in the original data set and changing the brightness of the image to obtain enhanced image data; and segmenting the enhanced image data to obtain a training sample set and a test sample set.
Preferably, the process of classifying the data in the training sample set and the test sample set includes: dividing the face mask wearing conditions into three categories according to the images of the original data set, including standard mask wearing images, non-standard mask wearing images and non-mask wearing images; and adopting an improved image unsupervised self-classification method SCAN to reclassify the mask images which are not worn normally, and obtaining a plurality of subclasses.
Further, the process of reclassifying the irregular mask wearing map by using the improved image unsupervised self-classification method SCAN comprises the following steps:
step 1: extracting a face area which is not standard for wearing the mask in the mask data set as a training set;
step 2: carrying out classification training on the mask wearing condition data set face region data by adopting an ECAResnet50 network to obtain a pre-training weight;
and step 3: leading the pre-training weight into a countermeasure network of ECAResnet50, and extracting high-level semantic features of the image;
and 4, step 4: calculating cosine similarity of each high-level semantic feature, and dividing the image corresponding to the semantic feature with higher similarity into neighbors;
and 5: performing cluster learning by using a nearest neighbor as a priori;
step 6: and carrying out fine-tuning marking processing on the clustering learned image by adopting a self-labeling label to obtain pseudo labels of four categories.
Further, a formula for calculating cosine similarity of the high-level semantic features is as follows:
Figure BDA0002994249870000031
preferably, the process of extracting the multi-scale features of the classified images in the training sample set by using the backbone network ECADarknet-53 includes: inputting the image into a data enhancement module, and adjusting the image to 416 × 3; inputting the adjusted image into an ECADarknet53 network, and performing primary convolution dimensionality increase on the image by adopting a convolution block to obtain an image with the size of batch _ size 416 32; extracting the characteristics of the graph after dimension increase of the convolution by adopting five residual volume blocks of an attention mechanism ECANet module, wherein the extracted characteristic scale is increased after each residual volume block passes through, and finally outputting two characteristic layers obtained by a fourth residual volume block and a fifth residual volume block; where batch _ size represents the number of images input to the network at a time.
Further, the process of processing the feature by the attention mechanism ECANet module comprises the following steps: performing global average pooling operation on the feature layer by adopting a channel level without reducing the dimension; selecting data of k adjacent channels for each channel, performing 1 × 1 convolution on the data subjected to the global average pooling operation, and activating a function through a sigmod; and expanding the activated data to the size of the input features and multiplying the input features by the activated data to obtain the enhanced features containing the information of the plurality of channels. The ECANet module is added to each of the five residual convolution blocks in ECADarknet-53, each resulting from the superposition of two convolutions with the output and input of one ECA module.
Preferably, the processing of the multiple scale features by using the feature enhancement and prediction network includes:
step 1: performing five times of convolution processing on the features obtained by a fifth residual convolution block in the ECADarknet53 network;
step 2: performing 3 × 3 convolution once and 1 × 1 convolution once again on the feature subjected to the convolution processing, and taking the processed result as the prediction result of the scale feature layer corresponding to the fifth residual convolution block;
and step 3: performing deconvolution UmSamplling 2d operation on the features after the five times of convolution processing, stacking the features with a feature layer obtained by a fourth residual convolution block, and fusing and enhancing information of two scale features;
and 4, step 4: performing five times of convolution processing on the fusion feature map, and performing one time of 3 × 3 convolution and one time of 1 × 1 convolution on the feature map subjected to five times of convolution to obtain a prediction result of a scale feature layer corresponding to the block subjected to the fourth residual convolution;
and 5: and outputting the prediction results of the feature layers of two scales, wherein the prediction result of each scale comprises a prediction frame corresponding to each grid point of two prior frames and the type of the prediction frame, namely, the positions, confidence degrees and the types of the three prior frames on each grid point after the two feature layers are respectively divided into grids with different sizes corresponding to the picture.
Preferably, the process of inputting the classification prediction result into the decoding network for decoding includes:
step 1: adding the corresponding x _ offset and y _ offset to each grid point to obtain the center of the prediction frame;
step 2: combining the prior frame with h and w, and calculating the length and width of the prediction frame;
and step 3: calculating positioning loss through the position information and the actual labeling information, and calculating classification loss through the prediction category information and the actual labeling category information;
and 5: judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;
step 6: calculating the coincidence degree of the real frame and the prior frame, and selecting the prior frame with the highest coincidence degree for verification;
and 7: and obtaining the predicted result which the network should have, and comparing the predicted result with the actual predicted result.
Preferably, the loss function of the model is expressed as:
Figure BDA0002994249870000051
preferably, the model introduces pre-training weights in the process of classifying the types of the data, firstly freezes parameters of a subsequent network layer of the backbone network to perform 50 times of iterative training, then unfreezes the parameters to perform 100 times of iterative training, and takes the weight with lower classification loss and total loss as a final training result.
The invention has the beneficial effects that:
according to the invention, the mask wearing condition video detection method based on YOLOv3 is adopted, so that the mask wearing video detection task can be realized more accurately and rapidly. And the proposed network is more concise, so that the application cost is further reduced. By further subclassing the data set and adding an ECANet attention mechanism module in a backbone network, the detection precision of the network is improved; by deleting the network feature layer of the minimum scale in the YOLOv3, the network is more focused on the targets of the large and medium scales, and the network detection speed is further improved.
Drawings
FIG. 1 is a diagram illustrating an example of the division of three major classes of data sets in the present invention;
FIG. 2 is a diagram illustrating the subclassing of the mask worn by a wearer without specification;
FIG. 3 is a diagram of the primary network structure of the original YOLOv3 in the present invention;
FIG. 4 is a network architecture diagram of the ECANet in the present invention;
FIG. 5 is a diagram of the main network structure of the mask wearing video detection task proposed by the present invention;
fig. 6 is a diagram of a video transmission and display device of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A mask wearing condition detection method based on Slim-YOLOv3 comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; and inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly. The improved Slim-YOLOv3 model includes a backbone network Darknet-53, a feature enhancement and prediction network, and a decoding network.
The process of training the improved Slim-Yolov3 model includes:
s1: acquiring an original data set, and preprocessing the original data set to obtain a training sample set and a test sample set;
s2: carrying out initial classification on data in a training sample set;
s3: inputting the classified images in the training sample set into a backbone network Darknet-53 for multi-scale transformation, and extracting a plurality of scale features;
s4: inputting a plurality of scale features into a feature enhancement and prediction network to obtain a classification prediction result;
s5: inputting the classification prediction result into a decoding network for decoding;
s6: calculating a loss function of the model according to the decoding result;
s7: and inputting the data in the test set into the model for measurement, optimizing the loss function of the model according to the measurement result, and finishing the training of the model when the loss function is minimum.
An embodiment of training an improved Slim-Yolov3 model, comprising:
s1: acquiring an original data set, and classifying the original data set, wherein the classification result comprises the following steps: the mask wearing picture is standardized, the mask wearing picture is not standardized, and the mask wearing picture is not standardized;
s2: dividing the classified data set to obtain a training sample set and a testing sample set; carrying out data enhancement processing on the training sample set;
s3: inputting the images in the enhanced training sample set into a YOLOv3 network model of a backbone network Darknet-53 for multi-scale transformation, and extracting multi-scale classification features and positioning features;
s4: two feature layers are required to be output, the two feature layers are located at different positions of a trunk part darknet53 and are respectively located at a middle lower layer and a bottom layer, the sizes of the two feature layers are respectively (26, 512) and (13, 1024), and then 5 times of convolution processing is carried out on the two feature layers;
s5: after the processing, a part of the feature layer with the size of 13 × 13 is used for outputting a prediction result corresponding to the feature layer, and a part of the feature layer is used for combining with the feature layer with the size of 26 × 26 after deconvolution UmSamplling 2d, and then, performing convolution processing of 3 × 3 and 1 × 1 on feature maps with two scales once;
s6: for a picture, if the picture is initially divided into K multiplied by K grids, K represents the number of the grids of the intercepted picture, and the larger the grid size is, the smaller the grid size is; two scales are required, predicting C classes, and the resulting tensor for each scale is K × [2 × (4+1+ C) ]. Where 4 and 1 denote x, y coordinate offsets x _ offset and y _ offset of the predicted position, width h and height w of the predicted target, and the prediction category of the target, respectively. The network uses a plurality of independent logistic regression classifiers for classification, and each classifier only judges whether the object appearing in the target frame belongs to the current label, namely, the classification is simple two-classification, so that the multi-label classification is realized.
S7: and (5) decoding. Adding x _ offset and y _ offset corresponding to each grid point, obtaining the result after adding the result which is the center of the prediction frame, then combining the prior frame with h and w to calculate the length and width of the prediction frame, then calculating the positioning loss through the position information and the actual marking information, and calculating the classification loss through the prediction type information and the actual marking type information, wherein the process is as follows:
1. judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;
2. judging which prior frame has the highest coincidence degree with the real frame;
3. calculating how much prediction result should be obtained for the grid point to obtain a real frame;
4. all real frames are processed as above;
5. and obtaining the predicted result which the network should have, and comparing the predicted result with the actual predicted result.
S8: and according to the classification loss and the positioning loss, finishing the training of the model when the loss converges to a certain degree and is not reduced or reaches a certain iteration number.
As shown in fig. 1, the acquired data are classified into three major categories, i.e., non-wearing mask, non-Standard wearing mask, and Standard wearing mask, according to the actual wearing condition of the mask. Labels are Nomask, Wrmask and Swmask, respectively.
Further, the mask is worn irregularly, so that large inter-class difference exists, the detection precision of the class is influenced, and the overall detection precision is influenced. The data set, which is not a standard respirator, is subdivided into four subclasses labeled Notnorm1, Notnorm2, Notnorm3, and Notnorm 4.
The process of classifying the mask images which are worn out of the specification by adopting an improved image unsupervised self-classification method SCAN comprises the following steps:
step 1: extracting a face area which is not standard for wearing the mask in the mask data set as a training set;
step 2: extracting high-level semantic features of the image through a self-supervision method, and eliminating low-level features in the current end-to-end learning method;
and step 3: carrying out classification training on the mask wearing condition data set face region data by adopting an ECAResnet50 network to obtain a pre-training weight;
and 4, step 4: introducing the pre-training weight into a confrontation network formed by ECAResnet50 and a multilayer perceptron, calculating cosine similarity of each high-level semantic feature through the high-level semantic features of the images extracted by the ECAResnet50 network, and dividing the images corresponding to the semantic features with larger similarity into neighbors according to the similarity;
and 5: and performing cluster learning by taking the nearest neighbor as a priori. By learning a clustered function phiηRepresenting the prediction class corresponding to the target, η representing the neural network weight parameter, for a sample X in the data set D and its neighbor set NXPseudo label assignment is done together, threshold assignment over C classes, and the probability that sample X is assigned to the C-th class is expressed as
Figure BDA0002994249870000081
Learning a function phi by an objective function lambdaηThe weight parameter of (2). The objective function Λ is as follows:
Figure BDA0002994249870000082
Figure BDA0002994249870000083
where D is the data set, X represents the sample, Φη(X) represents a clustering function, and eta represents a neural network weight parameterThe number of the first and second groups is,<·>is a dot product, and is represented by lambda,
Figure BDA0002994249870000084
representing the probability of all samples belonging to a class
Figure BDA0002994249870000085
C represents the confidence of the corresponding class of the target. To let the sample XiAnd its neighbors
Figure BDA0002994249870000086
And generating consistent prediction results, wherein the dot product result is the maximum only if the prediction results are the same class and are all one-hot results. In order to avoid classifying all samples into the same class, a penalty term is added, so that the prediction result is uniformly distributed to all classes. In a specific implementation, for the values of K in K neighbors, when K is 0, only the samples and their enhanced images are used, and when K > 1, the intra-class sample difference is considered, but a penalty term is also introduced because not all neighbors belong to the same class.
Step 6: fine-tuning was performed by self-labeling. That is, a high confidence prediction (p) is selectedmax1), a threshold value is required to be defined, samples with confidence degrees larger than the threshold value are selected, so that pseudo labels are obtained, then cross entropy loss can be calculated to update parameters, in order to avoid overfitting, the samples with strong enhancement are used as input, then the samples with the confidence degrees higher than the threshold value are continuously added to be samples with high confidence degrees, iteration is finished after a limited number of times, and four classes of pseudo labels are obtained from classification results.
The improved YOLOv3 was multi-scale trained on full images using the backbone network of Darknet-53. And (4) introducing a characteristic pyramid thought by utilizing a backbone network to extract multi-scale characteristics. And extracting three layers of characteristics with different scales for predicting that the box detects objects with different sizes. And (3) upsampling the feature layer with a smaller scale, converting the upsampled feature layer into the same size as the previous feature layer through deconvolution, and then splicing to obtain information among the three feature layers with different scales.
As shown in fig. 2, the structure of the original YOLOv3 model specifically includes:
1. backbone network Darknet-53. Inputting the images in the enhanced training sample set into a YOLOv3 network model of a backbone network Darknet-53 for multi-scale transformation, and extracting features of multiple scales through the backbone network Darknet-53;
2. feature enhancement and prediction networks. Three feature layers output by Darknet-53 are required as input, the three feature layers are positioned at different positions of the trunk part Darknet53 and are respectively positioned at the middle layer, the middle lower layer and the bottom layer, the sizes of the three feature layers are respectively (52, 256), (26, 512) and (13, 1024), and then 5 times of convolution processing is carried out on the three feature layers; after finishing processing, one part of the feature layer is used for outputting a prediction result corresponding to the feature layer, one part of the feature layer is used for combining with the previous layer after deconvolution UmSamplling 2d, and then each layer of features is subjected to convolution processing of 3 × 3 and 1 × 1; for a picture, if the picture is initially divided into K multiplied by K grids, K represents the number of the grids of the intercepted picture, and the larger the grid size is, the smaller the grid size is; three scales are required, predicting C classes, and the resulting tensor for each scale is K × [3 × (4+1+ C) ]. Where 4 and 1 denote x, y coordinate offsets x _ offset and y _ offset of the predicted position, width h and height w of the predicted target, and the prediction category of the target, respectively. The network uses a plurality of independent logistic regression classifiers for classification, and each classifier only judges whether the object appearing in the target frame belongs to the current label, namely, the classification is simple two-classification, so that the multi-label classification is realized.
3. And decoding the part. Adding x _ offset and y _ offset corresponding to each grid point, obtaining the result after adding the result which is the center of the prediction frame, then combining the prior frame with h and w to calculate the length and width of the prediction frame, then calculating the positioning loss through the position information and the actual marking information, and calculating the classification loss through the prediction type information and the actual marking type information, wherein the process is as follows:
1. judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;
2. judging which prior frame has the highest coincidence degree with the real frame;
3. calculating how much prediction result should be obtained for the grid point to obtain a real frame;
4. all real frames are processed as above;
5. and obtaining the predicted result which the network should have, and comparing the predicted result with the actual predicted result.
And according to the classification loss and the positioning loss, finishing the training of the model when the loss converges to a certain degree and is not reduced or reaches a certain iteration number.
A backbone network named Darknet-53 performs multi-scale training on the full image. Darknet-53 includes five large Residual convolution blocks, which respectively include 1, 2, 8 and 4 small Residual convolution units, Darknet-53 uses Residual network Residual, the Residual convolution in Darknet53 is a convolution with 3X3 and 2 step length, then the convolution layer is stored, a convolution with 1X1 and a convolution with 3X3 are carried out again, and the result is added with the layer as the final result, the Residual network is characterized by easy optimization and can improve accuracy rate by increasing a considerable depth, the Residual units inside the Residual network use jump connection, and the gradient disappearance problem caused by increasing the depth in a deep neural network is relieved. Each convolution part of the darknet53 uses a specific darknetv Conv2D structure, l2 regularization is carried out during each convolution, and Batchnormalization and LeakyReLU are carried out after the convolution is completed. The normal ReLU sets all negative values to zero, and the leakage ReLU assigns a non-zero slope to all negative values. Mathematically, it can be expressed as:
Figure BDA0002994249870000111
wherein x isiIndicating the input of standardized data, aiIs a custom scaling value that can adjust the data to a value close to 0, yiRepresenting the output of the activation function.
A backbone network Darknet-53 is utilized to introduce a characteristic pyramid idea to extract multi-scale characteristics; extracting three layers of characteristics with different scales for predicting boxes and detecting objects with different sizes; and (3) upsampling the characteristic layer with a smaller scale, converting the upsampled characteristic layer into the same size as the previous characteristic layer through deconvolution, and then splicing. In this way, information between feature layers of three different scales can be obtained.
As shown in fig. 3, after the ECANet module utilizes the global average pooling aggregation convolution characteristic without dimension reduction, the size K of the convolution kernel is determined adaptively, then one-dimensional convolution is performed, and then channel attention is learned through the sigmoid function. Because the dependency relationship between all channels cannot be efficiently captured by using the visual channel feature, ECANet only considers the information exchange between the current channel and its k-neighborhood channel. The parameter of each channel is C, then the parameter is k C. The formula of the parameters of each channel is as follows:
Figure BDA0002994249870000112
wherein, ω isiIndicates the number of parameters of the ith channel,
Figure BDA0002994249870000113
denotes yiA set of k neighbor channels, σ denotes an activation function, wjA weight parameter representing the jth neighbor channel,
Figure BDA0002994249870000114
the jth neighbor of the ith channel characteristic of the table, k, represents the number of neighbor channels.
The strategy can be simply and quickly realized in a one-dimensional convolution mode, the kernel size is k, and the processing formula is as follows:
ω=σ(C1Dk(y)),
where C1D represents a one-dimensional convolution, the ECANet module calls the above formula to make the final number of parameters k.
And adding a lightweight attention mechanism ECANet module at the end of the Darknet-53 residual volume block to obtain an ECA _ Darknet-53 trunk for extracting two-scale fine-grained features. Because only large and mesoscale human faces need to be detected in the actual application scene, the features of the last two scales of the ECA _ Darknet-53 are extracted and then further feature processing is carried out. While larger objects can be further processed to accomplish new tasks.
As shown in fig. 3, in the ECANet attention mechanism module, a convolution block of size W × H × C, given the aggregate nature of using Global Average Pools (GAPs), generates channel weights by performing a fast one-dimensional convolution of size k, where k is adaptively determined by a function of the channel dimension C.
As shown in fig. 4, the main network structure of the mask wearing video detection task proposed by the present invention includes: the data enhancement network is used for enhancing the data of the training data set; ECADarknet-53 is obtained by adding an ECANet attention mechanism module at the end of each small residual convolution unit of Darknet-53, and the backbone network ECADarknet-53 introduced with the ECANet attention mechanism module can better improve the capability of network extraction of more relevant characteristics to tasks; deleting the feature layer with the smallest scale in three feature layers with different scales extracted from the subsequent original Yolov3 network, so that the network concentrates more on the target with large and medium scales, the network is more simplified, and the detection speed is improved, as shown in the two layers of features of fig. 4, after three large residual errors are extracted through a backbone network, only the last two features with large residual errors are output, the shape of the two feature layers are respectively (26, 512), (13, 1024), and the two feature layers contain the features with larger scales; and after 5 times of convolution processing is carried out on the last characteristic layer, one part of the last characteristic layer is used for outputting a prediction result corresponding to the characteristic layer, and after deconvolution is carried out on the last characteristic layer, UmSamplling 2d, the last characteristic layer is combined with the last characteristic layer, and 5 times of convolution processing is carried out to output a corresponding prediction result.
In one embodiment, the labels of the data sets of the mask worn by the non-standard mask are re-labeled by an improved SCAN image unsupervised self-classification method, subclassing is carried out to obtain a final training and testing data set, and the final detection model is obtained by carrying out training by a YOLOv3 self-contained data enhancement method and by combining transfer learning based on an improved YOLOv3 video detection method. The hardware equipment adopts a Haokangwei human body temperature measurement double-optical-tube machine (DS-2TD2637B-10) as image acquisition equipment, and the arrangement and control are simply installed by adopting a tripod in a matching way; a desktop computer with a display card GeForce GTX 1060Ti is combined. As shown in fig. 5, the video acquisition is equipped with a hard disk recorder and connected through a switch to realize data transmission.
The expression of the loss function of the model is:
Figure BDA0002994249870000131
wherein λ iscoordAnd λnoobjAs a weight of the corresponding term, S2Representing the number of meshes, B representing the number of candidate frames generated per mesh,
Figure BDA0002994249870000132
whether the anchor box of the jth deep learning objective checking algorithm representing the ith mesh is responsible for predicting this object,
Figure BDA0002994249870000133
the jth anchor box representing the ith mesh is not responsible for predicting this object, xiAbscissa, y, representing the actual center point of the ith meshiThe ordinate representing the actual center point of the ith mesh,
Figure BDA0002994249870000134
the y anchor box of the ith trellis predicts and decodes the horizontal coordinate of the center point,
Figure BDA0002994249870000135
represents the vertical coordinate of the central point of the ith grid after being predicted and decoded by the jth anchor box, and respectively represents the width and the height of the target by omega and h,
Figure BDA0002994249870000136
respectively representing the width and the height of the decoded target, C representing the confidence level of the target object contained in the target prediction frame,
Figure BDA0002994249870000137
representing the confidence of the target object contained in the decoded target prediction frame, classes representing all classes of the data set, P representing the probability that the target belongs to the class c,
Figure BDA0002994249870000138
representing the probability of interpreting the object as belonging to class c.
The final embodiment can realize the fast and accurate wearing condition identification and body temperature monitoring of the large and medium-scale face masks.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A mask wearing condition detection method based on Slim-YOLOv3 is characterized by comprising the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly; the improved Slim-YOLOv3 model comprises a backbone network ECADArknet-53, a feature enhancement and prediction network and a decoding network;
the process of training the improved Slim-Yolov3 model includes:
s1: acquiring an original data set, and preprocessing the original data set to obtain a training sample set and a test sample set;
s2: classifying and re-labeling the data in the training sample set and the test sample set;
s3: inputting the classified training sample set into a backbone network Darknet-53 for multi-scale transformation, and extracting a plurality of scale features;
s4: inputting a plurality of scale features into a feature enhancement and prediction network to obtain a classification prediction result;
s5: inputting the classification prediction result into a decoding network for decoding;
s6: calculating a loss function of the model according to the decoding result;
s7: and inputting the data in the test set into the model for prediction, optimizing the loss function of the model according to the prediction result, and finishing the training of the model when the change of the loss function is small or the iteration times are reached.
2. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein preprocessing the raw data set comprises: compressing and turning data in the original data set and changing the brightness of the image to obtain enhanced image data; and segmenting the enhanced image data to obtain a training sample set and a test sample set.
3. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of classifying the data in the training sample set and the test sample set comprises: dividing the face mask wearing conditions into three categories according to the images of the original data set, including standard mask wearing images, non-standard mask wearing images and non-mask wearing images; and adopting an improved image unsupervised self-classification method SCAN to reclassify the mask images which are not worn normally, and obtaining a plurality of subclasses.
4. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 3, wherein the process of classifying the irregular wearing mask pattern by using the improved image unsupervised self-classification method SCAN comprises:
step 1: extracting a face area which is not standard for wearing the mask in the mask data set as a training set;
step 2: carrying out classification training on the mask wearing condition data set face region data by adopting an ECAResnet50 network to obtain a pre-training weight;
and step 3: leading the pre-training weight into a countermeasure network of ECAResnet50, and extracting high-level semantic features of the image;
and 4, step 4: calculating cosine similarity of each high-level semantic feature, and dividing the image corresponding to the semantic feature with higher similarity into neighbors;
and 5: performing cluster learning by using a nearest neighbor as a priori;
step 6: and carrying out fine-tuning marking processing on the clustering learned image by adopting a self-labeling label to obtain pseudo labels of four categories.
5. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 4, wherein the formula for calculating cosine similarity of high-level semantic features is as follows:
Figure FDA0002994249860000021
wherein x isiAnd yiThe ith dimension in the vector representing the two semantic features respectively, and n represents the total dimension of the vector.
6. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 5, wherein the process of extracting the multi-scale features of the classified images in the training sample set by using the backbone network ECADArknet-53 comprises: inputting the image into a data enhancement module, and adjusting the image to 416 × 3; inputting the adjusted image into an ECADarknet53 network, and performing primary convolution dimensionality increase on the image by adopting a convolution block to obtain an image with the size of batch _ size 416 32; extracting the characteristics of the graph after dimension increase of the convolution by adopting five residual volume blocks of an attention mechanism ECANet module, wherein the extracted characteristic scale is increased after each residual volume block passes through, and finally outputting two characteristic layers obtained by a fourth residual volume block and a fifth residual volume block; where batch _ size represents the number of images input to the network at a time.
7. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 6, wherein the process of processing the characteristics by the attention mechanism ECANet module comprises: performing global average pooling operation on the feature layer by adopting a channel level without reducing the dimension; selecting data of k adjacent channels for each channel, performing 1 × 1 convolution on the data subjected to the global average pooling operation, and activating a function through a sigmod; and expanding the activated data to the size of the input features and multiplying the input features by the activated data to obtain the enhanced features containing the information of the plurality of channels.
8. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of processing the multiple scale features by using the feature enhancement and prediction network comprises:
step 1: performing five times of convolution processing on the features obtained by a fifth residual convolution block in the ECADarknet53 network;
step 2: performing 3 × 3 convolution once and 1 × 1 convolution once again on the feature subjected to the convolution processing, and taking the processed result as the prediction result of the scale feature layer corresponding to the fifth residual convolution block;
and step 3: performing deconvolution UmSamplling 2d operation on the features after the five times of convolution processing, stacking the features with a feature layer obtained by a fourth residual convolution block, and fusing and enhancing information of two scale features;
and 4, step 4: performing five times of convolution processing on the fusion feature map, and performing one time of 3 × 3 convolution and one time of 1 × 1 convolution on the feature map subjected to five times of convolution to obtain a prediction result of a scale feature layer corresponding to the block subjected to the fourth residual convolution;
and 5: and outputting the prediction results of the feature layers of two scales, wherein the prediction result of each scale comprises a prediction frame corresponding to each grid point of two prior frames and the type of the prediction frame, namely, the positions, confidence degrees and the types of the three prior frames on each grid point after the two feature layers are respectively divided into grids with different sizes corresponding to the picture.
9. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of inputting the classification prediction result into a decoding network for decoding comprises:
step 1: adding the corresponding x _ offset and y _ offset to each grid point to obtain the center of the prediction frame; wherein x _ offset and y _ offset represent the grid upper left coordinate (x, y) and the offset in the x and y directions of the actual predicted point, respectively;
step 2: combining the prior frame with h and w, and calculating the length and width of the prediction frame; wherein h and w respectively represent the scaling values of the prediction frame;
and step 3: calculating positioning loss through the position information and the actual labeling information, and calculating classification loss through the prediction category information and the actual labeling category information;
and 5: judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;
step 6: calculating the coincidence degree of the real frame and the prior frame, and selecting the prior frame with the highest coincidence degree for verification;
and 7: and obtaining the predicted result which the network should have, and comparing the predicted result with the actual labeled result.
10. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the loss function of the model has the expression:
Figure FDA0002994249860000041
wherein λ iscoordAnd λnoobjAs a weight of the corresponding term, S2Representing the number of meshes, B representing the number of candidate frames generated per mesh,
Figure FDA0002994249860000042
whether the anchor box of the jth deep learning objective checking algorithm representing the ith mesh is responsible for predicting this object,
Figure FDA0002994249860000043
the jth anchor box representing the ith mesh is not responsible for predicting this object, xiAbscissa, y, representing the actual center point of the ith meshiThe ordinate representing the actual center point of the ith mesh,
Figure FDA0002994249860000044
the y anchor box of the ith trellis predicts and decodes the horizontal coordinate of the center point,
Figure FDA0002994249860000045
represents the vertical coordinate of the central point of the ith grid after being predicted and decoded by the jth anchor box, and respectively represents the width and the height of the target by omega and h,
Figure FDA0002994249860000051
respectively representing the width and the height of the decoded target, C representing the confidence level of the target object contained in the target prediction frame,
Figure FDA0002994249860000052
representing the confidence of the target object contained in the decoded target prediction frame, classes representing all classes of the data set, P representing the probability that the target belongs to the class c,
Figure FDA0002994249860000053
representing the probability of interpreting the object as belonging to class c.
CN202110330611.2A 2021-03-26 2021-03-26 Slim-YOLOv 3-based mask wearing condition detection method Active CN112949572B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110330611.2A CN112949572B (en) 2021-03-26 2021-03-26 Slim-YOLOv 3-based mask wearing condition detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110330611.2A CN112949572B (en) 2021-03-26 2021-03-26 Slim-YOLOv 3-based mask wearing condition detection method

Publications (2)

Publication Number Publication Date
CN112949572A true CN112949572A (en) 2021-06-11
CN112949572B CN112949572B (en) 2022-11-25

Family

ID=76227145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110330611.2A Active CN112949572B (en) 2021-03-26 2021-03-26 Slim-YOLOv 3-based mask wearing condition detection method

Country Status (1)

Country Link
CN (1) CN112949572B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516194A (en) * 2021-07-20 2021-10-19 海南长光卫星信息技术有限公司 Hyperspectral remote sensing image semi-supervised classification method, device, equipment and storage medium
CN113553984A (en) * 2021-08-02 2021-10-26 中再云图技术有限公司 Video mask detection method based on context assistance
CN113553936A (en) * 2021-07-19 2021-10-26 河北工程大学 Mask wearing detection method based on improved YOLOv3
CN113762201A (en) * 2021-09-16 2021-12-07 深圳大学 Mask detection method based on yolov4
CN113989708A (en) * 2021-10-27 2022-01-28 福州大学 Campus library epidemic prevention and control method based on YOLO v4
CN114092998A (en) * 2021-11-09 2022-02-25 杭州电子科技大学信息工程学院 Face recognition detection method for wearing mask based on convolutional neural network
CN114155453A (en) * 2022-02-10 2022-03-08 深圳爱莫科技有限公司 Training method for ice chest commodity image recognition, model and occupancy calculation method
CN114283462A (en) * 2021-11-08 2022-04-05 上海应用技术大学 Mask wearing detection method and system
CN114821702A (en) * 2022-03-15 2022-07-29 电子科技大学 Thermal infrared face recognition method based on face shielding
CN116311104A (en) * 2023-05-15 2023-06-23 合肥市正茂科技有限公司 Training method, device, equipment and medium for vehicle refitting recognition model
CN117975376A (en) * 2024-04-02 2024-05-03 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2020101210A4 (en) * 2020-06-30 2020-08-06 Anguraj, Dinesh Kumar Dr Automated screening system of covid-19 infected persons by measurement of respiratory data through deep facial recognition
CN111626330A (en) * 2020-04-23 2020-09-04 南京邮电大学 Target detection method and system based on multi-scale characteristic diagram reconstruction and knowledge distillation
CN111862408A (en) * 2020-06-16 2020-10-30 北京华电天仁电力控制技术有限公司 Intelligent access control method
CN111881775A (en) * 2020-07-07 2020-11-03 烽火通信科技股份有限公司 Real-time face recognition method and device
CN112183471A (en) * 2020-10-28 2021-01-05 西安交通大学 Automatic detection method and system for standard wearing of epidemic prevention mask of field personnel

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626330A (en) * 2020-04-23 2020-09-04 南京邮电大学 Target detection method and system based on multi-scale characteristic diagram reconstruction and knowledge distillation
CN111862408A (en) * 2020-06-16 2020-10-30 北京华电天仁电力控制技术有限公司 Intelligent access control method
AU2020101210A4 (en) * 2020-06-30 2020-08-06 Anguraj, Dinesh Kumar Dr Automated screening system of covid-19 infected persons by measurement of respiratory data through deep facial recognition
CN111881775A (en) * 2020-07-07 2020-11-03 烽火通信科技股份有限公司 Real-time face recognition method and device
CN112183471A (en) * 2020-10-28 2021-01-05 西安交通大学 Automatic detection method and system for standard wearing of epidemic prevention mask of field personnel

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIANG XIAOMING ET AL: "YOLOv3_slim for face mask recognition", 《JOURNAL OF PHYSICS: CONFERENCE SERIES》 *
肖俊杰: "基于YOLOv3和YCrCb的人脸口罩检测与规范佩戴识别", 《软件》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553936A (en) * 2021-07-19 2021-10-26 河北工程大学 Mask wearing detection method based on improved YOLOv3
CN113516194A (en) * 2021-07-20 2021-10-19 海南长光卫星信息技术有限公司 Hyperspectral remote sensing image semi-supervised classification method, device, equipment and storage medium
CN113516194B (en) * 2021-07-20 2023-08-08 海南长光卫星信息技术有限公司 Semi-supervised classification method, device, equipment and storage medium for hyperspectral remote sensing images
CN113553984A (en) * 2021-08-02 2021-10-26 中再云图技术有限公司 Video mask detection method based on context assistance
CN113553984B (en) * 2021-08-02 2023-10-13 中再云图技术有限公司 Video mask detection method based on context assistance
CN113762201B (en) * 2021-09-16 2023-05-09 深圳大学 Mask detection method based on yolov4
CN113762201A (en) * 2021-09-16 2021-12-07 深圳大学 Mask detection method based on yolov4
CN113989708A (en) * 2021-10-27 2022-01-28 福州大学 Campus library epidemic prevention and control method based on YOLO v4
CN113989708B (en) * 2021-10-27 2024-06-04 福州大学 Campus library epidemic prevention and control method based on YOLO v4
CN114283462A (en) * 2021-11-08 2022-04-05 上海应用技术大学 Mask wearing detection method and system
CN114283462B (en) * 2021-11-08 2024-04-09 上海应用技术大学 Mask wearing detection method and system
CN114092998A (en) * 2021-11-09 2022-02-25 杭州电子科技大学信息工程学院 Face recognition detection method for wearing mask based on convolutional neural network
CN114155453A (en) * 2022-02-10 2022-03-08 深圳爱莫科技有限公司 Training method for ice chest commodity image recognition, model and occupancy calculation method
CN114821702A (en) * 2022-03-15 2022-07-29 电子科技大学 Thermal infrared face recognition method based on face shielding
CN116311104B (en) * 2023-05-15 2023-08-22 合肥市正茂科技有限公司 Training method, device, equipment and medium for vehicle refitting recognition model
CN116311104A (en) * 2023-05-15 2023-06-23 合肥市正茂科技有限公司 Training method, device, equipment and medium for vehicle refitting recognition model
CN117975376A (en) * 2024-04-02 2024-05-03 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network
CN117975376B (en) * 2024-04-02 2024-06-07 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network

Also Published As

Publication number Publication date
CN112949572B (en) 2022-11-25

Similar Documents

Publication Publication Date Title
CN112949572B (en) Slim-YOLOv 3-based mask wearing condition detection method
CN111639692B (en) Shadow detection method based on attention mechanism
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN112801018B (en) Cross-scene target automatic identification and tracking method and application
US20190236411A1 (en) Method and system for multi-scale cell image segmentation using multiple parallel convolutional neural networks
CN100423020C (en) Human face identifying method based on structural principal element analysis
CN113361495B (en) Method, device, equipment and storage medium for calculating similarity of face images
CN107633226B (en) Human body motion tracking feature processing method
CN105160355B (en) A kind of method for detecting change of remote sensing image based on region correlation and vision word
CN113052185A (en) Small sample target detection method based on fast R-CNN
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN106599864A (en) Deep face recognition method based on extreme value theory
US8094971B2 (en) Method and system for automatically determining the orientation of a digital image
Teimouri et al. A real-time ball detection approach using convolutional neural networks
CN111274964A (en) Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle
Deeksha et al. Classification of Brain Tumor and its types using Convolutional Neural Network
Putro et al. Fast face-CPU: a real-time fast face detector on CPU using deep learning
CN111738194A (en) Evaluation method and device for similarity of face images
CN116824330A (en) Small sample cross-domain target detection method based on deep learning
CN116434010A (en) Multi-view pedestrian attribute identification method
Gowda Age estimation by LS-SVM regression on facial images
CN102156879A (en) Human target matching method based on weighted terrestrial motion distance
Nanthini et al. A novel Deep CNN based LDnet model with the combination of 2D and 3D CNN for Face Liveness Detection
Nasiri et al. Masked face detection using artificial intelligent techniques
CN112287929A (en) Remote sensing image significance analysis method based on feature integration deep learning network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant