CN111428604B

CN111428604B - Facial mask recognition method, device, equipment and storage medium

Info

Publication number: CN111428604B
Application number: CN202010194398.2A
Authority: CN
Inventors: 李斯; 赵齐辉
Original assignee: Dongpu Software Co Ltd
Current assignee: Dongpu Software Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2023-06-13
Anticipated expiration: 2040-03-19
Also published as: CN111428604A

Abstract

The invention relates to the field of biological recognition, and discloses a facial mask recognition method, a facial mask recognition device, facial mask recognition equipment and a storage medium. The facial mask recognition method comprises the following steps: obtaining sample images, and labeling each sample image to obtain label information of each sample image, wherein the sample images comprise facial images of a worn mask and facial images of a non-worn mask; inputting the sample image and corresponding label information into a preset MASK R-CNN model, extracting an eye feature image and a facial integral feature image of the sample image through the MASK R-CNN model, and training to obtain an identification model for identifying whether a MASK is worn on the face; and acquiring an image to be detected, inputting the image to be detected into the recognition model for recognition, and outputting a recognition result. The embodiment can improve the accuracy of the face recognition model in detecting whether the face is worn on the mask.

Description

Facial mask recognition method, device, equipment and storage medium

Technical Field

The invention relates to the field of biological recognition, in particular to a facial mask recognition method, a facial mask recognition device, facial mask recognition equipment and a storage medium.

Background

Face recognition is a technique for performing identity recognition based on facial features of a person. With the development of deep learning, the face recognition accuracy is higher and higher, and the method has wider and wider application in public places and replaces manpower in certain fields. If in the traffic field, the face recognition device can detect whether a passenger of an automobile wears safety belts or not, and can recognize whether the personnel working on the construction site wear safety helmets or not on the construction site.

However, the existing face recognition model is built on the basis of a complete face to judge the face. Because the mask shields a part of face information, the recognition rate of judging whether the face is worn on the mask is only about 30% based on a conventional face recognition model at present. And the risk of the person being infected is judged by adopting manual judgment.

Disclosure of Invention

The invention mainly aims to improve the technical problem of accuracy of a face recognition model for detecting whether a mask is worn or not.

The first aspect of the present invention provides a facial mask recognition method, comprising:

obtaining sample images, and labeling each sample image to obtain label information of each sample image, wherein the sample images comprise facial images of a worn mask and facial images of a non-worn mask;

Inputting the sample image and corresponding label information into a preset MASK R-CNN model, extracting an eye feature image and a facial integral feature image of the sample image through the MASK R-CNN model, and training to obtain an identification model for identifying whether a MASK is worn on the face;

and acquiring an image to be detected, inputting the image to be detected into the recognition model for recognition, and outputting a recognition result.

Optionally, in a first implementation manner of the first aspect of the present invention, the MASK R-CNN model sequentially includes: a target feature extraction network, an RPN network, an ROI alignment layer and an FCN network;

the target feature extraction network is used for extracting a target feature map of the sample image, wherein the target feature map comprises an eye feature map, a face integral feature map and a fusion feature map,

the RPN network is used for generating a pre-selection frame corresponding to the target feature map,

the ROI alignment layer is used for dividing a preselected frame in the target feature map and fusing the preselected frame and the target feature map in an endpoint pooling mode to generate and obtain a marked feature map;

and the FCN network is used for predicting each pixel point of the labeling feature map to obtain a prediction result corresponding to the sample image.

Optionally, in a second implementation manner of the first aspect of the present invention, the training method of the MASK R-CNN model includes:

inputting the sample image and corresponding label information into a preset MASK R-CNN model;

extracting a target feature map of the sample image through the target feature extraction network;

inputting the target feature map into the RPN network so as to generate a pre-selected frame corresponding to the target feature map through the RPN network according to preset anchor frame information;

inputting the pre-selection frame and the target feature map into the ROI alignment layer for fusing the pre-selection frame and the target feature map through the ROI alignment layer pair, and dividing the pre-selection frame and pooling endpoints to obtain a labeling feature map;

inputting the labeling feature map into the FCN network so as to predict each pixel point of the labeling feature map through the FCN network, obtaining a prediction result corresponding to the sample image and outputting the prediction result;

and optimizing parameters of the MASK R-CNN model according to the prediction result and the label information until the MASK R-CNN model converges to obtain an identification model.

Optionally, in a third implementation manner of the first aspect of the present invention, the generating, by the RPN network, a pre-selected box corresponding to the target feature map according to the preset anchor box information includes:

Acquiring preset anchor frame information through the RPN network, and generating candidate frames of each pixel point in the target feature map according to the anchor frame information;

judging whether the candidate frame contains a face for wearing a mask;

if yes, reserving the candidate frame, and adjusting the candidate frame to obtain a preselected frame of the target feature map.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the extracting, by the target feature extraction network, a target feature map of the sample image includes:

extracting a facial integral feature map corresponding to the sample image through the target feature extraction network;

extracting an eye feature map in the whole facial feature map based on a preset eye feature attention mechanism;

and carrying out multi-level feature fusion on the eye feature map and the whole facial feature map to obtain a fusion feature map.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the predicting, by the FCN network, each pixel point of the labeling feature map, to obtain a prediction result corresponding to the sample image, and output the prediction result includes:

convolving the labeling feature map through the FCN network to generate a mask corresponding to the preselected frame and a first heat map containing predicted values corresponding to all pixel points;

Upsampling the first heat map to obtain a second heat map consistent with the size of the sample image;

and outputting the mask, the pre-selection frame and the second heat map as a prediction result corresponding to the sample image.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the optimizing parameters of the MASK R-CNN model according to the prediction result and the tag information until the MASK R-CNN model converges, and obtaining the identification model includes:

calculating a loss value between the prediction result and the tag information according to a preset loss function;

the loss value is reversely transmitted back to the MASK R-CNN model, and parameters of the MASK R-CNN model are optimized according to a random gradient descent method;

and if the MASK R-CNN model converges, taking the current MASK R-CNN model as an identification model.

A second aspect of the present invention provides a facial mask recognition device comprising:

the acquisition module is used for acquiring sample images and labeling the sample images to obtain label information of the sample images, wherein the sample images comprise facial images of a wearing mask and facial images of an unworn mask;

The training module is used for inputting the sample image and the corresponding label information into a preset MASK R-CNN model so as to extract an eye feature image and a face integral feature image of the sample image through the MASK R-CNN model for training and obtain an identification model for identifying whether the face is worn on the MASK;

the identification module is used for acquiring an image to be detected, inputting the image to be detected into the identification model for identification, and outputting an identification result.

Optionally, in a first implementation manner of the second aspect of the present invention, the MASK R-CNN model includes a target feature extraction network, an RPN network, an ROI alignment layer, and an FCN network;

the target feature extraction network is used for extracting a target feature map of the sample image, wherein the target feature map comprises an eye feature map, a face integral feature map and a fusion feature map;

the RPN network is used for generating a pre-selection frame corresponding to the target feature map;

Optionally, in a second implementation manner of the second aspect of the present invention, the training module includes:

the input unit is used for inputting the sample image and the corresponding label information into a preset MASK R-CNN model;

an extraction unit configured to extract a target feature map of the sample image through the target feature extraction network;

the preselection frame unit is used for inputting the target feature map into the RPN network so as to generate a preselection frame corresponding to the target feature map through the RPN network according to preset anchor frame information;

the processing unit is used for inputting the pre-selection frame and the target feature map into the ROI alignment layer so as to fuse the pre-selection frame and the target feature map through the ROI alignment layer pair, and dividing the pre-selection frame and pooling the end points to obtain a labeling feature map;

the output unit is used for inputting the labeling feature diagram into the FCN network so as to predict each pixel point of the labeling feature diagram through the FCN network, and obtaining and outputting a corresponding prediction result of the sample image;

and the convergence unit is used for optimizing the parameters of the MASK R-CNN model according to the prediction result and the label information until the MASK R-CNN model converges to obtain an identification model.

Optionally, in a third implementation manner of the second aspect of the present invention, the pre-selection box unit is specifically configured to:

judging whether the candidate frame contains a face for wearing a mask;

if yes, the candidate frame is reserved, and coordinates of the candidate frame are adjusted and regressed to obtain a preselected frame of the target feature map.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the extracting unit is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the output unit is specifically configured to:

convoluting the marked feature map through an FCN network to generate a mask corresponding to the preselected frame and a first heat map containing predicted values corresponding to all pixel points;

Optionally, in a sixth implementation manner of the second aspect of the present invention, the convergence unit is specifically configured to:

A third aspect of the present invention provides a face-worn mask recognition apparatus comprising: the system comprises a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected through a line; the at least one processor invokes the instructions in the memory to cause the facial mask recognition device to perform the facial mask recognition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having instructions stored therein that, when run on a computer, cause the computer to perform the above-described face-worn mask recognition method.

In the technical scheme provided by the invention, facial images of the worn mask and the unworn mask are firstly taken as sample images, and then the sample images are marked to obtain corresponding label information. And inputting the sample image and the label information into a preset MASK R-CNN model. The MASK R-CNN model is a target recognition model capable of realizing example segmentation, and has higher accuracy and faster training speed compared with the existing model. In the present invention, the MASK R-CNN model includes a target feature extraction network, an RPN network, an ROI alignment layer, and an FCN network. The target feature extraction network comprises an eye attention mechanism, a facial integral feature map and an eye feature map in a sample image can be extracted, a fusion feature map which fuses the facial integral feature map and the eye feature map is generated, and the accuracy of identifying whether a face wears a mask is improved through multistage fusion of the facial integral feature map and the eye feature map.

Drawings

Fig. 1 is a schematic view showing a first embodiment of a face-wearing mask recognition method according to an embodiment of the present invention;

fig. 2 is a schematic view showing a first embodiment of a face-wearing mask recognition method according to an embodiment of the present invention;

fig. 3 is a schematic view of a third embodiment of a face-wearing mask recognition method according to an embodiment of the present invention;

Fig. 4 is a schematic view of a fourth embodiment of a face-wearing mask recognition method according to an embodiment of the present invention;

fig. 5 is a schematic view showing a fifth embodiment of a face-wearing mask recognition method according to an embodiment of the present invention;

fig. 6 is a schematic view of an embodiment of a face-worn mask recognition device according to an embodiment of the present invention;

fig. 7 is a schematic view of another embodiment of a face-worn mask recognition device according to an embodiment of the present invention;

fig. 8 is a schematic view of an embodiment of a face-worn mask recognition apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a facial MASK recognition method, a device, equipment and a storage medium, wherein the MASK R-CNN model is used as an initial model, and is a target recognition model capable of realizing example segmentation, and compared with the existing model, the method and the device have the advantages of higher accuracy and faster training speed. In the invention, the MASK R-CNN model comprises a target feature extraction network, the target feature extraction network comprises an eye attention mechanism, a facial integral feature map and an eye feature map in a sample image can be extracted, a fusion feature map which fuses the facial integral feature map and the eye feature map is generated, and the accuracy of identifying whether a face wears a MASK is improved through multistage fusion of the facial integral feature map and the eye feature map.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, where an embodiment of a face wearing mask recognition method according to an embodiment of the present invention includes:

101. obtaining sample images, and labeling each sample image to obtain label information of each sample image, wherein the sample images comprise facial images of a worn mask and facial images of a non-worn mask;

It is to be understood that the execution body of the present invention may be a facial mask recognition device, or may be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

In this embodiment, face images of the wearing mask and the non-wearing mask are acquired through a network, a camera, or the like, and taken as sample images.

And marking the face of the person wearing the mask in the sample image by circling a circle, a rectangle or an irregular polygon, and storing the marked position coordinates as tag information.

102. Inputting the sample image and corresponding label information into a preset MASK R-CNN model, extracting an eye feature image and a facial integral feature image of the sample image through the MASK R-CNN model, and training to obtain an identification model for identifying whether a MASK is worn on the face;

MASK RCNN was developed in 17 years to enable instance-partitioned target detection models. The method is characterized in that a fully-connected split sub-network is added on the basis of the prior Faster RCNN, and the original classification and regression tasks are changed into classification, regression and separation into a whole, and the method mainly comprises a target feature extraction network, an RPN network, an ROI alignment layer and an FCN network.

The target feature network may be various convolutional neural networks including VGG series, alexnet series, and the like. Firstly, inputting a sample image into a target feature extraction network, extracting an eye feature image and a face integral feature image of the sample image by the target feature extraction network, and then obtaining a fusion feature image of the fusion eye feature image and the face integral feature image through multistage fusion. And (3) taking the three feature maps as target feature maps to be transmitted into the RPN network.

RPN (Regin Proposal Network, regional production network), "region Proposal" is "regional selection". In the RPN network, preset anchor frame information is acquired, whether the anchor frame contains an identification target is judged, if yes, the anchor frame is reserved, and position adjustment is carried out on the anchor frame, so that a preselected frame of a sample image is obtained.

And fusing the target with the feature map and the pre-selected frame through an ROI alignment (Region of Interest Align) layer to obtain the marked feature map. ROI alignment is a region of interest matching method for solving the problem of region mismatch caused by two quantization in ROI Pooling operation.

Then, through FCN network (Fully Convolutional Networks, full convolution network), the preselected frame of the marked feature map, probability value and mask of each pixel point are obtained, and output as the recognition result. The full convolution network replaces the conventional full connection layer and is used for outputting a prediction result of the whole image.

And finally, optimizing parameters of the model according to a random gradient descent method, so as to obtain the identification model.

103. And acquiring an image to be detected, inputting the image to be detected into the recognition model for recognition, and outputting a recognition result.

After the identification model is obtained, the image to be detected is input into the identification model. The identification model is extracted to a target feature map through a target extraction network, an RPN network generates a preselected frame on the target feature map, an ROI alignment layer generates a labeling feature map, and finally the labeling feature map is input into an FCN network to obtain an identification result.

In the embodiment of the invention, facial images of the worn mask and the unworn mask are taken as sample images, and then marked to obtain corresponding label information. And inputting the sample image and the label information into a preset MASK R-CNN model. The MASK R-CNN model is a target recognition model capable of realizing example segmentation, and has higher accuracy and faster training speed compared with the existing model. In addition, in the process of training the MASK R-CNN model, an eye feature image and overall facial feature information in a sample image can be extracted, so that the recognition accuracy is improved.

Referring to fig. 2, a second embodiment of a face mask recognition method according to an embodiment of the present invention includes:

201. Acquiring sample images, and labeling each sample image to obtain label information of each sample image;

202. inputting the sample image and corresponding label information into a preset MASK R-CNN model, wherein the MASK R-CNN model sequentially comprises: a target feature extraction network, an RPN network, an ROI alignment layer and an FCN network;

the target feature extraction network is used for extracting a target feature map of the sample image, wherein the target feature map comprises an eye feature map, a facial integral feature map and a fusion feature map;

the FCN network is used for predicting each pixel point of the labeling feature map to obtain a prediction result corresponding to the sample image;

203. extracting a facial integral feature map corresponding to the sample image through the target feature extraction network;

the present embodiment preferably employs ResNet as the target feature extraction network. In the ResNet network constructed in this embodiment, there are a total of 5 stages, and each stage performs convolution operation. As stage deepens, the depth of the obtained feature map becomes deeper. The resulting graphs for these five stages are designated C1, C2, C3, C4 and C5, respectively. stage 3 extracts a face overall feature map, and C3 is a face overall feature extraction map.

204. Extracting an eye feature map in the facial overall feature map based on a preset eye feature attention mechanism;

stage5 extracts an eye feature map in the face, and C5 is the eye feature map. Notably, stage5 is associated with a preset eye attention mechanism when extracting the eye feature map.

In the embodiment, the eye attention mechanism is used for extracting the eye features, so that the accuracy of eye recognition is improved, and the face is better positioned. The present embodiment preferably employs a soft attention mechanism to build an eye feature attention mechanism.

205. Performing multi-level feature fusion on the eye feature map and the whole facial feature map to obtain a fusion feature map;

at the feature fusion layer, a 1x1 convolution operation is performed on C5 to form a feature map P5. And C4 also performs convolution operation, the generated feature map is summed with P5 to generate P4, and the like to generate P2, P3, P4 and P5. Feature extraction graphs obtained by each stage are input into a feature fusion layer. In order to prevent jitter, preferably, the feature fusion layer performs a 3x3 convolution operation on all the input feature graphs, and then downsamples P5, so as to implement multi-level feature fusion, and generate P6, where P6 is the fusion feature graph.

206. Inputting the target feature map into the RPN network so as to generate a pre-selected frame corresponding to the target feature map through the RPN network according to preset anchor frame information;

inputting the target feature map into an RPN (remote procedure network), acquiring preset anchor frame information by the RPN, establishing anchor frames for all pixel points in the target feature map, and adjusting the anchor frames containing faces wearing the mask, so that the circled range is more accurate.

207. Inputting the pre-selection frame and the target feature map into the ROI alignment layer for fusing the pre-selection frame and the target feature map through the ROI alignment layer pair, and carrying out segmentation and endpoint pooling on the pre-selection frame to obtain a labeling feature map;

the ROI alignment identifies the location of the candidate box on the target feature map, thereby fusing the candidate box and the target feature map. The ROI alignment uses bilinear interpolation to divide the pre-selected frame, and then the divided end points are maximally pooled. The bilinear interpolation value effectively avoids the accumulation of boundary non-integer factors in the segmentation process, so that the fusion precision of the two is higher.

208. Inputting the labeling feature map into the FCN network so as to predict each pixel point of the labeling feature map through the FCN network, obtaining a prediction result corresponding to the sample image and outputting the prediction result;

The FCN network classifies the image at the pixel level, thereby solving the problem of image segmentation at the semantic level. The FCN converts the full connection layer in the conventional CNN into a convolution layer, and after the convolution of the last layer, the generated graph is called a heat graph, and each pixel point on the heat graph contains a probability that the pixel point is of a certain type.

If the probability of a pixel point is 80% of the probability of wearing the mask, the pixel point corresponding to the image is marked with color, so that the mask of the pixel point is obtained. And classifying and marking pixel points in the range of the preselected frame in the first heat map, so as to obtain a mask corresponding to the preselected frame.

The first heat map is then up-sampled to expand the size of the second heat map to obtain a second heat map that is consistent with the sample image size.

And finally, outputting the mask, the pre-selection frame and the second heat map as a prediction result corresponding to the sample image.

209. Optimizing parameters of the MASK R-CNN model according to the prediction result and the label information until the MASK R-CNN model converges to obtain an identification model;

in the MASK R-CNN model, a loss value between the prediction result and the tag information is calculated based on a preset loss function. The loss values are then passed back into the MASK R-CNN model by back propagation. And then optimizing parameters of each network according to a random gradient descent method. If the model converges, the current model is stored as an identification model.

210. And acquiring an image to be detected, inputting the image to be detected into the recognition model for recognition, and outputting a recognition result.

A common object detection model is multi-level feature extraction. As the number of layers of the neural network increases, feature extraction is further refined from low level to high level. For example, low level refers to extracting contours of a person's face, while high level extracts higher features of eyes, nose, etc. However, as the network deepens, each layer may lose some information, and eventually more information. The invention aims to combine the eye feature data and the whole face feature data of the person to judge whether to wear the mask, so the feature map of the image is extracted by adopting a multi-scale feature fusion mode, thereby greatly improving the recognition accuracy.

Referring to fig. 3, a third embodiment of a face wearing mask recognition method according to an embodiment of the present invention includes:

301. acquiring sample images, and labeling each sample image to obtain label information of each sample image;

302. inputting the sample image and corresponding label information into a preset MASK R-CNN model;

303. extracting a target feature map of the sample image through the target feature extraction network;

304. Acquiring preset anchor frame information through the RPN network, and generating candidate frames of each pixel point in the target feature map according to the anchor frame information;

anchor frame information such as size and number is preset. In this embodiment, the RPN network establishes nine anchor frames with different sizes for each pixel point in the feature map, and uses all the anchor frames as candidate frames.

305. Judging whether the candidate frame contains a face for wearing a mask;

the RPN network is also provided with a softmax classifier, and whether the candidate frame contains a face wearing the mask can be judged through the softmax classifier.

306. If yes, reserving the candidate frame, and adjusting the candidate frame to obtain a preselected frame of the target feature map;

if yes, the candidate box is saved, and the candidate box is input into a reshape layer and a proposal layer in the RPN network.

Since there may be a case where a plurality of candidate frames overlap in the selection area, the overlapping area between the candidate frames may be calculated by a non-maximum suppression method. In this embodiment, the threshold value of the overlapping region may be set in advance, and then the candidate frames whose overlapping region is larger than the threshold value are retained, and the candidate frames smaller than the threshold value are deleted. And then, the coordinates of the candidate frames are adjusted and regressed through a preset candidate frame offset formula, so that the frame selected area is closer to the real range.

307. Inputting the pre-selection frame and the target feature map into the ROI alignment layer for fusing the pre-selection frame and the target feature map through the ROI alignment layer pair, and carrying out segmentation and endpoint pooling on the pre-selection frame to obtain a labeling feature map;

308. inputting the labeling feature map into the FCN network so as to predict each pixel point of the labeling feature map through the FCN network, obtaining a prediction result corresponding to the sample image and outputting the prediction result;

309. optimizing parameters of the MASK R-CNN model according to the prediction result and the label information until the MASK R-CNN model converges to obtain the identification model;

310. and acquiring an image to be detected, inputting the image to be detected into the recognition model for recognition, and outputting a recognition result.

In this embodiment, the process of generating a pre-selected frame of a sample image by the MASK R-CNN model is described in detail. Because the softmax classifier is adopted for judging, the response layer and the proposal layer are used for further adjusting the candidate frames, the softmax classifier, the response layer and the proposal layer can be trained in the training process of the model, and therefore the accuracy of selecting the identification target by the frame of the preselected frame is improved.

Referring to fig. 4, a fourth embodiment of the facial mask recognition method according to the embodiment of the present invention includes:

401. acquiring sample images, and labeling each sample image to obtain label information of each sample image;

402. inputting the sample image and corresponding label information into a preset MASK R-CNN model;

403. extracting a target feature map of the sample image through the target feature extraction network;

404. inputting the target feature map into the RPN network so as to generate a pre-selected frame corresponding to the target feature map through the RPN network according to preset anchor frame information;

405. inputting the pre-selection frame and the target feature map into the ROI alignment layer for fusing the pre-selection frame and the target feature map through the ROI alignment layer pair, and carrying out segmentation and endpoint pooling on the pre-selection frame to obtain a labeling feature map;

406. convolving the labeling feature map through the FCN network to generate a mask corresponding to the preselected frame and a first heat map containing predicted values corresponding to all pixel points;

the FCN comprises a plurality of convolution layers, and after the input labeling feature images are rolled and mapped for a plurality of times, the obtained images are smaller and the resolution is lower. After the convolution of the last layer, the generated graph is called a heat graph, and each pixel point, i.e. the probability that the pixel point is of a certain type, of each pixel point on the heat graph.

If the probability of a pixel point is 80% of the probability of wearing the mask and the probability of not wearing the mask is 20%, the pixel point corresponding to the image is color-marked, so that the mask of the pixel point is obtained. And classifying and marking pixel points in the range of the preselected frame in the first heat map, so as to obtain a mask corresponding to the preselected frame.

407. Upsampling the first heat map to obtain a second heat map consistent with the size of the sample image;

408. Outputting the mask, the pre-selection frame and the second heat map as prediction results corresponding to the sample image;

409. optimizing parameters of the MASK R-CNN model according to the prediction result and the label information until the MASK R-CNN model converges to obtain the identification model;

410. and acquiring an image to be detected, inputting the image to be detected into the recognition model for recognition, and outputting a recognition result.

In this embodiment, since the final recognition result includes the mask, the heat map and the pre-selected frame, the final recognition result of the present invention can learn and train the mask, the probability value of each pixel point and the pre-selected frame position, thereby improving the accuracy of the recognition result.

Referring to fig. 5, a fifth embodiment of a face mask recognition method according to an embodiment of the present invention includes:

501. acquiring sample images, and labeling each sample image to obtain label information of each sample image;

502. inputting the sample image and corresponding label information into a preset MASK R-CNN model;

503. extracting a target feature map of the sample image through the target feature extraction network;

504. inputting the target feature map into the RPN network so as to generate a pre-selected frame corresponding to the target feature map through the RPN network according to preset anchor frame information;

505. inputting the pre-selection frame and the target feature map into the ROI alignment layer for fusing the pre-selection frame and the target feature map through the ROI alignment layer pair, and carrying out segmentation and endpoint pooling on the pre-selection frame to obtain a labeling feature map;

506. inputting the labeling feature map into the FCN network so as to predict each pixel point of the labeling feature map through the FCN network, obtaining a prediction result corresponding to the sample image and outputting the prediction result;

507. calculating a loss value between the prediction result and the tag information according to a preset loss function;

In the MASK R-CNN model, the loss function is l=l _cls +L _box +L _mask . Wherein L is _cls Is a classification loss, i.e. a loss of accuracy of pixel point identification, L _box Refers to the loss of the pre-selected frame, which is obtained by comparing the position coordinates of the pre-selected frame with the position coordinates of the selected frame in the label information, and L _mask Refers to the loss of mask. The loss value between the predicted result and the tag information can be calculated by a preset loss function.

508. The loss value is reversely transmitted back to the MASK R-CNN model, and parameters of the MASK R-CNN model are optimized according to a random gradient descent method;

the loss values are transmitted back to the MASK R-CNN model through back propagation, and parameters of each network are optimized according to a random gradient descent method.

509. If the MASK R-CNN model converges, taking the current MASK R-CNN model as the identification model;

510. and acquiring an image to be detected, inputting the image to be detected into the recognition model for recognition, and outputting a recognition result.

The embodiment specifically describes how to learn and train according to the prediction result after the model obtains the prediction result. The invention mainly adopts a mode of reversely transmitting the loss value back to the model and optimizing the parameters by a random gradient descent method, and the adopted calculated loss value comprises a mask, a preselected frame and classification, so that the three results can be further optimized by adopting the learning training mode of the scheme, thereby improving the recognition efficiency.

The method for identifying the face-worn mask in the embodiment of the present invention is described above, and the face-worn mask identifying device in the embodiment of the present invention is described below, referring to fig. 6, where one embodiment of the face-worn mask identifying device in the embodiment of the present invention includes:

the acquiring module 601 is configured to acquire sample images, and label each sample image to obtain label information of each sample image, where the sample images include a facial image of a mask worn and a facial image of a mask not worn;

the training module 602 is configured to input the sample image and the corresponding tag information into a preset MASK R-CNN model, so as to extract an eye feature map and a facial overall feature map of the sample image through the MASK R-CNN model for training, and obtain an identification model for identifying whether the MASK is worn on the face;

the recognition module 603 is configured to obtain an image to be detected, input the image to be detected into the recognition model for recognition, and output a recognition result.

Referring to fig. 7, another embodiment of the facial mask recognition device according to the present invention includes:

the acquiring module 701 is configured to acquire sample images, and label each sample image to obtain label information of each sample image, where the sample image includes a facial image of a mask worn and a facial image of a mask not worn;

the training module 702 is configured to input the sample image and the corresponding tag information into a preset MASK R-CNN model, so as to extract an eye feature map and a whole facial feature map of the sample image through the MASK R-CNN model for training, and obtain an identification model for identifying whether the face is wearing a MASK;

the recognition module 703 is configured to obtain an image to be detected, input the image to be detected into the recognition model for recognition, and output a recognition result.

Optionally, the MASK R-CNN model includes a target feature extraction network, an RPN network, an ROI alignment layer, and an FCN network,

The ROI alignment layer is used for dividing a preselected frame in the target feature map and carrying out endpoint pooling to generate a labeling feature map;

Wherein the training module 702 comprises:

an input unit 7021, configured to input the sample image and the corresponding label information into a preset MASK R-CNN model;

an extraction unit 7022 for extracting a target feature map of the sample image through the target feature extraction network;

a preselection frame unit 7023, configured to input the target feature map into the RPN network, so that a preselection frame corresponding to the target feature map is generated through the RPN network according to preset anchor frame information;

a processing unit 7024, configured to input the pre-selection frame and the target feature map into the ROI alignment layer, so that the pre-selection frame and the target feature map are fused through the ROI alignment layer pair, and the pre-selection frame is segmented and the target feature map is pooled to obtain a labeling feature map;

the output unit 7025 is configured to input the labeling feature map into the FCN network, so that each pixel point of the labeling feature map is predicted by the FCN network, and a recognition result corresponding to the sample image is obtained and output;

And a convergence unit 7026, configured to optimize parameters of the MASK R-CNN model according to the identification result and the tag information until the MASK R-CNN model converges, so as to obtain an identification model.

Optionally, the preselection frame unit 7023 is specifically configured to:

judging whether the candidate frame contains a face for wearing a mask;

Optionally, the extracting unit 7022 is specifically configured to:

and adopting the feature fusion layer to perform multistage feature fusion on the eye feature map and the whole facial feature map to obtain a fusion feature map.

Optionally, the output unit 7027 is specifically configured to:

Optionally, the convergence unit 7026 is specifically configured to:

calculating a loss value between the initial classification result and the label information according to a preset loss function;

In the technical scheme provided by the invention, on the basis of the above embodiment, the MASK R-CNN model comprises a target feature extraction network, an RPN network, an ROI alignment layer and an FCN network. The target characteristic information is used for generating a fusion characteristic image fused with the facial integral characteristic image and the eye characteristic image, and the accuracy of identifying whether the face is worn on the mask is improved through multistage fusion of the facial integral characteristic image and the eye characteristic image. The final recognition result comprises the mask, the heat map and the preselection frame, so that the final recognition result can learn and train the mask, the probability value of each pixel point and the preselection frame position, and the accuracy of the mask, the classification and the preselection frame position in the recognition result is improved.

The face-wearing mask recognition device in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in fig. 4 and 5 above, and the face-wearing mask recognition apparatus in the embodiment of the present invention is described in detail from the point of view of hardware processing below.

Fig. 6 is a schematic structural diagram of a facial mask recognition device according to an embodiment of the present invention, where the facial mask recognition device 800 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 840 (e.g., one or more mass storage devices) storing application programs 833 or data 832. Wherein memory 820 and storage medium 840 can be transitory or persistent. The program stored in the storage medium 840 may include one or more modules (not shown), each of which may include a series of instruction operations in the face mask recognition device 800. Still further, the processor 810 may be configured to communicate with the storage medium 840 to perform a series of instruction operations in the storage medium 840 on the face-worn mask recognition device 800.

The face-worn mask-based identification device 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 880, one or more input/output interfaces 880, and/or one or more operating systems 831, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the structure of the facial mask-based facial mask recognition device illustrated in fig. 8 is not limiting and may include more or fewer components than illustrated, or may be combined with certain components, or may be arranged in a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, the computer readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of the facial mask recognition method.

It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, apparatuses and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not described in detail herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The face wearing mask identification method is characterized by comprising the following steps:

acquiring an image to be detected, inputting the image to be detected into the recognition model for recognition, and outputting a recognition result;

the MASK R-CNN model sequentially comprises the following steps: a target feature extraction network, an RPN network, an ROI alignment layer and an FCN network;

2. The face mask recognition method according to claim 1, wherein the training method of the recognition model comprises:

and optimizing parameters of the MASK R-CNN model according to the prediction result and the label information until the MASK R-CNN model converges to obtain the identification model.

3. The method of claim 2, wherein the generating, by the RPN network, a pre-selected box corresponding to the target feature map according to preset anchor box information comprises:

judging whether the candidate frame contains a face for wearing a mask;

4. The face mask recognition method of claim 2, wherein the extracting the target feature map of the sample image through the target feature extraction network comprises:

5. The face recognition method according to claim 2, wherein predicting each pixel point of the labeling feature map through the FCN network, obtaining a prediction result corresponding to the sample image, and outputting the prediction result includes:

6. The method for identifying a MASK for face wearing according to claim 2, wherein optimizing parameters of the MASK R-CNN model according to the prediction result and the tag information until the MASK R-CNN model converges, obtaining the identification model includes:

and if the MASK R-CNN model converges, taking the current MASK R-CNN model as the identification model.

7. A facial mask recognition device, characterized in that the facial mask recognition device comprises:

The device comprises an acquisition module, a labeling module and a display module, wherein the acquisition module is used for acquiring sample images and labeling each sample image to obtain label information of each sample image, and the sample images comprise facial images of a worn mask and facial images of an unworn mask;

the identification module is used for acquiring an image to be detected, inputting the image to be detected into the identification model for identification, and outputting an identification result;

the MASK R-CNN model comprises a target feature extraction network, an RPN network, a ROIAlign layer and an FCN network;

8. A facial mask recognition device, characterized in that the facial mask recognition device comprises: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;

the at least one processor invoking the instructions in the memory to cause the facial mask recognition device to perform the facial mask recognition method of any one of claims 1-6.

9. A computer-readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the facial mask recognition method of any one of claims 1-6.