CN113536965A

CN113536965A - Method and related device for training face shielding recognition model

Info

Publication number: CN113536965A
Application number: CN202110711009.3A
Authority: CN
Inventors: 曾梦萍
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-10-22
Anticipated expiration: 2041-06-25
Also published as: CN113536965B

Abstract

The embodiment of the invention relates to the technical field of intelligent recognition, and discloses a method and a related device for training a face shielding recognition model, wherein a preset neural network for training the model in the method comprises a feature extraction network, the feature extraction network comprises a common convolutional layer and N depth separable convolutional layers which are arranged layer by layer, the step length of the depth convolutional layer in the first M depth separable convolutional layers in the N depth separable convolutional layers is a preset value, the preset value is more than 1, and M is less than or equal to N. The preset neural network has the characteristics of light weight and high training speed based on the small parameter quantity and the small calculated quantity of the depth separable convolutional layers, and the characteristic diagram generated by the depth separable convolutional layers has the characteristics of low resolution, large receptive field and space invariance by adopting the depth separable convolutional layers with the M step lengths larger than 1 to perform downsampling, so that the accuracy of the face shielding recognition model obtained by training can be improved.

Description

Method and related device for training face shielding recognition model

Technical Field

The embodiment of the invention relates to the technical field of intelligent recognition, in particular to a method and a related device for training a face shielding recognition model.

Background

With the continuous progress of machine learning technology, the recognition technology is more widely applied to the daily life of people. When analyzing a face, in some application scenarios, it is necessary to detect whether the face has an obstruction or not, and identify the category of the obstruction.

For example, in the current epidemic situation period, public places need to detect whether people wear masks or not. In daily skin measurement, the accessories of the face part of the user, such as whether a hat is worn or not, whether glasses are worn or not, and the like, can be detected. The existing occlusion recognition algorithm model simply judges whether occlusion exists or not only based on pixel difference, but also cannot recognize the type of an occlusion object. Even if the existing target recognition algorithm is applied to face obstruction recognition, the face obstruction recognition method is easily interfered by facial features and has low accuracy.

Disclosure of Invention

The embodiment of the invention mainly solves the technical problem of providing a method for training a face shielding recognition model, a method for recognizing face shielding, an electronic device and a storage medium, wherein the face shielding recognition model obtained by training by adopting the method for training the face shielding recognition model can accurately recognize various shielding object types, and the face shielding recognition model can be rapidly converged in the training process.

In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a method for training a face occlusion recognition model, including:

acquiring an image sample set, wherein each image in the image sample set comprises a human face;

intercepting a face area of a target image to generate a face area image, wherein the target image is any one image in the image sample set;

dividing the face region image into at least one local region image, wherein each local region image is marked with a real label, one local region image and the real label marked by the local region image are used as a sample pair, and the real label comprises a blocking object type;

taking at least one sample pair corresponding to each image in the image sample set as a training set, inputting the training set into a preset neural network for training, and stopping training until an iteration termination condition is met so as to obtain a face shielding recognition model;

the preset neural network comprises a feature extraction network, the feature extraction network comprises a common convolutional layer and N depth separable convolutional layers which are arranged layer by layer, and one depth separable convolutional layer comprises a depth convolutional layer and a point-by-point convolutional layer which are arranged layer by layer;

the step length of a depth convolution layer in the first M depth separable convolution layers in the N depth separable convolution layers is a preset value, the preset value is larger than 1, and M is smaller than or equal to N.

In some embodiments, the depth-separable convolutional layers further comprise a first linear convolutional layer in which each convolutional core is 1 x 1 in size and a second linear convolutional layer in which each convolutional core is 1 x 1 in size;

wherein the first linear convolution layer is located between the depth convolution layer and the point-by-point convolution layer, and the second linear convolution layer is located after the point-by-point convolution layer;

the number of the convolution kernels in the first linear convolution layer is a preset multiple of the number of the convolution kernels in the depth convolution layer, and the preset multiple is larger than 1;

the number of convolution kernels in the second linear convolution layer is the same as the number of convolution kernels in the point-by-point convolution layer.

In some embodiments, the method further comprises:

and smoothing the real label of each sample pair to obtain each smoothed real label so as to enable each smoothed real label to participate in the training of the preset neural network, wherein the smoothing is to add noise into the real label.

In some embodiments, the step of smoothing the real label of each sample pair to obtain each smoothed real label includes:

smoothing the target real label according to the following formula to obtain a smoothed target real label, wherein the target real label is any real label;

wherein k is the shade class,

the probability, y, of the k category in the smoothed target real label_kIs the probability of k classes in the target real label, y when the occlusion class k is the correct classification_kEqual to 1, y when the obstruction class k is error classification_kAnd the number is equal to 0, alpha is a preset parameter value, and K is the total number of the types of the shelters in the training set.

In some embodiments, before the step of inputting the training set into a preset neural network for training, the method further includes:

and performing data enhancement processing on the training set.

In some embodiments, the local region image is an image in which a local region in the face region image exhibits geometric features.

In order to solve the above technical problem, in a second aspect, an embodiment of the present invention provides a method for identifying facial occlusion, including:

acquiring an image to be detected, wherein the image to be detected comprises a human face;

intercepting a face area of the image to be detected to generate a face area image to be detected;

dividing the face region image to be detected into at least one local region image to be detected;

inputting the at least one local area image to be measured into the facial occlusion recognition model according to any one of claims 1-6, the facial occlusion recognition model outputting an occlusion class of each local area image to be measured;

and determining the shielding condition of the image to be detected according to the shielding object type of each local area image to be detected.

In some embodiments, the method further comprises:

acquiring the area attribute of a target local area image to be detected, wherein the area attribute reflects the geometric characteristics of a face included in the target local area image to be detected, and the target local area image to be detected is any local area image to be detected;

judging whether the area attribute is matched with the shielding object type of the target local area image to be detected;

and if not, determining the target local area image to be detected as non-occlusion object.

In order to solve the above technical problem, in a third aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

In order to solve the technical problem described above, in a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer-executable instructions for causing an electronic device to perform the method according to the first aspect.

The embodiment of the invention has the following beneficial effects: different from the situation of the prior art, in the method for training the face occlusion recognition model provided by the embodiment of the invention, the images used for training are all local area images, so that the preset neural network learns the local area features of the image, compared with the learning of the features of the whole image, the local features of the specific area of the image can be better learned through the local area image learning, the interference of other areas is eliminated, the preset neural network can be rapidly converged to obtain the face occlusion recognition model, and the accuracy of the classification of the face occlusion recognition model obtained through training is favorably improved. And secondly, the image is divided into local areas, so that the size of the image is reduced, and the calculation speed is favorably improved in both a model training process and a model prediction process. In addition, the preset neural network comprises a feature extraction network, the feature extraction network comprises a common convolutional layer and N depth separable convolutional layers which are arranged layer by layer, and one depth separable convolutional layer comprises a depth convolutional layer and a point-by-point convolutional layer which are arranged layer by layer, wherein the step length of the depth convolutional layer in the first M depth separable convolutional layers in the N depth separable convolutional layers is a preset value, the preset value is greater than 1, and M is less than or equal to N. The parameter quantity based on the depth separable convolutional layer is small, the calculated quantity is small, therefore, the parameter quantity and the calculated quantity of the feature extraction network can be reduced, further, the parameter quantity and the calculated quantity of the whole preset neural network are effectively reduced, the preset neural network has the characteristics of light weight and high training speed, the depth separable convolutional layer with M step lengths larger than 1 is adopted for downsampling, the receptive field of the network is favorably increased, the invariance of a feature space is concerned more, the finally generated feature diagram has the characteristics of low resolution, large receptive field and space invariance, and therefore the accuracy of a face shielding recognition model obtained by training can be improved.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a schematic diagram of an operating environment of a method for training a facial occlusion recognition model and a method for recognizing facial occlusions according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for training a face occlusion recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a layer structure of a neural network according to an embodiment of the present invention;

fig. 5(a) is a schematic diagram illustrating a convolution operation of a normal convolutional layer according to an embodiment of the present invention, fig. 5(b) is a schematic diagram illustrating a convolution operation of a deep convolutional layer according to an embodiment of the present invention, and fig. 5(c) is a schematic diagram illustrating a convolution operation of a point-by-point convolutional layer according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a layer structure of a depth-scalable convolutional layer according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating a method for identifying facial occlusion according to an embodiment of the invention;

fig. 8 is a flowchart illustrating a method for identifying facial occlusion according to another embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that, if not conflicted, the various features of the embodiments of the invention may be combined with each other within the scope of protection of the present application. Additionally, while functional block divisions are performed in apparatus schematics, with logical sequences shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions in apparatus or flowcharts. Further, the terms "first," "second," "third," and the like, as used herein, do not limit the data and the execution order, but merely distinguish the same items or similar items having substantially the same functions and actions.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

FIG. 1 is a schematic diagram of an operating environment of a correlation method in an embodiment of the present invention, wherein the correlation method includes a method for training a face occlusion recognition model and a method for recognizing face occlusion. Referring to fig. 1, the operating environment includes an electronic device 10 and an image capturing apparatus 20, where the electronic device 10 and the image capturing apparatus 20 are communicatively connected.

The communication connection may be a wired connection, for example: fiber optic cables, and also wireless communication connections, such as: WIFI connection, bluetooth connection, 4G wireless communication connection, 5G wireless communication connection and so on.

The image obtaining apparatus 20 is configured to obtain an image sample set, where each image in the image sample set includes a human face, and may also be configured to obtain an image to be measured, where the image to be measured includes a human face, and the image obtaining apparatus 20 may be a terminal capable of capturing images, for example: a mobile phone, a tablet computer, a video recorder or a camera with shooting function.

The electronic device 10 is a device capable of automatically processing mass data at high speed according to a program, and is generally composed of a hardware system and a software system, for example: computers, smart phones, and the like. The electronic device 10 may be a local device, which is directly connected to the image capturing apparatus 20; it may also be a cloud device, for example: a cloud server, a cloud host, a cloud service platform, a cloud computing platform, etc., the cloud device is connected to the image acquisition apparatus 20 through a network, and the two are connected through a predetermined communication protocol, which may be TCP/IP, NETBEUI, IPX/SPX, etc. in some embodiments.

It can be understood that: the image capturing device 20 and the electronic apparatus 10 may also be integrated together as an integrated apparatus, such as a computer with a camera or a smart phone.

The electronic device 10 receives the image sample set sent by the image obtaining device 20, where each image in the image sample set includes a human face, the electronic device 10 trains the image sample set to obtain a face occlusion recognition model, and detects each occlusion object type of the human face in the image to be detected sent by the image obtaining device 20 by using the face occlusion recognition model. It is understood that the method for training the face occlusion recognition model and the method for recognizing the face occlusion may be executed on the same electronic device or different electronic devices.

On the basis of fig. 1, another embodiment of the present invention provides an electronic device 10, please refer to fig. 2, which is a hardware structure diagram of the electronic device 10 according to the embodiment of the present invention, specifically, as shown in fig. 2, the electronic device 10 includes at least one processor 11 and a memory 12 (in fig. 2, a bus connection, a processor is taken as an example) that are communicatively connected.

The processor 11 is configured to provide computing and control capabilities to control the electronic device 10 to perform corresponding tasks, for example, control the electronic device 10 to perform any one of the methods for training a face occlusion recognition model and the methods for recognizing a face occlusion provided by the following embodiments of the invention.

It is understood that the Processor 11 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

The memory 12, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for training a facial occlusion recognition model in the embodiments of the present invention, or program instructions/modules corresponding to the method for recognizing facial occlusions in the embodiments of the present invention. The processor 11 may implement the method of training a facial occlusion recognition model in any of the method embodiments described below and may implement the method of recognizing facial occlusions in any of the method embodiments described below by executing non-transitory software programs, instructions, and modules stored in the memory 12. In particular, the memory 12 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 12 may also include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

In the following, a method for training a face occlusion recognition model according to an embodiment of the present invention is described in detail, referring to fig. 3, the method S20 includes, but is not limited to, the following steps:

s21: an image sample set is obtained, and each image in the image sample set comprises a human face.

S22: and intercepting a face area of a target image to generate a face area image, wherein the target image is any one image in the image sample set.

S23: dividing the face region image into at least one local region image, wherein each local region image is marked with a real label, one local region image and the real label marked by the local region image are used as a sample pair, and the real label comprises a blocking object type.

S24: taking at least one sample pair corresponding to each image in the image sample set as a training set, inputting the training set into a preset neural network for training, and stopping training until an iteration termination condition is met so as to obtain a face shielding recognition model;

Each image in the image sample set includes a human face and can be acquired by the image acquisition device, for example, the image sample set can be a certificate photo or a self-portrait photo acquired by the image acquisition device. It can be understood that the images have an occlusion condition, that is, at least a part of the face has an occlusion, and the type of the occlusion can be set according to the scene requirement identified by the model.

In an alternative scenario, for example, in a public place, it is necessary to detect whether people wear a mask, and the face in each image in the image sample set is worn with the mask. In an optional scenario, when detecting an accessory of the face portion, for example, it needs to detect whether a user wears a hat, wears glasses, has bang or is faced with a mask, is faced with a nose patch, and the like, the face in the partial image in the image sample set is worn with a hat, the face in the partial image is worn with glasses, the face in the partial image is left with bang, the face in the partial image is faced with a mask, and the face in the partial image is faced with a nose patch. It is understood that one or more occlusion categories may exist in the same image, for example, a person's face in the same image may have both glasses and bang. Of course, the two usage scenes may also be combined, that is, six types of shelters, namely, a mask, a hat, glasses, a bang, a mask and a nose patch, need to be detected, and in this scene, the face in the same image may have the problem that bang and the like exist when the glasses are worn and the mask is worn.

It will be appreciated that the image comprises a face and a background, wherein the face is the target area for detecting occlusion. In order to reduce the interference of the background on the detection of the obstruction and to reduce the training time of the subsequent algorithm model, each image in the sample set is intercepted. In the following, a target image is described, which is any image in a sample set of images. And intercepting the face area of the target image to generate a face area image. Specifically, as shown in fig. 4, several key points of the face, including points in the areas such as eyebrows, eyes, nose, mouth, face contour, etc., can be located by the existing face key point algorithm. Then, according to the face contour, the face area is intercepted.

The existing face keypoint algorithm may be Active Area Models (AAMs), Constrained Local Models (CLMs), Explicit Shape Regression (ESR), or explicit device method (SDM).

Then, the face region image is divided into at least one local region image, wherein the local region image is an image of a local region in the face region image, which presents geometric features, such as an eye region, a forehead region, a nose region, a mouth region and the like, all belong to the local region in the face region image and present the geometric features. Therefore, the local area image includes any one or two or more of the area images: forehead area image, eye area image, nose-to-chin area image. For example, when it is necessary to detect whether or not a person wears a mask in a public place, at least one of the partial region images may include only a nose-to-chin region image. When detecting whether the user wears a hat, wears glasses, has a bang or a mask, a nose patch, or the like, the at least one partial area image includes a forehead area image, an eye area image, and a nose-to-chin area image. When the six types of the masks, the caps, the glasses, the bang or the mask and the nose patch need to be detected, the at least one local area image comprises a forehead area image, an eye area image and a nose-chin area image.

The division of each local area image can also adopt the human face key point algorithm to position the key points of the areas such as the eyes, the nose, the mouth, the face contour and the like of the human face, and then carry out local area interception according to the coordinate information of each area key point. For the convenience of the network learning, the captured local area images are scaled so that the sizes of the local area images are consistent, and may be 64 × 3, for example.

It is understood that each local area image is labeled with a real label, and the real label includes a type of a blocking object, for example, if the local area image is a forehead area image, the blocking object type may be bang or hat, if the local area image is an eye area image, the blocking object type may be glasses or a mask, and if the local area image is a nose-chin area image, the blocking object type may be a mask or a nose patch. It can be understood that if the local area image is not occluded, the real label is not occluded.

The local area image and the true label of the local area image are set as a sample pair, and for example, if the true label of the local area image 1# is a mask, (the local area image 1# and the mask) is set as a sample pair. It is understood that if the image sample set includes 400 images, 3 local area images are obtained from each image, and thus, 400 × 3 — 1200 sample pairs can be obtained.

At least one sample pair corresponding to each image in the image sample set is used as a training set, for example, in the above example, 1200 sample pairs are used as a training set. And then, inputting the training set into a preset neural network for training, stopping training until an iteration termination condition is met, and at the moment, acquiring a face shielding recognition model.

The preset neural network can be preset based on the existing deep learning framework, for example, the set by using a Keras framework. It is understood that the preset neural network may directly invoke an existing algorithm network in the deep learning framework, such as mobileneetv 1, mobileneetv 2, or mobileneetv 3, and may also modify the existing algorithm network structure according to the training set and the requirement, such as increasing or decreasing layers, or changing parameters in the layers, where the layers include, but are not limited to, each convolutional layer, normalization layer, or activation function layer in the algorithm network.

When the training set is input into the preset neural network, the preset neural network learns the training set by using a model parameter, and a prediction result is output. And outputting a prediction result every time of learning iteration, adjusting model parameters by the preset neural network in each learning iteration process, continuously correcting errors between the output prediction result and the real label, namely errors of a training set, stopping training when an iteration termination condition is met, obtaining corresponding optimal model parameters at the moment, and obtaining the preset neural network with the optimal model parameters, namely the face shielding recognition model obtained through training. In some embodiments, the iteration termination condition may be that an iteration number threshold is reached, that is, the training is stopped when the training reaches the number threshold, and the model parameter at this time is used as the optimal model parameter to obtain the face occlusion recognition model. It is understood that, in some embodiments, the iteration termination condition may also be that the error of the training set fluctuates within a preset range, and the model parameter at this time is taken as the optimal model parameter to obtain the face occlusion recognition model.

As can be seen from the above, the images in the training set are all local area images, so that the features of the local areas of the neural network learning image are preset, compared with the feature of the whole image, the local features of a specific area of the image can be better learned through the local area image learning, and the interference of other areas is eliminated, for example, when the network learns the features of the forehead area, the network is not interfered by the eye area or the nose-to-chin area. Therefore, the preset neural network can be quickly converged to obtain the face shielding recognition model, and the accuracy of classification of the face shielding recognition model obtained through training is improved. In addition, the image is divided into local areas, so that the size of the image is reduced, and the calculation speed is favorably improved in both a model training process and a model prediction process.

In order to further lighten the network model and ensure the model precision, in the embodiment, the preset neural network comprises a feature extraction network, and the feature extraction network is used for extracting features of the local area image to generate feature map data.

The feature extraction network comprises a common convolutional layer and N depth separable convolutional layers which are arranged layer by layer, wherein one depth separable convolutional layer comprises a depth convolutional layer and a point-by-point convolutional layer which are arranged layer by layer; the step length of a depth convolution layer in the first M depth separable convolution layers in the N depth separable convolution layers is a preset value, the preset value is larger than 1, and M is smaller than or equal to N.

It can be understood that, as shown in fig. 4, the local area image is first input to the general convolutional layer, and the output of each layer in the feature extraction network is the input of the next layer until the feature map output by the last layer is the feature map output finally by the feature extraction network.

Due to the fact that the parameter quantity of the depth separable convolution layer is reduced, the calculated quantity is reduced, and the preset neural network has the advantages of being light in weight and high in training speed.

Specifically, the normal convolution layer includes a normal convolution kernel, and the parameter of the normal convolution kernel is Dk1 × Dk1 × M1 × N1, where Dk1 × Dk1 is the size of the normal convolution kernel, M1 is the number of input channels of the normal convolution kernel, and N1 is the number of output channels (i.e., the number of convolution kernels) of the normal convolution kernel, where the number of input channels M1 needs to be consistent with the number of image channels of the input image. As shown in fig. 5(a), when performing convolution operation, a convolution kernel performs convolution operation on all image channels in the input image at the same time, and obtains a feature map after weighting, so that the number of output feature maps is the same as the number of output channels. For example, if the dimension of the input image is DF × M, where M is the number of image channels, the parameter of the general convolution kernel is DK1 × DK1 × M1 × N1, and if the convolution is performed, the output dimension is DT1 × DT1 × N1, that is, N1 feature maps with the size of DT × DT. During convolution, each ordinary convolution kernel performs DT1 × DT1 convolution operations on each image channel of the input image, and each image channel requires DT1 × DT1 weighted summation operations. Thus, the parameters of the normal convolution kernel are DK1 DK 1M 1N 1, and the calculated quantity is N1 DF1 DF 1M 1 DK1 DK 1.

The depth separable convolutional layers include a depth convolutional layer and a point-by-point convolutional layer that are arranged layer by layer. Specifically, the depth convolution layer includes a depth convolution kernel having a parameter DK2 DK2 × 1M 2, where DK2 × DK2 is the size of the depth convolution kernel, M2 is the number of output channels (the number of convolution kernels) of the depth convolution kernel, and the number of output channels M2 matches the number of image channels M of the input image DF × DF M. It is understood that the size of the normal convolutional layer is different from the size of the input image of the depth convolutional layer, and the input image DF × M is taken as an example for illustrating the calculation amount of each of the different types of convolutional layers. As shown in fig. 5(b), in the convolution operation, a depth convolution kernel is convolved with only one image channel in the input image to obtain an output feature map, so that the number of output feature maps matches the number of convolution kernel output channels M2 (or the number of image channels M of the input image). For example, if the dimension of the input image is DF × M, the parameter of the depth convolution kernel is DK2 × DK2 × 1 × M2, and if the dimension of the feature map after convolution is DT2 × DT2 × M2, that is, M2 feature maps with the size of DT2 × DT 2. In the convolution process, each depth convolution kernel only performs DT2 × DT2 times of convolution operation on one image channel of the input image, and then M2 output feature maps can be obtained. Thus, the number of parameters for the deep convolution was DK2 by D2K by M2, and the calculated amount was M2 by DF2 by DF2 by DK2 by DK 2.

The point-by-point convolution layer comprises point-by-point convolution kernels, and the parameters of the point-by-point convolution kernels are 1 × M3 × N3, wherein the size of the point-by-point convolution kernels is 1 × 1, M3 is the number of input channels, and N3 is the number of output channels. As shown in fig. 5(c), the point-by-point convolutional layer has the same structure as the normal convolutional layer, and the convolution calculation method is the same, the only difference being that the size of the point-by-point convolution kernel is 1 × 1. Thus, the parameter quantity of the dot-by-dot convolution layer was 1 × M3 × N3, and the calculated quantity was N3 × DF3 × DF3 × M3.

Based on the structures of the common convolutional layer and the depth separable convolutional layer, because a depth convolution kernel in the depth separable convolutional layer is only subjected to convolution operation with an image channel to obtain an output characteristic diagram, weighted summation after convolution operation with a plurality of image channels is not needed, and the size of the point-by-point convolution kernel is 1 × 1, the parameter quantity of the depth separable convolutional layer can be reduced, the calculated quantity is reduced, and the preset neural network has the characteristics of light weight and high training speed.

In order to further reduce the computation of the network and increase the training speed, in this embodiment, the step size of the depth convolutional layer in the first M depth separable convolutional layers of the N depth separable convolutional layers is set to a preset value, and the preset value is greater than 1.

For example, M may be 3, which sets the step size of the depth convolutional layer in the first 3 depth separable convolutional layers 1#, 2# and 3# to a preset value. In some embodiments, referring again to fig. 4, the predetermined value may be 2, i.e., the step sizes of the depth separable convolutional layers 1#, 2# and 3# are all 2. When convolution operation is carried out on the basis of the depth convolution kernel based on the step length, the step length is larger than 1, so that the sizes of the feature maps output by the first M depth separable convolution layers are reduced relative to the sizes of the feature maps input by the first M depth separable convolution layers, namely the resolution of the feature images is rapidly reduced, and the calculation amount of a network structure is reduced.

Equivalently, the preset network structure performs downsampling by adopting the depth scalable convolution layer with the step length larger than 1 for continuous M times, so that the method is favorable for increasing the receptive field of the network, pays more attention to the invariance of the feature space, achieves the characteristics that the finally generated feature map has low resolution, large receptive field and space invariance, and improves the precision of the classification network model.

In this embodiment, the parameter quantity and the calculation quantity based on the depth separable convolutional layer are small, so that the parameter quantity and the calculation quantity of the feature extraction network can be reduced, and further, the parameter quantity and the calculation quantity of the whole preset neural network are effectively reduced, so that the preset neural network has the characteristics of light weight and high training speed. In addition, the preset network structure adopts M depth scalable convolution layers with the step length larger than 1 to carry out down-sampling, so that the sense field of the network is increased, the invariance of the feature space is concerned more, and the finally generated feature map has the characteristics of low resolution, large sense field and space invariance, so that the accuracy of the face shielding recognition model obtained by training can be improved.

In summary, in the method for training a face occlusion model according to the embodiment of the present invention, images used for training are all local area images, so that a preset neural network learns local area features of an image, and compared with learning features of an entire image, the local area image learning can better learn local features of a specific area of the image and exclude interference of other areas, so that not only can the preset neural network quickly converge to obtain the face occlusion recognition model, but also the accuracy of classification of the trained face occlusion recognition model can be improved. And secondly, the image is divided into local areas, so that the size of the image is reduced, and the calculation speed is favorably improved in both a model training process and a model prediction process. In addition, the preset neural network comprises a feature extraction network, the feature extraction network comprises a common convolutional layer and N depth separable convolutional layers which are arranged layer by layer, and one depth separable convolutional layer comprises a depth convolutional layer and a point-by-point convolutional layer which are arranged layer by layer, wherein the step length of the depth convolutional layer in the first M depth separable convolutional layers in the N depth separable convolutional layers is a preset value, the preset value is greater than 1, and M is less than or equal to N. The parameter quantity based on the depth separable convolutional layer is small, the calculated quantity is small, therefore, the parameter quantity and the calculated quantity of the feature extraction network can be reduced, further, the parameter quantity and the calculated quantity of the whole preset neural network are effectively reduced, the preset neural network has the characteristics of light weight and high training speed, the depth separable convolutional layer with M step lengths larger than 1 is adopted for downsampling, the receptive field of the network is favorably increased, the invariance of a feature space is concerned more, the finally generated feature diagram has the characteristics of low resolution, large receptive field and space invariance, and therefore the accuracy of a face shielding recognition model obtained by training can be improved.

It is understood that the normal convolution layer, the deep convolution layer, and the point-by-point convolution layer are each configured with a normalization layer and an activation function layer. The Normalization layer can be realized by the existing Batch Normalization algorithm to normalize the data, namely, the mean value of the data input into the next layer is 0, and the variance is 1, so that the generalization capability of the network can be reduced, and the training speed of the network is improved. The activation function layer can be realized by the existing ReLU function to increase the nonlinearity of the model and overcome the problem of gradient disappearance.

However, after the activation of the ReLU function, the negative number is directly changed into 0, and the operation loses much feature space information, which is not beneficial to the fitting of the network model and the feature expression of the network model. Based on this problem, in some embodiments, the depth-separable convolutional layers further include a first linear convolutional layer in which each convolutional core has a size of 1 × 1 and a second linear convolutional layer in which each convolutional core has a size of 1 × 1.

As shown in fig. 6, a first linear convolution layer is positioned between the depth convolution layer and the point-by-point convolution layer, and a second linear convolution layer is positioned after the point-by-point convolution layer. In a depth-separable convolutional layer, the characteristic image data output by the depth convolutional layer is input into an activation function for activation after being normalized by a normalization layer, and the data activated by the activation function is input into a first linear convolutional layer for convolution operation and then sequentially input into a point-by-point convolutional layer, the normalization layer, the activation function layer and a second linear convolutional layer. The convolution kernel of 1 x 1 does not change the size of the characteristic diagram, can increase the nonlinear capacity and increase the depth of the model.

The number of convolution kernels in the first linear convolution layer is a preset multiple of the number of convolution kernels in the depth convolution layer, and the preset multiple is larger than 1. For example, the number of convolution kernels in the first linear convolution layer is set to be 2 times that of convolution kernels in the depth convolution layer, so that redundancy of information amount can be guaranteed, and after the next activation of the activation function layer, diversity of feature space can be contained to make up for feature space information lost after the activation of the activation function layer.

The number of the convolution kernels in the second linear convolution layer is the same as that of the convolution kernels in the point-by-point convolution layer, so that the nonlinear capacity of the model can be increased, and the model is accurate.

In this embodiment, by adding the first linear convolution layer and the second linear convolution layer to the depth-separable convolution layer, redundancy of information amount can be ensured, and after the next activation of the activation function layer, diversity of the feature space can be included to compensate for the feature space information lost after the activation of the activation function layer, and meanwhile, the nonlinear capability of the model can be increased, so that the model is accurate.

It can be understood that, referring to fig. 4 again, the preset neural network further includes a fully-connected layer and a softmax layer, and each feature map output by the last layer of the above feature extraction network is input into the fully-connected layer for performing an integrated computation, and a one-dimensional vector is output, where the dimension of the one-dimensional vector is equal to the number of neurons in the fully-connected layer. It can be understood that, in the fully-connected layer, a neuron is configured with a convolution kernel, the size of the convolution kernel in the neuron is the same as the size of the input feature map, and the number of input channels of the convolution kernel is the same as the number of image channels of the input feature map. The number of neurons is determined by the number of occlusion classes in the training set, for example, when the occlusion classes in the training set include six types of occlusions, such as a mask, a hat, glasses, a bang, a mask, and a nose patch, the number of neurons is 6.

For example, in the case that the occlusion categories include six categories, i.e., a mask, a cap, glasses, a bang, a mask, and a nose patch, if the last layer of the feature extraction network outputs a feature map of 12 × 20, the number of image channels of the feature map is 20, that is, there are 20 feature maps, the fully-connected layer used includes 6 neurons, each neuron is configured with a convolution kernel of 12 × 20, that is, the number of input channels of the convolution kernel is 20, a convolution kernel of 12 × 20 performs convolution operations on the 20 feature maps, and performs weighted summation to obtain a numerical value, that is, one neuron outputs one numerical value, and 6 neurons obtain a vector of 1 × 6.

And then, inputting the vector output by the full connection layer into a softmax layer, wherein the softmax layer comprises a softmax function, the vector is calculated by the softmax function, and then a one-dimensional predictive label probability distribution is output, the dimensionality of the predictive label probability distribution is the same as the dimensionality of the neuron, and the predictive label probability distribution represents the probability of each category of the local area image. It will be understood by those skilled in the art that the softmax function is an existing function, and its calculation formula will not be described in detail. For example, after the vector of 1 × 6 in the above example is input into the softmax function, a predictive label probability distribution of 1 × 6 is output to represent the probability that the occlusion categories are the six categories of the mask, the hat, the glasses, the bang, the mask, and the nose patch. The sum of the 6 probability values in the predictive tag probability distribution is 1. For example, the predicted label probability distribution [0.7,0.1,0.08,0.06,0.04,0.02] indicates that the blocking type is a mask, the probability of a cap is 0.1, the probability of glasses is 0.08, the probability of a bang is 0.06, the probability of a mask is 0.04, and the probability of a nose patch is 0.02. It can be understood that the predicted label output by the neural network is the predicted label probability distribution.

The identification is performed by one-hot encoding based on the real label, for example, the category represented by the real label [1,0,0,0,0,0] is a mask. When loss is calculated, except for the category with the real probability of 1, the other categories with the real probability of 0 do not participate in loss calculation, so that the relation between the real category and the other categories is ignored, and therefore, a trained model is too violent, transition fitting is performed, and generalization performance is poor.

In order to solve the problem caused by one-hot encoding, in some embodiments, the method S20 further includes:

s25: and smoothing the real label of each sample pair to obtain the smoothed real label so as to enable each smoothed real label to participate in the training of the preset neural network.

The smoothing is to add noise into the real tag, the noise is a random positive number or a random negative number, and the real tag and the noise are summed to obtain the smoothed real tag. The relationship between the true label, the predicted label probability distribution and the smoothed true label is illustrated by an example, as shown in table 1 below:

TABLE 1

Label name	Gauze mask	Cap (hat)	Glasses	Liu Hai	Face pack	Nose paste
							Predictive label probability distribution	0.7	0.1	0.08	0.06	0.04	0.02
Real label	1	0	0	0	0	0
							Smoothed real label	0.75	0.05	0.05	0.05	0.05	0.05

As can be seen from table 1 above, the probability distribution of each class in the smoothed true label does not have absolute 0 and 1. When the smooth processed real label participates in the training of the preset neural network and loss is calculated, the probability of each category can participate in loss calculation, so that the preset neural network can learn the relation between the real category and other categories, the problem that the model is too hard to break can be relieved to a certain extent, clusters between the categories are more compact, the inter-category distance is increased, the intra-category distance is reduced, and the generalization capability of the model is improved.

In some embodiments, the step S25 specifically includes:

wherein k is the shade class,

The parameter value α is an empirical value, and may be 0.1. K is the total number of types of the mask, and in the above-described embodiments of six types of masks, caps, glasses, bang, masks, and nose patches, K is 6. If the target real label is [1,0,0,0,0,0]When k is a mask, y_kEqual to 1, after smoothing

Is 1 (1-0.1) +0.1/6 ═ 0.917, when k is hat, glasses, bang, mask or nose patch,

0 × (1-0.1) +0.1/6 ═ 0.017, the smoothed target authentic tag [1,0,0,0]Is [0.917,0.017, 0.017, 0.017, 0.017, 0.017,]。

in the embodiment, the real label is subjected to smoothing processing through the formula, noise is added into the real label, the model is restrained, the overfitting degree of the model is reduced, clusters among the classifications are more compact, the inter-class distance is increased, the intra-class distance is reduced, and therefore the generalization capability of the model is improved.

In some embodiments, before the step S24, the method further includes:

s26: and performing data enhancement processing on the training set.

The data enhancement processing comprises turning, rotating, translating, zooming, denoising or brightness adjusting and the like on partial local area images in the training set so as to generate partial new local area images, thereby increasing the number of the training set, achieving the purpose of enhancing sample diversity and being beneficial to improving the generalization capability of the model.

In summary, in the method for training a face occlusion recognition model according to the embodiment of the present invention, each image in an image sample set is subjected to face region interception to obtain each face region image, each face region image is then divided into at least one local region image, such as a forehead region, an eye region, a nose region or a chin region, and each local region image is labeled with a real label, where the real label includes an occlusion object type, such as a mask, glasses, a mask, a bang, a hat or a nose patch. And taking a local area image and a real label marked on the local area image as a sample pair, so that at least one sample pair corresponding to each image in the image sample set can be taken as a training set and input into a preset neural network for training, and the training is stopped until an iteration termination condition is met, so as to obtain the face shielding recognition model. Therefore, images used for training are all local area images, so that the characteristics of each local area of the neural network learning image are preset, compared with the characteristics of the whole learning image, the local characteristics of the specific area of the learning image can be better learned through the local area image learning, the interference of other areas is eliminated, the preset neural network can be rapidly converged to obtain the face shielding recognition model, and the accuracy of the classification of the face shielding recognition model obtained through training is improved. And secondly, the image is divided into local areas, so that the size of the image is reduced, and the calculation speed is favorably improved in both a model training process and a model prediction process. In addition, the preset neural network comprises a feature extraction network, the feature extraction network comprises a common convolutional layer and N depth separable convolutional layers which are arranged layer by layer, and one depth separable convolutional layer comprises a depth convolutional layer and a point-by-point convolutional layer which are arranged layer by layer, wherein the step length of the depth convolutional layer in the first M depth separable convolutional layers in the N depth separable convolutional layers is a preset value, the preset value is greater than 1, and M is less than or equal to N. The parameter quantity based on the depth separable convolutional layer is small, the calculated quantity is small, therefore, the parameter quantity and the calculated quantity of the feature extraction network can be reduced, further, the parameter quantity and the calculated quantity of the whole preset neural network are effectively reduced, the preset neural network has the characteristics of light weight and high training speed, the depth separable convolutional layer with M step lengths larger than 1 is adopted for downsampling, the receptive field of the network is favorably increased, the invariance of a feature space is concerned more, the finally generated feature diagram has the characteristics of low resolution, large receptive field and space invariance, and therefore the accuracy of a face shielding recognition model obtained by training can be improved.

In the following, a method for identifying a face mask provided in the embodiment of the present application is described in detail, referring to fig. 7, the method S30 includes, but is not limited to, the following steps:

s31: and acquiring an image to be detected, wherein the image to be detected comprises a human face.

S32: and intercepting the face area of the image to be detected to generate an image of the face area to be detected.

S33: and dividing the face region image to be detected into at least one local region image to be detected.

S34: the at least one local area image to be detected is input into the face shielding identification model in any one of the embodiments, and the face shielding identification model outputs the shielding object type of each local area image to be detected.

S35: and determining the shielding condition of the image to be detected according to the shielding object type of each local area image to be detected.

The image to be detected includes a human face and can be acquired by the image acquisition device, for example, the image to be detected can be a certificate photo or a self-portrait photo acquired by the image acquisition device.

It is understood that the image to be measured includes a human face and a background, wherein the human face is a target region for detecting occlusion. In order to reduce the interference of the background on the detection of the shielding object and reduce the identification time, the image to be detected is intercepted, the face area of the image to be detected is intercepted, and the image of the face area to be detected is generated. Specifically, a plurality of key points of the face can be located through the existing face key point algorithm, wherein the key points comprise points in the areas such as eyebrows, eyes, nose, mouth, face contour and the like. Then, according to the face contour, the face area is intercepted, and a face area image to be detected is generated.

Then, the face region image to be detected is divided into at least one local region image to be detected, wherein the local region image to be detected is an image of which a local region in the face region image to be detected presents geometric characteristics, such as an eye region, a forehead region, a nose region, a mouth region and the like all belong to the local region in the face region image to be detected and present the geometric characteristics. Therefore, the local area image to be measured includes any one or two or more area images of: the forehead area image, the eye area image and the nose-chin area image can be specifically set according to the identification requirements.

It can be understood that, for the division of each local area image to be measured, the above-mentioned face key point algorithm may also be used to locate key points of the areas such as the contained eyes, nose, mouth, face contour, etc., and then, according to the coordinate information of the key points of each area, the local area image to be measured is intercepted.

The image of each local area to be detected is input into the face shielding identification model in any one of the embodiments, and the face shielding identification model outputs the shielding object type of each image of the local area to be detected, so that the shielding situation of the image to be detected can be determined.

It can be understood that the face occlusion recognition model is obtained by training through the method for training the face occlusion recognition model in the above embodiment, and has the same structure and function as the face occlusion recognition model in the above embodiment, and is not described in detail here.

In order to further determine the accuracy of the identification of the obstruction and reduce the misjudgment of the model, in some embodiments, the obstruction condition of the output of the model is logically judged. Specifically, referring to fig. 8, the method S30 further includes:

s36: the method comprises the steps of obtaining the area attribute of a target local area image to be detected, wherein the area attribute reflects the geometrical characteristics of a face included in the target local area image to be detected, and the target local area image to be detected is any local area image to be detected.

S37: and judging whether the area attribute is matched with the shielding object type of the target local area image to be detected.

S38: and if not, determining the target local area image to be detected as non-occlusion object.

For any local area image to be measured, that is, the area attribute of the target local area image to be measured, the area attribute reflects the geometric features of the face included in the target local area image to be measured, for example, if the target local area image to be measured is a forehead area image, the area attribute is a forehead, if the target local area image to be measured is an eye area image, the area attribute is an eye, and if the target local area image to be measured is a nose-to-chin area image, the area attribute is a mouth.

It is understood that the type of the shielding object has a corresponding relationship with the region attribute, for example, the shielding object in the forehead region may not be glasses, a nose patch or a mask, and the shielding object in the eye region may not be a mask or a nose patch.

Therefore, after the model outputs the type of the shielding object of the target local area image to be detected, in order to verify the correctness of the type of the shielding object, whether the area attribute is matched with the type of the shielding object is further judged, and if the area attribute is not matched with the type of the shielding object, for example, the area attribute of the target local area image to be detected is the forehead, and the type of the shielding object is the mask, the target local area image to be detected is determined to be free of the shielding object.

In this embodiment, by logically determining the shielding condition output by the model, the misjudgment degree of the face shielding recognition model can be reduced, so that the output result of the face shielding recognition model is more in line with objective facts.

Another embodiment of the present invention also provides a non-transitory computer-readable storage medium storing computer-executable instructions for causing an electronic device to perform the above-mentioned method of training a face occlusion recognition model, or a method of recognizing a face occlusion.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; within the idea of the invention, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of training a facial occlusion recognition model, comprising:

2. The method of claim 1, wherein the depth-separable convolutional layers further comprise a first linear convolutional layer and a second linear convolutional layer, each convolutional core in the first linear convolutional layer having a size of 1 x 1, and each convolutional core in the second linear convolutional layer having a size of 1 x 1;

3. The method of claim 1, further comprising:

4. The method according to claim 3, wherein the step of smoothing the real label of each sample pair to obtain each smoothed real label comprises:

wherein k is the shade class,

5. The method of claim 1, further comprising, prior to the step of inputting the training set into a predetermined neural network for training:

and performing data enhancement processing on the training set.

6. The method according to claim 1, wherein the local area image is an image in which a local area in the face area image exhibits geometric features.

7. A method of identifying facial occlusions, comprising:

8. The method of claim 7, further comprising:

9. An electronic device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions for causing an electronic device to perform the method of any one of claims 1-8.