CN111310718A

CN111310718A - High-accuracy detection and comparison method for face-shielding image

Info

Publication number: CN111310718A
Application number: CN202010156376.7A
Authority: CN
Inventors: 孙冰; 潘召军
Original assignee: Kehong New Technology Institute of Sichuan University
Current assignee: Kehong New Technology Institute of Sichuan University
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2020-06-19

Abstract

The invention discloses a high-accuracy detection and comparison method for an occluded face image, which comprises the following steps: preprocessing training data; making a generation target picture set for training generation branches; constructing a feature enhancement branch to obtain local features concentrated on the face; constructing parallel feature extraction network branches, and strengthening the extraction and utilization of detailed features; constructing a complete network model, and performing face region classification and frame regression based on fusion feature and parallel feature extraction; network end-to-end training, updating network parameters, and obtaining a detection model for completing training; and inputting a sample of the shielding face image, and using the trained detection model to select the position of the face image to finish the detection of the shielding face image. The method can effectively improve the proportion of the characteristics of the visible face region in the overall characteristics, improves the robustness of the detection model to the shielded face image, and has higher detection accuracy and recall rate to the shielded face image.

Description

High-accuracy detection and comparison method for face-shielding image

Technical Field

The invention relates to a face image detection method, in particular to a high-accuracy detection and comparison method for an occluded face image.

Background

Face image Detection is also called Face Detection (Face Detection) for short, and refers to a process of judging whether a Face image exists in an input image and determining specific positions of all Face image areas. With the increasing popularity of intelligent identification technology, the face image automatic detection technology plays an important application value in a wide range of scenes such as case detection, identity identification, mobile social contact, shooting beautification and the like.

The face image detection technology is mainly divided into a traditional detection method and a detection method based on deep learning. The traditional face image detection technology mainly performs face and non-face two-classification on images by designing artificial features such as gray scale features, contour features, skin color features and the like. The VJ detection algorithm proposed by Paul Viola et al is an excellent representation of the traditional detection algorithm. The VJ algorithm utilizes Haar characteristics and an Adaboost cascade strategy, and constructs a strong detector through level training weak classifiers, so that the real-time detection rate and the relatively good detection accuracy rate are achieved.

Compared with the traditional machine learning method, the neural network has more advantages in the aspect of nonlinear function fitting. With the progress of related technologies in the field of deep learning in recent years, a related model has excellent performance in the aspects of image feature extraction and classification detection, so that the application of deep learning in face image detection is increasingly wide. For example, a classical RCNN detection model performs feature extraction on an input image through convolution and pooling layers, then obtains candidate regions with different proportions on the basis of a feature map, and performs classification and frame regression on whether the candidate regions are human faces.

The existing human face image detection model can obtain good detection effect under the constraint condition, but the actual application scene usually has shielding of various conditions, and the human face image with partial characteristic loss brings difficulty and challenge to accurate detection of the human face image. For example, fast RCNN can achieve higher accuracy on the public data set VOC2007, but when processing face images with a large amount of occlusion, there are a large number of missed and false detections.

Disclosure of Invention

The present invention is directed to solve the above problems and provide a method for detecting an occluded face image with high accuracy, which can significantly improve the accuracy of detecting an occluded face image.

The invention realizes the purpose through the following technical scheme:

a high-accuracy detection and comparison method for an occluded face image comprises the following steps:

step 1, preprocessing training data;

step 2, making a generation target picture set for training generation branches;

step 3, constructing a feature enhancement branch to obtain local features concentrated on the face; constructing parallel feature extraction network branches, and strengthening the extraction and utilization of detailed features;

step 4, constructing a complete network model, and performing face region classification and frame regression based on fusion feature and parallel feature extraction;

step 5, end-to-end training of the network, updating network parameters and obtaining a detection model after training;

and 6, inputting a sample of the shielded face image, and using the trained detection model to select the position of the face image to finish the detection of the shielded face image.

Preferably, in step 1, the training data set is a WiderFace public data set, and the preprocessing includes performing size scaling processing on all input images to avoid occupying too high video memory.

More specifically, in the step 1, the WiderFace public data set includes a large number of facial occlusion pictures, an occlusion item in the data label represents an occlusion degree, which is divided into 0, 1 and 2 levels, and respectively represents no occlusion, slight occlusion and large-area occlusion; randomly selecting 50% from samples with occlusion grade of 0, intercepting a square background area from a non-ground route area of each picture, wherein the size range of the area is random [0.2, 0.8] times of the side length of a maximum GT frame, and covering a part of the GT frame by using the cut background area to cause artificial shielding; before inputting into the network, all input images are subjected to size processing and are scaled to have short edges not more than 600 pixels and long edges not more than 800 pixels.

Preferably, the method of step 2 is: and making a corresponding generation target picture for each training set picture based on the WiderFace data set, and calculating the similarity generation loss.

Further, in the step 2, the non-GT region pixel value of each input image is set to zero, and an image only including a face region is obtained and is used as a generation target of the enhancement branch.

Preferably, the step 3 comprises the following steps:

step 3.1, constructing two feature enhancement branches to obtain local features concentrated on the face; the method specifically comprises the following steps:

step 3.1.1, constructing two feature enhancement branch networks, including a convolution network for feature screening and a deconvolution network for picture generation;

step 3.1.2, the features output in the main network feature extraction stage pass through 3 layers of convolution layers, the size of a convolution kernel is 3 x 3, the padding is 1, the step length is 1, and 512-channel intermediate features with unchanged scale are obtained;

3.1.3, generating a target area by the intermediate characteristic through a decoder module, and specifically, obtaining a 1-channel output image with the same input size through 4 layers of deconv layers under a buffer frame;

step 3.1.4, calculating the similarity loss of the generated image and the production target image manufactured in the step 2, and adjusting and enhancing branch network parameters based on the similarity loss; the similarity loss adopts L2 loss, and the calculation formula is as follows:

L_sim＝αL_f+(1-α)L_nf

wherein L is_simIs the generated loss, α is a parameter for adjusting the degree of contribution of the face region to the loss in loss, L_fFor loss of face area, L_nfIs a loss of non-face regions, where L_fAnd L_nfWith the same L2 loss, the calculation is as follows:

wherein, y_iTo generate the values of the pixels of the picture,

marking the corresponding value of the picture;

3.2, constructing a parallel feature extraction network, fusing the parallel feature extraction network with a backbone network, and enhancing the extraction of the detail features of the human face, wherein the method specifically comprises the following steps:

step 3.2.1, constructing a parallel feature extraction network, wherein the parallel network and the backbone network adopt a pre-convolution module of VGG16, and the parallel network comprises 5 convolution modules Conv1 ', Conv2 ',. and Conv5 ', wherein Conv1 comprises 2 3 × 3 convolutions, the number of channels is 64, a maximum pooling layer, the output feature map size is 1/2 of the original, Conv2 ' comprises 2 3 × 3 convolutions, the number of channels is 128, a maximum pooling layer, the output feature map size is 1/4 of the original, Conv3 ' comprises 3 × 3 convolutions, the number of channels is 256, a maximum pooling layer, the output feature map size is 1/8 of the original, Conv4 ' comprises 3 × 3 convolutions, the number of channels is 512, a maximum pooling layer, the output feature map size is 1/16 of the original, Conv5 ' comprises 3 convolutions, and the number of channels is 512;

3.2.2, on the basis of the backbone network structure, connecting each convolution layer with the side surface of the backbone network through a 1 x 1 convolution, wherein the rest structures are completely the same as the backbone network;

step 3.2.3, except the first layer, each conv layer fuses the feature map obtained in the previous step and the feature map obtained by the backbone network, and then the feature maps are continuously used by the next layer of network;

and 3.2.4, fusing the conv5_ 3' layer output characteristic graph with the backbone network conv5_3 layer output characteristic graph and then using the fused graph as the input of the enhanced branch and the subsequent network.

Preferably, the step 4 comprises the following steps:

step 4.1, constructing a feature extraction backbone network, wherein the backbone network adopts pre-convolution modules of VGG16, and comprises 5 convolution modules, Conv1, Conv2, and Conv5, wherein Conv1 comprises 2 3 × 3 convolutions, the number of channels 64, a maximum pooling layer, the size of an output feature map is 1/2 of the original, Conv2 comprises 2 3 × 3 convolutions, the number of channels 128, a maximum pooling layer, the size of an output feature map is 1/4 of the original, Conv3 comprises 3 × 3 convolutions, the number of channels 256, a maximum pooling layer, the size of an output feature map is 1/8 of the original, Conv4 comprises 3 × 3 convolutions, the number of channels 512, a maximum pooling layer, the size of the output feature map is 1/16 of the original, and Conv5 comprises 3 × 3 convolutions and the number of channels 512;

step 4.2, taking the fusion result of the conv5_3 layer output feature graph and the conv5_3 layer feature graph of the parallel feature extraction network as the input of the enhancement branch and the subsequent network;

step 4.3, enhancing the same size characteristics output by the branches and the parallel branches, fusing the same size characteristics with the original conv5_3 layer characteristics through point multiplication, enhancing the weight of the visible face area in the classification characteristics, and enhancing the extraction of the face detail characteristics;

4.4, on the basis of the fusion characteristics, acquiring a normalized proposed region by using an RPN module and an ROI module;

step 4.5, finishing face two classification and frame fine adjustment of the proposed area through the classification branch and the regression branch; wherein the classification regression loss is calculated as follows:

wherein L is_clsTo classify losses, in which p_iIn order to be a classification score,

is an anchor label with a positive value of 1 and a negative value of 0, L_regIs the regression loss multiplied by

Representing regression bounding boxes, t, only for anchors classified as foreground_iTo predict one of the bounding box parameter components (x, y, w, h),

is the group true box parameter corresponding to anchor marked as positive.

Preferably, the step 5 comprises the following steps:

step 5.1, setting an enhanced branch loss function and a classification regression loss function;

and 5.2, performing end-to-end training on the network, updating network parameters based on the joint loss adjustment parameters, and obtaining a detection model for completing training.

More specifically, in step 5.2, the VGG16 is used to pre-train the model, training of network parameters is started, the training adopts a stochastic gradient descent method with impulse and weight attenuation, the impulse is 0.8, the attenuation is 0.0005, 2 pictures are processed per mini-batch, the initial learning rate is set to be 0.001, and the attenuation rate per 18000 steps is 0.1.

The invention has the beneficial effects that:

aiming at the problem of interference caused by shielding on face image detection, a fast RCNN model is used as a backbone, a feature enhancement branch generated based on a visible region and a parallel feature extraction network branch for enhancing face detail feature extraction are designed, the proportion of visible face region features in the overall features can be effectively improved by superposing original image features, parallel feature extraction network features and generated face region features, the robustness of a detection model on the shielded face image is improved, loss and interference caused by shielding on the feature region are inhibited, the face image in an image sample is better positioned and extracted, and the detection accuracy and recall rate on the shielded face image are higher.

Drawings

FIG. 1 is a general flow chart of the method for detecting the high accuracy of the occluded face image according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

as shown in FIG. 1, the method for detecting the high accuracy of the image of the occluded human face comprises the following steps:

step 1, preprocessing training data;

in the step, the training data set adopts a WiderFace public data set, and the preprocessing comprises the step of carrying out size scaling processing on all input images so as to avoid occupying too high video memory; the WiderFace public data set comprises a large number of face shielding pictures, an oclusion item in data annotation represents shielding degree which is divided into 0, 1 and 2 grades and respectively represents no shielding, slight shielding and large-area shielding; randomly selecting 50% from samples with occlusion grade of 0, intercepting a square background area from a non-ground route area of each picture, wherein the size range of the area is random [0.2, 0.8] times of the side length of a maximum GT frame, and covering a part of the GT frame by using the cut background area to cause artificial shielding; before inputting into the network, all input images are subjected to size processing and are scaled to have short edges not more than 600 pixels and long edges not more than 800 pixels.

Step 2, making a generation target picture set for training generation branches: making a corresponding generation target picture for each training set picture based on the WiderFace data set, and calculating similarity generation loss; and setting the pixel value of the non-GT region of each input image to zero to obtain an image only containing a face region, and using the image as a generation target of the enhancement branch.

the method specifically comprises the following steps:

L_sim＝αL_f+(1-α)L_nf

wherein L is_simIs to generate loss, α is to adjust lossParameter of degree of contribution of face region to loss, L_fFor loss of face area, L_nfIs a loss of non-face regions, where L_fAnd L_nfWith the same L2 loss, the calculation is as follows:

wherein, y_iTo generate the values of the pixels of the picture,

marking the corresponding value of the picture;

the method specifically comprises the following steps:

is the group true box parameter corresponding to anchor marked as positive.

the method specifically comprises the following steps:

step 5.2, the network carries out end-to-end training, and updates the network parameters based on the joint loss adjustment parameters to obtain a detection model which completes the training; in the step, a VGG16 pre-training model is used, network parameters begin to be trained, a random gradient descent method with impulse and weight attenuation is adopted for training, the impulse is 0.8, the attenuation is 0.0005, each mini-batch processes 2 pictures, the initial learning rate is set to be 0.001, and the attenuation rate in each 18000 steps is 0.1.

Description of the drawings: the steps in fig. 1 are not exactly the same as those described above, but correspond to each other, so as to facilitate simple extraction into a flow chart and understanding.

The invention designs a parallel feature extraction network based on feature enhancement branches generated by a face region and enhanced face detail feature extraction by using an attention mechanism; the feature enhancement branch generates images near a ground channel area based on the original image features, and performs point multiplication fusion on the trained features capable of generating better targets and the convolution features of the main network, so that the proportion of visible facial features is enhanced, and the interference of shielding on the features is reduced. The experimental result shows that the fused features can obviously improve the accuracy of the model for detecting the face to be shielded. The feature extraction branches parallel to the main network are laterally connected through 1 multiplied by 1 convolution, so that the detail features of the face image can be effectively captured, the positioning of the face area is enhanced, and the recognition accuracy of the face image can be effectively improved.

In order to accurately recover the face region from the features, the present invention utilizes supervised training enhancement branches to construct a target data set. And creating and generating a target picture by setting zero in a non-GT region of each input picture. According to the invention, the characteristic enhancement branches are fused into the fast RCNN detection model, and the experimental result shows that the model fused with the characteristic enhancement branches has a better detection effect on the shielded human face compared with the original model. In view of the fact that the face area does not occupy a large proportion in the picture, the constructed parallel feature extraction network branch can be further used for extracting bottom-layer features, and then operations such as convolution, pooling and the like are performed step by step, so that the method is beneficial to face area regression and face image recognition which do not occupy a large proportion in the picture.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims

1. A high-accuracy detection and comparison method for an occluded face image is characterized by comprising the following steps: the method comprises the following steps:

step 1, preprocessing training data;

2. The method for high-accuracy detection of occluded face images according to claim 1, wherein: in the step 1, the training data set adopts a WiderFace public data set, and the preprocessing comprises the step of carrying out size scaling processing on all input images so as to avoid occupying too high video memory.

3. The method for high-accuracy detection of occluded face images according to claim 2, wherein: in the step 1, the WiderFace public data set comprises a large number of face shielding pictures, an occlusion item in data annotation represents shielding degree which is divided into 0, 1 and 2 grades and respectively represents no shielding, slight shielding and large-area shielding; randomly selecting 50% from samples with occlusion grade of 0, intercepting a square background area from a non-ground route area of each picture, wherein the size range of the area is random [0.2, 0.8] times of the side length of a maximum GT frame, and covering a part of the GT frame by using the cut background area to cause artificial shielding; before inputting into the network, all input images are subjected to size processing and are scaled to have short edges not more than 600 pixels and long edges not more than 800 pixels.

4. The method for detecting the high accuracy of the image of the occluded human face according to claim 2 or 3, wherein: the method of the step 2 comprises the following steps: and making a corresponding generation target picture for each training set picture based on the WiderFace data set, and calculating the similarity generation loss.

5. The method for high-accuracy detection of occluded face images according to claim 4, wherein: in the step 2, the non-GT region pixel value of each input image is set to zero, and an image only including a face region is obtained and is used as a generation target of an enhanced branch.

6. The method for high-accuracy detection of occluded face images according to claim 4, wherein: the step 3 comprises the following steps:

L_sim＝αL_f+(1-α)L_nf

wherein, y_iTo generate the values of the pixels of the picture,

marking the corresponding value of the picture;

7. The method for high-accuracy detection of occluded face images according to claim 6, wherein: the step 4 comprises the following steps:

is the group true box parameter corresponding to anchor marked as positive.

8. The method for high-accuracy detection of occluded face images according to claim 7, wherein: the step 5 comprises the following steps:

9. The occlusion face image high-accuracy detection method according to claim 8, characterized in that: in the step 5.2, a VGG16 pre-training model is used, network parameters begin to be trained, a random gradient descent method with impulse and weight attenuation is adopted for training, the impulse is 0.8, the attenuation is 0.0005, 2 pictures are processed by each mini-batch, the initial learning rate is set to be 0.001, and the attenuation rate is 0.1 in each 18000 steps.