CN111241924A

CN111241924A - Face detection and alignment method and device based on scale estimation and storage medium

Info

Publication number: CN111241924A
Application number: CN201911387732.XA
Authority: CN
Inventors: 徐小丹; 刘小扬; 何学智; 王欢
Original assignee: Newland Digital Technology Co ltd
Current assignee: Newland Digital Technology Co ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-06-05
Anticipated expiration: 2039-12-30
Also published as: CN111241924B

Abstract

The invention discloses a face detection and alignment method based on scale estimation, which comprises the steps of inputting a picture into a scale estimation network, and outputting a scale with a scale probability vector larger than a preset threshold; when the scale estimation network is trained, attention weights are pre-distributed to the face in the image according to the face scale, and a loss function of the scale estimation network during training comprises two classification losses of the face attention diagram; scaling an image to be detected through a scale obtained by a scale estimation network to obtain a plurality of scale images; inputting the images with multiple scales into an anchor Pnet to obtain multiple candidate frames, and removing non-face candidate frames through a non-maximum suppression algorithm to obtain a pre-processing candidate frame; and cutting the preprocessing candidate frame on the original image, zooming the preprocessing candidate frame to a preset size, inputting the preprocessing candidate frame into anchor Rnet, removing redundant frames by using a non-maximum suppression algorithm to obtain a detection frame, and extracting corresponding human face characteristic points according to the detection frame. The method has the advantages of strong adaptability and higher detection for small-scale human faces.

Description

Face detection and alignment method and device based on scale estimation and storage medium

Technical Field

The invention relates to the technical field of video monitoring and image processing, in particular to a face detection and alignment method and device based on scale estimation and a storage medium.

Background

With the rapid development of science and technology, computer vision is increasingly popular in social life, a face detection and alignment technology is one of the research hotspots, and has numerous applications in real life, such as face brushing access control, mobile phone unlocking, security monitoring, identity verification and the like, and the application of the face detection and alignment technology brings great convenience to daily life. In an actual scene, an image may simultaneously contain faces of different scales, such as a small-scale face and a large-scale face, and in order to be able to detect faces of different scales simultaneously, in the existing method, firstly, an image pyramid which is uniformly distributed is used to detect on a dense pyramid image; secondly, a large network is designed to detect on a multi-scale characteristic diagram. However, these methods have the disadvantage of high computational complexity. In addition, in order to reduce the number of pyramids, some detection technologies use a scale estimation method, and when a multi-scale face exists in an image, the method is easy to ignore a small-scale face, which causes a false detection missing of the face, and brings inconvenience to the application of face detection.

Disclosure of Invention

The invention aims to solve the technical problem of how to provide a method and a device for detecting and aligning a human face, which have low computational complexity and are not easy to ignore small-scale human faces.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a face detection and alignment method based on scale estimation comprises the following steps:

inputting the picture into a scale estimation network, and outputting the scale with the scale probability vector larger than a preset threshold value; when the scale estimation network is used for training, attention weights are pre-distributed to faces in the images according to the face scales so as to make a face attention diagram; the loss function of the scale estimation network during training comprises the binary loss of the face attention diagram;

scaling an image to be detected through a scale obtained by a scale estimation network to obtain images of multiple scales;

inputting the images with multiple scales into an anchor Pnet to obtain multiple candidate frames, and removing non-face candidate frames through a non-maximum suppression algorithm to obtain a pre-processing candidate frame;

and cutting the preprocessing candidate frame on the original image, zooming the preprocessing candidate frame to a preset size, inputting the preprocessing candidate frame into anchor Rnet, removing redundant frames by using a non-maximum suppression algorithm to obtain a detection frame, and extracting corresponding human face characteristic points according to the detection frame.

Preferably, the training of the scale estimation network comprises:

labeling a face scale vector: presetting a plurality of scale intervals, taking the average value of the width and the height of the face as the face scale, and if the face belonging to one interval scale exists, setting the corresponding score on the score vector as 1; if no face belonging to the interval scale exists, setting the corresponding score on the score vector as 0;

making a face attention diagram: making a face mask, and pre-distributing attention weight according to the face scale, wherein the formula for pre-distributing the attention weight comprises the following steps:

wherein s is a face scale, and sigma and mu are probability distribution parameters;

loss of class two classes Using metrics_sAnd loss of binary classification of face attention maps_aAs a loss function, the loss is loss_s+λloss_aWherein λ is a weight coefficient.

Preferably, the first and second electrodes are formed of a metal,

N_adenotes the number of scale intervals, p_nA label representing the nth scale interval,

the estimation result of the nth scale interval is shown.

Preferably, the first and second electrodes are formed of a metal,

N_aa number of pixels, q, representing the face attention map_nA label representing the nth pixel is attached to the first pixel,

indicating the estimation result of the nth pixel.

Preferably, the model training process of anchor Pnet and anchor Rnet comprises:

anchor Pnet training: the anchor Pnet is a full convolution network, K anchors with different proportions are preset, if the intersection ratio of a predefined frame and a marking frame corresponding to the anchors is greater than a first preset value, the anchors are marked as positive samples, and meanwhile classification and regression calculation are involved; if the intersection ratio is smaller than a second preset value, the negative sample is considered to be only involved in classification and not involved in regression calculation; if the intersection ratio is larger than a second preset value and smaller than a first preset value, the samples are not classified and judged, and only participate in regression; during training, K anchors are required to be classified and detected simultaneously;

anchor Rnet training: and generating required training data by using the result and the labeling frame after the Anchor Pnet detection and a preset anchor, and simultaneously performing tasks during training, wherein the tasks comprise face classification, boundary frame regression and feature point positioning on K preset anchors.

Preferably, the non-face candidate box is removed by a non-maximum suppression algorithm, and the steps are performed: when a non-maximum suppression algorithm is used to remove the redundant boxes to obtain the detection boxes,

also includes the local maximum must cover the number N_nIs not a very great limitation of (1), wherein N_nIs the coverage threshold.

Preferably, the scale estimation network comprises a feature extraction module, an attention-assisted prediction module and a prediction module;

the feature extraction module is a full convolution network and is used for generating features;

the attention auxiliary prediction module is used for deconvoluting the feature map into the size of an original map and learning a human face attention map and human face attention features;

and the prediction module is used for obtaining a scale probability vector by combining the characteristics of the characteristic module and the attention characteristics of the human face and outputting the scale with the scale probability vector larger than a preset threshold value.

In a second aspect, the present invention further provides a face detection and alignment system based on scale estimation, including:

a scale estimation module: inputting the picture into a scale estimation network, and outputting the scale with the scale probability vector larger than a preset threshold value; when the scale estimation network is used for training, attention weights are pre-distributed to faces in the images according to the face scales so as to make a face attention diagram; the loss function of the scale estimation network during training comprises the binary loss of the face attention diagram;

a scaling module: scaling an image to be detected through a scale obtained by a scale estimation network to obtain images of multiple scales;

anchor Pnet Module: inputting the images with multiple scales into an anchor Pnet to obtain multiple candidate frames, and removing non-face candidate frames through a non-maximum suppression algorithm to obtain a pre-processing candidate frame;

anchor Rnet module: and cutting the preprocessing candidate frame on the original image, zooming the preprocessing candidate frame to a preset size, inputting the preprocessing candidate frame into anchor Rnet, removing redundant frames by using a non-maximum suppression algorithm to obtain a detection frame, and extracting corresponding human face characteristic points according to the detection frame.

In a third aspect, the present invention further provides an electronic device for face detection and alignment based on scale estimation, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for face scale detection and alignment when executing the program.

In a fourth aspect, the present invention further provides a computer-readable storage medium for face detection and alignment based on scale estimation, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the face detection and alignment based on scale estimation method.

By adopting the technical scheme, because the anchor-based cascade face detection method is adopted, the candidate area of the face is extracted by using simple and quick anchor Pnet, and then is gradually corrected by using relatively complex anchor Rnet, so that the face detection can be quicker and more accurate, the face detection can adapt to a certain range of scales, the adaptability to scale estimation results is enhanced, and simultaneously two tasks of face detection and alignment are carried out. In addition, the attention-based human face scale estimation network is adopted, so that parameters do not need to be adjusted according to different scenes, different scenes can be self-adapted, the attention-based scale estimation network can detect small-scale human faces more highly, and the small-scale human faces can be prevented from being ignored during detection.

Drawings

FIG. 1 is a flowchart illustrating steps of an embodiment of a scale estimation-based face detection and alignment method according to the present invention;

FIG. 2 is an original image to be processed for face detection and alignment based on scale estimation according to the present invention;

FIG. 3 is a prepared face attention diagram of the face detection and alignment based on scale estimation of the present invention;

FIG. 4 is a block diagram of a scale estimation network;

FIG. 5 is a diagram showing the structure of anchor Pnet;

FIG. 6 is a structural diagram of anchor Rnet.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The technical scheme of the invention provides a face detection and alignment method based on scale estimation, which comprises the following steps:

s10: inputting the picture into a scale estimation network, and outputting the scale with the scale probability vector larger than a preset threshold value; when the scale estimation network is used for training, attention weights are pre-distributed to faces in the images according to the face scales so as to make a face attention diagram; the loss function of the scale estimation network during training comprises the binary loss of the face attention diagram;

s20: scaling an image to be detected through a scale obtained by a scale estimation network to obtain a plurality of scale images;

s30: inputting the images with multiple scales into an anchor Pnet to obtain multiple candidate frames, and removing non-face candidate frames through a non-maximum suppression algorithm to obtain a pre-processing candidate frame;

s40: and cutting the preprocessing candidate frame on the original image, zooming the preprocessing candidate frame to a preset size, inputting the preprocessing candidate frame into anchor Rnet, removing redundant frames by using a non-maximum suppression algorithm to obtain a detection frame, and extracting corresponding human face characteristic points according to the detection frame.

In step S10, the training process of the scale estimation network is:

labeling a face scale vector: presetting a plurality of scale intervals, taking the average value of the width and the height of the face as the face scale, and if the face belonging to one interval scale exists, setting the corresponding score on the score vector as 1; if no face belonging to the interval scale exists, setting the corresponding score on the score vector as 0; making a face attention diagram: making a face mask, and pre-distributing attention weight according to the face scale; loss of class two classes Using metrics_sAnd loss of binary classification of face attention maps_aAs a loss function, the loss is loss_s+λloss_aWherein λ is a weight coefficient.

By adopting the technical scheme, because the anchor-based cascade human face detection method is adopted, the candidate region of the human face is extracted by using simple and rapid anchor Pnet, and then is gradually corrected by using relatively complex anchor Rnet, so that the human face detection can be more rapid and accurate, and can adapt to a certain range of scales. The adaptability to the scale estimation result is enhanced, two tasks of face detection and alignment are simultaneously carried out, and due to the mutual adaptation of the two networks, a small network structure can achieve good performance. In addition, the attention-based human face scale estimation network is adopted, so that parameters do not need to be adjusted according to different scenes, different scenes can be self-adapted, the attention-based scale estimation network can detect small-scale human faces more highly, and the small-scale human faces can be prevented from being ignored during detection.

In an embodiment of the present invention, the step of implementing face scale estimation includes:

the method comprises the following steps: attention-based face scale estimation.

Designing an attention-based scale estimation network for generating a scale probability vector, and then enabling the face scale probability vector to be larger than a threshold value T₁As the final scale S ═ S₁，S₂，S₃.............S_n}；

The method comprises the following specific steps:

step 1: attention-based scale estimation network training

Referring to fig. 4, the scale estimation network is composed of a feature extraction module, an attention-assisted prediction module, and a prediction module. The feature extraction module is a full convolution network and is used for generating features; the attention auxiliary prediction module deconvolves the feature graph into the size of an original graph, learns the human face attention graph and learns the human face attention feature; and the prediction module combines the feature module features and the human face attention features to obtain a scale probability vector, and outputs the scale with the scale probability vector larger than a preset threshold value.

And 1.1, marking and manufacturing the human face scale vector. Due to the adaptability of the detection network, the scale interval of the face scale can be set to be larger, and the scale interval can be 2¹. Preset scale X ═ {2 ═ 2^2.5，2^3.5，2^4.5........2ⁿThe scale space is XS { (2) {²，2³)，(2³，2⁴)..........(2^n-0.5，2^n+0.5) And (6) marking the face dimension s as the mean value of the face width and height, and marking 0 or 1 in each dimension interval according to the corresponding preset dimension. In implementation, the predetermined dimension X is {2 ═ 2²，2³，2⁴........2⁸In total 7 scale spaces with scale range of [2 ]²,2⁸]。

Step 1.2, making the human face attention map. Taking a binary segmentation image formed by inscribed ellipses in a face labeling frame as a face mask, and pre-distributing attention weight to the face in the image according to the face scale, wherein the pre-distributing attention weight formula is as follows:

as shown in the figure, fig. 2 is an original figure, and fig. 3 is a prepared human face attention diagram.

Step 1.3 uses a multitask penalty function.

The loss of training is composed of two parts, one is the multi-class two-class loss of scale_aTwo is the two-class loss of the human attention map_a. Loss of training_s+λloss_a。

Wherein the content of the first and second substances,

the estimation result of the nth scale interval is shown.

In the formula, N_aRepresenting the number of pixels of the face attention map. q. q.s_nA label representing the nth pixel is attached to the first pixel,

indicating the estimation result of the nth pixel. For the weighting factor, 2 may be used for implementation.

Step 2: the network test is estimated based on the scale of attention.

In the step 2.1 test, the attention auxiliary prediction module does not participate, and only needs to be usedA forward feature extraction module and a prediction module. The implementation is to down-sample the picture to 256 × 256 and then input it into the attention-based scale estimation network to obtain a 1 × 7 scale probability vector. Will be greater than the threshold value T₀S ═ S as a suggested face scale₁,s₂..........s_n}。

Step two: the method for cascade face detection and alignment based on anchors refers to fig. 5 and 6.

The method for detecting and aligning the cascaded human faces based on the anchor is composed of two convolutional neural network cascades, namely anchor Pnet and anchor Rnet. Firstly, a candidate region of a human face is extracted by using a simple and quick anchor Pnet, and then, a relatively complex anchor Rnet is used for gradually correcting, so that the human face detection can be more quick and accurate. The method comprises the following specific steps:

step 1: anchor-based cascading face detection and alignment method training

Step 1.1: anchor Pnet training.

The anchor Pnet is a full convolution network, and K anchors A ═ a { (a) } with different proportions are designed₁,a₂..........a_nMatch with the label box for training. If the Iou value of the predefined frame and the marking frame corresponding to the anchor is greater than 0.65, marking the anchor as a positive sample, and simultaneously participating in classification and regression calculation; if the number is less than 0.3, the negative sample is considered to be only involved in classification and not in regression calculation; for [0.4,0.65 ]]The samples of (1) are not classified and judged, and only participate in regression. During training, K anchors are required to be classified and detected simultaneously.

The anchors may have any aspect ratio, and in the embodiment, for convenience, the aspect ratio is 1, and the anchors are obtained by using the following formula: a is_k＝γ*a_k-1Wherein a is₁16, γ is 0.709, the number of anchors is 3; the 3 dashed boxes on the 16 × 16 diagram are the preset 3 anchors.

Step 1.2: anchor Rnet is trained.

Generating required training data by using the result and the labeling frame after the detection of the Anchor Pnet and a preset anchor, wherein three tasks are required to be performed simultaneously during training, namely performing three tasks on K preset anchorsFace a ═ a₁,a₂.........a_kClassification, bounding box regression, feature point localization with 48 × 48 inputs. The design rule of the anchor is consistent with step 1.1 in the implementation, a₁48, γ is 0.709, and the anchor number is 3.

Step 2: anchor-based face detection and alignment method test

Step 2.1: anchor Pnet generates a candidate box. Using the scale estimation network to obtain the scale S ═ S₁,s₂.....s_nAnd zooming the image to obtain a plurality of scale images. Since the anchor Pnet is a full convolution network, any size of input can be accepted, and images with multiple scales are sequentially input into the Pnet to obtain a large number of candidate frames. Tests show that the places with dense candidate frames are high in probability of faces, and the isolated candidate frames are high in probability of non-face areas. Thus, more non-face candidate boxes may be removed using the improved non-maximum suppression algorithm. The improved non-maximum value suppression algorithm is that the local maximum must cover the number N on the basis of the non-maximum value suppression algorithm_nIs not a very great limitation. The improved non-maxima suppression algorithm proceeds as follows:

wherein iou represents the cross-over ratio,

in this embodiment, the coverage threshold N_nNMS threshold N2_t0.5, confidence threshold T₁＝0.6。

Step 2.2: anchor Rnet gave the final result. The candidate frames generated in the first stage are clipped on the original image, scaled to 48 × 48 size, and input into the anchors Rnet, and one 48 × 48 input will result in K candidate frames, corresponding to K anchors respectively, which will be greater than the threshold value T₂The detection frame removes the redundant frame by using a non-maximum suppression algorithm to obtain a detection frame, and extracts the redundant frame according to the detection frameAnd taking out the corresponding face characteristic points. The non-maxima suppression algorithm proceeds as follows:

in practice, NMS threshold N_t0.5, confidence threshold T₂＝0.7。

The invention also provides a face detection and alignment system based on scale estimation, which comprises:

a scaling module: scaling an image to be detected through a scale obtained by a scale estimation network to obtain a plurality of scale images;

The invention also provides electronic equipment for detecting and aligning the face based on the scale estimation, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the steps of the method for detecting and aligning the face scale are realized when the processor executes the program. The method comprises the following steps:

: scaling an image to be detected through a scale obtained by a scale estimation network to obtain a plurality of scale images;

The invention also proposes a computer-readable storage medium for face detection and alignment based on scale estimation, on which a computer program is stored, the computer program being executed by a processor to implement the steps of the above-mentioned method for face scale detection and alignment.

The method comprises the following steps:

scaling an image to be detected through a scale obtained by a scale estimation network to obtain a plurality of scale images;

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and the scope of protection is still within the scope of the invention.

Claims

1. A face detection and alignment method based on scale estimation is characterized by comprising the following steps:

2. The scale estimation-based face detection and alignment method according to claim 1, wherein the training of the scale estimation network comprises:

3. The scale estimation-based face detection and alignment method of claim 2, wherein:

the estimation result of the nth scale interval is shown.

4. The scale estimation-based face detection and alignment method of claim 2, wherein:

indicating the estimation result of the nth pixel.

5. The scale estimation-based face detection and alignment method of claim 1, wherein: the model training process for anchorponet and anchor Rnet includes:

6. The scale estimation-based face detection and alignment method of claim 1, wherein: in the execution step: removing the non-face candidate box through a non-maximum suppression algorithm, and executing the following steps: when a non-maximum suppression algorithm is used to remove the redundant boxes to obtain the detection boxes,

7. The scale estimation-based face detection and alignment method according to claim 1, wherein the scale estimation network comprises a feature extraction module, an attention-assisted prediction module and a prediction module;

8. A face detection and alignment device based on scale estimation is characterized by comprising:

9. An apparatus for scale estimation based face detection and alignment, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the program, implements the steps of the scale estimation based face detection and alignment method of any one of claims 1-7.

10. A storage medium for scale estimation based face detection and alignment, having a computer program stored thereon, wherein: the computer program is executed by a processor to perform the steps of the scale estimation based face detection and alignment method of any one of claims 1 to 7.