CN111553230A

CN111553230A - Feature enhancement based progressive cascade face detection method under unconstrained scene

Info

Publication number: CN111553230A
Application number: CN202010319149.1A
Authority: CN
Inventors: 徐琴珍; 杨哲; 刘杨; 王路; 王驭扬; 杨绿溪
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-08-18

Abstract

The invention provides a progressive cascade face detection scheme based on feature enhancement in an unconstrained scene, belonging to the field of multimedia signal processing. The method comprises the steps of performing data amplification on a training set, taking VGGNet-16 as a basic feature extraction network, utilizing a feature enhancement module to realize a dual-branch architecture, and applying a Max-Both-Out strategy to each branch before prediction; and during training, an iterative cascade structure is built, and progressive loss is designed, namely the first branch multitask loss and the second branch multitask loss are subjected to weighted summation to guide the training and learning process until convergence, so that the detection of the target face is finally realized. The method not only focuses on context information, but also emphatically excavates the current layer characteristics, enriches the extraction modes of facial characteristics, is suitable for unconstrained scenes with high detection difficulty, and can realize accurate detection on tiny, fuzzy and shielded human faces.

Description

Feature enhancement based progressive cascade face detection method under unconstrained scene

Technical Field

The invention belongs to the technical field of image processing, and relates to a progressive cascade face detection method based on feature enhancement in an unconstrained scene.

Background

The popularization of intelligent terminal equipment profoundly influences the thinking way of human beings, and has a brand-new definition on the social essence of the intelligent terminal equipment. Human face detection is the most suitable application for daily life in the field of computer vision, and relieves human beings from heavy visual processing work, and specific information in images and videos is analyzed and summarized by a machine, so that the development of the modern society is deeply influenced. On the smart phone, 3D face recognition unlocking is respectively realized on an IOS platform and an android platform by iPhone X and Mate20pro, so that privacy is better protected; in security monitoring, lawbreakers can be tracked and captured by a face recognition technology, so that security maintenance is enhanced; in the aspect of property safety, the paying treasures firstly put out face-brushing payment and credit loan for identity authentication, so that the safety is also ensured while the efficiency is improved.

The early mainstream face detection method is mostly based on a manually designed template matching technology, has a good detection effect on a face with a clear front face without shielding, is easy to implement and hardly influenced by illumination and image quality, but cannot make a completely effective face template to adapt to changes of posture, scale and the like due to high plasticity of the face, so that the precision is limited. The conventional face detection method, which determines whether a face exists in an image by only mechanically comparing the autocorrelation between the manual features and the target face, is not suitable for an unconstrained scene.

With the rapid development of deep learning, the face detection method based on the convolutional neural network gradually replaces the traditional face detection method with strong characterization learning and nonlinear modeling capability, the detection performance is remarkably improved, and the accuracy rate of a clear face without shielding can almost reach one hundred percent. However, an unconstrained face in a natural scene is easily interfered by external environmental factors such as shielding, illumination, expressions, postures and the like, so that facial features are insufficiently extracted and utilized; in addition, the low-resolution human face with a smaller size is a bottleneck, and the small human face is densely sampled by using the small-size anchor point, so that excessive background negative samples are easily generated, and the false detection rate is increased. The accuracy of the existing face detection method under the unconstrained scene is still insufficient, and a satisfactory effect cannot be obtained.

Disclosure of Invention

In order to solve the above problems, the present invention provides a progressive cascade face detection method based on feature enhancement in an unconstrained scene, which focuses on improvement and optimization in the following two aspects: on one hand, the characteristics of the current layer are fully mined, a single-branch architecture is expanded into a double-branch architecture by utilizing a characteristic strengthening module, progressive loss is correspondingly designed to match the near learning capacity of each branch and level characteristic diagram, and the extraction mode of the facial characteristics is enriched; and on the other hand, a Max-Both-Out strategy is applied, an iterative cascade structure is built, and more appropriate sample distribution is matched for each stage through sub-detectors which are combined and gradually increased compared with a threshold value.

In order to achieve the purpose, the invention provides the following technical scheme:

the progressive cascade face detection method based on feature enhancement under the unconstrained scene comprises the following steps:

step 1, carrying out data augmentation on WIDERFACE (the current most authoritative face detection reference) training set;

step 2, based on the augmented picture in the step 1, taking VGGNet-16 (a classical deep convolutional neural network) as a basic feature extraction network, utilizing a feature enhancement module to realize a dual-branch architecture, and applying a Max-Both-Out strategy to each branch and each level of feature graph for prediction;

and 3, after the training parameters are initialized, building an iterative cascade structure, guiding and supervising the autonomous learning process of the model by utilizing progressive loss, storing the model after the model is converged, and detecting the model.

Further, the step 1 specifically includes the following sub-steps:

step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDERFACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:

x_preprocess＝Crop(Flip(Extend(x_input)))

in the formula, x_inputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x is_preprocessThen the corresponding preliminary pre-processing result is represented, and the size is unified as 640 × 640;

step 1.2: simulating the interference under the unconstrained scene by adopting a color dithering and noise disturbance mode, and carrying out the preliminary preprocessing result x obtained in the step 1.1 again_preprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatment_processAs shown in the following formula:

in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively.

Further, the step 2 specifically includes the following sub-steps:

step 2.1: performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5;

step 2.2: the feature enhancement module is used for realizing a double-branch architecture, and the original feature map used for prediction in the step 2.1 is subjected to different dimension informationThe enhancement is carried out, and the neuron cells in the upper layer original characteristic diagram are marked as oc: (_i,j,_l) The non-local neuron cell in the original feature map of the current layer is nc_(i-,j-,l)，nc_(i-,j,l)，…，nc_(i,j+,l)，nc_(i+,j+,l)And the neuron cell ec of the characteristic diagram after the strengthening_(i,j,l)Expressed as:

ec_(i,j,l)＝f_Concat(f_Dilation(nc_(i,j,l)))

nc_(i,j,l)＝f_Element-wise(oc_(i,j,l),f_Up(oc_(i,j,l+1)))

in the formula, c_(i,j,l)The cell unit mapped by the coordinates (i, j) in the characteristic diagram of the ith layer, and f points to a series of basic splicing operation, expansion convolution operation, element-by-element multiplication operation and upsampling operation;

step 2.3: and applying a Max-Both-Out strategy to the feature maps used for prediction of each branch and each level obtained in the steps to reduce the false positives of the training samples, wherein the Max-Both-Out strategy simultaneously predicts Cp positive sample face scores and Cn negative sample background scores, and respectively selects the face scores and the Cn negative sample background scores with the highest positive scores and the highest negative scores as final targets and final backgrounds, namely integrating Cn + Cp classifiers.

Further, the feature enhancing module in step 2.2 is specifically implemented as follows:

(1) normalizing the characteristic diagram by using a convolution kernel with the size of 1 multiplied by 1;

(2) multiplying the up-sampled upper layer feature map by the current feature layer element by element;

(3) decomposing the characteristic diagram into three branches, and then respectively sending the three branches into sub-networks containing different numbers of expansion convolution layers;

(4) and reducing the three branches into the dimension of the initial characteristic diagram in a channel splicing mode.

Further, the step 3 specifically includes the following sub-steps:

step 3.1: initializing the training parameters;

step 3.2: an iterative cascade structure is set up during training, the model performance is optimized by utilizing the intersection ratio threshold between a candidate frame and a truth frame, each sub-detector is obtained based on the training of positive and negative samples with different thresholds, the output of the former sub-detector is used as the input of the latter sub-detector, iterative calculation is carried out step by step, and the intersection ratio threshold of the positive and negative samples is increased progressively to match a detection frame with higher confidence;

step 3.3: according to the asymptotic learning capacity of each branch and each level of feature graph, adopting an autonomous learning process of a progressive loss guidance and supervision model, wherein progressive loss is obtained by weighted summation of first branch multitask loss and second branch multitask loss;

step 3.4: when the progressive loss does not rise any more and is stabilized in a smaller value range, stopping training, storing the model and detecting; otherwise, the procedure returns to step 3.1.

Further, in the step 3.1, the optimizer selects a random gradient descent method with a momentum value of 0.9; while setting the weight attenuation value to 10-⁵。

Further, when the number of iterations is in the set step list {40000,60000,80000}, the learning rate drops to 0.1.

Further, in 3.2, the iterative cascade structure is a three-level structure.

Further, in step 3.3, the weighted summation of the progressive loss and the first branch multitask loss comprises the following steps:

(1) basic category scoring is guided by softmax loss training, and the expression is as follows:

in the formula, x_kIndicating the actual class label, z_mDenotes the input of the softmax layer, f (z)_m) Represents soThe output predicted by the ftmax layer, T being the number of classes on the training data set;

the basic position regression is trained by smooth L1 loss guidance, and the expression is as follows:

in the formula, y⁽ⁱ⁾A tag that represents the true location of the object,

representing coordinate label information predicted by a CRFD model, wherein omega represents a region set of which a prior frame is a positive sample;

(2) the original first branch's multitask penalty resulting from step 2.1 is defined as follows:

where N represents the total number of dense positive and negative anchor boxes, N_PIndicates the number of matching positive anchor boxes, L_confRefers to the softmax loss, L, of both the face and background categories_locMeans when using anchor point a_iAt the time of detection, the frame t is predicted_iAnd truth box g_iParameterized smooth L1 loss, p between_iIs the prediction anchor, L_locIs lost, β is used to balance the weight between position regression and category scoring;

(3) the multiplexing penalty of the enhanced second branch resulting from step 2.2 is defined as follows:

where N represents the total number of dense positive and negative anchor boxes, N_PIndicates the number of matching positive anchor boxes, L_confMeans thatSoftmax loss, L, for both face and background categories_locMeans when using anchor point sa_iAt the time of detection, the frame t is predicted_iAnd truth box g_iParameterized smooth L1 loss in between;

the original first branch is taken with a_iDetecting, using sa in the enhanced second branch_iCarrying out detection;

(4) and weighting and summing the loss functions of the two branches to obtain the progressive loss, wherein the expression is as follows:

L_PL＝L_FBML(a)+λL_SBML(sa)

in the formula, λ is a weighting coefficient.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention makes up the neglect of the current characteristic layer in the existing method, fully excavates the characteristic diagram information of the current layer while paying attention to the context clue, and realizes the double-branch architecture through the characteristic strengthening module. And the progressive loss is designed correspondingly, and the deficiency of the existing method in the progressive learning capability of the feature maps of different levels is remedied.

2. The method further improves the distribution situation of the positive and negative samples, applies a Max-Both-Out strategy and builds an iterative cascade structure, relieves the adverse effect of sample distribution on the precision in the prior method and obtains good gain.

3. The invention can keep higher detection accuracy rate, stronger interference resistance and extremely high plasticity and comprehensiveness when facing to the human face with the attributes of inconsistent scales, fuzziness, strong illumination, different postures, facial shielding, makeup and the like in an unconstrained scene.

Drawings

FIG. 1 is a flowchart of a progressive cascade face detection method based on feature enhancement according to the present invention.

FIG. 2 is a network model diagram of the progressive cascade human face detection method based on feature enhancement.

Fig. 3 is a schematic diagram of a human face image processing enhancement mode.

FIG. 4 is a feature map output visualization of a basic feature extraction network.

FIG. 5 is a diagram of a feature enhancement module.

Fig. 6 is a feature map output visualization of a dual-branch architecture.

FIG. 7 is a Max-Both-Out strategy diagram.

Fig. 8 is a schematic diagram of an iterative cascade structure.

Fig. 9 is a diagram illustrating the detection effect of the trained model on WIDER FACE face samples in the test set.

FIG. 10 shows the detection accuracy of the trained model on the Easy, Medium, Hard validation set of WIDER FACE.

Fig. 11 is a diagram illustrating the effect of detecting an unconstrained face by using a trained model.

The original pictures of the photos in the drawings are color pictures, and are modified into a gray form according to the requirements of patent filing.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

WIDER FACE (the most authoritative face detection reference) data set is taken as an example, and the specific implementation of the feature enhancement based progressive cascade face detection method in the unconstrained scene of the present invention is further described in detail with reference to the accompanying drawings, and the flow thereof is shown in fig. 1, and includes the following steps:

step 1: the data augmentation of WIDERFACE training set mainly includes the following two aspects:

x_preprocess＝Crop(Flip(Extend(x_input)))

in the formula, x_inputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x is_preprocessThe corresponding preliminary pre-processing results are represented with a size of 640, 640 × 640, an example of a data enhancement operation is shown in fig. 3, where the first line is the original arbitrary sized input image, the second line is the size scaling of the corresponding graph to 4 times the original size, and the third and fourth lines are the picture preliminary pre-processing enhancement results of the flipped, cropped partial sample.

Step 1.2: and simulating the interference in an unconstrained scene by adopting a color dithering and noise disturbance mode. These two data enhancement modes are briefly described below:

color dithering: considering different illumination intensity, background atmosphere, shooting conditions and the like, the saturation, brightness, contrast and sharpness of the input image are respectively adjusted according to random factors generated randomly.

Noise disturbance: the method mainly relates to the addition of Gaussian white noise and salt and pepper noise, wherein the Gaussian noise refers to that the noise amplitude obeys Gaussian distribution, namely the number of noise points with certain intensity is the largest, and the number of noise points which are farther away from the intensity is smaller, so that the noise is additive noise; the salt and pepper noise is an impulse noise, and the alternating black and white bright and dark point noise can be generated on an original image by randomly changing the values of some pixel points, so that the salt and pepper noise is vivid, is just like spreading salt and pepper on the image, and is a logic noise.

To sum up, the preliminary pre-processing result x obtained in step 1.1 is again subjected to_preprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatment_processAs shown in the following formula:

in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively. An example of the data enhancement operation is shown in fig. 3, in which the fifth line is a color dithering enhancement mode for the picture cropped from the fourth line, and the sixth and seventh lines are modes for adding gaussian noise and salt and pepper noise of different degrees respectively to the picture cropped from the fourth line, so as to enhance the detection stability of the model for any environmental external cause.

Step 2: based on the augmented picture in step 1, VGGNet-16 is taken as a basic feature extraction network, a dual-branch architecture is realized by using a feature enhancement module, and a Max-Both-Out strategy is applied to each branch and each level of feature graph for prediction, and the method mainly comprises the following steps:

step 2.1: and performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected to be used for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5. These feature maps are visualized in turn, as shown in fig. 4.

Step 2.2: and (3) realizing a double-branch architecture by using a feature enhancement module, and enhancing the original feature map used for prediction in the step 2.1 through different dimension information. The neuron cells in the upper primitive feature map are marked as oc_(i,j,l)The non-local neuron cell in the original feature map of the current layer is nc_(i-,j-,l)，nc_(i-,j,l)，…，nc_(i,j+,l)，nc_(i+,j+,l)And the neuron cell ec of the characteristic diagram after the strengthening_(i,j,l)Can be expressed as:

ec_(i,j,l)＝f_Concat(f_Dilation(nc_(i,j,l)))

nc_(i,j,l)＝f_Element-wise(oc_(i,j,l),f_Up(oc_(i,j,l+1)))

in the formula, c_(i,j,l)Is the cell unit mapped by the coordinate (i, j) in the feature map of the l-th layer, and f points to a series of basic concatenation (Concatenate) operation, expansion Convolution (decomposition Convolution) operation, Element-wise multiplication (Element-wise Production) operation and Up Sampling (Up Sampling) operation.

The structure of the feature enhancing module is shown in fig. 5, and is specifically realized as follows:

(3) decomposing the characteristic diagram into three branches, and then respectively sending the three branches into sub-networks containing different numbers of expansion convolution layers; briefly, we will now introduce dilation of convolutional layers, i.e., holes are injected into a standard convolution map to increase the field, where the convolution kernel size is set to 3 × 3 and the dilation rate is set to 3, this hyper-parameter defines the spacing between the values of the convolution kernel when processing data, and the dilation rate of a normal convolutional layer is 1.

The feature maps expanded into the dual-branch architecture are sequentially visualized, as shown in fig. 6, a training picture with a size of 640 × 640 is selected, where the previous row is an original first branch, i.e., the feature map output by the basic feature extraction network vgnet-16 in step 2.1, and the next row is an enhanced second branch, i.e., the feature map output by the corresponding original first branch passing through the feature enhancement module, it can be seen that the enhanced second branch contains more semantic information than the feature map of the original first branch, which can promote detection performance.

Step 2.3: and applying a Max-Both-Out strategy to the feature maps used for prediction of each branch and each hierarchy obtained in the steps to reduce the false positive of the training sample, namely the probability that the prediction is true and actually is false, wherein the index can reflect the classification capability of the model. A schematic diagram of the Max-bouh-Out strategy is shown in fig. 7, wherein Cp positive sample face scores and Cn negative sample background scores are respectively predicted by a left branch, and then the final target and background with the highest positive score and the highest negative score are respectively selected from the Cp positive sample face scores and the Cn negative sample background scores, which is equivalent to integrating Cn + Cp classifiers. Therefore, the prediction probability of the negative samples can be effectively weakened, and the effect of reducing the false detection rate is achieved. In the invention, a Max-Both-Out strategy simultaneously predicts Cp positive sample face scores and Cn negative sample background scores, Cp is set to be 1 and Cn is set to be 3 for the first layer and the second layer of each branch, and multiple times of prediction of negative sample background scores is beneficial to detecting small facet holes; cp is set to 3 and Cn is set to 1 in all the other layers of each branch, and multiple times of predicting the face score of the positive sample can recall more faces as much as possible.

And step 3: after the training parameters are initialized, an iterative cascade structure is built, the model can be stored and detected after the model is converged by utilizing the self-learning process of progressive loss guidance and supervision of the model, and the method mainly comprises the following steps:

step 3.1: the training parameters are initialized, and the specific settings are shown in table 1 below.

TABLE 1 training parameter settings

Wherein, the optimizer selects a random gradient descent (SGD) method with a momentum value of 0.9; meanwhile, to prevent overfitting, the weight attenuation value is set to 10^-5. It should be noted that, in consideration of the continuous depth of the network learning process, the following settings are set for the learning rate: as the number of iterations increases, when the number of iterations is in the set stepping list {35000,45000,55000}, the learning rate decreases to 0.1, which can prevent the unexpected situation that the network parameter is close to the global optimal solution, and the optimal value is missed due to the excessive learning rate.

Step 3.2: and (3) an iterative cascade structure is set up during training, the model performance is optimized by utilizing the intersection ratio threshold between the candidate frame and the truth frame, namely, each sub-detector is obtained based on the training of positive and negative samples with different thresholds, the output of the former sub-detector is used as the input of the latter sub-detector, iterative calculation is carried out step by step, and the intersection ratio threshold of the positive and negative samples is increased progressively so as to match the detection frame with higher confidence coefficient. A schematic diagram of an iterative cascade structure is shown in fig. 8, and a three-stage structure is provided in the present invention, where Hi, Ci, and Bi (i ═ 1,2, and 3) respectively represent a network header, a classification result, and a coordinate position of an i-th detector. In the invention, intersection ratio thresholds are set to be [0.35,0.5 and 0.6] respectively for each stage of the three-stage iterative cascade structure.

Step 3.3: according to the asymptotic learning capacity of each branch and each level of feature graph, adopting an autonomous learning process of a progressive loss guide and supervision model, wherein progressive loss is obtained by weighted summation of first branch multitask loss and second branch multitask loss, and the progressive loss is elaborated as follows:

in the formula, x_kIndicating the actual class label, z_mDenotes the input of the softmax layer, f (z)_m) Representing the predicted output of the softmax layer, T is the number of classes on the training data set.

and representing coordinate label information predicted by the CRFD model, and omega represents a region set of which the prior frame is a positive sample.

(2) The original First branch multitask Loss (FBML, First Branch Multi-task Loss) resulting from step 2.1 is defined as follows:

where N represents the total number of dense positive and negative anchor boxes, N_PIndicates the number of matching positive anchor boxes, L_confRefers to the softmax loss, L, of both the face and background categories_locMeans when using anchor point a_iAt the time of detection, the frame t is predicted_iAnd truth box g_iParameterized smooth L1 loss in between. When in use

Then, this prediction anchor point p is indicated_iFramed is a positive sample, and activates L_locβ is used to balance the weight between the location regression and the category score.

(3) The multitask Loss (SBML, Second Branch Multi-task Loss) of the enhanced Second branch resulting from step 2.2 is defined as follows:

where N represents the total number of dense positive and negative anchor boxes, N_PIndicates the number of matching positive anchor boxes, L_confRefers to the softmax loss, L, of both the face and background categories_locMeans when using anchor point sa_iAt the time of detection, the frame t is predicted_iAnd truth box g_iParameterized smooth L1 loss in between. When in use

Then, this prediction anchor point p is indicated_iFramed is a positive sample, and activates L_locβ is used to balance the weight between position regression and class scoring, the difference between the two is the anchor point used, a was used in the original first branch_iDetecting, using sa in the enhanced second branch_iAnd (6) detecting.

(4) The loss functions of the two branches are weighted and summed to obtain the Progressive Loss (PL), which is expressed as:

L_PL＝L_FBML(a)+λL_SBML(sa)

in the formula, lambda is a weighting coefficient, and lambda takes a value of 0.5 in the invention so as to match the compensation of the anchor point scale.

In summary, the overall network structure of the feature-enhanced progressive cascade face detection method of the present invention is shown in fig. 2.

Step 3.4: when the progressive loss no longer rises and settles in a smaller range (e.g., (0, 1)), the training may be stopped, otherwise, the process returns to step 3.1.

Step 3.5: stopping training, saving the model and detecting. It is noted here that to avoid introducing additional computational costs, only the output of the enhanced second branch is used as a reference when the model is put into the actual detection process. The trained model is used for detecting partial human face samples related to attributes of inconsistent scales, fuzziness, strong and weak illumination, different postures, facial occlusion and makeup in the WIDER FACE test set, and the human face is marked by a rectangular frame, so that higher detection accuracy can be still maintained in the high-difficulty unconstrained scenes as shown in FIG. 9. The accuracy of the invention on the Easy, Medium and Hard verification sets of the disclosed WIDER FACE respectively reaches 95.3%, 94.1% and 88.5%, and as shown in figure 10, good gain is obtained. The method has wide application scenes, is suitable for face detection tasks in various unconstrained scenes, has extremely high comprehensiveness and generalization, and still has higher accuracy when the method is used for detecting the arbitrarily captured unconstrained faces as shown in figure 11.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. The progressive cascade face detection method based on feature enhancement under the unconstrained scene is characterized by comprising the following steps:

step 1, carrying out data augmentation on WIDERFACE training sets;

step 2, based on the augmented picture in the step 1, taking VGGNet-16 as a basic feature extraction network, utilizing a feature enhancement module to realize a dual-branch architecture, and applying a Max-Both-Out strategy to each branch and each level of feature graph for prediction;

2. The method for progressive cascade face detection based on feature enhancement under the unconstrained scene as claimed in claim 1, wherein the step 1 specifically comprises the following sub-steps:

x_preprocess＝Crop(Flip(Extend(x_input)))

3. The method for progressive cascade face detection based on feature enhancement under the unconstrained scene as claimed in claim 1, wherein the step 2 specifically comprises the following sub-steps:

step 2.2: utilizing the feature enhancement module to realize a double-branch architecture, enhancing the original feature map used for prediction in the step 2.1 through different dimension information, and recording the neuron cells in the upper original feature map as oc_(i,j,l)The non-local neuron cell in the original feature map of the current layer is nc_(i-,j-,l)，nc_(i-,j,l)，…，nc_i(,j+,l)，nc_(i+,j+,l)And the neuron cell ec of the characteristic diagram after the strengthening_(i,j,l)Expressed as:

ec_(i,j,l)＝f_Concat(f_Dilation(nc_(i,j,l)))

nc_(i,j,l)＝f_Element-wise(oc_(i,j,l),f_Up(oc_(i,j,l+1)))

4. The progressive cascade face detection method based on feature enhancement under the unconstrained scene according to claim 3, wherein the feature enhancement module in step 2.2 is specifically implemented as follows:

5. The feature-enhancement-based progressive cascade face detection method under the unconstrained scene according to claim 1, wherein the step 3 specifically includes the following sub-steps:

step 3.1: initializing the training parameters;

6. The progressive cascade face detection method based on feature enhancement under the unconstrained scene of claim 5, wherein in the step 3.1, a random gradient descent method with a momentum value of 0.9 is selected by an optimizer; while setting the weight attenuation value to 10^-5。

7. The progressive cascade face detection method based on feature enhancement under the unconstrained scene of claim 6, wherein when the iteration number is in the set step list {40000,60000,80000}, the learning rate is reduced to 0.1.

8. The progressive face detection method based on feature enhancement under the unconstrained scene of claim 5, wherein in 3.2, the iterative cascade structure is a three-level structure.

9. The feature-based enhancement progressive cascade face detection method under the unconstrained scenario of claim 5, wherein in step 3.3, the weighted summation process of the progressive loss from the first branch multitask loss and the second branch multitask loss includes the following steps:

in the formula, x_kIndicating the actual class label, z_mDenotes the input of the softmax layer, f (z)_m) Represents the predicted output of the softmax layer, T is the number of classes on the training dataset;

where N represents the total number of dense positive and negative anchor boxes, N_PIndicates the number of matching positive anchor boxes, L_confRefers to the softmax loss, L, of both the face and background categories_locMeans when using anchor point sa_iDuring detection, the method isMeasuring frame t_iAnd truth box g_iParameterized smooth L1 loss in between;

L_PL＝L_FBML(a)+λL_SBML(sa)

in the formula, λ is a weighting coefficient.