CN111553230A - Feature enhancement based progressive cascade face detection method under unconstrained scene - Google Patents

Feature enhancement based progressive cascade face detection method under unconstrained scene Download PDF

Info

Publication number
CN111553230A
CN111553230A CN202010319149.1A CN202010319149A CN111553230A CN 111553230 A CN111553230 A CN 111553230A CN 202010319149 A CN202010319149 A CN 202010319149A CN 111553230 A CN111553230 A CN 111553230A
Authority
CN
China
Prior art keywords
branch
loss
feature
progressive
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010319149.1A
Other languages
Chinese (zh)
Inventor
徐琴珍
杨哲
刘杨
王路
王驭扬
杨绿溪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010319149.1A priority Critical patent/CN111553230A/en
Publication of CN111553230A publication Critical patent/CN111553230A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a progressive cascade face detection scheme based on feature enhancement in an unconstrained scene, belonging to the field of multimedia signal processing. The method comprises the steps of performing data amplification on a training set, taking VGGNet-16 as a basic feature extraction network, utilizing a feature enhancement module to realize a dual-branch architecture, and applying a Max-Both-Out strategy to each branch before prediction; and during training, an iterative cascade structure is built, and progressive loss is designed, namely the first branch multitask loss and the second branch multitask loss are subjected to weighted summation to guide the training and learning process until convergence, so that the detection of the target face is finally realized. The method not only focuses on context information, but also emphatically excavates the current layer characteristics, enriches the extraction modes of facial characteristics, is suitable for unconstrained scenes with high detection difficulty, and can realize accurate detection on tiny, fuzzy and shielded human faces.

Description

Feature enhancement based progressive cascade face detection method under unconstrained scene
Technical Field
The invention belongs to the technical field of image processing, and relates to a progressive cascade face detection method based on feature enhancement in an unconstrained scene.
Background
The popularization of intelligent terminal equipment profoundly influences the thinking way of human beings, and has a brand-new definition on the social essence of the intelligent terminal equipment. Human face detection is the most suitable application for daily life in the field of computer vision, and relieves human beings from heavy visual processing work, and specific information in images and videos is analyzed and summarized by a machine, so that the development of the modern society is deeply influenced. On the smart phone, 3D face recognition unlocking is respectively realized on an IOS platform and an android platform by iPhone X and Mate20pro, so that privacy is better protected; in security monitoring, lawbreakers can be tracked and captured by a face recognition technology, so that security maintenance is enhanced; in the aspect of property safety, the paying treasures firstly put out face-brushing payment and credit loan for identity authentication, so that the safety is also ensured while the efficiency is improved.
The early mainstream face detection method is mostly based on a manually designed template matching technology, has a good detection effect on a face with a clear front face without shielding, is easy to implement and hardly influenced by illumination and image quality, but cannot make a completely effective face template to adapt to changes of posture, scale and the like due to high plasticity of the face, so that the precision is limited. The conventional face detection method, which determines whether a face exists in an image by only mechanically comparing the autocorrelation between the manual features and the target face, is not suitable for an unconstrained scene.
With the rapid development of deep learning, the face detection method based on the convolutional neural network gradually replaces the traditional face detection method with strong characterization learning and nonlinear modeling capability, the detection performance is remarkably improved, and the accuracy rate of a clear face without shielding can almost reach one hundred percent. However, an unconstrained face in a natural scene is easily interfered by external environmental factors such as shielding, illumination, expressions, postures and the like, so that facial features are insufficiently extracted and utilized; in addition, the low-resolution human face with a smaller size is a bottleneck, and the small human face is densely sampled by using the small-size anchor point, so that excessive background negative samples are easily generated, and the false detection rate is increased. The accuracy of the existing face detection method under the unconstrained scene is still insufficient, and a satisfactory effect cannot be obtained.
Disclosure of Invention
In order to solve the above problems, the present invention provides a progressive cascade face detection method based on feature enhancement in an unconstrained scene, which focuses on improvement and optimization in the following two aspects: on one hand, the characteristics of the current layer are fully mined, a single-branch architecture is expanded into a double-branch architecture by utilizing a characteristic strengthening module, progressive loss is correspondingly designed to match the near learning capacity of each branch and level characteristic diagram, and the extraction mode of the facial characteristics is enriched; and on the other hand, a Max-Both-Out strategy is applied, an iterative cascade structure is built, and more appropriate sample distribution is matched for each stage through sub-detectors which are combined and gradually increased compared with a threshold value.
In order to achieve the purpose, the invention provides the following technical scheme:
the progressive cascade face detection method based on feature enhancement under the unconstrained scene comprises the following steps:
step 1, carrying out data augmentation on WIDERFACE (the current most authoritative face detection reference) training set;
step 2, based on the augmented picture in the step 1, taking VGGNet-16 (a classical deep convolutional neural network) as a basic feature extraction network, utilizing a feature enhancement module to realize a dual-branch architecture, and applying a Max-Both-Out strategy to each branch and each level of feature graph for prediction;
and 3, after the training parameters are initialized, building an iterative cascade structure, guiding and supervising the autonomous learning process of the model by utilizing progressive loss, storing the model after the model is converged, and detecting the model.
Further, the step 1 specifically includes the following sub-steps:
step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDERFACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:
xpreprocess=Crop(Flip(Extend(xinput)))
in the formula, xinputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x ispreprocessThen the corresponding preliminary pre-processing result is represented, and the size is unified as 640 × 640;
step 1.2: simulating the interference under the unconstrained scene by adopting a color dithering and noise disturbance mode, and carrying out the preliminary preprocessing result x obtained in the step 1.1 againpreprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatmentprocessAs shown in the following formula:
Figure BDA0002460694260000021
in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively.
Further, the step 2 specifically includes the following sub-steps:
step 2.1: performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5;
step 2.2: the feature enhancement module is used for realizing a double-branch architecture, and the original feature map used for prediction in the step 2.1 is subjected to different dimension informationThe enhancement is carried out, and the neuron cells in the upper layer original characteristic diagram are marked as oc: (i,j,l) The non-local neuron cell in the original feature map of the current layer is nc(i-,j-,l),nc(i-,j,l),…,nc(i,j+,l),nc(i+,j+,l)And the neuron cell ec of the characteristic diagram after the strengthening(i,j,l)Expressed as:
ec(i,j,l)=fConcat(fDilation(nc(i,j,l)))
nc(i,j,l)=fElement-wise(oc(i,j,l),fUp(oc(i,j,l+1)))
in the formula, c(i,j,l)The cell unit mapped by the coordinates (i, j) in the characteristic diagram of the ith layer, and f points to a series of basic splicing operation, expansion convolution operation, element-by-element multiplication operation and upsampling operation;
step 2.3: and applying a Max-Both-Out strategy to the feature maps used for prediction of each branch and each level obtained in the steps to reduce the false positives of the training samples, wherein the Max-Both-Out strategy simultaneously predicts Cp positive sample face scores and Cn negative sample background scores, and respectively selects the face scores and the Cn negative sample background scores with the highest positive scores and the highest negative scores as final targets and final backgrounds, namely integrating Cn + Cp classifiers.
Further, the feature enhancing module in step 2.2 is specifically implemented as follows:
(1) normalizing the characteristic diagram by using a convolution kernel with the size of 1 multiplied by 1;
(2) multiplying the up-sampled upper layer feature map by the current feature layer element by element;
(3) decomposing the characteristic diagram into three branches, and then respectively sending the three branches into sub-networks containing different numbers of expansion convolution layers;
(4) and reducing the three branches into the dimension of the initial characteristic diagram in a channel splicing mode.
Further, the step 3 specifically includes the following sub-steps:
step 3.1: initializing the training parameters;
step 3.2: an iterative cascade structure is set up during training, the model performance is optimized by utilizing the intersection ratio threshold between a candidate frame and a truth frame, each sub-detector is obtained based on the training of positive and negative samples with different thresholds, the output of the former sub-detector is used as the input of the latter sub-detector, iterative calculation is carried out step by step, and the intersection ratio threshold of the positive and negative samples is increased progressively to match a detection frame with higher confidence;
step 3.3: according to the asymptotic learning capacity of each branch and each level of feature graph, adopting an autonomous learning process of a progressive loss guidance and supervision model, wherein progressive loss is obtained by weighted summation of first branch multitask loss and second branch multitask loss;
step 3.4: when the progressive loss does not rise any more and is stabilized in a smaller value range, stopping training, storing the model and detecting; otherwise, the procedure returns to step 3.1.
Further, in the step 3.1, the optimizer selects a random gradient descent method with a momentum value of 0.9; while setting the weight attenuation value to 10-5
Further, when the number of iterations is in the set step list {40000,60000,80000}, the learning rate drops to 0.1.
Further, in 3.2, the iterative cascade structure is a three-level structure.
Further, in step 3.3, the weighted summation of the progressive loss and the first branch multitask loss comprises the following steps:
(1) basic category scoring is guided by softmax loss training, and the expression is as follows:
Figure BDA0002460694260000041
Figure BDA0002460694260000042
in the formula, xkIndicating the actual class label, zmDenotes the input of the softmax layer, f (z)m) Represents soThe output predicted by the ftmax layer, T being the number of classes on the training data set;
the basic position regression is trained by smooth L1 loss guidance, and the expression is as follows:
Figure BDA0002460694260000043
Figure BDA0002460694260000044
in the formula, y(i)A tag that represents the true location of the object,
Figure BDA0002460694260000045
representing coordinate label information predicted by a CRFD model, wherein omega represents a region set of which a prior frame is a positive sample;
(2) the original first branch's multitask penalty resulting from step 2.1 is defined as follows:
Figure BDA0002460694260000046
where N represents the total number of dense positive and negative anchor boxes, NPIndicates the number of matching positive anchor boxes, LconfRefers to the softmax loss, L, of both the face and background categorieslocMeans when using anchor point aiAt the time of detection, the frame t is predictediAnd truth box giParameterized smooth L1 loss, p betweeniIs the prediction anchor, LlocIs lost, β is used to balance the weight between position regression and category scoring;
(3) the multiplexing penalty of the enhanced second branch resulting from step 2.2 is defined as follows:
Figure BDA0002460694260000047
where N represents the total number of dense positive and negative anchor boxes, NPIndicates the number of matching positive anchor boxes, LconfMeans thatSoftmax loss, L, for both face and background categorieslocMeans when using anchor point saiAt the time of detection, the frame t is predictediAnd truth box giParameterized smooth L1 loss in between;
the original first branch is taken with aiDetecting, using sa in the enhanced second branchiCarrying out detection;
(4) and weighting and summing the loss functions of the two branches to obtain the progressive loss, wherein the expression is as follows:
LPL=LFBML(a)+λLSBML(sa)
in the formula, λ is a weighting coefficient.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention makes up the neglect of the current characteristic layer in the existing method, fully excavates the characteristic diagram information of the current layer while paying attention to the context clue, and realizes the double-branch architecture through the characteristic strengthening module. And the progressive loss is designed correspondingly, and the deficiency of the existing method in the progressive learning capability of the feature maps of different levels is remedied.
2. The method further improves the distribution situation of the positive and negative samples, applies a Max-Both-Out strategy and builds an iterative cascade structure, relieves the adverse effect of sample distribution on the precision in the prior method and obtains good gain.
3. The invention can keep higher detection accuracy rate, stronger interference resistance and extremely high plasticity and comprehensiveness when facing to the human face with the attributes of inconsistent scales, fuzziness, strong illumination, different postures, facial shielding, makeup and the like in an unconstrained scene.
Drawings
FIG. 1 is a flowchart of a progressive cascade face detection method based on feature enhancement according to the present invention.
FIG. 2 is a network model diagram of the progressive cascade human face detection method based on feature enhancement.
Fig. 3 is a schematic diagram of a human face image processing enhancement mode.
FIG. 4 is a feature map output visualization of a basic feature extraction network.
FIG. 5 is a diagram of a feature enhancement module.
Fig. 6 is a feature map output visualization of a dual-branch architecture.
FIG. 7 is a Max-Both-Out strategy diagram.
Fig. 8 is a schematic diagram of an iterative cascade structure.
Fig. 9 is a diagram illustrating the detection effect of the trained model on WIDER FACE face samples in the test set.
FIG. 10 shows the detection accuracy of the trained model on the Easy, Medium, Hard validation set of WIDER FACE.
Fig. 11 is a diagram illustrating the effect of detecting an unconstrained face by using a trained model.
The original pictures of the photos in the drawings are color pictures, and are modified into a gray form according to the requirements of patent filing.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.
WIDER FACE (the most authoritative face detection reference) data set is taken as an example, and the specific implementation of the feature enhancement based progressive cascade face detection method in the unconstrained scene of the present invention is further described in detail with reference to the accompanying drawings, and the flow thereof is shown in fig. 1, and includes the following steps:
step 1: the data augmentation of WIDERFACE training set mainly includes the following two aspects:
step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDERFACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:
xpreprocess=Crop(Flip(Extend(xinput)))
in the formula, xinputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x ispreprocessThe corresponding preliminary pre-processing results are represented with a size of 640, 640 × 640, an example of a data enhancement operation is shown in fig. 3, where the first line is the original arbitrary sized input image, the second line is the size scaling of the corresponding graph to 4 times the original size, and the third and fourth lines are the picture preliminary pre-processing enhancement results of the flipped, cropped partial sample.
Step 1.2: and simulating the interference in an unconstrained scene by adopting a color dithering and noise disturbance mode. These two data enhancement modes are briefly described below:
color dithering: considering different illumination intensity, background atmosphere, shooting conditions and the like, the saturation, brightness, contrast and sharpness of the input image are respectively adjusted according to random factors generated randomly.
Noise disturbance: the method mainly relates to the addition of Gaussian white noise and salt and pepper noise, wherein the Gaussian noise refers to that the noise amplitude obeys Gaussian distribution, namely the number of noise points with certain intensity is the largest, and the number of noise points which are farther away from the intensity is smaller, so that the noise is additive noise; the salt and pepper noise is an impulse noise, and the alternating black and white bright and dark point noise can be generated on an original image by randomly changing the values of some pixel points, so that the salt and pepper noise is vivid, is just like spreading salt and pepper on the image, and is a logic noise.
To sum up, the preliminary pre-processing result x obtained in step 1.1 is again subjected topreprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatmentprocessAs shown in the following formula:
Figure BDA0002460694260000061
in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively. An example of the data enhancement operation is shown in fig. 3, in which the fifth line is a color dithering enhancement mode for the picture cropped from the fourth line, and the sixth and seventh lines are modes for adding gaussian noise and salt and pepper noise of different degrees respectively to the picture cropped from the fourth line, so as to enhance the detection stability of the model for any environmental external cause.
Step 2: based on the augmented picture in step 1, VGGNet-16 is taken as a basic feature extraction network, a dual-branch architecture is realized by using a feature enhancement module, and a Max-Both-Out strategy is applied to each branch and each level of feature graph for prediction, and the method mainly comprises the following steps:
step 2.1: and performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected to be used for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5. These feature maps are visualized in turn, as shown in fig. 4.
Step 2.2: and (3) realizing a double-branch architecture by using a feature enhancement module, and enhancing the original feature map used for prediction in the step 2.1 through different dimension information. The neuron cells in the upper primitive feature map are marked as oc(i,j,l)The non-local neuron cell in the original feature map of the current layer is nc(i-,j-,l),nc(i-,j,l),…,nc(i,j+,l),nc(i+,j+,l)And the neuron cell ec of the characteristic diagram after the strengthening(i,j,l)Can be expressed as:
ec(i,j,l)=fConcat(fDilation(nc(i,j,l)))
nc(i,j,l)=fElement-wise(oc(i,j,l),fUp(oc(i,j,l+1)))
in the formula, c(i,j,l)Is the cell unit mapped by the coordinate (i, j) in the feature map of the l-th layer, and f points to a series of basic concatenation (Concatenate) operation, expansion Convolution (decomposition Convolution) operation, Element-wise multiplication (Element-wise Production) operation and Up Sampling (Up Sampling) operation.
The structure of the feature enhancing module is shown in fig. 5, and is specifically realized as follows:
(1) normalizing the characteristic diagram by using a convolution kernel with the size of 1 multiplied by 1;
(2) multiplying the up-sampled upper layer feature map by the current feature layer element by element;
(3) decomposing the characteristic diagram into three branches, and then respectively sending the three branches into sub-networks containing different numbers of expansion convolution layers; briefly, we will now introduce dilation of convolutional layers, i.e., holes are injected into a standard convolution map to increase the field, where the convolution kernel size is set to 3 × 3 and the dilation rate is set to 3, this hyper-parameter defines the spacing between the values of the convolution kernel when processing data, and the dilation rate of a normal convolutional layer is 1.
(4) And reducing the three branches into the dimension of the initial characteristic diagram in a channel splicing mode.
The feature maps expanded into the dual-branch architecture are sequentially visualized, as shown in fig. 6, a training picture with a size of 640 × 640 is selected, where the previous row is an original first branch, i.e., the feature map output by the basic feature extraction network vgnet-16 in step 2.1, and the next row is an enhanced second branch, i.e., the feature map output by the corresponding original first branch passing through the feature enhancement module, it can be seen that the enhanced second branch contains more semantic information than the feature map of the original first branch, which can promote detection performance.
Step 2.3: and applying a Max-Both-Out strategy to the feature maps used for prediction of each branch and each hierarchy obtained in the steps to reduce the false positive of the training sample, namely the probability that the prediction is true and actually is false, wherein the index can reflect the classification capability of the model. A schematic diagram of the Max-bouh-Out strategy is shown in fig. 7, wherein Cp positive sample face scores and Cn negative sample background scores are respectively predicted by a left branch, and then the final target and background with the highest positive score and the highest negative score are respectively selected from the Cp positive sample face scores and the Cn negative sample background scores, which is equivalent to integrating Cn + Cp classifiers. Therefore, the prediction probability of the negative samples can be effectively weakened, and the effect of reducing the false detection rate is achieved. In the invention, a Max-Both-Out strategy simultaneously predicts Cp positive sample face scores and Cn negative sample background scores, Cp is set to be 1 and Cn is set to be 3 for the first layer and the second layer of each branch, and multiple times of prediction of negative sample background scores is beneficial to detecting small facet holes; cp is set to 3 and Cn is set to 1 in all the other layers of each branch, and multiple times of predicting the face score of the positive sample can recall more faces as much as possible.
And step 3: after the training parameters are initialized, an iterative cascade structure is built, the model can be stored and detected after the model is converged by utilizing the self-learning process of progressive loss guidance and supervision of the model, and the method mainly comprises the following steps:
step 3.1: the training parameters are initialized, and the specific settings are shown in table 1 below.
TABLE 1 training parameter settings
Figure BDA0002460694260000081
Wherein, the optimizer selects a random gradient descent (SGD) method with a momentum value of 0.9; meanwhile, to prevent overfitting, the weight attenuation value is set to 10-5. It should be noted that, in consideration of the continuous depth of the network learning process, the following settings are set for the learning rate: as the number of iterations increases, when the number of iterations is in the set stepping list {35000,45000,55000}, the learning rate decreases to 0.1, which can prevent the unexpected situation that the network parameter is close to the global optimal solution, and the optimal value is missed due to the excessive learning rate.
Step 3.2: and (3) an iterative cascade structure is set up during training, the model performance is optimized by utilizing the intersection ratio threshold between the candidate frame and the truth frame, namely, each sub-detector is obtained based on the training of positive and negative samples with different thresholds, the output of the former sub-detector is used as the input of the latter sub-detector, iterative calculation is carried out step by step, and the intersection ratio threshold of the positive and negative samples is increased progressively so as to match the detection frame with higher confidence coefficient. A schematic diagram of an iterative cascade structure is shown in fig. 8, and a three-stage structure is provided in the present invention, where Hi, Ci, and Bi (i ═ 1,2, and 3) respectively represent a network header, a classification result, and a coordinate position of an i-th detector. In the invention, intersection ratio thresholds are set to be [0.35,0.5 and 0.6] respectively for each stage of the three-stage iterative cascade structure.
Step 3.3: according to the asymptotic learning capacity of each branch and each level of feature graph, adopting an autonomous learning process of a progressive loss guide and supervision model, wherein progressive loss is obtained by weighted summation of first branch multitask loss and second branch multitask loss, and the progressive loss is elaborated as follows:
(1) basic category scoring is guided by softmax loss training, and the expression is as follows:
Figure BDA0002460694260000082
Figure BDA0002460694260000083
in the formula, xkIndicating the actual class label, zmDenotes the input of the softmax layer, f (z)m) Representing the predicted output of the softmax layer, T is the number of classes on the training data set.
The basic position regression is trained by smooth L1 loss guidance, and the expression is as follows:
Figure BDA0002460694260000091
Figure BDA0002460694260000092
in the formula, y(i)A tag that represents the true location of the object,
Figure BDA0002460694260000093
and representing coordinate label information predicted by the CRFD model, and omega represents a region set of which the prior frame is a positive sample.
(2) The original First branch multitask Loss (FBML, First Branch Multi-task Loss) resulting from step 2.1 is defined as follows:
Figure BDA0002460694260000094
where N represents the total number of dense positive and negative anchor boxes, NPIndicates the number of matching positive anchor boxes, LconfRefers to the softmax loss, L, of both the face and background categorieslocMeans when using anchor point aiAt the time of detection, the frame t is predictediAnd truth box giParameterized smooth L1 loss in between. When in use
Figure BDA0002460694260000095
Then, this prediction anchor point p is indicatediFramed is a positive sample, and activates Llocβ is used to balance the weight between the location regression and the category score.
(3) The multitask Loss (SBML, Second Branch Multi-task Loss) of the enhanced Second branch resulting from step 2.2 is defined as follows:
Figure BDA0002460694260000096
where N represents the total number of dense positive and negative anchor boxes, NPIndicates the number of matching positive anchor boxes, LconfRefers to the softmax loss, L, of both the face and background categorieslocMeans when using anchor point saiAt the time of detection, the frame t is predictediAnd truth box giParameterized smooth L1 loss in between. When in use
Figure BDA0002460694260000097
Then, this prediction anchor point p is indicatediFramed is a positive sample, and activates Llocβ is used to balance the weight between position regression and class scoring, the difference between the two is the anchor point used, a was used in the original first branchiDetecting, using sa in the enhanced second branchiAnd (6) detecting.
(4) The loss functions of the two branches are weighted and summed to obtain the Progressive Loss (PL), which is expressed as:
LPL=LFBML(a)+λLSBML(sa)
in the formula, lambda is a weighting coefficient, and lambda takes a value of 0.5 in the invention so as to match the compensation of the anchor point scale.
In summary, the overall network structure of the feature-enhanced progressive cascade face detection method of the present invention is shown in fig. 2.
Step 3.4: when the progressive loss no longer rises and settles in a smaller range (e.g., (0, 1)), the training may be stopped, otherwise, the process returns to step 3.1.
Step 3.5: stopping training, saving the model and detecting. It is noted here that to avoid introducing additional computational costs, only the output of the enhanced second branch is used as a reference when the model is put into the actual detection process. The trained model is used for detecting partial human face samples related to attributes of inconsistent scales, fuzziness, strong and weak illumination, different postures, facial occlusion and makeup in the WIDER FACE test set, and the human face is marked by a rectangular frame, so that higher detection accuracy can be still maintained in the high-difficulty unconstrained scenes as shown in FIG. 9. The accuracy of the invention on the Easy, Medium and Hard verification sets of the disclosed WIDER FACE respectively reaches 95.3%, 94.1% and 88.5%, and as shown in figure 10, good gain is obtained. The method has wide application scenes, is suitable for face detection tasks in various unconstrained scenes, has extremely high comprehensiveness and generalization, and still has higher accuracy when the method is used for detecting the arbitrarily captured unconstrained faces as shown in figure 11.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims (9)

1. The progressive cascade face detection method based on feature enhancement under the unconstrained scene is characterized by comprising the following steps:
step 1, carrying out data augmentation on WIDERFACE training sets;
step 2, based on the augmented picture in the step 1, taking VGGNet-16 as a basic feature extraction network, utilizing a feature enhancement module to realize a dual-branch architecture, and applying a Max-Both-Out strategy to each branch and each level of feature graph for prediction;
and 3, after the training parameters are initialized, building an iterative cascade structure, guiding and supervising the autonomous learning process of the model by utilizing progressive loss, storing the model after the model is converged, and detecting the model.
2. The method for progressive cascade face detection based on feature enhancement under the unconstrained scene as claimed in claim 1, wherein the step 1 specifically comprises the following sub-steps:
step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDERFACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:
xpreprocess=Crop(Flip(Extend(xinput)))
in the formula, xinputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x ispreprocessThen the corresponding preliminary pre-processing result is represented, and the size is unified as 640 × 640;
step 1.2: simulating the interference under the unconstrained scene by adopting a color dithering and noise disturbance mode, and carrying out the preliminary preprocessing result x obtained in the step 1.1 againpreprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatmentprocessAs shown in the following formula:
Figure FDA0002460694250000011
in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively.
3. The method for progressive cascade face detection based on feature enhancement under the unconstrained scene as claimed in claim 1, wherein the step 2 specifically comprises the following sub-steps:
step 2.1: performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5;
step 2.2: utilizing the feature enhancement module to realize a double-branch architecture, enhancing the original feature map used for prediction in the step 2.1 through different dimension information, and recording the neuron cells in the upper original feature map as oc(i,j,l)The non-local neuron cell in the original feature map of the current layer is nc(i-,j-,l),nc(i-,j,l),…,nci(,j+,l),nc(i+,j+,l)And the neuron cell ec of the characteristic diagram after the strengthening(i,j,l)Expressed as:
ec(i,j,l)=fConcat(fDilation(nc(i,j,l)))
nc(i,j,l)=fElement-wise(oc(i,j,l),fUp(oc(i,j,l+1)))
in the formula, c(i,j,l)The cell unit mapped by the coordinates (i, j) in the characteristic diagram of the ith layer, and f points to a series of basic splicing operation, expansion convolution operation, element-by-element multiplication operation and upsampling operation;
step 2.3: and applying a Max-Both-Out strategy to the feature maps used for prediction of each branch and each level obtained in the steps to reduce the false positives of the training samples, wherein the Max-Both-Out strategy simultaneously predicts Cp positive sample face scores and Cn negative sample background scores, and respectively selects the face scores and the Cn negative sample background scores with the highest positive scores and the highest negative scores as final targets and final backgrounds, namely integrating Cn + Cp classifiers.
4. The progressive cascade face detection method based on feature enhancement under the unconstrained scene according to claim 3, wherein the feature enhancement module in step 2.2 is specifically implemented as follows:
(1) normalizing the characteristic diagram by using a convolution kernel with the size of 1 multiplied by 1;
(2) multiplying the up-sampled upper layer feature map by the current feature layer element by element;
(3) decomposing the characteristic diagram into three branches, and then respectively sending the three branches into sub-networks containing different numbers of expansion convolution layers;
(4) and reducing the three branches into the dimension of the initial characteristic diagram in a channel splicing mode.
5. The feature-enhancement-based progressive cascade face detection method under the unconstrained scene according to claim 1, wherein the step 3 specifically includes the following sub-steps:
step 3.1: initializing the training parameters;
step 3.2: an iterative cascade structure is set up during training, the model performance is optimized by utilizing the intersection ratio threshold between a candidate frame and a truth frame, each sub-detector is obtained based on the training of positive and negative samples with different thresholds, the output of the former sub-detector is used as the input of the latter sub-detector, iterative calculation is carried out step by step, and the intersection ratio threshold of the positive and negative samples is increased progressively to match a detection frame with higher confidence;
step 3.3: according to the asymptotic learning capacity of each branch and each level of feature graph, adopting an autonomous learning process of a progressive loss guidance and supervision model, wherein progressive loss is obtained by weighted summation of first branch multitask loss and second branch multitask loss;
step 3.4: when the progressive loss does not rise any more and is stabilized in a smaller value range, stopping training, storing the model and detecting; otherwise, the procedure returns to step 3.1.
6. The progressive cascade face detection method based on feature enhancement under the unconstrained scene of claim 5, wherein in the step 3.1, a random gradient descent method with a momentum value of 0.9 is selected by an optimizer; while setting the weight attenuation value to 10-5
7. The progressive cascade face detection method based on feature enhancement under the unconstrained scene of claim 6, wherein when the iteration number is in the set step list {40000,60000,80000}, the learning rate is reduced to 0.1.
8. The progressive face detection method based on feature enhancement under the unconstrained scene of claim 5, wherein in 3.2, the iterative cascade structure is a three-level structure.
9. The feature-based enhancement progressive cascade face detection method under the unconstrained scenario of claim 5, wherein in step 3.3, the weighted summation process of the progressive loss from the first branch multitask loss and the second branch multitask loss includes the following steps:
(1) basic category scoring is guided by softmax loss training, and the expression is as follows:
Figure FDA0002460694250000031
Figure FDA0002460694250000032
in the formula, xkIndicating the actual class label, zmDenotes the input of the softmax layer, f (z)m) Represents the predicted output of the softmax layer, T is the number of classes on the training dataset;
the basic position regression is trained by smooth L1 loss guidance, and the expression is as follows:
Figure FDA0002460694250000033
Figure FDA0002460694250000034
in the formula, y(i)A tag that represents the true location of the object,
Figure FDA0002460694250000035
representing coordinate label information predicted by a CRFD model, wherein omega represents a region set of which a prior frame is a positive sample;
(2) the original first branch's multitask penalty resulting from step 2.1 is defined as follows:
Figure FDA0002460694250000036
where N represents the total number of dense positive and negative anchor boxes, NPIndicates the number of matching positive anchor boxes, LconfRefers to the softmax loss, L, of both the face and background categorieslocMeans when using anchor point aiAt the time of detection, the frame t is predictediAnd truth box giParameterized smooth L1 loss, p betweeniIs the prediction anchor, LlocIs lost, β is used to balance the weight between position regression and category scoring;
(3) the multiplexing penalty of the enhanced second branch resulting from step 2.2 is defined as follows:
Figure FDA0002460694250000037
where N represents the total number of dense positive and negative anchor boxes, NPIndicates the number of matching positive anchor boxes, LconfRefers to the softmax loss, L, of both the face and background categorieslocMeans when using anchor point saiDuring detection, the method isMeasuring frame tiAnd truth box giParameterized smooth L1 loss in between;
the original first branch is taken with aiDetecting, using sa in the enhanced second branchiCarrying out detection;
(4) and weighting and summing the loss functions of the two branches to obtain the progressive loss, wherein the expression is as follows:
LPL=LFBML(a)+λLSBML(sa)
in the formula, λ is a weighting coefficient.
CN202010319149.1A 2020-04-21 2020-04-21 Feature enhancement based progressive cascade face detection method under unconstrained scene Pending CN111553230A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010319149.1A CN111553230A (en) 2020-04-21 2020-04-21 Feature enhancement based progressive cascade face detection method under unconstrained scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010319149.1A CN111553230A (en) 2020-04-21 2020-04-21 Feature enhancement based progressive cascade face detection method under unconstrained scene

Publications (1)

Publication Number Publication Date
CN111553230A true CN111553230A (en) 2020-08-18

Family

ID=72007533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010319149.1A Pending CN111553230A (en) 2020-04-21 2020-04-21 Feature enhancement based progressive cascade face detection method under unconstrained scene

Country Status (1)

Country Link
CN (1) CN111553230A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069993A (en) * 2020-09-04 2020-12-11 西安西图之光智能科技有限公司 Dense face detection method and system based on facial features mask constraint and storage medium
CN112132140A (en) * 2020-09-23 2020-12-25 平安国际智慧城市科技股份有限公司 Vehicle brand identification method, device, equipment and medium based on artificial intelligence
CN113688785A (en) * 2021-09-10 2021-11-23 深圳市同为数码科技股份有限公司 Multi-supervision-based face recognition method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214353A (en) * 2018-09-27 2019-01-15 云南大学 A kind of facial image based on beta pruning model quickly detects training method and device
CN109472193A (en) * 2018-09-21 2019-03-15 北京飞搜科技有限公司 Method for detecting human face and device
CN110674714A (en) * 2019-09-13 2020-01-10 东南大学 Human face and human face key point joint detection method based on transfer learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472193A (en) * 2018-09-21 2019-03-15 北京飞搜科技有限公司 Method for detecting human face and device
CN109214353A (en) * 2018-09-27 2019-01-15 云南大学 A kind of facial image based on beta pruning model quickly detects training method and device
CN110674714A (en) * 2019-09-13 2020-01-10 东南大学 Human face and human face key point joint detection method based on transfer learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘其嘉: "基于多任务级联卷积网络模型的人脸检测和识别" *
姚树春 等: "基于级联回归网络的多尺度旋转人脸检测方法" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069993A (en) * 2020-09-04 2020-12-11 西安西图之光智能科技有限公司 Dense face detection method and system based on facial features mask constraint and storage medium
CN112069993B (en) * 2020-09-04 2024-02-13 西安西图之光智能科技有限公司 Dense face detection method and system based on five-sense organ mask constraint and storage medium
CN112132140A (en) * 2020-09-23 2020-12-25 平安国际智慧城市科技股份有限公司 Vehicle brand identification method, device, equipment and medium based on artificial intelligence
CN112132140B (en) * 2020-09-23 2022-08-12 平安国际智慧城市科技股份有限公司 Vehicle brand identification method, device, equipment and medium based on artificial intelligence
CN113688785A (en) * 2021-09-10 2021-11-23 深圳市同为数码科技股份有限公司 Multi-supervision-based face recognition method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
Guo et al. Fake face detection via adaptive manipulation traces extraction network
CN111639692B (en) Shadow detection method based on attention mechanism
CN113554089B (en) Image classification countermeasure sample defense method and system and data processing terminal
US20200410212A1 (en) Fast side-face interference resistant face detection method
CN111553230A (en) Feature enhancement based progressive cascade face detection method under unconstrained scene
CN111079739B (en) Multi-scale attention feature detection method
CN111553227A (en) Lightweight face detection method based on task guidance
CN111898410A (en) Face detection method based on context reasoning under unconstrained scene
CN107316029A (en) A kind of live body verification method and equipment
CN111951154B (en) Picture generation method and device containing background and medium
US11138464B2 (en) Image processing device, image processing method, and image processing program
CN114092793B (en) End-to-end biological target detection method suitable for complex underwater environment
Yu et al. Pedestrian detection based on improved Faster RCNN algorithm
CN111126155A (en) Pedestrian re-identification method for generating confrontation network based on semantic constraint
CN117333753A (en) Fire detection method based on PD-YOLO
CN113011307A (en) Face recognition identity authentication method based on deep residual error network
Afzali et al. Genetic programming for feature selection and feature combination in salient object detection
CN116452472A (en) Low-illumination image enhancement method based on semantic knowledge guidance
CN115908409A (en) Method and device for detecting defects of photovoltaic sheet, computer equipment and medium
CN112215868B (en) Method for removing gesture image background based on generation of countermeasure network
Goel et al. Automatic image colorization using u-net
CN112800941A (en) Face anti-fraud method and system based on asymmetric auxiliary information embedded network
Lu et al. An improved YOLOv5 algorithm for obscured target recognition
CN117496131B (en) Electric power operation site safety behavior identification method and system
CN111597338B (en) Countermeasure support vector machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination