CN108520219B - Multi-scale rapid face detection method based on convolutional neural network feature fusion - Google Patents

Multi-scale rapid face detection method based on convolutional neural network feature fusion Download PDF

Info

Publication number
CN108520219B
CN108520219B CN201810276795.7A CN201810276795A CN108520219B CN 108520219 B CN108520219 B CN 108520219B CN 201810276795 A CN201810276795 A CN 201810276795A CN 108520219 B CN108520219 B CN 108520219B
Authority
CN
China
Prior art keywords
neural network
detection
layer
model
deep neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810276795.7A
Other languages
Chinese (zh)
Other versions
CN108520219A (en
Inventor
钱学明
韩振
张宇奇
邹屹洋
侯兴松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taizhou Zhibi'an Technology Co ltd
Original Assignee
Taizhou Zhibi'an Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taizhou Zhibi'an Technology Co ltd filed Critical Taizhou Zhibi'an Technology Co ltd
Priority to CN201810276795.7A priority Critical patent/CN108520219B/en
Publication of CN108520219A publication Critical patent/CN108520219A/en
Application granted granted Critical
Publication of CN108520219B publication Critical patent/CN108520219B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale rapid face detection method with convolutional neural network feature fusion, which comprises the following steps: step 1: based on the model structure of the SSD fast target detection method, a feature extraction method in the SSD is modified, a feature fusion method is added, and a modified detection model is obtained; step 2: training the detection model modified in the step 1 aiming at face detection to obtain a trained deep neural network model; and step 3: and (3) calculating the picture to be detected by using the deep neural network model trained in the step (2) to obtain a model output result. The invention can quickly identify the face in the image and accurately position the face, thereby separating the face from the complex background. Providing a basis for authentication and tracking of the person in the image.

Description

Multi-scale rapid face detection method based on convolutional neural network feature fusion
Technical Field
The invention belongs to the technical field of computer digital image processing and pattern recognition, and particularly relates to a face detection method.
Background
With the increase of the number of the monitoring cameras in China, massive monitoring video data can be generated every day. In this context, computer-aided surveillance video content analysis techniques become necessary and desirable. In the surveillance video, the monitored object is mainly a person, and the human face features are the most important information for identity recognition and verification by using image information. The face detection can position all faces in the picture, and separate the faces from the background, which is the premise of face representation and face identification. Therefore, face detection is the first step in the analysis of surveillance video content.
The current target detection methods mainly comprise a traditional method and a deep learning-based method. The basic flow of the traditional method is the same as that of the method based on deep learning, and the traditional method firstly extracts the features of the image and then carries out foreground and background classification on each part of the image on a feature map. The features used in the conventional target detection method are mostly designed by human experience, such as artificial features like Harr, HOG and LBP. These features are based on human experience and do not fully characterize the image. In addition, classifiers such as SVM, Adaboost and the like are mainly adopted by the classifier of the traditional method, and the classification precision of the classifier on the image is inferior to that of a classification method using a convolutional neural network. The detection method based on deep learning is to train a Convolutional Neural Network (CNN) to extract deep features and train a classifier on the basis of the features. Detection methods based on deep learning can be roughly divided into two types. One is a detection method based on candidate regions, and the other is a detection method of direct regression. A representative method of a detection method based on a candidate region is a Faster R-CNN method proposed by the Ross Girshick team, and is detailed in the literature: shaoqing Ren, Kaiming He, Ross Girshick, Jianan Sun, Faster R-CNN, TowardsReal-Time Object Detection with Region pro-potential networks NIPS 2015-99. the basic idea is to extract a rough feature for an image through CNN, then extract a candidate frame possibly having an Object through the feature, then intercept the part of the feature map corresponding to the possible candidate frame, and send the part to the final decision device classification and regression Detection frame. The method has high detection precision, but the detection speed is very low and the real-time effect cannot be achieved due to two stages of judging whether the object exists. A representative method of the direct regression detection method is an SSD target detection algorithm, which is described in detail in the following documents: liu, D.Anguelov, D.Erhan, C.Szegedy, S.Reed, C.Y.Fu, and A.C.berg.SSD, Single shot multibox detector ECCV, pages 21-37,2016 the basic idea is to abandon the judgment of two stages of fast R-CNN and directly make classification and regression on the characteristic diagram extracted by CNN to obtain the final result. This single-stage detection method is fast, and also enables simultaneous detection on features of different depths in the CNN. Although the detection speed is high, the method has the disadvantages of poor performance, high missing detection and false detection rate and insufficient accuracy of the detection frame.
The face detection method is used for processing the video, and has higher requirements on the single-frame detection speed. However, the fast R-CNN method is too slow to detect in real time. Although the SSD method has a fast detection speed, detection accuracy indexes such as a missing detection rate and a false detection rate are relatively poor, and the accuracy of positioning the detection frame is not high enough.
Disclosure of Invention
The invention aims to provide a depth feature fusion multi-scale rapid face detection method, which is used for positioning the positions of all faces in a picture so as to overcome the defects that the SSD rapid target detection method is high in missing detection rate and false detection rate and poor in frame positioning accuracy.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-scale rapid face detection method with convolutional neural network feature fusion comprises the following steps:
step 1: based on the model structure of the SSD fast target detection method, a feature extraction method in the SSD is modified, a feature fusion method is added, and a modified detection model is obtained;
step 2: training the detection model modified in the step 1 aiming at face detection to obtain a trained deep neural network model;
and step 3: and (3) calculating the picture to be detected by using the deep neural network model trained in the step (2) to obtain a model output result.
Further, step 1 specifically includes:
the features of the SSD512 input detector are conv4_3 layer features of VGG, and subsequently added fc7, conv6_2, conv7_2, conv8_2, conv9_2 and conv10_2 layer features, respectively; starting from the conv8_2 layer feature, the feature fusion is carried out by upsampling and conv7_2 layer feature fusion; the fused features are up-sampled again and fused with the conv6_2 layer features; by analogy, fusing to conv4_3, and respectively sequentially fusing to obtain fuse7_2, fuse6_2, fuse7 and fuse4_ 3; and replacing the original conv4_3 layer, fc7 layer, conv6_2 layer and conv7_2 layer characteristics with the fused fuse4_3, fuse fc7, fuse6_2 and fuse7_2 characteristics, feeding the original conv8_2 layer and conv9_2 layer and conv10_2 layer characteristics which do not participate in fusion into the detector identical to the SSD model, and performing regression and classification on detection frames by the detector consisting of a plurality of convolution layers on the basis of the 7 characteristics to obtain the modified detection model.
Further, step 1 performs feature fusion on the features of the conv4_3 layer, fc7 layer, conv6_2 layer, conv7_2 layer and conv8_2 layer, and the two-layer feature fusion step is as follows:
1.1, firstly, the deep layer characteristic f to be fused is subjected to up-sampling by a nearest neighbor interpolation methoddIs amplified to obtain
Figure BDA0001613807650000031
At this time, the process of the present invention,
Figure BDA0001613807650000032
width and height of and shallow feature f to be fusedsHave the same width and height;
1.2, mixing
Figure BDA0001613807650000033
And fsConnecting longer features in channel dimension by splicing to obtain feature fd+s
1.3, characteristic fd+sReducing the channel number dimension by a layer of 3 x 3 convolution to remove the unwanted noise, and dividing fd+sC ofd+csThe channels are reduced to be uniform 256 dimensions; then a final fusion-finished feature f is obtained through a ReLU activation layerfuse
1.4、ffuseThe deep features are fused with the shallower features until conv4_3, and fuse7_2, fuse6_2, fuse fc7 and fuse4_3 are obtained respectively in sequence.
Further, step 2 specifically includes:
2.1, initializing detection model parameters by adopting parameters of a VGG pre-training model given by SSD;
2.2, adopting a public widget Face detection data set as the data set; randomly extracting a plurality of pictures from the wire Face into a batch of pictures, and performing data enhancement and pretreatment on the batch of pictures;
2.3, inputting the plurality of pictures subjected to data enhancement and preprocessing into a deep neural network model, and respectively obtaining the output results of the pictures of the batch through calculation of the deep neural network model; the deep neural network model structure comprises a convolutional neural network used by the SSD for extracting features, a feature fusion part and a detector part; wherein the feature fusion part is as described in step 1; carrying out model setting of the characteristic convolution neural network and the detector part continuation SSD;
2.4, comparing the output result of the deep neural network model with a label given by a data set, and calculating loss through a loss function;
2.5, updating the parameters of the deep neural network model by using a random gradient descent method;
2.6, judging whether the deep neural network model reaches a convergence condition, and returning to the step 2.2 if the deep neural network model does not reach the convergence condition; if so, finishing the training to obtain the trained deep neural network model.
Further, step 2.1 specifically includes: data enhancement was done as follows: the luminance is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed within +/-32; the contrast is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed between 0.5 time and 1.5 times; fine adjustment of the color tone is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between +/-18; fine adjustment of saturation is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between 0.5 time and 1.5 times; after data enhancement, preprocessing an image, comprising the following steps: adjusting the size of the picture processed by the previous enhancement to a fixed 512 x 512 size by a bilinear interpolation method; the RGB average values of all pixels of the wire Face data set calculated in advance are respectively subtracted from the RGB three channels of the picture fixed to be 512 x 512 in size.
Further, step 3 specifically includes:
3.1, preprocessing the picture to be detected, and adjusting the picture to be detected to a fixed size of 512 multiplied by 512 by a bilinear interpolation method like the preprocessing method in the step 2.2; respectively subtracting the average value of RGB channels of all pixels of the wire Face data set calculated in advance from the RGB channels of the picture to be detected with the adjusted size;
3.2, inputting the preprocessed picture to be detected into the deep neural network model trained in the step 2, and respectively obtaining output results of the picture through model calculation;
and 3.3, performing statistical non-maximum suppression on the output result of the deep neural network model trained in the step 2 to obtain a model output result.
Further, step 3.3 specifically includes:
(a) the detection boxes output by the deep neural network model have a uniform format, and each detection box consists of five numbers x1,y1,x2,y2And s represents; wherein x is1,y1And x2,y2Respectively representing the coordinate values of the upper left and the lower right of the frame; s represents the prediction confidence of the deep neural network model to the detection frame, is called the score of the detection frame, the value is between 0 and 1, and the higher the score is, the more confident the network model is to the detection frame is; all the test frames output by the deep neural network model are recorded as
Figure BDA0001613807650000051
Find out
Figure BDA0001613807650000052
Detection box b with the highest median scoremax,bmaxRespectively, the coordinates and the score of (a), (b), (c), and (d) are ((x)m1,ym1),(xm2,ym2) ) and sm
Initialization xsum1=smxm1,xsum2=smxm2,ysum1=smym1,ysum2=smym2,ssum=smFive variables are used for storing accumulated values; wherein x issum1,xsum2,ysum1,ysum2Respectively stored are weighted accumulations of frame coordinates, ssumStored is a score accumulation;
(b) find out
Figure BDA0001613807650000056
All of (A) and (b)maxFramed is a detection frame of the same object, denoted as
Figure BDA0001613807650000053
Calculation of bmaxAnd
Figure BDA0001613807650000054
middle and other detection frames biIf IOU is greater than threshold θ, then b is considered to beiAnd bmaxFraming the same object, biAdding into
Figure BDA0001613807650000055
Let biHas a left upper right lower coordinate of ((x)1,y1),(x2,y2) B) then biAnd bmaxThe area I covered at the same time is defined as follows:
I=(min(xm2,x2)-max(xm1,x1))(min(ym2,y2)-max(ym1,y1))
biand bmaxThe total covered area U is defined as follows:
U=(x2-x1)(y2-y1)+(xm2-xm1)(ym2-ym1)-I
biand bmaxThe degree of overlap IOU is defined as follows:
Figure BDA0001613807650000061
the IOU reflects the overlapping proportion of the two detection frames, and the IOU is more than or equal to 0 and less than or equal to 1; let theta equal to 0.3, when IOU > theta, update
Figure BDA0001613807650000062
xsum1←xsum1+sixi1
xsum2←xsum2+sixi2
ysum1←ysum1+siyi1
ysum2←ysum2+siyi2
ssum←ssum+sm
(c) Will be provided with
Figure BDA00016138076500000613
All the detection boxes in (a) and (b)maxThe coordinate values are weighted and averaged together with the score of the detection frame, and the coordinate values after the weighted and averaged value are ((x)mean1,ymean1),(xmean2,ymean2) Is marked as bmean,bmeanThe coordinate values are calculated as follows:
Figure BDA0001613807650000063
Figure BDA0001613807650000064
Figure BDA0001613807650000065
Figure BDA0001613807650000066
(d) in that
Figure BDA0001613807650000067
In which b is removedmaxAnd
Figure BDA0001613807650000068
the detection frame of (1);
Figure BDA0001613807650000069
Figure BDA00016138076500000610
(e) repeating steps (a), (b), (c) and (d) until
Figure BDA00016138076500000611
Is empty; at this time set
Figure BDA00016138076500000612
All detection frames in (1) are the output results of statistical non-maximum suppression.
In order to ensure faster detection speed, the invention modifies the whole network structure based on the SSD fast target detection method, and trains the modified network aiming at face detection;
in order to reduce the missing detection and the false detection of the SSD, the invention adds a characteristic fusion method on the basis of the SSD network structure: the detection method requires that the features include both local features for positioning and semantic features for classification. Generally, CNN deep features are rich in semantic features and lack local positioning information, and shallow features are rich in local features but limited by the lack of semantic features at network depth. In order to reduce the missing detection and the false detection caused by the classification error, the invention performs characteristic fusion between the characteristics of different depths of the convolutional neural network with improved characteristics so as to obtain more integral and comprehensive characteristics, thereby being beneficial to the judgment of a classifier to reduce the missing detection and the false detection and being beneficial to the positioning of a regressor to improve the positioning accuracy;
in order to further improve the positioning accuracy of the detection frame, the statistical non-maximum suppression method is used in the merging process of the final result: and a statistic-based non-maximum suppression algorithm is used for screening the generated final result to replace the traditional non-maximum suppression so as to eliminate the contingency caused by the traditional non-maximum suppression algorithm and improve the positioning accuracy of the detection frame.
The statistical non-maximum suppression method of the present invention differs from the conventional non-maximum suppression method in step 3.3. The normal non-maximum suppression directly removes the detection frames IOU > theta, and the invention takes the weighted average of the coordinates of the detection frames and the detection frame with the maximum score. Thus, the statistical information is used to replace the information of the individual boxes, eliminating the chance that the box with the largest score deviates from the object.
Compared with the prior art, the invention has the following beneficial effects: the invention achieves the purpose of rapid target detection by training the SSD by using a data set of face detection; according to the invention, the characteristics of the convolutional neural network at different depths are fused, so that the expression capability of the characteristics is enhanced, the classification and positioning effects are improved on the basis, and the defects of high missing detection rate and false detection rate and poor positioning accuracy of a detection frame in the current rapid detection algorithm are overcome to a certain extent; according to the invention, by using a statistical non-maximum value inhibition method in the merging process of the final results, the contingency is eliminated, and the positioning accuracy of the detection frame is further improved.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a schematic diagram of a neural network model of the present invention, including a deep convolutional network feature model, a deep feature fusion branch;
FIG. 2 is a neural network training and detection flow diagram;
FIG. 3 is a flow chart of a statistical non-maximum suppression algorithm;
FIGS. 4(a) and 4(b) are partial face detection result graphs on FDDB data sets;
fig. 5(a) and fig. 5(b) are the comparison of SSD-512 with the results of the face detection part of the present invention in the smart city data set with 10 frames as the frame extraction interval.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
As shown in fig. 1, the invention relates to a multi-scale rapid face detection method with convolutional neural network feature fusion, which comprises the following steps:
step 1, taking ssd (single Shot multi box detector) target detection algorithm as basic framework, see literature: liu, D.Anguelov, D.Erhan, C.Szegedy, S.Reed, C.Y.Fu, and A.C.berg.SSD, Single shot multibox detector ECCV, pages 21-37,2016.
As shown in fig. 2, SSD is a multi-scale target detection method, and the detector performs classification and detection frame regression based on features of different depths of CNN, i.e. features of different scales, respectively. The features of the SSD512 input detector are conv4_3 level features of VGG, and subsequently added fc7, conv6_2, conv7_2, conv8_2, conv9_2, and conv10_2 level features, respectively. The detailed model structure is shown in the literature, and the invention is not repeated. The present invention finds that the conv9_2 and conv10_2 layer features are too macroscopic and do not help shallow small object classification. Therefore, the feature fusion of the invention starts from the conv8_2 layer feature, and the up-sampling is fused with the conv7_2 layer feature; the fused features are up-sampled again and fused with the conv6_2 layer features; and so on, until conv4_ 3. The fused fuse4_3, fuse fc7, fuse6_2 and fuse7_2 features are used to replace the original conv4_3, fc7, conv6_2 and conv7_2 layer features, and the original conv8_2 layer is fed into the detector together with the conv9_2 and conv10_2 layer features which do not participate in the fusion.
The invention performs feature fusion on the conv4_3 layer, fc7 layer, conv6_2 layer, conv7_2 layer and conv8_2 layer features. The two-layer feature fusion procedure was as follows:
1.1, firstly, the deep layer characteristic f to be fused is subjected to up-sampling by a nearest neighbor interpolation methoddIs amplified to obtain
Figure BDA0001613807650000091
At this time, the process of the present invention,
Figure BDA0001613807650000092
width and height of and shallow feature f to be fusedsAre the same in width and height. For example, when the input is a 3 × 512 × 512-bit RGIn the case of a B color image, the conv8_2 layer feature size of the SSD feature extraction network is 256 × 4 × 4; upsampling the conv8_2 layer features by adopting a nearest neighbor interpolation method to obtain features with the size of 256 multiplied by 8, wherein the length and the width of the features are the same as those of the conv7_2 layer;
1.2, mixing
Figure BDA0001613807650000093
And fsConnecting longer features in channel dimension by splicing to obtain feature fd+s. For example, a feature with a size of 256 × 8 × 8 after the conv8_2 layer upsampling is spliced in the channel dimension with a conv7_2 layer feature with the same size of 256 × 8 × 8, resulting in a feature with a size of 512 × 8 × 8;
1.3, characteristic fd+sReducing the channel number dimension by a layer of 3 x 3 convolution to remove the unwanted noise, and dividing fd+sC ofd+csAnd each channel is reduced to be uniform 256 dimensions. Then a final fusion-finished feature f is obtained through a ReLU activation layerfuse. For example, after conv8_2 upsampling and conv7_2 splicing, the features are sent into a combination of a convolutional layer and a ReLU activation layer for dimension reduction, and the features of 512 × 8 × 8 are reduced into features of 256 × 8 × 8;
1.4、ffuseas a fusion of deep features with shallower features. For example, the feature after the conv8_2 upsampling and the conv7_2 dimension splicing and dimension reduction continues to be upsampled and fused with the conv6_2 layer feature, and so on until the conv4_ 3;
the characteristics of the final feed detector are the characteristics of the conv8_2 layer and the conv9_2 layer and the conv10_2 layer which are not fused, and the characteristics of fuse4_2, fuse4_3, fuse fc7, fuse6_2 and fuse7_2 which are fused respectively. Here, using the same detector as the SSD model, a detector consisting of multiple convolutional layers would perform regression and classification of the detection frame on the basis of these 7 features.
Step 2, training the detection model reconstructed in the step 1 for face detection:
as shown in fig. 1, a wire Face data set is used in the training process, and the steps are as follows:
2.1, initializing detection model parameters by adopting parameters of a VGG pre-training model given by SSD;
2.2, the data set adopts a public Face detection data set, and the details are shown in the literature: shuo Yang, PingLuo, ChenChange Loy, Xiaoou Tang WIDER FACE: A Face Detection benchmark. CVPR2016: 5525-. The invention randomly extracts 16 pictures from the wire Face into a batch of pictures (specifically, how many pictures used in the batch can be adjusted according to the performance of a machine), and performs data enhancement and pretreatment on the 16 pictures, wherein the data enhancement mainly comprises the following steps: the luminance is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed within +/-32; the contrast is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed between 0.5 time and 1.5 times; fine adjustment of the color tone is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between +/-18; there is a probability of 0.5 for fine tuning of the saturation, the fine tuning range being a uniform distribution between 0.5 and 1.5 times. Then, preprocessing the image, and mainly comprising the following steps: adjusting the size of the picture processed by the previous enhancement to a fixed 512 x 512 size by a bilinear interpolation method; respectively subtracting the RGB average values of all pixels of the wire Face data set calculated in advance from the RGB three channels of the picture fixed to be 512 x 512 in size so as to ensure that all the data average values sent into the model are zero;
2.3, inputting the 16 pictures subjected to data enhancement and preprocessing into a deep neural network model, and respectively obtaining the output results of the pictures of the batch through calculation of the deep neural network model; the deep neural network model structure comprises a convolutional neural network used by the SSD for extracting features, a feature fusion part and a detector part. Wherein the feature fusion part is as described in step one; the convolutional neural network and detector portion of the features continues the model setup of the SSD.
2.4, comparing the output result of the deep neural network model with a label given by a data set, and calculating loss through a loss function;
2.5, updating the parameters of the deep neural network model by using a Stochastic Gradient Descent (SGD) method;
2.6, judging whether the deep neural network model reaches a convergence condition, and returning to the step 2.2 if the deep neural network model does not reach the convergence condition; if so, finishing the training to obtain the trained deep neural network model.
Step 3, calculating the picture by using the deep neural network model trained in the step 2 to obtain a model output result:
as shown in fig. 1, the model used in the detection process is the deep neural network model after the training in step 2, and the steps are as follows:
3.1, preprocessing the picture to be detected, and adjusting the picture to be detected to a fixed size of 512 multiplied by 512 by a bilinear interpolation method like the preprocessing method in the step 2.2; respectively subtracting the average value of RGB channels of all pixels of the wire Face data set calculated in advance from the RGB channels of the picture to be detected with the adjusted size;
and 3.2, inputting the preprocessed picture to be detected into the deep neural network model trained in the step 2, and respectively obtaining output results of the picture through calculation of the model.
3.3, as shown in fig. 3, performing statistical non-maximum suppression on the output result of the deep neural network model trained in the step 2, and the steps are as follows:
(a) the detection boxes output by the deep neural network model have a uniform format, and each detection box consists of five numbers x1,y1,x2,y2And s denotes. Wherein x is1,y1And x2,y2Respectively representing the coordinate values of the upper left and the lower right of the frame; s represents the prediction confidence of the deep neural network model to the detection frame, is called the score of the detection frame, and takes a value between 0 and 1, wherein the higher the score is, the more confident the network model is to the detection frame is. All the test frames output by the deep neural network model are recorded as
Figure BDA0001613807650000111
Find out
Figure BDA0001613807650000112
ZhongdeDetection frame b with maximum divisionmax,bmaxRespectively, the coordinates and the score of (a), (b), (c), and (d) are ((x)m1,ym1),(xm2,ym2) ) and sm
Initialization xsum1=smxm1,xsum2=smxm2,ysum1=smym1,ysum2=smym2,ssum=smFive variables are used for storing accumulated values; wherein x issum1,xsum2,ysum1,ysum2Respectively stored are weighted accumulations of frame coordinates, ssumStored is a score accumulation.
(b) Find out
Figure BDA0001613807650000116
All of (A) and (b)maxFramed is a detection frame of the same object, denoted as
Figure BDA0001613807650000113
Calculation of bmaxAnd
Figure BDA0001613807650000114
middle and other detection frames biIf IOU is greater than threshold θ, then b is considered to beiAnd bmaxFraming the same object, biAdding into
Figure BDA0001613807650000115
Let biHas a left upper right lower coordinate of ((x)1,y1),(x2,y2) B) then biAnd bmaxThe area I covered at the same time is defined as follows:
I=(min(xm2,x2)-max(xm1,x1))(min(ym2,y2)-max(ym1,y1))
biand bmaxThe total covered area U is defined as follows:
U=(x2-x1)(y2-y1)+(xm2-xm1)(ym2-ym1)-I
biand bmaxThe degree of overlap IOU is defined as follows:
Figure BDA0001613807650000121
the IOU reflects the overlapping proportion of the two detection frames, and the IOU is more than or equal to 0 and less than or equal to 1. Let theta equal to 0.3, when IOU > theta, update
Figure BDA0001613807650000122
xsum1←xsum1+sixi1
xsum2←xsum2+sixi2
ysum1←ysum1+siyi1
ysum2←ysum2+siyi2
ssum←ssum+sm.
Figure BDA0001613807650000123
Figure BDA0001613807650000124
Figure BDA0001613807650000125
Figure BDA0001613807650000126
Figure BDA0001613807650000127
Figure BDA0001613807650000128
(c) Will be provided with
Figure BDA0001613807650000131
All the detection boxes in (a) and (b)maxThe coordinate values are weighted and averaged together with the score of the detection frame, and the coordinate values after the weighted and averaged value are ((x)mean1,ymean1),(xmean2,ymean2) Is marked as bmean,bmeanThe coordinate values are calculated as follows:
Figure BDA0001613807650000132
Figure BDA0001613807650000133
Figure BDA0001613807650000134
Figure BDA0001613807650000135
(d) in that
Figure BDA0001613807650000136
In which b is removedmaxAnd
Figure BDA00016138076500001311
the detection frame of (1);
Figure BDA0001613807650000137
Figure BDA0001613807650000138
(e) repeating steps (a), (b), (c) and (d) until
Figure BDA0001613807650000139
Is empty. At this time set
Figure BDA00016138076500001310
All detection frames in (1) are the output results of statistical non-maximum suppression.
Experimental results show that the technical scheme can accurately capture and position the face under the complex background, and the detection speed of more than 15 frames per second is achieved. Partial detection results using the FDDB face detection data set as a test set are listed in fig. 4. The FDDB face detection data set comprises 2845 pictures and 5171 faces, and when the false detection is not more than 10%, namely the false detection number is 500, the accuracy rate reaches 95.10%, and compared with SSD-512, the SSD-512 face detection data set is improved by 1.02%. As shown in fig. 5(a) and 5(b), the false detection of the present invention is significantly reduced compared to SSD 512. The invention not only has higher detection rate and lower false detection rate, but also enhances the precision of the detection frame and realizes the rapid detection of the human face.

Claims (6)

1. A multi-scale rapid face detection method with convolutional neural network feature fusion is characterized by comprising the following steps:
step 1: based on the model structure of the SSD fast target detection method, a feature extraction method in the SSD is modified, a feature fusion method is added, and a modified detection model is obtained;
step 2: training the detection model modified in the step 1 aiming at face detection to obtain a trained deep neural network model;
and step 3: calculating the picture to be detected by using the deep neural network model trained in the step 2 to obtain a model output result;
the step 1 specifically comprises the following steps:
the features of the SSD512 input detector are conv4_3 layer features of VGG, and subsequently added fc7, conv6_2, conv7_2, conv8_2, conv9_2 and conv10_2 layer features, respectively; starting from the conv8_2 layer feature, the feature fusion is carried out by upsampling and conv7_2 layer feature fusion; the fused features are up-sampled again and fused with the conv6_2 layer features; by analogy, fusing to conv4_3, and respectively sequentially fusing to obtain fuse7_2, fuse6_2, fuse7 and fuse4_ 3; and replacing the original conv4_3 layer, fc7 layer, conv6_2 layer and conv7_2 layer characteristics with the fused fuse4_3, fuse7, fuse6_2 and fuse7_2 characteristics, feeding the original conv8_2 layer and conv9_2 layer and conv10_2 layer characteristics which do not participate in fusion into a detector identical to the SSD model, and performing regression and classification on detection frames by the detector consisting of a plurality of convolution layers on the basis of the 7 characteristics to obtain the modified detection model.
2. The convolutional neural network feature fusion multi-scale rapid face detection method according to claim 1, wherein step 1 performs feature fusion on the conv4_3 layer, fc7 layer, conv6_2 layer, conv7_2 layer and conv8_2 layer features, and the two-layer feature fusion step is as follows:
1.1, firstly, the deep layer characteristic f to be fused is subjected to up-sampling by a nearest neighbor interpolation methoddIs amplified to obtain
Figure FDA0002385687480000011
At this time, the process of the present invention,
Figure FDA0002385687480000012
width and height of and shallow feature f to be fusedsHave the same width and height;
1.2, mixing
Figure FDA0002385687480000013
And fsConnecting longer features in channel dimension by splicing to obtain feature fd+s
1.3, characteristic fd+sReducing the channel number dimension by a layer of 3 x 3 convolution to remove the unwanted noise, and dividing fd+sC ofd+csThe channels are reduced to be uniform 256 dimensions; then a final fusion-finished feature f is obtained through a ReLU activation layerfuse
1.4、ffuseAs a deep layer featureFusing with shallower layer features till conv4_3, and respectively fusing sequentially to obtain fuse7_2, fuse6_2, fuse7 and fuse4_ 3.
3. The method for multi-scale rapid face detection with convolutional neural network feature fusion according to claim 1, wherein step 2 specifically comprises:
2.1, initializing detection model parameters by adopting parameters of a VGG pre-training model given by SSD;
2.2, adopting a public widget Face detection data set as the data set; randomly extracting a plurality of pictures from the wire Face into a batch of pictures, and performing data enhancement and pretreatment on the batch of pictures;
2.3, inputting the plurality of pictures subjected to data enhancement and preprocessing into a deep neural network model, and respectively obtaining output results of the plurality of pictures input into the deep neural network model through calculation of the deep neural network model; the model structure of the deep neural network model comprises a convolutional neural network used by SSD for extracting features, a feature fusion part and a detector part; wherein the feature fusion part is as described in step 1; carrying out model setting of the characteristic convolution neural network and the detector part continuation SSD;
2.4, comparing the output result of the deep neural network model with a label given by a data set, and calculating loss through a loss function;
2.5, updating the parameters of the deep neural network model by using a random gradient descent method;
2.6, judging whether the deep neural network model reaches a convergence condition, and returning to the step 2.2 if the deep neural network model does not reach the convergence condition; if so, finishing the training to obtain the trained deep neural network model.
4. The method for multi-scale rapid face detection with convolutional neural network feature fusion according to claim 3, wherein the step 2.1 specifically comprises: data enhancement was done as follows: the luminance is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed within +/-32; the contrast is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed between 0.5 time and 1.5 times; fine adjustment of the color tone is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between +/-18; fine adjustment of saturation is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between 0.5 time and 1.5 times; after data enhancement, preprocessing an image, comprising the following steps: adjusting the size of the picture processed by the previous enhancement to a fixed 512 x 512 size by a bilinear interpolation method; the RGB average values of all pixels of the WiderFace data set calculated in advance are subtracted from the three RGB channels of the picture fixed to 512 x 512 size, respectively.
5. The method for multi-scale rapid face detection with convolutional neural network feature fusion according to claim 3, wherein step 3 specifically comprises:
3.1, preprocessing the picture to be detected, and adjusting the picture to be detected to a fixed size of 512 multiplied by 512 by a bilinear interpolation method like the preprocessing method in the step 2.2; respectively subtracting the average value of RGB channels of all pixels of the wire Face data set calculated in advance from the RGB channels of the picture to be detected with the adjusted size;
3.2, inputting the preprocessed picture to be detected into the deep neural network model trained in the step 2, and respectively obtaining output results of the picture through model calculation;
and 3.3, performing statistical non-maximum suppression on the output result of the deep neural network model trained in the step 2 to obtain a model output result.
6. The method for multi-scale rapid face detection with convolutional neural network feature fusion according to claim 5, wherein step 3.3 specifically comprises:
(a) the detection boxes output by the deep neural network model have a uniform format, and each detection box consists of five numbers x1,y1,x2,y2And s represents; wherein x is1,y1And x2,y2Respectively representing the coordinate values of the upper left and the lower right of the frame; s represents a deep neural network modelThe prediction confidence coefficient of the detection frame is called the score of the detection frame, the value is between 0 and 1, and the higher the score is, the more confident the network model is to the detection frame is; all the test frames output by the deep neural network model are recorded as
Figure FDA0002385687480000031
Find out
Figure FDA0002385687480000032
Detection box b with the highest median scoremax,bmaxRespectively, the coordinates and the score of (a), (b), (c), and (d) are ((x)m1,ym1),(xm2,ym2) ) and sm
Initialization xsum1=smxm1,xsum2=smxm2,ysum1=smym1,ysum2=smym2,ssum=smFive variables are used for storing accumulated values; wherein x issum1,xsum2,ysum1,ysum2Respectively stored are weighted accumulations of frame coordinates, ssumStored is a score accumulation;
(b) find out
Figure FDA0002385687480000043
All of (A) and (b)maxFramed is a detection frame of the same object, denoted as
Figure FDA0002385687480000044
Calculation of bmaxAnd
Figure FDA0002385687480000045
middle and other detection frames biIf IOU is greater than threshold θ, then b is considered to beiAnd bmaxFraming the same object, biAdding into
Figure FDA0002385687480000046
Let biHas a left upper right lower coordinate of ((x)1,y1),(x2,y2) B) then biAnd bmaxThe area I covered at the same time is defined as follows:
I=(min(xm2,x2)-max(xm1,x1))(min(ym2,y2)-max(ym1,y1))
biand bmaxThe total covered area U is defined as follows:
U=(x2-x1)(y2-y1)+(xm2-xm1)(ym2-ym1)-I
biand bmaxThe degree of overlap IOU is defined as follows:
Figure FDA0002385687480000041
the IOU reflects the overlapping proportion of the two detection frames, and the IOU is more than or equal to 0 and less than or equal to 1; let theta equal to 0.3, when IOU > theta, update
Figure FDA0002385687480000042
xsum1←xsum1+sixi1
xsum2←xsum2+sixi2
ysum1←ysum1+siyi1
ysum2←ysum2+siyi2
ssum←ssum+sm
(c) Will be provided with
Figure FDA0002385687480000051
All the detection boxes in (a) and (b)maxWeighted average of coordinate values is determined by using the scores of the detection frames as weights, and the weighted average is performedCoordinate value of (2) ((x)mean1,ymean1),(xmean2,ymean2) Is marked as bmean,bmeanThe coordinate values are calculated as follows:
Figure FDA0002385687480000052
Figure FDA0002385687480000053
Figure FDA0002385687480000054
Figure FDA0002385687480000055
(d) in that
Figure FDA0002385687480000056
In which b is removedmaxAnd
Figure FDA0002385687480000057
the detection frame of (1);
Figure FDA0002385687480000058
Figure FDA0002385687480000059
(e) repeating steps (a), (b), (c) and (d) until
Figure FDA00023856874800000510
Is empty; at this time set
Figure FDA00023856874800000511
All detection frames in (1) are outputs of statistical non-maximum suppressionAnd (6) obtaining a result.
CN201810276795.7A 2018-03-30 2018-03-30 Multi-scale rapid face detection method based on convolutional neural network feature fusion Active CN108520219B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810276795.7A CN108520219B (en) 2018-03-30 2018-03-30 Multi-scale rapid face detection method based on convolutional neural network feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810276795.7A CN108520219B (en) 2018-03-30 2018-03-30 Multi-scale rapid face detection method based on convolutional neural network feature fusion

Publications (2)

Publication Number Publication Date
CN108520219A CN108520219A (en) 2018-09-11
CN108520219B true CN108520219B (en) 2020-05-12

Family

ID=63430934

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810276795.7A Active CN108520219B (en) 2018-03-30 2018-03-30 Multi-scale rapid face detection method based on convolutional neural network feature fusion

Country Status (1)

Country Link
CN (1) CN108520219B (en)

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101958A (en) * 2018-11-01 2018-12-28 钟祥博谦信息科技有限公司 Face detection system based on deep learning
CN111435418B (en) * 2018-12-26 2024-01-02 深圳市优必选科技有限公司 Method and device for identifying personalized object of robot, storage medium and robot
CN109919013A (en) * 2019-01-28 2019-06-21 浙江英索人工智能科技有限公司 Method for detecting human face and device in video image based on deep learning
CN109886312B (en) * 2019-01-28 2023-06-06 同济大学 Bridge vehicle wheel detection method based on multilayer feature fusion neural network model
CN109858547A (en) * 2019-01-29 2019-06-07 东南大学 A kind of object detection method and device based on BSSD
CN109886159B (en) * 2019-01-30 2021-03-26 浙江工商大学 Face detection method under non-limited condition
CN109977793B (en) * 2019-03-04 2022-03-04 东南大学 Roadside image pedestrian segmentation method based on variable-scale multi-feature fusion convolutional network
CN109977790A (en) * 2019-03-04 2019-07-05 浙江工业大学 A kind of video smoke detection and recognition methods based on transfer learning
CN110008853B (en) * 2019-03-15 2023-05-30 华南理工大学 Pedestrian detection network and model training method, detection method, medium and equipment
CN109993089B (en) * 2019-03-22 2020-11-24 浙江工商大学 Video target removing and background restoring method based on deep learning
CN111738036B (en) * 2019-03-25 2023-09-29 北京四维图新科技股份有限公司 Image processing method, device, equipment and storage medium
CN110008876A (en) * 2019-03-26 2019-07-12 电子科技大学 A kind of face verification method based on data enhancing and Fusion Features
CN111753581A (en) * 2019-03-27 2020-10-09 虹软科技股份有限公司 Target detection method and device
CN110245675B (en) * 2019-04-03 2023-02-10 复旦大学 Dangerous object detection method based on millimeter wave image human body context information
CN110033505A (en) * 2019-04-16 2019-07-19 西安电子科技大学 A kind of human action capture based on deep learning and virtual animation producing method
CN110189307B (en) * 2019-05-14 2021-11-23 慧影医疗科技(北京)有限公司 Pulmonary nodule detection method and system based on multi-model fusion
CN110210538B (en) * 2019-05-22 2021-10-19 雷恩友力数据科技南京有限公司 Household image multi-target identification method and device
TWI738009B (en) 2019-06-20 2021-09-01 和碩聯合科技股份有限公司 Object detection system and object detection method
CN110427821A (en) * 2019-06-27 2019-11-08 高新兴科技集团股份有限公司 A kind of method for detecting human face and system based on lightweight convolutional neural networks
CN110472634B (en) * 2019-07-03 2023-03-14 中国民航大学 Change detection method based on multi-scale depth feature difference fusion network
CN110473185B (en) * 2019-08-07 2022-03-15 Oppo广东移动通信有限公司 Image processing method and device, electronic equipment and computer readable storage medium
CN110427912A (en) * 2019-08-12 2019-11-08 深圳市捷顺科技实业股份有限公司 A kind of method for detecting human face and its relevant apparatus based on deep learning
CN110495962A (en) * 2019-08-26 2019-11-26 赫比(上海)家用电器产品有限公司 The method and its toothbrush and equipment of monitoring toothbrush position
CN110765886B (en) * 2019-09-29 2022-05-03 深圳大学 Road target detection method and device based on convolutional neural network
CN111191508A (en) * 2019-11-28 2020-05-22 浙江省北大信息技术高等研究院 Face recognition method and device
CN110910415A (en) * 2019-11-28 2020-03-24 重庆中星微人工智能芯片技术有限公司 Parabolic detection method, device, server and computer readable medium
CN111144248B (en) * 2019-12-16 2024-02-27 上海交通大学 People counting method, system and medium based on ST-FHCD network model
CN111232200B (en) * 2020-02-10 2021-07-16 北京建筑大学 Target detection method based on micro aircraft
CN111401290A (en) * 2020-03-24 2020-07-10 杭州博雅鸿图视频技术有限公司 Face detection method and system and computer readable storage medium
CN112464701B (en) * 2020-08-26 2023-06-30 北京交通大学 Method for detecting whether person wears mask or not based on lightweight feature fusion SSD
CN112926506B (en) * 2021-03-24 2022-08-12 重庆邮电大学 Non-controlled face detection method and system based on convolutional neural network
CN115346114A (en) * 2022-07-21 2022-11-15 中铁二院工程集团有限责任公司 Method and equipment for identifying and positioning bad geologic body by railway tunnel aviation electromagnetic method
CN115200784B (en) * 2022-09-16 2022-12-02 福建(泉州)哈工大工程技术研究院 Powder leakage detection method and device based on improved SSD network model and readable medium
CN116851856B (en) * 2023-03-27 2024-05-10 浙江万能弹簧机械有限公司 Pure waterline cutting processing technology and system thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NL2010887C2 (en) * 2013-05-29 2014-12-02 Univ Delft Tech Memristor.
CN106709568B (en) * 2016-12-16 2019-03-22 北京工业大学 The object detection and semantic segmentation method of RGB-D image based on deep layer convolutional network
CN107705324A (en) * 2017-10-20 2018-02-16 中山大学 A kind of video object detection method based on machine learning

Also Published As

Publication number Publication date
CN108520219A (en) 2018-09-11

Similar Documents

Publication Publication Date Title
CN108520219B (en) Multi-scale rapid face detection method based on convolutional neural network feature fusion
CN110909690B (en) Method for detecting occluded face image based on region generation
CN109657595B (en) Key feature region matching face recognition method based on stacked hourglass network
US20200410212A1 (en) Fast side-face interference resistant face detection method
CN110543846B (en) Multi-pose face image obverse method based on generation countermeasure network
CN112818862B (en) Face tampering detection method and system based on multi-source clues and mixed attention
CN112950661A (en) Method for generating antithetical network human face cartoon based on attention generation
CN111310718A (en) High-accuracy detection and comparison method for face-shielding image
CN1975759A (en) Human face identifying method based on structural principal element analysis
CN104794693B (en) A kind of portrait optimization method of face key area automatic detection masking-out
US20100111375A1 (en) Method for Determining Atributes of Faces in Images
CN107066963B (en) A kind of adaptive people counting method
CN111368758A (en) Face ambiguity detection method and device, computer equipment and storage medium
CN105956570B (en) Smiling face's recognition methods based on lip feature and deep learning
CN114550268A (en) Depth-forged video detection method utilizing space-time characteristics
CN112434647A (en) Human face living body detection method
CN116110100A (en) Face recognition method, device, computer equipment and storage medium
Cai et al. Perception preserving decolorization
CN116453232A (en) Face living body detection method, training method and device of face living body detection model
CN111882525A (en) Image reproduction detection method based on LBP watermark characteristics and fine-grained identification
Booysens et al. Ear biometrics using deep learning: A survey
CN112200008A (en) Face attribute recognition method in community monitoring scene
KR20180092453A (en) Face recognition method Using convolutional neural network and stereo image
CN111191549A (en) Two-stage face anti-counterfeiting detection method
CN113014914B (en) Neural network-based single face-changing short video identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant