CN108520219B

CN108520219B - Multi-scale rapid face detection method based on convolutional neural network feature fusion

Info

Publication number: CN108520219B
Application number: CN201810276795.7A
Authority: CN
Inventors: 钱学明; 韩振; 张宇奇; 邹屹洋; 侯兴松
Original assignee: Taizhou Zhibi'an Technology Co ltd
Current assignee: Taizhou Zhibi'an Technology Co ltd
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2020-05-12
Anticipated expiration: 2038-03-30
Also published as: CN108520219A

Abstract

The invention discloses a multi-scale rapid face detection method with convolutional neural network feature fusion, which comprises the following steps: step 1: based on the model structure of the SSD fast target detection method, a feature extraction method in the SSD is modified, a feature fusion method is added, and a modified detection model is obtained; step 2: training the detection model modified in the step 1 aiming at face detection to obtain a trained deep neural network model; and step 3: and (3) calculating the picture to be detected by using the deep neural network model trained in the step (2) to obtain a model output result. The invention can quickly identify the face in the image and accurately position the face, thereby separating the face from the complex background. Providing a basis for authentication and tracking of the person in the image.

Description

Multi-scale rapid face detection method based on convolutional neural network feature fusion

Technical Field

The invention belongs to the technical field of computer digital image processing and pattern recognition, and particularly relates to a face detection method.

Background

With the increase of the number of the monitoring cameras in China, massive monitoring video data can be generated every day. In this context, computer-aided surveillance video content analysis techniques become necessary and desirable. In the surveillance video, the monitored object is mainly a person, and the human face features are the most important information for identity recognition and verification by using image information. The face detection can position all faces in the picture, and separate the faces from the background, which is the premise of face representation and face identification. Therefore, face detection is the first step in the analysis of surveillance video content.

The current target detection methods mainly comprise a traditional method and a deep learning-based method. The basic flow of the traditional method is the same as that of the method based on deep learning, and the traditional method firstly extracts the features of the image and then carries out foreground and background classification on each part of the image on a feature map. The features used in the conventional target detection method are mostly designed by human experience, such as artificial features like Harr, HOG and LBP. These features are based on human experience and do not fully characterize the image. In addition, classifiers such as SVM, Adaboost and the like are mainly adopted by the classifier of the traditional method, and the classification precision of the classifier on the image is inferior to that of a classification method using a convolutional neural network. The detection method based on deep learning is to train a Convolutional Neural Network (CNN) to extract deep features and train a classifier on the basis of the features. Detection methods based on deep learning can be roughly divided into two types. One is a detection method based on candidate regions, and the other is a detection method of direct regression. A representative method of a detection method based on a candidate region is a Faster R-CNN method proposed by the Ross Girshick team, and is detailed in the literature: shaoqing Ren, Kaiming He, Ross Girshick, Jianan Sun, Faster R-CNN, TowardsReal-Time Object Detection with Region pro-potential networks NIPS 2015-99. the basic idea is to extract a rough feature for an image through CNN, then extract a candidate frame possibly having an Object through the feature, then intercept the part of the feature map corresponding to the possible candidate frame, and send the part to the final decision device classification and regression Detection frame. The method has high detection precision, but the detection speed is very low and the real-time effect cannot be achieved due to two stages of judging whether the object exists. A representative method of the direct regression detection method is an SSD target detection algorithm, which is described in detail in the following documents: liu, D.Anguelov, D.Erhan, C.Szegedy, S.Reed, C.Y.Fu, and A.C.berg.SSD, Single shot multibox detector ECCV, pages 21-37,2016 the basic idea is to abandon the judgment of two stages of fast R-CNN and directly make classification and regression on the characteristic diagram extracted by CNN to obtain the final result. This single-stage detection method is fast, and also enables simultaneous detection on features of different depths in the CNN. Although the detection speed is high, the method has the disadvantages of poor performance, high missing detection and false detection rate and insufficient accuracy of the detection frame.

The face detection method is used for processing the video, and has higher requirements on the single-frame detection speed. However, the fast R-CNN method is too slow to detect in real time. Although the SSD method has a fast detection speed, detection accuracy indexes such as a missing detection rate and a false detection rate are relatively poor, and the accuracy of positioning the detection frame is not high enough.

Disclosure of Invention

The invention aims to provide a depth feature fusion multi-scale rapid face detection method, which is used for positioning the positions of all faces in a picture so as to overcome the defects that the SSD rapid target detection method is high in missing detection rate and false detection rate and poor in frame positioning accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-scale rapid face detection method with convolutional neural network feature fusion comprises the following steps:

step 1: based on the model structure of the SSD fast target detection method, a feature extraction method in the SSD is modified, a feature fusion method is added, and a modified detection model is obtained;

step 2: training the detection model modified in the step 1 aiming at face detection to obtain a trained deep neural network model;

and step 3: and (3) calculating the picture to be detected by using the deep neural network model trained in the step (2) to obtain a model output result.

Further, step 1 specifically includes:

the features of the SSD512 input detector are conv4_3 layer features of VGG, and subsequently added fc7, conv6_2, conv7_2, conv8_2, conv9_2 and conv10_2 layer features, respectively; starting from the conv8_2 layer feature, the feature fusion is carried out by upsampling and conv7_2 layer feature fusion; the fused features are up-sampled again and fused with the conv6_2 layer features; by analogy, fusing to conv4_3, and respectively sequentially fusing to obtain fuse7_2, fuse6_2, fuse7 and fuse4_ 3; and replacing the original conv4_3 layer, fc7 layer, conv6_2 layer and conv7_2 layer characteristics with the fused fuse4_3, fuse fc7, fuse6_2 and fuse7_2 characteristics, feeding the original conv8_2 layer and conv9_2 layer and conv10_2 layer characteristics which do not participate in fusion into the detector identical to the SSD model, and performing regression and classification on detection frames by the detector consisting of a plurality of convolution layers on the basis of the 7 characteristics to obtain the modified detection model.

Further, step 1 performs feature fusion on the features of the conv4_3 layer, fc7 layer, conv6_2 layer, conv7_2 layer and conv8_2 layer, and the two-layer feature fusion step is as follows:

1.1, firstly, the deep layer characteristic f to be fused is subjected to up-sampling by a nearest neighbor interpolation method_dIs amplified to obtain

At this time, the process of the present invention,

width and height of and shallow feature f to be fused_sHave the same width and height;

1.2, mixing

And f_sConnecting longer features in channel dimension by splicing to obtain feature f_d+s；

1.3, characteristic f_d+sReducing the channel number dimension by a layer of 3 x 3 convolution to remove the unwanted noise, and dividing f_d+sC of_d+c_sThe channels are reduced to be uniform 256 dimensions; then a final fusion-finished feature f is obtained through a ReLU activation layer_fuse；

1.4、f_fuseThe deep features are fused with the shallower features until conv4_3, and fuse7_2, fuse6_2, fuse fc7 and fuse4_3 are obtained respectively in sequence.

Further, step 2 specifically includes:

2.1, initializing detection model parameters by adopting parameters of a VGG pre-training model given by SSD;

2.2, adopting a public widget Face detection data set as the data set; randomly extracting a plurality of pictures from the wire Face into a batch of pictures, and performing data enhancement and pretreatment on the batch of pictures;

2.3, inputting the plurality of pictures subjected to data enhancement and preprocessing into a deep neural network model, and respectively obtaining the output results of the pictures of the batch through calculation of the deep neural network model; the deep neural network model structure comprises a convolutional neural network used by the SSD for extracting features, a feature fusion part and a detector part; wherein the feature fusion part is as described in step 1; carrying out model setting of the characteristic convolution neural network and the detector part continuation SSD;

2.4, comparing the output result of the deep neural network model with a label given by a data set, and calculating loss through a loss function;

2.5, updating the parameters of the deep neural network model by using a random gradient descent method;

2.6, judging whether the deep neural network model reaches a convergence condition, and returning to the step 2.2 if the deep neural network model does not reach the convergence condition; if so, finishing the training to obtain the trained deep neural network model.

Further, step 2.1 specifically includes: data enhancement was done as follows: the luminance is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed within +/-32; the contrast is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed between 0.5 time and 1.5 times; fine adjustment of the color tone is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between +/-18; fine adjustment of saturation is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between 0.5 time and 1.5 times; after data enhancement, preprocessing an image, comprising the following steps: adjusting the size of the picture processed by the previous enhancement to a fixed 512 x 512 size by a bilinear interpolation method; the RGB average values of all pixels of the wire Face data set calculated in advance are respectively subtracted from the RGB three channels of the picture fixed to be 512 x 512 in size.

Further, step 3 specifically includes:

3.1, preprocessing the picture to be detected, and adjusting the picture to be detected to a fixed size of 512 multiplied by 512 by a bilinear interpolation method like the preprocessing method in the step 2.2; respectively subtracting the average value of RGB channels of all pixels of the wire Face data set calculated in advance from the RGB channels of the picture to be detected with the adjusted size;

3.2, inputting the preprocessed picture to be detected into the deep neural network model trained in the step 2, and respectively obtaining output results of the picture through model calculation;

and 3.3, performing statistical non-maximum suppression on the output result of the deep neural network model trained in the step 2 to obtain a model output result.

Further, step 3.3 specifically includes:

(a) the detection boxes output by the deep neural network model have a uniform format, and each detection box consists of five numbers x₁,y₁,x₂,y₂And s represents; wherein x is₁,y₁And x₂,y₂Respectively representing the coordinate values of the upper left and the lower right of the frame; s represents the prediction confidence of the deep neural network model to the detection frame, is called the score of the detection frame, the value is between 0 and 1, and the higher the score is, the more confident the network model is to the detection frame is; all the test frames output by the deep neural network model are recorded as

Find out

Detection box b with the highest median score_max，b_maxRespectively, the coordinates and the score of (a), (b), (c), and (d) are ((x)_m1,y_m1),(x_m2,y_m2) ) and s_m；

Initialization x_sum1＝s_mx_m1,x_sum2＝s_mx_m2,y_sum1＝s_my_m1,y_sum2＝s_my_m2,s_sum＝s_mFive variables are used for storing accumulated values; wherein x is_sum1,x_sum2,y_sum1,y_sum2Respectively stored are weighted accumulations of frame coordinates, s_sumStored is a score accumulation;

(b) find out

All of (A) and (b)_maxFramed is a detection frame of the same object, denoted as

Calculation of b_maxAnd

middle and other detection frames b_iIf IOU is greater than threshold θ, then b is considered to be_iAnd b_maxFraming the same object, b_iAdding into

Let b_iHas a left upper right lower coordinate of ((x)₁,y₁),(x₂,y₂) B) then b_iAnd b_maxThe area I covered at the same time is defined as follows:

I＝(min(x_m2,x₂)-max(x_m1,x₁))(min(y_m2,y₂)-max(y_m1,y₁))

b_iand b_maxThe total covered area U is defined as follows:

U＝(x₂-x₁)(y₂-y₁)+(x_m2-x_m1)(y_m2-y_m1)-I

b_iand b_maxThe degree of overlap IOU is defined as follows:

the IOU reflects the overlapping proportion of the two detection frames, and the IOU is more than or equal to 0 and less than or equal to 1; let theta equal to 0.3, when IOU > theta, update

x_sum1←x_sum1+s_ix_i1；

x_sum2←x_sum2+s_ix_i2；

y_sum1←y_sum1+s_iy_i1；

y_sum2←y_sum2+s_iy_i2；

s_sum←s_sum+s_m；

(c) Will be provided with

All the detection boxes in (a) and (b)_maxThe coordinate values are weighted and averaged together with the score of the detection frame, and the coordinate values after the weighted and averaged value are ((x)_mean1,y_mean1),(x_mean2,y_mean2) Is marked as b_mean，b_meanThe coordinate values are calculated as follows:

(d) in that

In which b is removed_maxAnd

the detection frame of (1);

(e) repeating steps (a), (b), (c) and (d) until

Is empty; at this time set

All detection frames in (1) are the output results of statistical non-maximum suppression.

In order to ensure faster detection speed, the invention modifies the whole network structure based on the SSD fast target detection method, and trains the modified network aiming at face detection;

in order to reduce the missing detection and the false detection of the SSD, the invention adds a characteristic fusion method on the basis of the SSD network structure: the detection method requires that the features include both local features for positioning and semantic features for classification. Generally, CNN deep features are rich in semantic features and lack local positioning information, and shallow features are rich in local features but limited by the lack of semantic features at network depth. In order to reduce the missing detection and the false detection caused by the classification error, the invention performs characteristic fusion between the characteristics of different depths of the convolutional neural network with improved characteristics so as to obtain more integral and comprehensive characteristics, thereby being beneficial to the judgment of a classifier to reduce the missing detection and the false detection and being beneficial to the positioning of a regressor to improve the positioning accuracy;

in order to further improve the positioning accuracy of the detection frame, the statistical non-maximum suppression method is used in the merging process of the final result: and a statistic-based non-maximum suppression algorithm is used for screening the generated final result to replace the traditional non-maximum suppression so as to eliminate the contingency caused by the traditional non-maximum suppression algorithm and improve the positioning accuracy of the detection frame.

The statistical non-maximum suppression method of the present invention differs from the conventional non-maximum suppression method in step 3.3. The normal non-maximum suppression directly removes the detection frames IOU > theta, and the invention takes the weighted average of the coordinates of the detection frames and the detection frame with the maximum score. Thus, the statistical information is used to replace the information of the individual boxes, eliminating the chance that the box with the largest score deviates from the object.

Compared with the prior art, the invention has the following beneficial effects: the invention achieves the purpose of rapid target detection by training the SSD by using a data set of face detection; according to the invention, the characteristics of the convolutional neural network at different depths are fused, so that the expression capability of the characteristics is enhanced, the classification and positioning effects are improved on the basis, and the defects of high missing detection rate and false detection rate and poor positioning accuracy of a detection frame in the current rapid detection algorithm are overcome to a certain extent; according to the invention, by using a statistical non-maximum value inhibition method in the merging process of the final results, the contingency is eliminated, and the positioning accuracy of the detection frame is further improved.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a neural network model of the present invention, including a deep convolutional network feature model, a deep feature fusion branch;

FIG. 2 is a neural network training and detection flow diagram;

FIG. 3 is a flow chart of a statistical non-maximum suppression algorithm;

FIGS. 4(a) and 4(b) are partial face detection result graphs on FDDB data sets;

fig. 5(a) and fig. 5(b) are the comparison of SSD-512 with the results of the face detection part of the present invention in the smart city data set with 10 frames as the frame extraction interval.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

As shown in fig. 1, the invention relates to a multi-scale rapid face detection method with convolutional neural network feature fusion, which comprises the following steps:

step 1, taking ssd (single Shot multi box detector) target detection algorithm as basic framework, see literature: liu, D.Anguelov, D.Erhan, C.Szegedy, S.Reed, C.Y.Fu, and A.C.berg.SSD, Single shot multibox detector ECCV, pages 21-37,2016.

As shown in fig. 2, SSD is a multi-scale target detection method, and the detector performs classification and detection frame regression based on features of different depths of CNN, i.e. features of different scales, respectively. The features of the SSD512 input detector are conv4_3 level features of VGG, and subsequently added fc7, conv6_2, conv7_2, conv8_2, conv9_2, and conv10_2 level features, respectively. The detailed model structure is shown in the literature, and the invention is not repeated. The present invention finds that the conv9_2 and conv10_2 layer features are too macroscopic and do not help shallow small object classification. Therefore, the feature fusion of the invention starts from the conv8_2 layer feature, and the up-sampling is fused with the conv7_2 layer feature; the fused features are up-sampled again and fused with the conv6_2 layer features; and so on, until conv4_ 3. The fused fuse4_3, fuse fc7, fuse6_2 and fuse7_2 features are used to replace the original conv4_3, fc7, conv6_2 and conv7_2 layer features, and the original conv8_2 layer is fed into the detector together with the conv9_2 and conv10_2 layer features which do not participate in the fusion.

The invention performs feature fusion on the conv4_3 layer, fc7 layer, conv6_2 layer, conv7_2 layer and conv8_2 layer features. The two-layer feature fusion procedure was as follows:

At this time, the process of the present invention,

width and height of and shallow feature f to be fused_sAre the same in width and height. For example, when the input is a 3 × 512 × 512-bit RGIn the case of a B color image, the conv8_2 layer feature size of the SSD feature extraction network is 256 × 4 × 4; upsampling the conv8_2 layer features by adopting a nearest neighbor interpolation method to obtain features with the size of 256 multiplied by 8, wherein the length and the width of the features are the same as those of the conv7_2 layer;

1.2, mixing

And f_sConnecting longer features in channel dimension by splicing to obtain feature f_d+s. For example, a feature with a size of 256 × 8 × 8 after the conv8_2 layer upsampling is spliced in the channel dimension with a conv7_2 layer feature with the same size of 256 × 8 × 8, resulting in a feature with a size of 512 × 8 × 8;

1.3, characteristic f_d+sReducing the channel number dimension by a layer of 3 x 3 convolution to remove the unwanted noise, and dividing f_d+sC of_d+c_sAnd each channel is reduced to be uniform 256 dimensions. Then a final fusion-finished feature f is obtained through a ReLU activation layer_fuse. For example, after conv8_2 upsampling and conv7_2 splicing, the features are sent into a combination of a convolutional layer and a ReLU activation layer for dimension reduction, and the features of 512 × 8 × 8 are reduced into features of 256 × 8 × 8;

1.4、f_fuseas a fusion of deep features with shallower features. For example, the feature after the conv8_2 upsampling and the conv7_2 dimension splicing and dimension reduction continues to be upsampled and fused with the conv6_2 layer feature, and so on until the conv4_ 3;

the characteristics of the final feed detector are the characteristics of the conv8_2 layer and the conv9_2 layer and the conv10_2 layer which are not fused, and the characteristics of fuse4_2, fuse4_3, fuse fc7, fuse6_2 and fuse7_2 which are fused respectively. Here, using the same detector as the SSD model, a detector consisting of multiple convolutional layers would perform regression and classification of the detection frame on the basis of these 7 features.

Step 2, training the detection model reconstructed in the step 1 for face detection:

as shown in fig. 1, a wire Face data set is used in the training process, and the steps are as follows:

2.2, the data set adopts a public Face detection data set, and the details are shown in the literature: shuo Yang, PingLuo, ChenChange Loy, Xiaoou Tang WIDER FACE: A Face Detection benchmark. CVPR2016: 5525-. The invention randomly extracts 16 pictures from the wire Face into a batch of pictures (specifically, how many pictures used in the batch can be adjusted according to the performance of a machine), and performs data enhancement and pretreatment on the 16 pictures, wherein the data enhancement mainly comprises the following steps: the luminance is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed within +/-32; the contrast is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed between 0.5 time and 1.5 times; fine adjustment of the color tone is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between +/-18; there is a probability of 0.5 for fine tuning of the saturation, the fine tuning range being a uniform distribution between 0.5 and 1.5 times. Then, preprocessing the image, and mainly comprising the following steps: adjusting the size of the picture processed by the previous enhancement to a fixed 512 x 512 size by a bilinear interpolation method; respectively subtracting the RGB average values of all pixels of the wire Face data set calculated in advance from the RGB three channels of the picture fixed to be 512 x 512 in size so as to ensure that all the data average values sent into the model are zero;

2.3, inputting the 16 pictures subjected to data enhancement and preprocessing into a deep neural network model, and respectively obtaining the output results of the pictures of the batch through calculation of the deep neural network model; the deep neural network model structure comprises a convolutional neural network used by the SSD for extracting features, a feature fusion part and a detector part. Wherein the feature fusion part is as described in step one; the convolutional neural network and detector portion of the features continues the model setup of the SSD.

2.5, updating the parameters of the deep neural network model by using a Stochastic Gradient Descent (SGD) method;

Step 3, calculating the picture by using the deep neural network model trained in the step 2 to obtain a model output result:

as shown in fig. 1, the model used in the detection process is the deep neural network model after the training in step 2, and the steps are as follows:

and 3.2, inputting the preprocessed picture to be detected into the deep neural network model trained in the step 2, and respectively obtaining output results of the picture through calculation of the model.

3.3, as shown in fig. 3, performing statistical non-maximum suppression on the output result of the deep neural network model trained in the step 2, and the steps are as follows:

(a) the detection boxes output by the deep neural network model have a uniform format, and each detection box consists of five numbers x₁,y₁,x₂,y₂And s denotes. Wherein x is₁,y₁And x₂,y₂Respectively representing the coordinate values of the upper left and the lower right of the frame; s represents the prediction confidence of the deep neural network model to the detection frame, is called the score of the detection frame, and takes a value between 0 and 1, wherein the higher the score is, the more confident the network model is to the detection frame is. All the test frames output by the deep neural network model are recorded as

Find out

ZhongdeDetection frame b with maximum division_max，b_maxRespectively, the coordinates and the score of (a), (b), (c), and (d) are ((x)_m1,y_m1),(x_m2,y_m2) ) and s_m；

Initialization x_sum1＝s_mx_m1,x_sum2＝s_mx_m2,y_sum1＝s_my_m1,y_sum2＝s_my_m2,s_sum＝s_mFive variables are used for storing accumulated values; wherein x is_sum1,x_sum2,y_sum1,y_sum2Respectively stored are weighted accumulations of frame coordinates, s_sumStored is a score accumulation.

(b) Find out

Calculation of b_maxAnd

I＝(min(x_m2,x₂)-max(x_m1,x₁))(min(y_m2,y₂)-max(y_m1,y₁))

b_iand b_maxThe total covered area U is defined as follows:

U＝(x₂-x₁)(y₂-y₁)+(x_m2-x_m1)(y_m2-y_m1)-I

b_iand b_maxThe degree of overlap IOU is defined as follows:

the IOU reflects the overlapping proportion of the two detection frames, and the IOU is more than or equal to 0 and less than or equal to 1. Let theta equal to 0.3, when IOU > theta, update

x_sum1←x_sum1+s_ix_i1；

x_sum2←x_sum2+s_ix_i2；

y_sum1←y_sum1+s_iy_i1；

y_sum2←y_sum2+s_iy_i2；

s_sum←s_sum+s_m.

(c) Will be provided with

(d) in that

In which b is removed_maxAnd

the detection frame of (1);

(e) repeating steps (a), (b), (c) and (d) until

Is empty. At this time set

Experimental results show that the technical scheme can accurately capture and position the face under the complex background, and the detection speed of more than 15 frames per second is achieved. Partial detection results using the FDDB face detection data set as a test set are listed in fig. 4. The FDDB face detection data set comprises 2845 pictures and 5171 faces, and when the false detection is not more than 10%, namely the false detection number is 500, the accuracy rate reaches 95.10%, and compared with SSD-512, the SSD-512 face detection data set is improved by 1.02%. As shown in fig. 5(a) and 5(b), the false detection of the present invention is significantly reduced compared to SSD 512. The invention not only has higher detection rate and lower false detection rate, but also enhances the precision of the detection frame and realizes the rapid detection of the human face.

Claims

1. A multi-scale rapid face detection method with convolutional neural network feature fusion is characterized by comprising the following steps:

and step 3: calculating the picture to be detected by using the deep neural network model trained in the step 2 to obtain a model output result;

the step 1 specifically comprises the following steps:

the features of the SSD512 input detector are conv4_3 layer features of VGG, and subsequently added fc7, conv6_2, conv7_2, conv8_2, conv9_2 and conv10_2 layer features, respectively; starting from the conv8_2 layer feature, the feature fusion is carried out by upsampling and conv7_2 layer feature fusion; the fused features are up-sampled again and fused with the conv6_2 layer features; by analogy, fusing to conv4_3, and respectively sequentially fusing to obtain fuse7_2, fuse6_2, fuse7 and fuse4_ 3; and replacing the original conv4_3 layer, fc7 layer, conv6_2 layer and conv7_2 layer characteristics with the fused fuse4_3, fuse7, fuse6_2 and fuse7_2 characteristics, feeding the original conv8_2 layer and conv9_2 layer and conv10_2 layer characteristics which do not participate in fusion into a detector identical to the SSD model, and performing regression and classification on detection frames by the detector consisting of a plurality of convolution layers on the basis of the 7 characteristics to obtain the modified detection model.

2. The convolutional neural network feature fusion multi-scale rapid face detection method according to claim 1, wherein step 1 performs feature fusion on the conv4_3 layer, fc7 layer, conv6_2 layer, conv7_2 layer and conv8_2 layer features, and the two-layer feature fusion step is as follows:

At this time, the process of the present invention,

1.2, mixing

1.4、f_fuseAs a deep layer featureFusing with shallower layer features till conv4_3, and respectively fusing sequentially to obtain fuse7_2, fuse6_2, fuse7 and fuse4_ 3.

3. The method for multi-scale rapid face detection with convolutional neural network feature fusion according to claim 1, wherein step 2 specifically comprises:

2.3, inputting the plurality of pictures subjected to data enhancement and preprocessing into a deep neural network model, and respectively obtaining output results of the plurality of pictures input into the deep neural network model through calculation of the deep neural network model; the model structure of the deep neural network model comprises a convolutional neural network used by SSD for extracting features, a feature fusion part and a detector part; wherein the feature fusion part is as described in step 1; carrying out model setting of the characteristic convolution neural network and the detector part continuation SSD;

4. The method for multi-scale rapid face detection with convolutional neural network feature fusion according to claim 3, wherein the step 2.1 specifically comprises: data enhancement was done as follows: the luminance is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed within +/-32; the contrast is finely adjusted with the probability of 0.5, and the fine adjustment range is uniformly distributed between 0.5 time and 1.5 times; fine adjustment of the color tone is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between +/-18; fine adjustment of saturation is carried out with the probability of 0.5, and the fine adjustment range is uniform distribution between 0.5 time and 1.5 times; after data enhancement, preprocessing an image, comprising the following steps: adjusting the size of the picture processed by the previous enhancement to a fixed 512 x 512 size by a bilinear interpolation method; the RGB average values of all pixels of the WiderFace data set calculated in advance are subtracted from the three RGB channels of the picture fixed to 512 x 512 size, respectively.

5. The method for multi-scale rapid face detection with convolutional neural network feature fusion according to claim 3, wherein step 3 specifically comprises:

6. The method for multi-scale rapid face detection with convolutional neural network feature fusion according to claim 5, wherein step 3.3 specifically comprises:

(a) the detection boxes output by the deep neural network model have a uniform format, and each detection box consists of five numbers x₁，y₁，x₂，y₂And s represents; wherein x is₁，y₁And x₂，y₂Respectively representing the coordinate values of the upper left and the lower right of the frame; s represents a deep neural network modelThe prediction confidence coefficient of the detection frame is called the score of the detection frame, the value is between 0 and 1, and the higher the score is, the more confident the network model is to the detection frame is; all the test frames output by the deep neural network model are recorded as

Find out

Detection box b with the highest median score_max，b_maxRespectively, the coordinates and the score of (a), (b), (c), and (d) are ((x)_m1，y_m1)，(x_m2，y_m2) ) and s_m；

Initialization x_sum1＝s_mx_m1，x_sum2＝s_mx_m2，y_sum1＝s_my_m1，y_sum2＝s_my_m2，s_sum＝s_mFive variables are used for storing accumulated values; wherein x is_sum1，x_sum2，y_sum1，y_sum2Respectively stored are weighted accumulations of frame coordinates, s_sumStored is a score accumulation;

(b) find out

Calculation of b_maxAnd

Let b_iHas a left upper right lower coordinate of ((x)₁，y₁)，(x₂，y₂) B) then b_iAnd b_maxThe area I covered at the same time is defined as follows:

I＝(min(x_m2，x₂)-max(x_m1，x₁))(min(y_m2，y₂)-max(y_m1，y₁))

b_iand b_maxThe total covered area U is defined as follows:

U＝(x₂-x₁)(y₂-y₁)+(x_m2-x_m1)(y_m2-y_m1)-I

b_iand b_maxThe degree of overlap IOU is defined as follows:

x_sum1←x_sum1+s_ix_i1；

x_sum2←x_sum2+s_ix_i2；

y_sum1←y_sum1+s_iy_i1；

y_sum2←y_sum2+s_iy_i2；

s_sum←s_sum+s_m；

(c) Will be provided with

All the detection boxes in (a) and (b)_maxWeighted average of coordinate values is determined by using the scores of the detection frames as weights, and the weighted average is performedCoordinate value of (2) ((x)_mean1，y_mean1)，(x_mean2，y_mean2) Is marked as b_mean，b_meanThe coordinate values are calculated as follows:

(d) in that

In which b is removed_maxAnd

the detection frame of (1);

(e) repeating steps (a), (b), (c) and (d) until

Is empty; at this time set

All detection frames in (1) are outputs of statistical non-maximum suppressionAnd (6) obtaining a result.