CN107909027B

CN107909027B - Rapid human body target detection method with shielding treatment

Info

Publication number: CN107909027B
Application number: CN201711121852.6A
Authority: CN
Inventors: 周雪; 徐雨亭; 邹见效; 徐红兵
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2020-08-11
Anticipated expiration: 2037-11-14
Also published as: CN107909027A

Abstract

The invention discloses a fast human body target detection method with shielding processing, which is characterized in that fusion of human body target detection frames acquired and detected in real time is further improved on the basis of the existing human body target detection area full convolution neural network model, and a non-maximum suppression algorithm with a put-back sampling strategy is adopted, so that the detection of a human body target is insensitive to a threshold value, the missing detection and the repeated detection of the human body target can be effectively avoided, and the problem of the detection of two human body targets which are close to each other and have partial shielding is well solved.

Description

Rapid human body target detection method with shielding treatment

Technical Field

The invention belongs to the technical fields of computer vision, pattern recognition, machine learning and the like, and particularly relates to a rapid human body target detection method with shielding processing based on a regional full convolution neural network in a monitoring scene.

Background

In recent years, with the advancement of science and technology, various industries have started to pay more attention to security issues. In important areas such as banks, airports, subways, stations, communities and the like and public places, people are provided with monitoring cameras for video monitoring. These surveillance cameras are typically mounted in a high position for surveillance from a top view. The monitoring scene is a monitoring picture taken in the scene.

Generally, a person is a main body of a monitoring scene, and tracking and subsequent behavior recognition analysis of a human target heavily depend on the precision of human target detection, so how to accurately detect the human target in the monitoring scene has become one of the hot spots of wide attention in academic and industrial fields.

Early researchers generally solved the human target detection problem in two steps, first performing feature extraction based on a manually designed model, and then training a detection model based on a target feature design classifier. For example, Dalal N and Triggs B propose a human target detection method based on histogram of Gradients (HOG) features and Support Vector Machine (SVM) framework, and the specific algorithm principle is as follows: dalal N, Triggs B. histograms of oriented grams for human detection [ C ] Computer Vision and Pattern Recognition,2005.CVPR 2005.IEEE Computer society conference on. IEEE,2005,1: 886-. Shanshan Zhang and Rodrigo Benenson adopt gradient Histogram (HOG) and color space transform (LUV) to extract features, and use Boosted Decision Tree to train human target classifier. The specific algorithm principle is as follows: shanshan Zhang, Rodrigo Benenson, and BerntSchiele.filtered channel defects for caderstentin detection [ C ]. Computer Vision and Pattern Recognition,2015.CVPR 2015: 1751-. The method obtains a better result for human target detection in a simple monitoring scene, but the human target detection result in a complex monitoring scene still cannot meet the actual requirements of people, and the detection speed of the traditional algorithm is slower, so that the requirement of real-time detection is far not met.

With the rise of deep learning in recent years, methods based on deep learning have achieved excellent performance in the field of image classification. Many researchers have also attempted to apply deep learning to the field of object detection based on this. Ren, Shaoqing proposes a method of fast regional convolutional neural network (fast R-RCNN), which divides the human target detection problem into three stages, firstly obtains a human target region candidate frame, then uses the convolutional neural network to extract the target characteristics, and finally carries out classification training on the target characteristics to obtain a model. Compared with the traditional human target detection method, the detection accuracy is improved by 57%. Specific algorithm principles can be found in the literature: ren, Shaoqing, et al. fast R-CNN Towards real-time object detection with region pro-technical networks. Advances in neural information processing systems. 2015.

Subsequently, Jifen Dai and Yi Li et al propose a Detection model based on a regional full convolution network (R-FCN), and the specific algorithm principle can be found in the documents Dai J, Li Y, He K, et al. The R-FCN method uses a position sensitive score map to process the problem of translation conversion in image detection, so that the network can perform full convolution calculation based on the whole picture, and the method can effectively reduce the training time and the detection time of a network model. Meanwhile, the model uses a residual error network (ResNet) as a characteristic extraction model of the model. Compared with fast R-CNN, R-FCN not only improves the accuracy of target detection but also reduces the time of target detection on the basis of a general target detection platform Pascal VOC.

Although the R-FCN method achieves better detection results in terms of general target detection and human target detection, there are some problems, such as detection failure due to detection of two persons as a single person when there is a block between human targets, and detection failure due to detection failure when the human target size is small. Furthermore, for some complex monitoring scenarios of human targets, such as: in the monitoring scene with complex background, more human targets and more serious human shielding, the existing human target detection method has certain missing detection and false detection.

In the invention patent application china published in 2017, 06, 20 and having publication number CN106874894A, the applicant proposes "a human target detection method based on regional full convolution neural network", which improves the rule for generating anchors, and meanwhile, through calculating the loss value of each regional candidate frame of a human target image, and selecting the front B regional candidate frame with the largest loss value as a difficult sample, the loss value is fed back to the regional full convolution neural network model, and updating the parameters of the regional full convolution neural network model by using a random gradient descent method, the accuracy of human target detection in a complex scene is improved, and the omission ratio and the false detection ratio are reduced.

In the above-mentioned regional full convolution neural network model for human target detection, when detecting a human target in an image acquired in a monitoring scene, frame fusion still needs to be performed on the obtained human target detection frame. The existing frame fusion method adopts a single threshold segmentation method, and is difficult to process two close human body targets needing to be detected. When the threshold value is small, the human body target is easy to miss detection, and when the threshold value is large, the human body target is repeatedly detected.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a rapid human body target detection method with shielding processing so as to further reduce the missing detection rate and the false detection rate and improve the detection accuracy rate.

In order to achieve the above object, the present invention provides a method for rapidly detecting a human target with occlusion processing, comprising the steps of:

(1) training to obtain a human body target detection area full convolution neural network model for human body target detection of images collected under a monitoring scene;

(2) inputting the image acquired in real time into a full convolution neural network model of a human body target detection area to obtain a detection frame and a confidence score thereof;

(3) fusion of detection frames

3.1) deleting the detection frames with the confidence scores lower than 0.5 for all the detection frames, and then arranging the rest detection frames in a descending order according to the confidence scores and storing the rest detection frames in an ordered queue Q;

3.2) calculating the overlapping degree and similarity of the first detection frame and the subsequent detection frames for the detection frames in the ordered queue Q; wherein, the similarity is calculated according to the following formula:

metric＝e^{-(λ*dxy+dw+dh)}

wherein:

wherein, x and y are the coordinates of the central position of the first detection frame, and w and h are the width and height of the first detection frame; x and y represent the coordinates of the center position of the subsequent detection frame, w and h represent the width and height of the subsequent detection frame respectively, | | | | represents the norm of L1, and λ is a weight balance factor, which is determined according to the specific implementation condition;

3.3) moving the detection frame with the overlapping degree larger than 0.3 in the ordered queue Q to the buffer queue B;

3.4) searching the detection frames with the overlapping degree less than 0.5 and the similarity exceeding a set threshold value T in the cache queue B, arranging the searched detection frames in a descending order according to the confidence scores and putting the detection frames back to the tail of the ordered queue Q again, and then moving the first detection frame of the ordered queue Q to an output list L;

3.5) if the ordered queue Q is empty, the processing is finished, the detection frame in the output list L is the human body detection target, and if not, the step 3.2) is returned.

The object of the invention is thus achieved.

The invention relates to a rapid human body target detection method with shielding processing, which further improves the fusion of human body target detection frames acquired and detected in real time on the basis of the existing human body target detection area full convolution neural network model, adopts a non-maximum suppression algorithm with a put-back sampling strategy, so that the detection of the human body target has insensitivity to a threshold value, can effectively avoid the missing detection and the repeated detection of the human body target, and well solves the problem of the detection of two human body targets which are close to each other and have partial shielding.

Drawings

FIG. 1 is a schematic block diagram of a specific embodiment of training and detection in the fast human target detection method with occlusion processing according to the present invention;

FIG. 2 is a comparison of the fusion effect of the non-maximum suppression algorithm with the set-back sampling strategy of the present invention with the conventional non-maximum suppression method in the final detection frame;

fig. 3 is a schematic diagram of a specific process of the non-maximum suppression algorithm with a set-back sampling strategy shown in fig. 1;

FIG. 4 is a graph comparing the P-R curves of the present invention with R-FCN, Faster-RCNN on different datasets;

FIG. 5 is a diagram of the detection effect of the present invention and R-FCN, fast-RCNN in the actual scene, respectively, wherein the first behavior is the detection result image obtained by using the fast-RCNN method, the second behavior is the corresponding detection result obtained by using the R-FCN method, and the third behavior is the corresponding detection result obtained by using the method of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

FIG. 1 is a schematic block diagram of a specific embodiment of training and detection in the fast human target detection method with occlusion processing according to the present invention.

In this embodiment, as shown in fig. 1, the method for detecting a human target with occlusion processing according to the present invention includes the following steps:

1. training of full-convolution neural network model of human body target detection area

1.1) human body target calibration

For images in a monitoring scene, when a human body has more objects, the lower half of the human body is easily blocked. In the problem of human target detection, people generally express a whole-body image of a human target as a feature of the human target. However, in the monitoring scene, the lower body image of the human body target is easily blocked, so that a large amount of superposition exists between the detection frames of the two human body targets. Therefore, a network trained with such calibration data may have difficulty separating the two targets. In order to reduce the probability of human body target being shielded in the monitoring scene, the invention adopts the image calibrated based on the human body head and shoulder model as the characteristic expression of the human body target, thus, the detection frames (calibration frames during training) calibrated based on the human body head and shoulder area have less overlap in the same monitoring scene, and the characteristic calibrated based on the human body head and shoulder area has certain robustness to the posture change and the visual angle change of the human body. Therefore, the characteristics of the human target head-shoulder area are learned through the network, the shielding problem existing in a monitoring scene can be solved, and missing detection and false detection of the human target can be reduced to a certain extent.

In this example, two data sets, Caltech and Bronze, were used for training and testing. The Caltech pedestrian data set of the public California university is a pedestrian data set with a large scale at present, 25482 pictures of the Caltech pedestrian data set are selected, the Caltech pedestrian data set is re-calibrated in a human head and shoulder area mode, 17799 samples of the California university are selected as a training set, and the remaining 7629 samples are selected as a verification set. Bronze is our self-created data set that contains images of the monitored scene taken from a top view perspective. The method comprises a plurality of complex scenes which are seriously shielded and have more intensive personnel. For each human body target image, the position of the head and shoulder area of the human body target is calibrated to serve as a calibration frame of the human body target. Meanwhile, the data set is divided into 5: the scale of 1 is divided into a separate training set and validation set.

1.2) generating region candidate frame

In this embodiment, the method in fast-RCNN is used when the RPN (region proxy network) generates the candidate frame. After obtaining the convolution characteristics obtained through the residual error network, generating a sliding window by adopting a plurality of frames with different scale ratios and aspect ratios. Unlike the rule of generating anchors (anchors) upon detection of a general object, in the present embodiment, two aspect ratios of {0.8,1.2} of different ratios and five different dimensions of {48,96,144,192,240} are selected as the rule of generating anchors.

1.3), calculating a position sensitivity score map

In this embodiment, by using the method proposed in R-FCN, a position sensitive score map is calculated by using a set of convolution filters according to convolution characteristics and a region candidate frame, and then a confidence score and a frame regression value of the candidate frame are obtained by using a position sensitive region pooling method. The obtained probability S that the region candidate frame is the positive candidate frame_iAnd probability S of negative candidate box_j(ii) a Meanwhile, according to the human body target calibration frame, the real category probability S of the region candidate frame is obtained; when the intersection ratio of the region candidate frame and the real human body target calibration frame is more than or equal to 0.5, judgingDetermining a region candidate frame as a positive candidate frame sample, wherein the true category probability S is 1; and when the intersection ratio of the candidate frame and the real human body target calibration frame is less than 0.5, judging that the area candidate frame is a negative candidate frame sample, and the real category probability S is 0.

1.4), calculating the classification loss value and the regression loss value of the region candidate frame

In the present example, the cross entropy loss value of the region candidate box is adopted as the classification loss value L of the region candidate box_clsThe specific calculation formula is as follows:

adopting the first-order smooth loss value of the region candidate frame as the regression loss value L of the region candidate frame_regThe specific calculation formula is as follows:

L_reg＝smooth_L1(x^*-x)+smooth_L1(y^*-y)+smooth_L1(w^*-w)+smooth_L1(h^*-h) (2)，

wherein x and y represent the upper left position coordinates of the region candidate frame, w and h represent the width and height of the region candidate frame, respectively, and x^*And y^*Upper left position coordinate, w, representing the real human target calibration box^*And h represents the width and height of the real human target calibration frame respectively;

wherein the first order smoothing function smooths_L1The calculation formula is as follows:

wherein, σ is determined according to a specific monitoring scene, and is generally 3.0, and z is a difference value in parentheses in formula (2).

1.5) on-line difficult excavation

For each region candidate box, calculating its loss value by the following formula:

wherein λ is a balance factor between classification loss and regression loss, and is determined according to specific implementation conditions, and is usually 1.

For some complex monitoring scenes, the detection capability of a human body target under the complex monitoring scenes is improved by referring to an online hard case mining algorithm in an R-FCN, the loss value of each area candidate frame is obtained according to a formula (4), the loss values of the area candidate frames are sequenced, the first 1/2 area candidate frames with the largest loss values are selected as hard case samples, then the loss values of the hard case samples are fed back to an area full convolution neural network model, and the parameters of the area full convolution neural network model are updated by using a random gradient descent method.

And for each human body target image, continuously updating the parameters of the regional full convolution neural network according to the steps 1.2) -1.5), thereby obtaining a regional full convolution neural network model for human body target detection, which is used for human body target detection of the images collected under the monitoring scene.

2. A non-maxima suppression algorithm with a put-back sampling strategy is used for testing (detection).

In the invention, a detection frame fusion method is provided, which is called a non-maximum suppression algorithm with a put-back sampling strategy. The research finds that the existing detection frame fusion method is difficult to process two targets which are close to each other and need to be detected. The traditional non-maximum value suppression method adopts a single threshold segmentation method, and detection omission is easily caused when the threshold is smaller.

Fig. 2 is a comparison graph of the fusion effect of the non-maximum suppression algorithm with the set-back sampling strategy and the conventional non-maximum suppression method in the final detection frame.

In this embodiment, as shown in the first column of fig. 2, a larger threshold will cause repeated detection, as shown in the second column of fig. 2. The invention distinguishes whether two detection frames are the same target by defining a detection frame similarity function, and the definition of the detection frame similarity metric is as follows:

wherein, x and y are the coordinates of the center position of the first detection frame, namely the detection frame with the maximum confidence score, and w and h are the width and the height of the first detection frame, namely the detection frame with the maximum confidence score; x and y represent the coordinates of the center position of the subsequent detection box, w and h represent the width and height of the subsequent detection box, respectively, | | | | represents the norm of L1, where the weight λ is set to 1. The final similarity distance is a weighted sum of 3 deviations (dxy, dw, dh).

Fig. 3 is a process diagram of the non-maximum suppression algorithm with a set-back sampling strategy according to the present invention.

In the invention, the specific steps of the non-maximum suppression algorithm with the sample-putting-back strategy are as follows:

1) and for all the detection frames, deleting the detection frames with the confidence scores C lower than 0.5, and then arranging the rest detection frames in a descending order according to the confidence scores C and storing the rest detection frames in an ordered queue Q. In FIG. 3, there are n detection boxes, C in the queue₁、C₂、C₃、…、C_n-1、C_nExpressed as confidence score, F, for each test box₁、F₂、F₃、…、F_n-1、F_nExpressed as the position of each detection box, including the center position coordinates, width, and height.

2) And for the detection frames in the ordered queue Q, calculating the overlapping degree r of the first detection frame and the subsequent detection frames₂、r₃、…、r_n-1、r_nAnd similarity m₂、m₃、…、m_n-1、m_n。

3) And moving the detection frame with the overlapping degree r larger than 0.3 in the ordered queue Q into the buffer queue B. Such as the i, j, k detection boxes in fig. 3.

4) Searching detection frames with the overlapping degree r smaller than 0.5 and the similarity m exceeding a set threshold value T in the cache queue B, such as the ith and kth detection frames, arranging the searched detection frames in a descending order according to the confidence scores and putting the detection frames back to the tail of the ordered queue Q again, and then moving the first detection frame of the ordered queue Q to an output list L;

5) and if the ordered queue Q is empty, the processing is finished, the detection frame in the output list L is the human body detection target, and if not, the step 2) is returned.

The improved non-maximum suppression algorithm provided by the invention has insensitivity to the threshold T, and the final fusion result is shown in the third column of FIG. 2.

In order to verify the effectiveness of the invention, firstly, the newly calibrated data set is used for training and testing (detecting) the model, and then the comparison verification of the detection effect is carried out based on the acquired human body target image in the complex monitoring scene. In this example, a deep learning framework, which is commonly used in the field of image processing, is used for training and testing, and a ResNet-50 residual network model, which is trained based on an ImageNet image dataset, is used as a pre-training model. For other parameters of the network model, the learning rate is set to be 0.001, the learning rate is reduced by 10 times after 2 ten thousand iterations, and the total iteration number is 6 ten thousand. The momentum is set to 0.9 and the weighted decay term is set to 0.0005. The testing speed of the network model obtained by fine tuning on the Yingwei GTX1080 GPU reaches 86 milliseconds per picture, and approaches the speed of real-time detection.

In the present embodiment, a relatively general accuracy-recall (P-R) graph in the human target detection method is used as a criterion for determining the merits of the algorithm, and a P-R curve generally refers to a curve drawn by data pairs with different accuracies and recall ratios generated when different confidence probability values are taken for detected prediction windows. When algorithm detection effects of different algorithms are compared, the recall rate is usually fixed, the accuracy corresponding to each algorithm is checked, and the higher the accuracy is, the better the detection capability of the algorithm on the target is indicated. Meanwhile, in order to characterize the detection performance of the detection algorithm in a digital quantization form, in the embodiment, the accuracy is used as an evaluation criterion of the algorithm data quantization form, and for the calculation of the accuracy, the area of the P-R curve and the Recall axis is generally used as the average accuracy of the algorithm.

In this embodiment, a residual error network ResNet-50 model is used for training, and images collected in an actual monitoring scene are selected, and in this embodiment, comparison of human target detection effects is performed with a network model finely adjusted based on a traditional fast-RCNN, R-FCN method. The P-R curve obtained from the comparative experiment is shown in FIG. 4. It can be seen that the present invention provides a small improvement in the detection of human targets over the fast-RCNN, R-FCN method on two different data sets. FIG. 5 is a comparison graph of the detection effect of a certain frame in the actual detection by the fast-RCNN, R-FCN method of the present invention. Wherein, the first action is a detection result image obtained by using a fast-RCNN method, the second action is a corresponding detection result obtained by using an R-FCN method, and the third action is a corresponding detection result obtained by using the method of the invention. The method has a good detection effect on the human body target under the shielding condition, and has less missed detection on a complex monitoring scene.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A rapid human body target detection method with shielding processing is characterized by comprising the following steps:

(3) fusion of detection frames

3.1) deleting the detection frames with the confidence scores lower than 0.5 for all the detection frames, and then arranging the rest detection frames in a descending order according to the scores and storing the rest detection frames in an ordered queue Q;

metric＝e^{-(λ*dxy+dw+dh)}

wherein: