CN110245675B

CN110245675B - Dangerous object detection method based on millimeter wave image human body context information

Info

Publication number: CN110245675B
Application number: CN201910264671.1A
Authority: CN
Inventors: 张铂; 王斌; 吴晓峰; 张立明
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2023-02-10
Anticipated expiration: 2039-04-03
Also published as: CN110245675A

Abstract

The invention belongs to the technical field of image processing, and particularly relates to a dangerous object detection method based on millimeter wave image human body context information. Firstly, carrying out down-sampling operation on an input millimeter wave image by using a convolutional neural network, recovering human body context information by using a top-to-top structure in a high-level feature space, fusing a human body carrier obtained in a down-sampling stage with the human body context information obtained by the top-to-bottom structure, and jointly predicting a foreground target; in addition, aiming at the problem that the initialized candidate frame cannot be effectively matched with the ground truth, the invention adopts an auxiliary supervision function to carry out coordinate regression on the initialized candidate frame, and improves the detection rate of the model in a standard test set and an actual test scene. The invention can automatically identify dangerous objects in the millimeter wave image in real time, and greatly improves the efficiency of security inspection and security protection.

Description

Dangerous object detection method based on millimeter wave image human body context information

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method for detecting dangerous objects carried by a human body.

Background

With the upgrade of terrorist activities, the design of security systems in airports, customs, and crowded areas has been on the way. The traditional security system mainly comprises a metal detector aiming at a human body and an X-ray system aiming at a human body carried object. The security inspection system with separated people and objects is widely applied to various security inspection scenes, and the security protection capability of important places such as airports, customs, crowd-concentrated areas and the like is effectively improved. However, the conventional security system still has some defects. On one hand, metal detectors for human bodies can only detect metal contraband, e.g., guns, knives; however, it is difficult to detect non-metallic dangerous objects such as explosives, dangerous liquids, ceramic knives, etc. [1]. On the other hand, the security inspection system for separating people and objects requires different imaging principles to be respectively used for inspecting the human body and the human body carried objects, and the time consumption for inspecting the human body in an airport is higher than that for inspecting the human body carried objects, so that the security inspection system for separating people and objects is influenced by the time consumption of one subsystem.

The millimeter wave imaging system [2] can effectively solve the above-described problems. It is capable of penetrating the clothing and the shielding of insulators and is non-ionizing radiation and therefore harmless to the human body. Millimeter wave close-range imaging systems can be divided into passive millimeter wave imaging (PMMW) and active millimeter wave imaging (AMMW) according to the working mode. The former uses a millimeter wave radiometer to collect the thermal radiation or scattering distribution characteristics of the measured target and then generates an image. The latter uses millimeter wave signals of different wave bands with certain power to irradiate the measured object, and uses the receiver to collect the signal returned by the measured object to reconstruct the space scattering intensity of the measured object [3]. Compared with the two modes, the AMMW system can realize real-time imaging, and the imaging quality is better than that of a passive system.

Automatic contraband object detection algorithms based on active millimeter wave imaging systems have been extensively studied in recent years. In 2017, the U.S. national security administration published a competition named "Passenger Screening algorithmchalinge" in the Kaggle challenge race, and the data set adopted by the competition is based on the human body imaging results acquired by the AMMW imaging equipment. Challenges require automated identification of human images with contraband objects using computer vision techniques. Document [5] considers the identification of prohibited objects as an image segmentation and classification task, divides a human body into 17 regions by using an image segmentation technology, identifies potential dangerous objects in each human body region by using a classification technology, and achieves good results in a Kaggle challenge race. In addition to the above algorithm, [5] [6] recognizing forbidden objects is regarded as a foreground object detection task, given the data set [3], the position and confidence of dangerous objects in the human body are regressed. [5] A probability accumulation graph is proposed, which can effectively obtain the position information of the foreground object. [6] Features are extracted by a convolutional neural network, and potential dangerous objects in the human body are positioned by a two-stage target detector.

The algorithm for detecting the forbidden objects of the millimeter wave image adopts the convolutional neural network to extract the characteristics, and a certain performance breakthrough is obtained. However, the above algorithms do not consider the human body context information, which is an important feature of the millimeter wave imaging algorithm applied in the security field. In all millimeter wave images related to security and safety inspection, the detection scene is unchanged, the context relationship in the images is also fixed, and the correlation exists between the human body context and the distribution of forbidden objects. Therefore, it is an unreasonable way to predict only the appearance features of the forbidden objects in the training set, which may result in missing detection, false alarm, etc.

Some important concepts related to the present invention are presented below:

1. two-stage target detection algorithm

The two-stage (two-stage) target detection algorithm means that the relative position offset and the class probability of the extraction candidate box (anchors) and the prediction candidate box are completed in different stages, and the two stages are controlled by different cost functions. The input of the second stage is the result of the extraction of candidate boxes in the first stage. The second stage corrects the position of the frame candidate generated in the first stage and makes a judgment on the object class in the frame candidate.

The concept of candidate boxes is described below, and as shown in fig. 1, the Bounding boxes (Bounding boxes) with dotted lines are real labels (groudtruth), and the Bounding boxes with remaining colors are candidate boxes generated by the detection algorithm before the detection starts, and the candidate boxes are composed of Bounding boxes with different length-width ratios and different length-width ratios. Refer to step 2.1 for a specific way of generating candidate boxes.

The detector will select positive and negative samples from the candidate box according to the following formula:

wherein, theta ₁ And theta ₂ The IOU is calculated by the following formula (2), wherein P represents a candidate box, G represents Ground Truth (Ground Truth), and Area (X) represents the Area for solving an X bounding box.

After the positive samples are selected, a part of negative samples which are difficult to identify are selected according to the proportion that the positive and negative samples are 1:3 for training, instead of sending all the negative samples to the second stage for training. In general, in fast RCNN [7 ]]，RPN[8]Threshold θ for decision positive and negative samples in equation (1) in an equal two-stage detector ₁ And theta ₂ Take 0.5 and 0.3, respectively.

2. One-stage target detection algorithm

One-stage (alternatively called one-shot) target detection algorithm refers to extracting candidate boxes and using the candidate boxes to predict that the group Truth is completed in one stage, and is generally an end-to-end deep learning model architecture.

The way of initializing the candidate box and the way of determining the positive and negative samples by the one-stage detection algorithm are the same as those by the two-stage detection algorithm, and reference is made to the above. In SSD [9]]In, theta ₁ And theta ₂ Take 0.5 and 0.5 respectively. After reasonably picking a certain proportion of positive and negative samples from the candidate box, the trained cost function is as follows:

where N is the number of positive samples selected. L is _cls (I, C) denotes class prediction, L _loc (I, P, G)) represents the position regression prediction, alpha represents the penalty factor, C is the number of classes in the training set, I is an indicative term,

i =1 if and only if the ith candidate box and the jth group Truth match.

The regression term is shown in the formula (4),

and

respectively representing the coordinates of the center point of the ith candidate frame and the jth group Truth,

and

respectively indicating the width and height of the ith candidate box and the jth group Truth.

Is the relative offset at which the candidate box occurs.

Is a regression prediction of the shift to the ith candidate box.

The category prediction term is as in equation (6),

is the prediction probability of the ith candidate box with respect to the kth class,

is the prediction probability of the i-th class candidate box with respect to the background.

It can be seen that the initialized candidate box size affects the number of positive and negative samples. Therefore, selecting a reasonably wide and high initialization candidate box can effectively increase the number of positive samples selected. In specific embodiments, the present invention also experimentally illustrates this hypothesis.

Disclosure of Invention

The invention aims to provide a dangerous object detection method based on human body context for a millimeter wave image, so that a machine can automatically identify dangerous objects in the millimeter wave image, real-time performance is achieved, and safety inspection and security efficiency is improved.

The idea of the technical scheme of the invention is shown in the attached figure 2, and the process of the invention is divided into the following steps: 1, bottom-up (bottom-up); 2, two processes from top to bottom (top-down). Wherein the bottom-up process functions to detect potentially dangerous objects; the function of the top-down process is: a. restoring the context information of the human body, b, fusing the characteristic diagram from the bottom-up process with the characteristic diagram with the context information of the human body, and c, taking charge of finally predicting the dangerous object. After the bottom-up process can extract the feature map related to the dangerous object, the significance Module (Attention Module) adopts an Attention focusing mechanism to select a part of the feature map expressing the characteristics of the dangerous object from the bottom-up process, and combines the result passing through the Attention focusing mechanism with the human body context from top to bottom.

In addition, the initialized frame candidates (described in the background introduction section) cannot generate a large overlapping area with the Ground Truth (Ground true), so that the frame candidates contain more noise information. Therefore, to solve this problem, the present invention employs multitask learning, adding auxiliary supervision (AuxiliarySupervision), which initializes candidate boxes for the top-down process by regression results of the candidate boxes for the bottom-up process. And optimizing the model by adopting a multi-objective optimization mode combining auxiliary supervision and an SSD cost function.

The invention provides a dangerous object detection method based on millimeter wave image human body context information, which comprises a method for constructing a network structure, a method for designing a cost function and the like, and the specific implementation mode part is found in training and testing. The method comprises the following specific steps:

step 1, from bottom to top: and downsampling the millimeter wave image, and selecting the feature maps of three levels for prediction.

1.1: the millimeter wave image is input into a Convolutional Neural Network (CNN), features are extracted, and a downsampling operation is performed. The convolutional neural network comprises 10 convolutional layers in total and is used for extracting features; the 10 convolutional layers are: conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3; the size of the convolution kernel is set to be 3 multiplied by 3, the size of the convolution kernel moving step is 1, and 0 is supplemented at the convolution boundary. As shown in fig. 2, the feature maps visualized in the figure are features extracted from the convolution kernels of Conv1_2, conv2_2, conv3_3, conv4_3, namely, feature maps extracted from the convolution kernels of Conv1_2, conv2_2, conv3_3, conv4_3, which are still labeled as Conv1_2, conv2_2, conv3_3, conv4_3; the purpose of adopting convolution operation is to obtain local abstract features of the image;

and (3) pooling operation: the Conv1_2, conv2_2, conv3 _3convolution kernels are all followed by maximum pooling operations to achieve a downsampling operation, each maximum pooling operation downsampling twice the image size;

1.2: performing down-sampling operation twice on Conv4_3 by adopting the maximum pooling introduced in the step 1.1, wherein the down-sampling operation is twice, and the feature map obtained after down-sampling is named as fc7, conv6_2;

1.3: and selecting feature maps of three levels conv4_3, fc7 and conv6_2 for prediction. The feature map conv4_3 downsamples the original image by 8 times, the feature map fc7 downsamples the original image by 16 times, and the feature map conv6_2 downsamples the original image by 32 times; the three different levels of feature maps represent dangerous objects with different scales respectively.

Step 2, from bottom to top: and initializing a candidate frame in the millimeter wave image according to the feature map, and selecting positive and negative samples.

2.1: respectively initializing the ith candidate frame in the original image for each feature point in the feature maps of the three levels conv4_3, fc7 and conv6 \2

Where cx denotes an abscissa of the center point of the candidate frame, cy denotes an ordinate of the center point of the candidate frame, w denotes a width of the candidate frame, and h denotes a height of the candidate frame. The method for initializing candidate boxes is according to the formula (7) to the formula (9), and the initialization result is shown in fig. 1. In fig. 1, the dotted bounding box represents Ground Truth (Ground Truth), and the remaining colors represent candidate boxes for algorithm initialization;

in the above formula, s _k E { conv4_3, fc7, conv6_2}, which represents the scale factor (width-height ratio for the millimeter wave image) of the hierarchical feature map initialization candidate box participating in prediction; n represents the number of hierarchical feature maps participating in prediction, and n =3 in the present invention; s _min Represents a global minimum scale; s _max Representing the global maximum scale. r is a radical of hydrogen _j Representing a collection of different aspect ratios. W represents the width of the millimetric-wave image and H represents the height of the millimetric-wave image.

In SSD [9]]In (1), for a natural image dataset, take s _min ＝0.2，s _max =0.9. However, since the present invention is directed to the millimeter wave image, the area of the foreground object in the millimeter wave image is much smaller than that in the natural image, as shown in fig. 3. Therefore, in the examples of the present invention, s is _min Set to 0.1,s _max Set to 0.4.

2.2: after step 2.1 is finished, the candidate box can already cover the original image. At this point, each candidate box generated by step 2.1 is scaled as a positive or negative sample according to equation (1). In the embodiment of the invention, the threshold value theta ₁ And theta ₂ Set to 0.3 and 0.3, respectively.

Step 3, from top to bottom: performing up-sampling on the characteristic diagram conv6_2 obtained in the step 1 to recover the context information of the human body;

3.1: as shown in fig. 2, step 1, down-sampling the original image by CNN by 32 times to obtain a feature map conv6_2; the conv6_2 is subjected to significance fusion (Attention Module) to obtain a feature map E6. In the bottom-up process, conv4_3, fc7, conv6 _2can extract foreground features. Based on the foreground features extracted by conv4_3, fc7, conv6_2, the significance fusion module is used for selecting a part of more representative foreground features from the effective foreground features to be fused with the context information of the human body, and screening the features in the bottom-up process (focusing attention on a part of the features). The invention adopts an S-E framework [11] to realize significance fusion;

3.2: step 3.1, obtaining a characteristic diagram E6, and recovering the context information of the human body by using the characteristic diagram E6; upsampling by deconvolution [12] to obtain E6';

3.3: fusing the feature map fc7 obtained in the step 1.1 with the feature map E6' obtained in the step 3.2 through a significance fusion module (by adopting addition), and obtaining a feature map E5 after fusing;

3.4: again, following steps 3.2 and 3.3, a feature map E4 is obtained.

Step 4, from top to bottom: and generating a candidate frame in the millimeter wave original image, and selecting positive and negative samples.

The step 3.4 generates a characteristic map of three levels of E4, E5 and E6 which are respectively 1/8,1/16,1/32 times of the original image; the final dangerous objects (or forbidden objects) are predicted by using the feature maps of the three levels.

4.1: and initializing candidate boxes for the three feature maps E4, E5 and E6.

Generating a corrected i-th candidate frame for each feature point in the feature maps E4, E5 and E6 according to the following rule

Here, the first and second liquid crystal display panels are,

is a regression prediction of the shift of the ith frame candidate, m ∈ { cx, cy, w, h }, as in equation (5), where cx represents the abscissa of the center point of the frame candidate, cy represents the ordinate of the center point of the frame candidate, w represents the width of the frame candidate, and h represents the height of the frame candidate.

Is a correction vector learned by the auxiliary supervision function in the bottom-up stage. Here using the vector and the candidate box generated in step 2.1 stage

To initialize candidate frames of E4, E5, E6

For ease of understanding, fig. 5 visualizes this flow. The bounding box of the dotted line represents the candidate box represented by the group Truth, and the bounding boxes of other colors represent the candidate boxes. A is the input image, B is the candidate frame generated in step 2.1, C is the candidate frame corrected in step 4.1, and D is the regression result through the SSD cost function.

4.2: after step 4.1 is finished, the corrected candidate frame can be overlapped with the group Truth to a greater extent. Each candidate box generated by step 4.1 is then scaled as a positive or negative sample according to equation (1). In an embodiment of the invention, the threshold value theta ₁ And theta ₂ Set to 0.7 and 0.3, respectively. (since the candidate frame has been corrected, so with respect to θ set in step 2.2 ₁ =0.3 and θ ₂ =0.3, the positive sample threshold θ is appropriately raised ₁ =0.7, the basis for selecting the positive sample threshold is: on the premise of not reducing the model performance, the theta is improved as much as possible ₁ )

And 5, optimizing the result of the step 2 by using Auxiliary Supervision (Auxiliary Supervision).

After step 2.2 is finished, a series of positive samples and negative samples can be obtained, and the bottom-up process is optimized by using formula (3), which has the advantages that: a. the superficial network can effectively learn the appearance characteristics of the small target; b. regression terms for candidate boxes may be learned

The candidate blocks initialized by the top-down procedure are corrected by this entry.

For the positive sample obtained in step 2.2, formula (4) is used to learn the candidate frame regression term

And (3) correctly judging the positive and negative samples by adopting a formula (6) aiming at the positive and negative samples obtained in the step 2.2. In order to further improve the convenience of model training, a position regression penalty term alpha =1 is set through a multitask learning mechanism,co-learning regression terms

And positive and negative sample discrimination terms.

And 6, optimizing the result of the step 4 by using the SSD cost function.

And 4.2, after the step 4.2 is finished, the candidate frame finishes the correction of the position coordinates, and then the top-down (top-down) process is optimized by adopting a formula (3).

For the corrected positive sample obtained in step 4.2, formula (4) is adopted to learn the final i-th candidate frame regression term

Where cx denotes an abscissa of the center point of the candidate frame, cy denotes an ordinate of the center point of the candidate frame, w denotes a width of the candidate frame, and h denotes a height of the candidate frame.

And (4) correctly judging the positive sample and the negative sample by adopting a formula (6) according to the corrected positive sample and the negative sample obtained in the step (4.2). In order to further improve the convenience of model training, a position regression penalty term alpha =1 is set through a multitask learning mechanism, and a regression term is learned together

And positive and negative sample discrimination terms.

The invention provides a method for acquiring human body context information in a millimeter wave image by using a top-up process aiming at the millimeter wave image, and a part of representative foreground target characteristics are fused with the human body context information acquired by the top-up process through a saliency fusion module. And finally, optimizing the bottom-up process by adopting an auxiliary supervision function, correcting the candidate frame initialized in the bottom-up process, and accurately describing Ground Truth (Ground Truth) by the corrected candidate frame so as to improve the accuracy of the model. Experiments show that the model after the human body context information and the significance are fused improves the recall rate by 29.13%, and the accuracy rate of the model can be further improved by correcting the candidate frame through an auxiliary supervision function.

Drawings

Fig. 1 shows the result of candidate frames (anchors) generated in the original image by the feature maps involved in the prediction.

Fig. 2 is the overall architecture of the present invention. Human context information is fused through bottom-up and top-down processes.

Fig. 3 is a comparison graph of the area size of the foreground object in the natural image and the area size of the foreground object in the millimetric-wave image. The left side of the figure is the statistical result of the natural image, and the right side of the figure is the statistical result of the millimeter wave image. The abscissa GT area represents the size of the area of the foreground object, and the ordinate Number represents the Number of foreground objects.

Fig. 4 is a different network architecture. Where A is HRF _1,B is HRF _2,C is HRF _3.

Fig. 5 is a schematic diagram of a candidate box correction process. Where a is the picture originally input to the network and the dashed bounding box represents the group Truth. B is a candidate box initialized by the bottom-up stage. C is the candidate box after the correction of step 4.1. D is the result of prediction using E4, E5, E6 and the candidate box of graph C.

FIG. 6 is a comparison of significance modules on a validation set. The abscissa Training iterations indicates the number of Training iterations, and the ordinate mAP indicates the performance index of the detector.

FIG. 7 is a comparison of the present invention with a DSOD 13 model. The abscissa Training iterations indicates the number of Training iterations, and the ordinate mAP indicates the performance index of the detector.

FIG. 8 compares the prediction of the bottom-up process with the prediction of the top-down process (the threshold for picking positive samples is set to 0.01).

FIG. 9 is a comparison of the predicted results of the Baseline model, HRF _1 model, and HRF _1 _ASmodel. The red bounding box represents the prediction result and the dashed bounding box represents the group Truth.

FIG. 10 is a visualization of the feature maps of the Baseline model, HRF _1 model, and HRF _1 _ASmodel. The left image is the input image and the right image is the feature map.

Detailed Description

In the following, embodiments of the present invention are described in the millimeter wave data set.

Description of the data set: the data set used in the present invention is from [3], which contains 15 training set images with forbidden objects, 6454 verification set images with forbidden objects, and 9 standard test sets.

1. Ablation experiment:

training experiment setup:

training is carried out in 15 ten thousand pictures in the training data set, codes are written by means of "caffe [14], and all experiments in this section are carried out according to the following experimental settings:

initializing the learning rate to 0.001;

the training period is about 20 times of traversing the training set, which is also called epochs number;

training iteration times: 45000 times, number of catch sizes captured each time: 64;

an optimization algorithm, wherein the impulse SGD and momentum are set to be 0.9;

the regularization term: l2 is adopted, wherein the penalty factor (weight penalty) is set to 0.0005;

pre-training the model: the optimal results of the training of the SSD [9] model on the VOC0712 data set are loaded as initialization parameters.

Test experiment setup:

the test was done in 9 standard test sets.

Constructing a test set: the test set is from people with different acquisition time, different heights and body types, and the results are acquired in different postures. The test set contained 50% of the images containing the threat object and 50% of the images completely free of the threat object.

In the test process, the positive sample threshold is set to be 0.5 and the detected contact ratio is set to be 0.1 in all the following experiments (namely, the contact ratio between the predicted result of the network and the group Truth is greater than 0.1, and the network is judged to be detected).

1.1, bottom-up (bottom-up) and top-down (top-down) effects:

the overlap is defined as the threshold value of selecting the positive sample during training (coincidence degree of the candidate frame and the Ground Truth)

As shown in Table 1, the baseline model employs SSD [9]]See fig. 3 due to the small area of the foreground object in the millimeter wave dataset. Therefore, the invention modifies the proportionality coefficient of the SSD initialization candidate frame to be: s _min Set to 0.1,s _max Set to 0.4 and extend this setting for all subsequent experiments, see equation (7).

Since the invention removes conv7_2, conv8_2, conv9 _2of the SSD [9] model in consideration of the characteristic that the scale of the millimeter wave data set is basically fixed, experiments show that the deletion of the high-level feature map does not cause the performance degradation of the model, and therefore in the later experiments, we only sample to the conv6_2 layer.

The substantially scale-invariant nature of the millimeter-wave data set is due to the actual usage scenario. In actual use, the distance between the millimeter wave imaging device and a measured human body is always fixed, so that the network only needs to consider the imaging information of different statures, heights and postures based on the same distance without considering remote human body information and dangerous objects carried by the human body. The experimental results also demonstrate this hypothesis of the invention.

Table 1. Comparison of feature map predictions for different levels, overlap =0.3 for all data training (Recall for detection rate, precision for accuracy, AVG for mean of nine test sets, F1 for F1 score)

HRF _1, HRF _2, and HRF _3 in Table 1 correspond to A, B, C in FIG. 4, respectively. Respectively, predicting a final result by adopting three hierarchical characteristic graphs of E4, E5 and E6; b, respectively predicting a final result by adopting three hierarchical feature graphs of E3, E4 and E5; and C, respectively predicting the final result by adopting the three hierarchical characteristic graphs of E5, E6 and E7. Experiments show that the three ways of integrating the human context information can effectively improve the baseline model [9], wherein the HRF _2 has the best effect because the HRF _2 architecture adopts more high-resolution feature map predictions, but consumes more time. Therefore, in order to trade off performance and network computation time, the HRF _1 network architecture is adopted in the present invention for the following research.

1.2, significance fusion module (Attention module) effect:

Purpose of the significance module: after a top-down (top-down) process generates a feature map with human body context information, it is expected that a feature map of an expression foreground object acquired in a shallow layer can be effectively fused with the feature map with the human body context information. Then this section will investigate the fusion mode.

HRF _1 _Conv256represents that 256 features are selected from a bottom-up (bottom-up) process in a convolution manner to represent a foreground object. And fusion with the feature map generated by the top-down process is performed by using addition.

HRF _1 \/concat represents that 256 features are selected from the bottom-up process in a convolution mode to represent the foreground target. And is fused with the feature maps generated by the top-down process in a cascading manner.

HRF _1 _SErepresents the selection of features representing foreground objects from a bottom-up process using the SE [11] architecture. And fusion with the feature map generated by the top-down process using addition.

HRF _1w/o Attention Module indicates no fusion with bottom-up feature maps.

The experimental results are shown in table 2:

table 2. Overlap =0.3 for all data training for comparison of significance modules (Recall for detection rate, precision for accuracy, AVG for mean of nine test sets, F1 for F1 score)

By comparing table 2, it can be seen that the detection rate of the model on each test set is low when the feature map of the bottom-up process is not combined, and the detection rate of the model gradually increases as the extraction capability of the significance module becomes stronger. FIG. 6 is the result on 6454 validation sets, and it can be seen that HRF _1 _Conv256is superior to the model without the attention-focusing mechanism in both convergence and mAP metrics. Finally, experiments show that the effect of using convolution for attention concentration and additive fusion is optimal. Therefore in subsequent experiments we used HRF _1_conv256 for further study.

1.3, auxiliary Supervision (Auxiliary Supervision) effect:

The purpose of auxiliary supervision is as follows: in the bottom-up process, conv4_3, fc7, conv6 _2initialize candidate frames in the original image, as detailed in step 2.1. We initialize the candidate box a priori based on statistics of the foreground object size, as in fig. 3. The method of initializing the candidate box based on the prior statistical information can effectively estimate the size of the group Truth, but still cannot change with the change of the size of the group Truth. Therefore, the invention proposes an Auxiliary Supervision (Auxiliary Supervision) function, which aims to a. the Auxiliary Supervision function can learn the offset of the initialization candidate box of conv4_3, fc7, conv6 _2relative to the GroundTruth

See step 4.1 for details, by offset

So as to reinitialize the candidate frame with higher overlap ratio with the group route, as shown in 5.C.

Here, the present invention will explain the research results of the auxiliary supervision function by table 3.

HRF _1 _Conv256is an architecture with saliency modules. HRF _1 _ASrepresents the addition of an auxiliary supervisory function to HRF _1 _Conv256.

The invention researches the influence of different overlap on the effect generated by the auxiliary supervision function, wherein the definition of the overlap is equal to theta in formula (1) ₁ 。

HRF _1_as, overlap =0.3: it can be seen that after the auxiliary supervision function is added to the HRF _1 \/conv 256 model, the average detection rate on the test set is increased to 84.92%, but the average accuracy rate is decreased to 73.21%, we consider here that the HRF _1 \/as model has the accurate-positioning candidate frame initialization method through candidate frame rectification, and a lower overlap threshold results in more noise samples.

Thus in subsequent experiments, overlap was increased.

HRF _1_as, overlap =0.7: after the overlap is increased from 0.3 to 0.7, experiments show that the average detection rate of a test set is not greatly reduced, and meanwhile, the average accuracy rate is effectively increased by 8.81%, and the f1 fraction is increased by 0.0349.

* HRF _1_as, overlap =0.7: to verify whether the performance improvement is due to the secondary supervisory function itself

For the performance improvement brought by the correction of the candidate frame, we design a model of HRF _1_as, overlap =0.7, which is different from the model of HRF _1_as, overlap =0.7 in that the candidate frame is initialized by step 2.1, that is, the size of the candidate frame is estimated a priori. The experimental result shows that the average detection rate of the test set is reduced by 6.3% due to the prior initialization mode, but the model has higher accuracy due to the excessively high training overlap.

TABLE 3 comparative study of auxiliary supervision functions (Recall stands for Rate of detection, precision for accuracy, AVG for mean of nine test sets, F1 for F1 score)

2. Comparative experiment:

experimental setup:

the training experiment setup and testing experiment setup in this section are the same as the ablation experiment, but it is noted that SSD is being implemented [9]]，DSOD[13]And DSSD [12]]In the model, since the HRF _ 1as_overlap =0.7 adopts the method of initializing the candidate box: s _min Set to 0.1,s _max Set to 0.4. Thus, unlike the original SSD, DSOD and DSSD implementations, we will assign s _min Set to 0.1,s _max Set to 0.4 (originally s) _min ＝0.2,s _max = 0.9), in fact, the experimental results also show that this a priori modification also contributes to improving the performance of the model on the test set.

2.1, compare from scratch training:

because the millimeter wave imaging result is different from the imaging result of the natural optical image to a certain extent, the model obtained by the training of the VOC0712 data set is used as the pre-training model, and the VOC0712 data set is based on the natural image data set acquired by the optical imaging equipment. The motivation for this section of the experiment was therefore to compare the difference between the ab initio training (from scratch) and the training with the VOC0712 loaded dataset, thereby further illustrating that in the mmwave dataset, a pre-trained model using a large scale natural image dataset is effective.

Experimental setup:

the invention uses the DSOD [13] model as a de novo training model on the millimeter wave data set, since the training of the DSOD requires traversing 600 (epochs) training sets or more. For a 15-ten-thousand millimeter wave training set, training on 2 NVIDIA TITANXP devices requires 833 hours (34.72 days). Therefore, considering that the present embodiment selects a subset from the 15 ten thousand millimeter wave data sets, we select 19097 pictures (these pictures are distributed at different times, and include millimeter wave imaging results of different statures, different heights, and different sexes), and divide 19097 pictures into 14491 training sets and 4606 verification sets.

Batch size =64, the remaining experimental setup was the same as described in 1.

FIG. 7 illustrates the experimental results of the present invention. HRF 1AS overlay =0.7 represents the HRF architecture with auxiliary supervisory function introduced in the above section, using a pre-trained model on the VOC0712 dataset. And DSOD is s _min Set to 0.1,s _max Set to 0.4.

2.2, best-compared model:

table 4 shows the results of the experiments compared to SSD, DSSD respectively, where DSSD is s of DSSD _min Set to 0.1,s _max Set to 0.4. The run-time test was done on NVIDIA TITAN Xp, batch size =4, taking the average of 1000 iterations.

Comparing DSSD with DSSD, the difference between foreground targets of natural image and millimeter wave image can reduce the size of the initialized candidate frame reasonably to raise the number of positive samples selected by the model effectively and raise the detection rate of the model.

TABLE 4 comparison of results with the best model (Recall stands for detection rate, precision for accuracy, AVG for mean of nine test sets, F1 for score F1, time for model inference time in milliseconds)

3. Analysis of results

Description of the drawings: for the purposes of the study, the threshold for positive samples was set to 0.01, the dashed bounding box represents group Truth, and the red bounding box represents the prediction.

The experimental results of the specific implementation process are shown in fig. 8. And this result is analyzed in this section. In fig. 8, the first row represents the detection results of the bottom-up process of the models HRF _1_as, overlap =0.7, and the second row represents the detection results of the top-down process of the models HRF _1_as, overlap = 0.7.

The prediction result of the first line is used as an initialization candidate frame for prediction of the second line by using an auxiliary supervision function, and the dangerous object is predicted again by combining the human body context information.

In fig. 8, comparing A, B, C, D column, some dangerous objects that are difficult to detect in the bottom-up process can be classified as positive samples with high probability after the initialization candidate frame is corrected by the auxiliary supervision function and the context information of the human body is merged.

Comparing columns E to J in fig. 8, it is shown that effective frame candidate initialization can effectively remove sample noise and improve the detection rate of dangerous objects.

In conclusion, for millimeter wave security inspection data, the algorithm provided by the invention effectively combines the millimeter wave-based human body context information to predict the result, and effectively initializes the candidate frame by adding the auxiliary supervision function, thereby improving the prediction performance of the model. Compared with other algorithms of the same type, the method has higher algorithm performance and faster algorithm running speed.

The details introduced in the examples are not intended to limit the scope of the claims but to aid in the understanding of the process described herein. Those skilled in the art will understand that: various modifications, changes or substitutions to the preferred embodiment steps are possible without departing from the spirit and scope of the invention and its appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.

Reference to the literature

[1]Sheen D M,Mcmakin D L,Hall T E.Three-dimensional millimeter-wave imaging for concealed weapon detection[J].IEEE Transactions on Microwave Theory Techniques,2001,49(9):1581-1592.

[2]Huguenin G R,Goldsmith P F,Deo N C,et al.Contraband detection system.U.S.Patent 5073782,Dec.17,1991.

[3]Zhu Y Z Y,Yang M Y M,Wu L W L,et al.Practical millimeter-wave holographic imaging system with good robustness[J].Chinese Optics Letters,2016,14(10):101101-101105.

[4]Guimaraes A A R.Detecting zones and threat on 3D body in security airports using deep learning machine[J].arXiv:1802.00565,2018.

[5] Yao Guxiong, yang Minghui, zhu Yu, et al, millimeter wave image contraband object localization using convolutional neural networks [ J ]. Infrared and millimeter wave academy, 2017,36 (3).

[6]Liu C,Yang M H,Sun X W.TOWARDS ROBUST HUMAN MILLIMETER WAVE IMAGING INSPECTION SYSTEM IN REAL TIME WITH DEEP LEARNING[J].Progress In Electromagnetics Research,2018,161:87-100.

[7]Ren S,He K,Girshick R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2015,39(6):1137-1149.

[8]Lin T Y,Dollár,Piotr,Girshick R,et al.Feature Pyramid Networks for Object Detection[C].In CVPR,2017.

[9]Liu W,Anguelov D,Erhan D,et al.SSD:Single Shot MultiBox Detector[C].In ECCV,2016.

[10]K.Simonyan and A.Zisserman.Very deep convolutional networks for large-scale image recognition.In ICLR,2015.

[11]Hu J,Shen L,Albanie S,et al.Squeeze-and-Excitation Networks[J].In CVPR,2017.

[12]Fu C Y,Liu W,Ranga A,et al.DSSD:Deconvolutional Single Shot Detector[J].In CVPR,2017.

[13]Shen Z,Liu Z,Li J,et al.DSOD:Learning Deeply Supervised Object Detectors from Scratch[J].In ICCV,2017.

[14]Jia,Y.,Shelhamer,E.,Donahue,J.,Karayev,S.,Long,J.,Girshick,R.,Guadarrama,S.,Darrell,T.:Caffe:Convolutional architecture for fast feature embedding.In:MM.(2014)。

Claims

1. A dangerous object detection method based on millimeter wave image human context information is characterized in that a convolutional neural network is used for down-sampling a millimeter wave image to obtain abstract features, the human context information is recovered through a top-down (top-down) process, the features of a human body carrier and the features of the human context are fused through a significance fusion module, and finally an auxiliary supervision function is used for correcting the offset generated by an initialized candidate frame; the method comprises the following specific steps:

step 1, from bottom to top: down-sampling the millimeter wave image, and selecting three levels of feature maps for prediction;

1.1: inputting the millimeter wave image into a Convolutional Neural Network (CNN), extracting features, and performing downsampling operation; the convolutional neural network comprises 10 convolutional layers in total and is used for extracting features; the 10 convolutional layers are respectively: conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3; setting the size of a convolution kernel to be 3 multiplied by 3, setting the size of a convolution kernel moving step to be 1, and performing 0 complementing processing at a convolution boundary; the feature maps extracted from the convolution kernels of Conv1_2, conv2_2, conv3_3, conv4 _3are still labeled as Conv1_2, conv2_2, conv3_3, conv4_3;

the Conv1_2, conv2_2, conv3 _3convolution kernels are all followed by maximum pooling operations to achieve a downsampling operation, each maximum pooling operation downsampling twice the image size;

1.3: selecting feature maps of three levels conv4_3, fc7 and conv6_2 for prediction; the feature map conv4_3 downsamples the original image by 8 times, the feature map fc7 downsamples the original image by 16 times, and the feature map conv6_2 downsamples the original image by 32 times; the three different levels of feature maps respectively represent dangerous objects with different scales;

step 2, from bottom to top: initializing a candidate frame in the millimeter wave image according to the feature map, and selecting positive and negative samples;

2.1: respectively initializing the ith candidate frame in the original image for each feature point in the feature maps of the three levels conv4_3, fc7 and conv6_2

Wherein cx represents the abscissa of the center point of the candidate frame, cy represents the ordinate of the center point of the candidate frame, w represents the width of the candidate frame, and h represents the height of the candidate frame; the initialization method of the candidate box is performed according to the formula (7) to the formula (9):

in the above formula, s _k E { conv4_3, fc7, conv6_2}, which represents the scale factor of the hierarchical feature map initialization candidate box participating in prediction, namely the width-height ratio of the millimeter wave image; n represents the number of hierarchical feature maps participating in prediction, and n =3 is taken; s _min Represents a global minimum scale; s _max Representing the global maximum ratio, r _j Represents a set of different aspect ratios, W represents the width of the millimetric wave image, and H represents the height of the millimetric wave image;

2.2: after step 2.1 is over, the candidate frames may already cover the original image, at which point each candidate frame resulting from step 2.1 is scaled as a positive or negative sample according to equation (1):

wherein, theta ₁ And theta ₂ Respectively, determining the threshold of positive and negative samples, calculating IOU by adopting the following formula (2), wherein P represents a candidate box, G represents ground truth, and Area (X) represents the Area for solving an X bounding box:

3.1: step 1, carrying out CNN downsampling on an original image by 32 times to obtain a feature map conv6_2; passing conv6_2 through a significance fusion module to obtain a feature map E6; in the bottom-up process, foreground features are extracted from conv4_3, fc7, conv6_2; based on the foreground features extracted by conv4_3, fc7 and conv6_2, the significance fusion module selects a part of more representative foreground features from the effective foreground features to be fused with the context information of the human body, and the features in the bottom-up process are screened; the significance fusion module adopts an S-E framework;

3.2: restoring the human body context information by using the characteristic diagram E6 obtained in the step 3.1; upsampling by a deconvolution operation to obtain E6';

3.3: fusing the feature map fc7 obtained in the step 1.1 with the feature map E6' obtained in the step 3.2 through a saliency fusion module, and obtaining a feature map E5 after fusing;

3.4: thirdly, obtaining a characteristic diagram E4 by following the step 3.2 and the step 3.3;

step 4, from top to bottom: generating a candidate frame in the millimeter wave original image, and selecting positive and negative samples;

the step 3.4 generates a characteristic map of three levels of E4, E5 and E6 which are respectively 1/8,1/16,1/32 times of the original image; predicting a final dangerous object by using the feature maps of the three levels;

4.1: initializing candidate boxes for the three feature maps E4, E5 and E6;

Here, the first and second liquid crystal display panels are,

the method is a regression prediction of the offset of the ith candidate frame, and m belongs to { cx, cy, w, h }, wherein cx represents an abscissa of the center point of the candidate frame, cy represents an ordinate of the center point of the candidate frame, w represents the width of the candidate frame, and h represents the height of the candidate frame;

the correction vector is learned by an auxiliary supervision function in the bottom-up stage; here using the vector and the candidate box generated in step 2.1 stage

To initialize candidate frames of E4, E5, E6

4.2: after the step 4.1 is finished, the corrected candidate frame is actually overlapped with the ground to a large extent; then, each candidate box generated by step 4.1 is scaled as a positive or negative sample according to equation (1);

step 5, optimizing the result of the step 2 by using auxiliary supervision

After step 2.2 is finished, a series of positive samples and negative samples are obtained, and the bottom-up process is optimized by using a formula (3), so that the method has the advantages that: a. the superficial network can effectively learn the appearance characteristics of the small target; b. regression terms for candidate boxes may be learned

Correcting the candidate box initialized by the top-down process by the item;

where N is the number of positive samples selected, L _cls (I, C) denotes class prediction, L _loc (I, P, G)) represents the position regression prediction, alpha represents the penalty factor, C is the number of classes in the training set, I is an indicative term,

i =1 if and only if the ith candidate box and the jth group Truth match;

for the positive samples obtained in step 2.2, the candidate box regression term is learned using equation (4)

Correctly judging the positive sample and the negative sample by adopting a formula (6) aiming at the positive sample and the negative sample obtained in the step 2.2;

in the regression term formula (4),

and

and

respectively indicate the width and height of the ith candidate box and the jth group Truth,

is a candidate frameThe relative shift that has occurred is such that,

is a regression prediction of the shift to the ith candidate box:

in the formula (6), the first and second groups,

is the prediction probability of the i-th class candidate box with respect to the background;

step 6, optimizing the result of the step 4 by utilizing an SSD cost function;

after the step 4.2 is finished, the candidate frame finishes the correction of the position coordinates, and then the process from top to bottom is optimized by adopting a formula (3);

m belongs to { cx, cy, w, h }, wherein cx represents an abscissa of the center point of the candidate frame, cy represents an ordinate of the center point of the candidate frame, w represents the width of the candidate frame, and h represents the height of the candidate frame;

and (4) correctly judging the positive sample and the negative sample by adopting a formula (6) according to the corrected positive sample and the negative sample obtained in the step (4.2).

2. The method for detecting dangerous objects based on millimeter wave image human body context information according to claim 1, wherein in step (2.1), s _min Set to 0.1,s _max Set to 0.4.

3. The method for detecting dangerous object based on millimeter wave image human body context information according to claim 2, wherein in step (2.2), the threshold value θ is ₁ And theta ₂ Set to 0.3 and 0.3, respectively.

4. The method for detecting dangerous objects based on millimeter wave image human body context information according to claim 3, wherein in step (4.2), the threshold θ is ₁ ＝0.7，θ ₂ ＝0.3。

5. The method for detecting dangerous objects based on millimeter wave image human body context information as claimed in claim 4, wherein in step (5), in order to further improve the convenience of model training, a position regression penalty term α =1 is set through a multitask learning mechanism, and a regression term is learned together

And positive and negative sample discrimination terms.

6. The method for detecting dangerous objects based on millimeter wave image human body context information as claimed in claim 4, wherein in step (6), in order to further improve the convenience of model training, a position regression penalty term α =1 is set through a multitask learning mechanism, and a regression term is learned together

And positive and negative sample discrimination terms.