CN110795991A

CN110795991A - Mining locomotive pedestrian detection method based on multi-information fusion

Info

Publication number: CN110795991A
Application number: CN201910860797.5A
Authority: CN
Inventors: 张传伟; 罗坤鑫; 陈黎明; 夏占; 卢强
Original assignee: Xian University of Science and Technology
Current assignee: Xian University of Science and Technology
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2020-02-14
Anticipated expiration: 2039-09-11
Also published as: CN110795991B

Abstract

The invention provides a mine locomotive pedestrian detection method based on multi-information fusion, which comprises the steps of establishing a pedestrian data set of visible light and infrared light under a coal mine, respectively carrying out CLAHE finite contrast adaptive histogram equalization and optimal notch denoising method processing based on a 3-order Butterworth function on two types of images aiming at a special underground environment, adjusting and improving a YOLOv3 target detection network by adopting a dense connection and multi-scale pooling structure, extracting and fusing the characteristics of the two types of images, optimizing a loss function by adopting a cross entropy function, establishing a model, improving the accuracy, real-time property and stability of the model under the complex underground environment, further improving the training precision of the network by adopting a transfer learning method, shortening the training time, and further detecting a pedestrian target by using the target detection model obtained by training, and the pedestrian detection result in front of the mining locomotive is output in real time, so that the detection speed required by the running of the mining locomotive can be met.

Description

Mining locomotive pedestrian detection method based on multi-information fusion

Technical Field

The invention relates to the technical field of underground coal mine detection, in particular to a mine locomotive pedestrian detection method based on multi-information fusion.

Background

With the continuous rise of the coal mine resource market, the underground transportation task is heavier. The data show that the rate of safety accidents caused by mine locomotive transportation accounts for 20% -30% of the total accidents due to the influence of dust, illumination and other factors of the operation environment. Transportation accidents may occur due to fatigue driving of drivers, improper operation or illegal operation of miners during operation of mine locomotives, so that serious harm is brought to life safety of the miners, and meanwhile, great loss is brought to production efficiency of the miners.

Generally, installing an object detection device on a locomotive under a coal mine is a main means for reducing accidents. And in the image acquisition stage, single visible light sensor is easily influenced by light, and is relatively poor to the penetrability of tiny granule, is difficult to adapt to complicated environment in the pit. The infrared sensor is less affected by dark light and dust, and the defects of the visible light sensor can be well made up. The current image processing technology has the technical defects of low processing precision, low processing speed and the like.

TABLE 1 comparison of visible and infrared sensors

Therefore, the advantages of visible light and infrared light are fully fused, and the convolutional neural network technology is combined to be applied to pedestrian target detection in front of the mine locomotive in order to prevent the occurrence of the accident of collision of the locomotive.

Disclosure of Invention

In order to solve the problems, the invention provides a method for detecting pedestrians by a mining locomotive based on multi-information fusion, which can overcome the defect that the existing visible light sensor is easily influenced by light and dust in a special underground environment, has strong adaptability and anti-interference capability, and further improves the detection precision and real-time performance.

In order to achieve the purpose, the invention provides a mining locomotive pedestrian detection method based on multi-information fusion, which comprises the following steps:

step 1, acquiring visible light and infrared light videos of pedestrians in front of a mining locomotive, extracting the videos into images, respectively preprocessing the images by using a CLAHE finite contrast adaptive histogram equalization and an optimal notch denoising method, then labeling the images by using LabelImg software, and expanding a data set by using an image enhancement method;

step 2, dividing the data set into a training set, a cross validation set and a test set according to the ratio of 8:1:1, wherein the training set is used for model training, the cross validation set is used for measuring the performance of the model so as to select optimal parameters, and the test set is used for final evaluation of the model; each data set is expanded into a plurality of scales through an image scaling method and is used for subsequent multi-scale training;

step 3, improving the YOLOv3 target detection network by adopting dense connection and a multi-scale pooling structure, and optimizing a loss function of the YOLOv3 target detection network;

step 4, initializing the weight parameters of the first 43 layers of convolutions of the improved YOLOv3 target detection network by adopting a transfer learning method, wherein the weight parameters of the first 43 layers of convolutions of the YOLOv3 target detection network are trained;

step 5, adjusting training parameters, and training the improved Yolov3 target detection network by using a training set;

step 6, selecting a model with the highest detection precision as an optimal model according to the detection result of the cross validation set, and then evaluating the performance of the model by using the test set;

step 7, analyzing the evaluation result, if the performance does not meet the expected requirement, executing the step 5 again, otherwise, directly outputting the trained target detection model;

and 8, detecting the re-acquired visible light and infrared light videos by using the trained target detection model, and outputting a pedestrian detection result in front of the mining locomotive in real time.

Optionally, the YOLOv3 target detection network is improved by adopting dense connection, including performing jump connection on a 52 × 52 × 256 feature map in the network, so that the adjusted feature map is overlapped with the subsequent two feature maps 26 × 26 × 512 and 13 × 13 × 512; the 26 × 26 × 512 feature map is then superimposed with the two subsequent feature maps 13 × 13 × 512 and 26 × 26 × 256, whereas the 13 × 13 × 512 feature map is superimposed with only one subsequent feature map 26 × 26 × 256.

Optionally, the YOLOv3 target detection network is improved by adopting a multi-scale pooling structure, which includes extracting 4 feature maps with different sizes from 13 × 13 × 512, 26 × 26 × 256 and 52 × 52 × 128 feature maps in the network through 4 pooling layers with different scales, combining context information of global and sub-regions, and then combining the 4 feature maps with original features to form a final feature expression, thereby performing convolution output.

Optionally, the optimizing the loss function includes defining class loss by using a cross entropy loss function, so that the model is more easily fitted, that is, the modified loss function is as follows: (1)

where S represents the network size and is 13 × 13, 26 × 26, or 52 × 52, B is the number of candidate frames, and the variable x_iAnd y_iAs coordinates of the center point of the candidate frame, w_iAnd h_iWidth and height of the bounding box, respectively, C_iIs the confidence of the predicted object, p^(ij)Is a class, variable, of an objectIs a predicted value;

representing the presence of an object in grid i;

indicating the presence of an object in bounding box j in mesh i;

indicating that no object is present in bounding box j in mesh i.

Optionally, the expanding each data set to multiple scales by the image scaling method is to scale each image to 10 sizes using the image scaling method: {320, 352, 384, 416,448,480,512,544,576,608}.

Optionally, in step 1, for the visible light image, processing the visible light image by using a CLAHE finite contrast adaptive histogram equalization method; and for the infrared light image, denoising the infrared light image by adopting a filtering method of an optimal notch filter based on a 3-order Butterworth function.

Optionally, the denoising processing of the infrared image by using a filtering method based on an optimal notch filter of a 3-order butterworth function includes:

firstly, Fourier transform is carried out on an infrared light image G (x, y) containing periodic noise to obtain a frequency spectrum image G (u, v) of the infrared light image G;

a5-pair 3-order Butterworth notch band-pass filter H (u, v) is placed at the position of a noise peak and used for extracting a main frequency part of noise, and the mathematical expression of the filter is as follows:

wherein, for a notch, its center point coordinate is (u)_k,v_k) Then a distance D from the center of the filter_k(u, v) and the coordinates of the center point of the notch, which is symmetrical about the origin, are (-u)_k,-v_k) At a distance D from the center of the filter_-k(u, v), W is the width of the band, D₀Is the center radius of the frequency band, k is a natural number;

the spectral image of the extracted noise can be represented as:

N(u,v)＝H(u,v)G(u,v) (3)；

and (3) obtaining a corresponding spatial domain image n (x, y) by carrying out inverse Fourier transform on the frequency spectrum image of the noise:

weighting and adjusting the noise by using a w (x, y) modulation function, and then subtracting the modulated noise image from the infrared image containing the noise in the spatial domain to obtain an estimate of a denoised image

The modulation function is then minimizedThe minimum of the variance over a given neighborhood (2a +1) (2b +1) of each point (x, y), namely:

wherein s and t are variables,

and

respectively, as the average of their functions.

The second derivative is made to be zero, and then the modulation function can be obtained:

and finally, obtaining the denoised space domain infrared image through the step 4.

Optionally, in step 5, the improved YOLOv3 target detection network is trained by setting different learning rates, weight attenuation coefficients and momentum coefficients, and then 10 different models are generated simultaneously.

Optionally, in step 6, the 10 models are verified on the cross-validation set to obtain their respective loss function values, and the model corresponding to the minimum value is determined as the optimal model.

In addition, the present invention also provides an electronic device including:

a memory for storing a computer program;

and the processor is used for executing the computer program stored in the memory, and when the computer program is executed, the pedestrian detection method for the mining locomotive based on multi-information fusion is realized.

The invention has the advantages and beneficial effects that: compared with the existing mine locomotive pedestrian detection technology, the invention provides a mine locomotive pedestrian detection method based on multi-information fusion, which has the following advantages:

(1) by utilizing a multi-sensor fusion technology, the defect of a single visible light sensor is overcome, and the method can adapt to complex underground coal mine environments.

(2) For visible light, a CLAHE finite contrast adaptive histogram equalization method is adopted to process the visible light image, and the problem that the visible light image is weak in detail under the dark light condition is effectively solved.

(3) For the infrared image, the filtering method of the optimal notch filter based on the 3-order Butterworth function is adopted to carry out denoising processing on the infrared image, and the influence of periodic noise is effectively reduced.

(4) By adopting the idea of dense connection, the characteristic graphs of different levels are overlapped in a jumping mode, so that the characteristic multiplexing is realized, the training parameters are effectively reduced, the performance of back propagation is improved, and the model is easier to train.

(5) By introducing a multi-scale pooling model structure, the network can be combined with feature maps of different levels, the context information of the whole area and the sub-area can be combined, and the detection precision can be effectively improved.

(6) The training precision of the network is improved and the training time is shortened by using the transfer learning technology.

(7) The method lays a foundation for the application of auxiliary driving of the mining locomotive and the like in the future.

Drawings

Fig. 1 is a flow chart of a training phase of a multi-information fusion mining locomotive pedestrian detection method in an embodiment of the invention.

FIG. 2 is a flow chart of a detection phase of a multi-information fusion mining locomotive pedestrian detection method in an embodiment of the invention.

Fig. 3 is a schematic diagram of the convolutional layer of the improved YOLOv3 network structure of the present invention.

Fig. 4 is a schematic diagram of a residual block of the improved YOLOv3 network structure in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

In one embodiment, as shown in fig. 1, the invention provides a mining locomotive pedestrian detection method based on multi-information fusion, which includes the following steps:

and 8, as shown in fig. 2, detecting the re-acquired visible light and infrared light videos by using the trained target detection model, and outputting a pedestrian detection result in front of the mining locomotive in real time.

Among them, a real-time object detection method is a youonly look once (YOLO) network structure of YOLOv3 (v 3 is version three), and the position of the bounding box and the category to which the bounding box belongs are directly regressed in the output layer by using the whole graph as the input of the network. The improved yollov 3 network structure is composed of a Convolutional Layer (conditional Layer) of 1x1 (default step size is 1), 3x3 and 3x3/2 (step size is 2), a Residual Block (Residual Block), a multi-scale average pooling Layer (averaging Pooling), an upsampling (Up Sample), and a feature map connection (collocation) as a whole.

FIG. 3 is a schematic view of a convolutional layer. The convolutional layer consists of a convolution, batch normalization and leakage ReLU activation function (linear unit function with leakage correction). The convolution is used for extracting features, batch normalization is used for improving the convergence efficiency of the algorithm and accelerating fitting, and the Leaky ReLU activation function is used for realizing the nonlinear operation of the network.

Fig. 4 is a schematic diagram of a residual block. The input x is output as F (x) after passing through two volumes, namely blocks, and the input x is directly connected with the output F (x) by introducing a jump connection to form the final output F (x) + x.

Taking a three-channel 416 × 416 color image as an example, the input image is subjected to four sets of convolution (32 3 × 3 convolutional layers, 64 3 × 3/2 convolutional layers, 32 1 × 1 convolutional layers, and 64 3 × 3 convolutional layers in sequence), and a feature map (feature map) with an output of 208 × 208 × 64 is obtained by calculation according to the following formula:

where n denotes the size of the input, p denotes the number of fills, f denotes the size of the convolutional or pooling layer, s denotes the step size (Stride), and out is the size of the output.

The method sequentially uses the obtained 208 × 208 × 64 feature map as input to obtain a 52 × 52 × 256 feature map through the subsequent 22 sets of convolutional layers, obtains a 26 × 26 × 512 feature map through the subsequent 17 convolutional layers for the 52 × 52 × 256 feature map, and obtains a 13 × 13 × 512 feature map through the subsequent 14 convolutional layers.

The 13 × 13 × 512 feature map is subjected to 5-layer convolution to obtain a 13 × 13 × 256 feature map, and then the feature map is up-sampled to 26 by using a bilinear difference algorithm, so that a 26 × 26 × 256 feature map is obtained. And obtaining a 52 × 52 × 128 feature map through 5-layer convolution and upsampling in a sequential method.

The 52 × 52 × 256 feature maps are subjected to skip connection, and the first connection is reduced to 26 × 26 × 128 by using 128 convolution layers of 3 × 3/2 and is overlapped with the subsequent first feature map. The second connection takes 128 convolutional layers of 3 × 3/2 and 64 3 × 3/2 to be dimensionality reduced to 13 × 13 × 64 and overlaid with the subsequent second feature map.

Similarly, the 26 × 26 × 512 feature maps are subjected to skip connection, adjusted by the convolution layer, and then superimposed on the two subsequent feature maps. Whereas the 13 x 512 signature is superimposed with only a subsequent one.

The 13 × 13 × 512 feature map is reduced in size by a multi-scale pooling method, that is, 4 feature maps are simultaneously output through 4 pooling layers (a global average pooling layer, a 5 × 5/4 average pooling layer of 7 × 7/6, and a 3 × 3/2 average pooling layer) and then are input and simultaneously pass through 128 1 × 1 convolutional layers to respectively obtain 4 feature maps, the feature maps are uniformly enlarged to 13 × 13 × 128 by an upsampling method, then are fused with the previous 13 × 13 × 512 feature map by a connection method, and finally the fused feature maps pass through 18 1 × 1 convolutional layers to obtain an output with the size of 13 × 13 × 18.

By using the above method for the 26 × 26 × 256 and 52 × 52 × 128 feature maps, 26 × 26 × 18 and 52 × 52 × 18 outputs can be obtained, respectively.

The first two-dimensional vector in the output represents the number of grids which divide the input image, each grid complexly detects an object of which the center position falls into, and each grid can simultaneously detect 3 kinds of objects by adopting a priori frame method, each object contains information of size, coordinates, confidence coefficient and class probability, and the total number of the objects is 5 parameters, so that the total number of the parameters is 18.

The method is characterized in that a traditional manual selection method is replaced, a modified K-means algorithm is adopted to aggregate a data set to obtain the size and the number of prior frames, and original distance measurement in the K-means algorithm is modified into IOU according to the following formula, namely distance measurement d is represented by subtracting the intersection ratio of all bounding boxes box and clustering center bounding box centroids from 1.

d(box,centroid)＝1-IOU(box,centroid) (9)；

For each output (3 total) 3 prior boxes were set, for a total of 9 sizes of prior boxes clustered. In assignment, a larger prior box is applied on the smallest 13 x 13 signature (with the largest receptive field) for detecting larger objects. Medium prior frames were applied on medium 26 x 26 signature (medium receptive field) for detection of medium sized objects. A smaller a priori box is applied on the larger 52 x 52 signature (smaller field of reception) for detecting smaller objects.

In the process of calculating the gradient decline of the weight parameters, the square sum function used by the original category loss is probably not a convex function, namely, a plurality of local optimal values exist, and a global optimal solution is difficult to obtain, so that the plan adopts a cross entropy function to replace the original square sum function to define the category loss. The modified loss function is as follows: (1)

where S represents the network size and is 13 × 13, 26 × 26, or 52 × 52, B is the number of candidate frames (3), and the variable x_iAnd y_iAs coordinates of the center point of the candidate frame, w_iAnd h_iWidth and height of the bounding box, respectively, C_iIs the confidence of the predicted object, p^(ij)Is a class, variable, of an objectIs a predicted value;

indicating the presence of an object in grid i, meaning that the grid in which the object is present accounts for the error.

Indicating the presence of an object in bounding box j in grid i, means that only the bounding box data that is "responsible" (relatively large) for the prediction will be subject to error.

-indicating that no object is present in the bounding box j in the mesh i.

Since there are some large objects or objects near multiple grids that can be detected by multiple grids at the same time, multiple bounding boxes are output, where the result of output repetition is filtered by using a non-maximum suppression method.

The method comprises the steps of extracting visible light and infrared light pedestrian videos recorded in front of a mining locomotive into images, processing the visible light images by a CLAHE finite contrast adaptive histogram equalization method, namely dividing the images by an 8 x 8 network, then respectively performing histogram equalization on each small block, and finally stitching the small blocks by using a bilinear difference algorithm in order to remove boundaries caused by the algorithm among the small blocks.

For the infrared image, because the infrared image is easily interfered by periodic noise, the infrared image is denoised by adopting a filtering method of an optimal notch filter based on a 3-order Butterworth function, namely:

firstly, Fourier transform is carried out on the infrared light image G (x, y) containing periodic noise to obtain a frequency spectrum image G (u, v).

According to the noise characteristics of the spectrum image, a 5-pair 3-order Butterworth notch band-pass filter H (u, v) is placed at the position of a noise peak and used for extracting the main frequency part of noise, and the mathematical expression of the filter is as follows:

the spectral image of the extracted noise can be represented as:

N(u,v)＝H(u,v)G(u,v) (3)；

the corresponding spatial domain image n (x, y) can be obtained by the frequency spectrum image of the noise through inverse Fourier transform

Since this process usually only yields an approximation of the noise, the noise is weighted with a w (x, y) modulation function and then subtracted with the noisy infrared image in the spatial domainThe noise image after being processed can obtain an estimation of a de-noised image

The modulation function is then minimized

The minimum of the variance over a given neighborhood (2a +1) (2b +1) of each point (x, y), namely:

wherein s and t are variables,and

respectively, as the average of their functions.

The data sets were divided into training sets, cross-validation sets, and test sets in 8:1:1, and each data set was expanded to 10 sizes by an image scaling method: {320, 352, 384, 416,448,480,512,544,576,608], i.e. in increments of size 32.

And initializing the weight parameters of the front 43 layers of convolutions of the improved YOLOv3 target detection network by adopting a transfer learning method and training the weight parameters of the front 43 layers of convolutions of the YOLOv 3.

Adam is adopted in the optimization algorithm, the iteration number of the Adam is set to be 100, the number of samples of each iteration is 64, the momentum coefficient is 0.9, and the learning rate attenuation coefficient is 0.0005. After every 10 iterations, i.e. randomly selecting a new size for training (10 sizes in total), all iterations are completed, i.e. completing one training to obtain one model, and then modifying the parameters again to continuously output 10 models in total.

And verifying the 10 models on the cross verification set to respectively obtain loss function values of the 10 models, determining the model corresponding to the minimum value as an optimal model, calculating the loss function value of the model by using the verification set, modifying parameters again if the value fails to meet the expected requirement, training the model, and otherwise, directly outputting the model as a target detection model.

The method comprises the steps of extracting the obtained visible light and infrared light videos into images, respectively carrying out CLAHE finite contrast adaptive histogram equalization on the images, carrying out denoising processing on the images and a filtering method of an optimal notch filter based on a 3-order Butterworth function, then detecting the images by using a target detection model obtained through training, and outputting a pedestrian detection result in front of the mining locomotive in real time.

a memory for storing a computer program;

a processor for executing the computer program stored in the memory, and when the computer program is executed, implementing a mining locomotive pedestrian detection method based on multi-information fusion, at least comprising the following steps:

step 6, selecting an optimal target detection model according to the detection result of the cross validation set, and then evaluating the performance of the model by using the test set;

Alternatively, the electronic device may be a server or a personal computer, or the like.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to the above specific embodiments, it is to be understood that the invention is not limited to the specific embodiments disclosed, nor is the division of the aspects, which is for convenience only as the features in these aspects cannot be combined to advantage. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A mining locomotive pedestrian detection method based on multi-information fusion is characterized by comprising the following steps:

2. The method of claim 1, wherein the YOLOv3 target detection network is improved by using dense connection, and comprises performing jump connection on a 52 x 256 feature map in the network, so that the adjusted feature map is overlapped with the two subsequent feature maps 26 x 512 and 13 x 512; the 26 × 26 × 512 feature map is then superimposed with the two subsequent feature maps 13 × 13 × 512 and 26 × 26 × 256, whereas the 13 × 13 × 512 feature map is superimposed with only one subsequent feature map 26 × 26 × 256.

3. The method of claim 1, wherein the YOLOv3 target detection network is improved by using a multi-scale pooling structure, which comprises extracting 4 feature maps with different sizes from 13 x 512, 26 x 256 and 52 x 128 feature maps in the network through 4 pooling layers with different scales, combining the context information of the global region and the sub-region, and then combining the 4 feature maps with the original features to form a final feature expression, thereby performing convolution output.

4. The method of claim 1, wherein optimizing the loss function comprises defining class losses using a cross-entropy loss function to make the model easier to fit, i.e., the modified loss function is as follows:

where S represents the network size and is 13 × 13, 26 × 26, or 52 × 52, B is the number of candidate frames, and the variable x_iAnd y_iAs coordinates of the center point of the candidate frame, w_iAnd h_iWidth and height of the bounding box, respectively, C_iIs the confidence of the predicted object, p^(ij)Is a class, variable, of an object

Is its predicted value;

representing the presence of an object in grid i;

indicating the presence of an object in bounding box j in mesh i;

indicating that no object is present in bounding box j in mesh i.

5. The method of claim 1, wherein optionally the expanding each data set to multiple scales by an image scaling method is scaling each image to 10 sizes using an image scaling method: {320, 352, 384, 416,448,480,512,544,576,608}.

6. The method of claim 1, wherein in step 1, for the visible light image, the CLAHE finite contrast adaptive histogram equalization method is adopted to process the visible light image; and for the infrared light image, denoising the infrared light image by adopting a filtering method of an optimal notch filter based on a 3-order Butterworth function.

7. The method of claim 6, wherein denoising the infrared image using a filtering method based on an optimal notch filter of a 3 rd order Butterworth function comprises:

the spectral image of the extracted noise can be represented as:

N(u,v)＝H(u,v)G(u,v) (3)；

The modulation function is then minimized

wherein s and t are variables,and

respectively, the average of their functions;

8. The method of claim 1, wherein in step 5, the improved YOLOv3 target detection network is trained by setting different learning rates, weighting attenuation coefficients and momentum coefficients, and then 10 different models are generated simultaneously.

9. The method of claim 8, wherein in step 6, the 10 models are validated on the cross-validation set to obtain their respective loss function values, and the model with the minimum value is identified as the optimal model.

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in the memory, and when executed, implementing the method of any of claims 1-9.