CN110472542A

CN110472542A - A kind of infrared image pedestrian detection method and detection system based on deep learning

Info

Publication number: CN110472542A
Application number: CN201910716970.4A
Authority: CN
Inventors: 孙立坤; 林保均; 王忠荣; 焦玉海; 吕建峰; 时文忠
Original assignee: Shenzhen Beidou Communications Technology Co Ltd
Current assignee: Shenzhen Beidou Communications Technology Co Ltd
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2019-11-19

Abstract

The present invention provides a kind of infrared image pedestrian detection method and detection system based on deep learning, belongs to technical field of computer vision.Infrared image pedestrian detection method of the present invention includes the following steps: to obtain data and data prediction；Target detection FIDN network is constructed based on convolutional neural networks；Target detection FIDN network is constructed based on convolutional neural networks；Predict that the present invention also provides a kind of detection systems for realizing the infrared image pedestrian detection method based on optimal models.The invention has the benefit that ensure high-precision while being able to satisfy requirement of real-time, strong robustness.

Description

A kind of infrared image pedestrian detection method and detection system based on deep learning

Technical field

The present invention relates to a kind of image detecting method more particularly to a kind of infrared image pedestrian detections based on deep learning Method and detection system.

Background technique

Target detection is an important project in computer vision field, main task be positioned from image it is interested Target, need accurately to judge the specific category of each target, and provide the bounding box of each target.Due to visual angle, block, The factors such as posture cause target, and deformation occurs, and target detection is caused to become a challenging task.

Conventional target detection method be broadly divided into pretreatment, window sliding, feature extraction, feature selecting, tagsort and Post-process six steps.Conventional target detection generally by some preferable manual features are designed, then using classifier into Row classification.As the requirement of target detection accuracy and speed is higher and higher, conventional method is no longer satisfied demand.In recent years, Depth learning technology is widely used, and produces a series of algorithm of target detection, such as RCNN, Fast-RCNN, Faster-RCNN, YOLO, SSD and its a series of derivative algorithms, but these detection techniques or since precision is low or detection is time-consuming It is too long to be applied in commercial product well.Current algorithm of target detection is difficult meet the needs of practical application, In Scientific research field, most of researcher only focus on target detection precision (using mAP (Mean Average Precision, average essence Spend mean value) measurement), very complicated network can be designed and add some very complicated methods and some training skills, then open Obtain a preferable achievement on data set, but this is difficult to be applied directly to and goes in practice.Infrared imaging is by infrared biography The thermal imaging performance of sensor obtains image, is solely dependent upon the temperature and its heat radiated of object.Therefore at night, rain In the insufficient situation of the light intensities such as it or haze, infrared image has apparent advantage compared to visible images.Human body target It is all the research hotspot of target following and detection field all the time as factor main, most active in environment, and human body Target it is non-rigid, in addition the shortcomings that infrared image itself, so that the pedestrian detection based on infrared image is filled with difficulty and chooses War.

Summary of the invention

To solve the problems of the prior art, the present invention provides a kind of infrared image pedestrian detection side based on deep learning Method and detection system, it is ensured that high-precision while being able to satisfy requirement of real-time.

The present invention is based on the infrared image pedestrian detection methods of deep learning to include the following steps:

Step S1: data and data prediction are obtained: obtains the infrared image comprising pedestrian, infrared image is located in advance Reason, and pretreated infrared image is manually marked, the training set of detection model is then divided into according to setting ratio Collect with verifying；

Step S2: based on convolutional neural networks building target detection FIDN network: the target detection FIDN network includes Several layers convolutional layer and maximum pond layer, and be arranged in convolutional layer and the subsequent expansion convolutional layer of maximum pond layer, convolutional layer In stacking, when port number reaches setting value, the port number for expanding convolutional layer is not further added by；

Step S3: model training: model training is carried out to target detection FIDN network using training set, and selects and is verifying Collection shows optimal optimal models；

Step S4: optimal models prediction: being based on optimal models, predicted on GPU server, and realization flows into video Row target detection.

The present invention is further improved, and in step S2, the target detection FIDN network further includes that self-adaptive features figure is logical Trace weighting module, channel weighting of the setting in expansion convolutional layer output end, for the characteristic pattern to expansion convolutional layer output.

The present invention is further improved, the processing method of the self-adaptive features figure channel weighting module are as follows:

A1: using a global pool layer characteristic pattern boil down to 1*1*C, wherein the port number of C expression characteristic pattern；

A2: using full articulamentum port number boil down to C/16；

A3: by Relu activation primitive, port number is reduced to C using full articulamentum；

A4: output result connects sigmoid active coating, the weight vectors of a 1*1*C is obtained, at sigmoid function It manages, the weight value in the weight vectors is between 0-1；

A5: characteristic pattern channel dimension is weighted using weight.

The present invention is further improved, and in step S1, the pretreatment includes median filter process, and median filtering formula is such as Under:

G (x, y)=median { f (x-k, y-l), (k, l) ∈ W }

Wherein, f (x, y) and g (x, y) is respectively image after original image and processing, and W is two dimension pattern plate.

The present invention is further improved, and artificial mark is that the pedestrian in each picture is used rectangle using annotation tool Circle goes out, and rectangle frame is the minimum circumscribed rectangle of target pedestrian, and the corresponding XML file generated records in figure in XML file The coordinate of each target includes top left co-ordinate x, top left co-ordinate_y, width w and height h, at the same delete picture blur or It is difficult to the picture marked, by above-mentioned data mixing, the ratio cut partition according to 9:1 is that the training set of detection model and verifying collect.

The present invention is further improved, and in step S2, the target detection FIDN network is by 7 layers of 1*1 convolution or 3*3 volumes The full convolutional network that network is constituted is accumulated, the candidate frame on image is directly to generate on original image, and generation method is as follows:

Original image: being directly divided into S*S region by B1, and wherein S is the size of the characteristic pattern of the last one convolution；

B2: in the different candidate frame of each Area generation several length-width ratios, specific length-width ratio is marked according to data set Rectangle frame is obtained using k-means algorithm；

B3: being distributed according to the size that real data collection calculates priori candidate frame, use (1-IoU) as distance metric, Middle IoU indicates the friendship of area between priori candidate frame and the rectangle frame of label and ratio, calculation formula are as follows:

Wherein, A indicates that priori candidate frame, B indicate that the rectangle frame of label, ∩ indicate the intersection of A and B, and ∪ indicates A's and B Union.

The present invention is further improved, and the target detection FIDN network is using lightweight convolutional neural networks as backbone network Network predicted according to algorithm of target detection using the convolution of a 1*1, the positioning loss function of the algorithm of target detection Are as follows:

Wherein, λ is coefficient of the control positioning loss in total loss accounting, and S indicates the characteristic pattern of last convolution Size, A indicate the number of each Area generation anchor frame,It is a 0-1 function, if there is target in the region of the i-th row j column, Value is 1, otherwise value 0, x, y, h, and w respectively indicates the height and width of the coordinate of central point, prediction block, wherein lower marker tape ^ is indicated It is true value, the expression predicted value not with ^.

The present invention is further improved, and in step S3, the model training refers to training of starting from scratch, and weight parameter uses The method of random initializtion carries out data enhancement operations to data by left and right overturning, random cropping, color jitter, by not Disconnected regularized learning algorithm rate, batch size, optimization method hyper parameter carry out training objective detection FIDN network.

The present invention is further improved, in step S4, the prediction technique are as follows: the forward direction for constructing network infers process, defeated Enter parameter be image data, be returned as prediction result, to video carry out target detection when, be added Kalman filter carry out with Track.

The present invention also provides a kind of detection systems for realizing the infrared image pedestrian detection method, comprising:

Obtain data module: for obtaining the infrared image comprising pedestrian；

Data preprocessing module: people is carried out for pre-processing to infrared image, and to pretreated infrared image Then work mark is divided into the training set of detection model according to setting ratio and verifying collects；

Construct target detection FIDN network module: for constructing target detection FIDN network, institute based on convolutional neural networks Stating target detection FIDN network includes several layers convolutional layer and maximum pond layer, and setting is behind convolutional layer and maximum pond layer Expansion convolutional layer, in the stacking of convolutional layer, when port number reaches setting value, the port number for expanding convolutional layer is not further added by；

Model training module: it for carrying out model training to target detection FIDN network using training set, and selects and is testing Card collection shows optimal optimal models；

Optimal models prediction module: being based on optimal models, predicted on GPU server, realizes and carries out to video flowing Target detection.

Compared with prior art, the beneficial effects of the present invention are: taking full advantage of the high property of deep learning accuracy, Shandong Stick is good, can adapt to the various change of external environment.By design construction FIDN network, precision with higher and extremely low Calculation amount can achieve 180fps on GPU, have 18fps or so on CPU, ensure that the requirement of real-time, has Very high practicability.

Detailed description of the invention

Fig. 1 is the method for the present invention flow chart；

Fig. 2 is target detection FIDN schematic network structure；

Fig. 3 is characterized figure channel weighting resume module method flow diagram；

Fig. 4 is former infrared image；

Fig. 5 is the image after detection.

Specific embodiment

The present invention is described in further details with reference to the accompanying drawings and examples.

As shown in Figure 1, the method for the present invention constructs FIDN (Fast-Infared-Detect-Network, fast infrared mesh Mark detection) deep neural network, include the following steps:

Step S1: data and data prediction are obtained: obtains the infrared image comprising pedestrian, infrared image is located in advance Reason, and pretreated infrared image is manually marked, the training set of detection model is then divided into according to setting ratio Collect with verifying.

After obtaining the largely picture comprising pedestrian, because the usual image quality of infrared image is bad, need to do some pre- places Then reason carries out artificial mark for the infrared image after processing, mark includes two parts, and target category and target are surrounded Frame.

Step S2: based on convolutional neural networks building target detection FIDN network (abbreviation FIDN network): the target inspection Surveying FIDN network includes several layers convolutional layer and maximum pond layer, and setting in convolutional layer and the subsequent expansion volume of maximum pond layer Lamination, in the stacking of convolutional layer, when port number reaches setting value, the port number for expanding convolutional layer is not further added by；

Step S4: optimal models prediction: being based on optimal models, predicted on GPU server, and realization flows into video Row target detection can achieve 180fps (video real-time detection speed, the frame number of detection per second) or more, specifically on GPU Pre- flow gauge is shown in Fig. 3.

In step sl, the pretreatment includes median filter process.Due to by external environment and infrared camera imaging Principle influences, and infrared image imaging process can generate more noise, cause picture imaging quality bad, and clarity is inadequate, increases Add the difficulty to pedestrian detection and identification, so starting to pre-process image and filter out noise.Median filtering formula It is as follows:

G (x, y)=median { f (x-k, y-l), (k, l) ∈ W }

Wherein, f (x, y) and g (x, y) is respectively image after original image and processing, and W is two dimension pattern plate, and k, l are respectively W In two dimension value.

The artificial mark of this example refers to: all being outlined the pedestrian in each picture with rectangle frame using annotation tool, square Shape frame is the minimum circumscribed rectangle of target pedestrian, the corresponding XML file generated.In XML file, each target in figure is recorded Coordinate includes top left co-ordinate x, top left co-ordinate y, width w and height h, while deleting what picture blur or be difficult to marked Picture.By above-mentioned data mixing, the ratio cut partition according to 9:1 is that the training set of detection model and verifying collect, and training set is used for mould Type training, verifying collection is not involved in model training, for verifying the training effect of model.

In step S2, the FIDN network is the full convolutional network being made of 7 layers of 1*1 convolution or 3*3 convolutional network. The whole flow process of this method is a single phase detector, without specially generating candidate frame, the candidate frame of this method in a network It is directly to be generated in original image, generation method is as follows, and original image is directly divided into S*S part, and (wherein S is the last one volume The size of long-pending characteristic pattern, usually 13*13, original image are 416*416), it is then different in 5 length-width ratios of each Area generation Candidate frame, specific length-width ratio are to be obtained according to data set indicia framing using k-means algorithm.It is calculated according to real data collection The size of anchors (priori candidate frame) is distributed, which is obtained by K-means algorithm, uses (1-IoU) as apart from degree Amount, wherein IoU indicates the friendship of area and ratio between priori candidate frame and indicia framing.Calculation formula is as follows:

As shown in Fig. 2, wherein conv indicates that convolutional layer, Dilated conv indicate expansion convolution, maxpool is maximum value Chi Hua, predicted portions are the convolution of a 1*1, and target detection FIDN network described in the target detection FIDN network of this example includes 5 Layer convolutional layer and maximum pond layer, and setting is in convolutional layer and the subsequent 2 expansions convolutional layer of maximum pond layer, the heap of convolutional layer In folded, when port number reaches setting value 256, the port number 256 for expanding convolutional layer is not further added by.

Using Dilated Convolution (expansion convolution), the great advantage for expanding convolution exists most latter two convolutional layer In the operation for not doing pond or down-sampling, receptive field can be increased, each convolution output is allowed to include large range of information, Retain the spatial information of biggish characteristic pattern and image as far as possible simultaneously, this is very crucial for small target deteection.For target Test problems, can great retaining space information using expansion convolution.When using expansion convolution, since characteristic pattern does not reduce, Calculation amount can be significantly greatly increased in this, different from general network structure, and FIDN network in the last one module, lead to by all convolution Road number is both configured to 256, and due to having compressed the number of plies, we attached a self-adaptive features figure channel after this layer of convolution and add Module is weighed, self-adaptive features figure channel weighting module, setting is in expansion convolutional layer output end, for expansion convolutional layer output The channel weighting of characteristic pattern.

As shown in figure 3, the processing method of the self-adaptive features figure channel weighting module are as follows:

A1: using a global pool layer characteristic pattern boil down to 1*1*C, wherein C indicates the port number of characteristic pattern, this Place is 256；

A2: using full articulamentum port number boil down to C/16；

A3: connecing Relu activation primitive again, and by Relu activation primitive, port number is reduced to C using full articulamentum；

A4: output result connects sigmoid active coating, is equivalent to have obtained the weight vectors of a 1*1*C, passes through Sigmoid function is handled, and the weight value in the weight vectors is between 0-1, as the output characteristic pattern of convolutional layer before Channel weighting allows network oneself to learn the weight in channel, because there is different role in channel different in characteristic pattern so multichannel With different significance levels；

A5: being weighted characteristic pattern channel dimension using weight,

In Fig. 3, conv indicates that convolutional layer, avgpool indicate that average pond layer, fc indicate full articulamentum, and ReLU expression makes Use relu function as activation primitive, Sigmoid expression uses sigmoid function as active coating.ReWeight indicates basis The weight that the right branch obtains is weighted characteristic pattern channel dimension.

It is demonstrated experimentally that the convolutional layer port number is 256 (being denoted as FIDN-256 network) and port number is 1024 (to be denoted as FIDN-1024 network) it compares, on self-built data set, detection accuracy is respectively 80.1% (FIDN-256 network) and 80.6% (FIND-1024 network).As shown in Figure 2, whole network is using lightweight convolutional neural networks as bone for entire FIDN network structure Dry network, detection part is similar with a most of common step algorithm of target detection, is predicted using a full articulamentum, FIDN is predicted using the convolution of a 1*1.This example is improved in the loss function part of network, in algorithm of target detection In, loss function generally comprises two parts, respectively positioning loss and Classification and Identification loss.Positioning is lost, it is contemplated that Influence of the different size of target detection frame to loss be it is different, therefore, this example be provided with following positioning loss function:

Wherein, λ is a control positioning loss in the coefficient of total loss accounting, and default is 5, because positioning loss is opposite Classification Loss is more important, so accounting is heavier.S indicates the size of the characteristic pattern of last convolution, and A indicates each Area generation anchor frame Number, default is 5,It is a 0-1 function, if there is a target in the region of the i-th row j column, value 1, otherwise value 0.x, y, h, w respectively indicate the coordinate of central point and the height and width of prediction block, wherein the ^ expression of lower marker tape is true value, no band Expression predicted value.

In step S3, the model training refers to training of starting from scratch, because network is smaller, training of directly starting from scratch Quickly, there is no over-fitting risk yet, be trained on data set directly in step sl, weight parameter is all using random yet The method of initialization carries out the data enhancement operations such as flip horizontal, random cropping, color jitter, continuous regularized learning algorithm to data The hyper parameters such as rate, batch size (batch_size), optimization method train FIDN network.

The optimal models are: in training process, every by 1 wheel, (1 wheel refers to that all pictures are all trained to one in data set It is secondary) model of storage, ordinary circumstance, 60 wheel of training.And by the model in verifying collection test, according to the essence of pedestrian detection It spends mAP and selects optimal models.

In step S4, the prediction technique is: the forward direction for constructing network infers process, and forward direction infers the network knot of process Structure is process that is identical, only losing without calculating loss and passback with structure when training.Input parameter be image data, It is returned as prediction result, input picture does a simple pretreatment, is then passed to the input of network, which can be adaptive The picture of any size, network internal can scale automatically.And can centainly be post-processed, target detection is being carried out to video When, it is tracked by the way that Kalman filter is added, so that detection process is more smooth and stablizes.To Fig. 4 by of the invention The result of object detection method detection is as shown in Figure 5.

Of the invention takes full advantage of the high property of deep learning accuracy based on the infrared pedestrian detection method of deep learning, Robustness is good, can adapt to the various change of external environment.By design construction FIDN network, the network have higher precision and Extremely low calculation amount, can achieve 180fps on GPU, have 18fps or so on CPU, ensure that wanting for real-time It asks, there is very high practicability.

The present invention has following two points main innovation point:

(1) new target detection network FIDN is designed.Method proposes a kind of new efficient target detection networks, are used for Infrared image pedestrian detection is a kind of single phase object detection method, and the priori for obtaining data set by k-means method is candidate Then the distribution of frame carries out the positioning of target frame using the method returned.It (does not include channel that whole network, which only has 7 convolutional layers, The part of weighting), comprising some convolutional layers and maximum pond layer, then do not reduce the size of characteristic pattern using expansion convolution finally It is helpful to the precision improvement of pedestrian detection with enough receptive fields.In the stacking of convolutional layer, there is no as general networks that The progress of one straight grip port number of sample is double, and when port number is 256, port number is not just further added by, and can greatly reduce calculating in this way Amount.

(2) self-adaptive features figure channel weighting method is designed.Since in planned network, no picture Normal practice is to channel Number progress is double, and this reduces characteristic pattern port numbers, can there is certain influence on effect, and the present invention devises one adaptively The method of characteristic pattern channel weighting, it is several hundred or even thousands of because the port number of characteristic pattern is usually very much, but the letter of their offers Breath and significance level are different, and the self-adaptive features figure channel weighting method that the present invention designs can pass through network oneself Learn a set of weighting parameters out, be then dissolved into characteristic pattern, and this method has certain versatility, may be added to very much In network, part convolutional layer can be added to unrestricted choice followed by characteristic pattern channel weighting.

The specific embodiment of the above is better embodiment of the invention, is not limited with this of the invention specific Practical range, the scope of the present invention includes being not limited to present embodiment, all equal according to equivalence changes made by the present invention Within the scope of the present invention.

Claims

1. a kind of infrared image pedestrian detection method based on deep learning, which is characterized in that the infrared image pedestrian detection Method includes the following steps:

Step S1: data and data prediction are obtained: obtain the infrared image comprising pedestrian, infrared image is pre-processed, And pretreated infrared image is manually marked, be then divided into the training set of detection model according to setting ratio and is tested Card collection；

Step S2: construct target detection FIDN network based on convolutional neural networks: the target detection FIDN network includes several Layer convolutional layer and maximum pond layer, and setting is in convolutional layer and the subsequent expansion convolutional layer of maximum pond layer, the stacking of convolutional layer In, when port number reaches setting value, the port number for expanding convolutional layer is not further added by；

Step S3: model training: model training is carried out to target detection FIDN network using training set, and is selected in verifying collection table Existing optimal optimal models；

Step S4: optimal models prediction: being based on optimal models, predicted on GPU server, realizes and carries out mesh to video flowing Mark detection.

2. infrared image pedestrian detection method according to claim 1, it is characterised in that: in step S2, the target inspection Surveying FIDN network further includes self-adaptive features figure channel weighting module, and setting is in expansion convolutional layer output end, for rolling up to expansion The channel weighting of the characteristic pattern of lamination output.

3. infrared image pedestrian detection method according to claim 2, it is characterised in that: self-adaptive features figure channel The processing method of weighting block are as follows:

A2: using full articulamentum port number boil down to C/16；

A4: output result connects sigmoid active coating, obtains the weight vectors of a 1*1*C, handles by sigmoid function, institute The weight value in weight vectors is stated between 0-1；

A5: characteristic pattern channel dimension is weighted using weight.

4. infrared image pedestrian detection method according to claim 1-3, it is characterised in that: in step S1, institute Stating pretreatment includes median filter process, and median filtering formula is as follows:

G (x, y)=median fx-k, y-l), (k, l) ∈ W }

5. infrared image pedestrian detection method according to claim 4, it is characterised in that: artificial mark is using mark work Tool all outlines the pedestrian in each picture with rectangle frame, and rectangle frame is the minimum circumscribed rectangle of target pedestrian, corresponding to generate XML file, in XML file, record figure in each target coordinate, include top left co-ordinate x, top left co-ordinate y, width W and height h, while deleting picture blur or being difficult to the picture marked, by above-mentioned data mixing, according to the ratio cut partition of 9:1 For the training set and verifying collection of detection model.

6. infrared image pedestrian detection method according to claim 5, it is characterised in that: in step S2, the target inspection The full convolutional network that FIDN network is made of 7 layers of 1*1 convolution or 3*3 convolutional network is surveyed, the candidate frame on image is direct It is generated on original image, generation method is as follows:

B2: in the different candidate frame of each Area generation several length-width ratios, the rectangle that specific length-width ratio is marked according to data set Frame is obtained using k-means algorithm；

B3: it is distributed according to the size that real data collection calculates priori candidate frame, uses (1-IoU) as distance metric, wherein IoU It indicates the friendship of area between priori candidate frame and the rectangle frame of label and ratio, calculation formula is as follows:

Wherein, A indicates that priori candidate frame, B indicate that the rectangle frame of label, ∩ indicate the intersection of A and B, and ∪ indicates the union of A and B.

7. infrared image pedestrian detection method according to claim 6, it is characterised in that: the target detection FIDN network Using lightweight convolutional neural networks as backbone network, according to algorithm of target detection, predicted using the convolution of a 1*1, The positioning loss function of the algorithm of target detection are as follows:

Wherein, λ is coefficient of the control positioning loss in total loss accounting, and S indicates the size of the characteristic pattern of last convolution, A indicates the number of each Area generation anchor frame,It is a 0-1 function, if there is target in the region of the i-th row j column, value is 1, otherwise value 0, x, y, h, w respectively indicate the height and width of the coordinate of central point, prediction block, wherein lower marker tape ^ expression is true Value, the expression predicted value not with ^.

8. infrared image pedestrian detection method according to claim 1-3, it is characterised in that: in step S3, institute It states model training and refers to training of starting from scratch, the method that weight parameter uses random initializtion passes through left and right overturning, random sanction Cut, color jitter to data carry out data enhancement operations, pass through continuous regularized learning algorithm rate, batch size, the super ginseng of optimization method Number carrys out training objective and detects FIDN network.

9. infrared image pedestrian detection method according to claim 8, it is characterised in that: in step S4, the prediction side Method are as follows: the forward direction for constructing network infers that process, input parameter are image data, are returned as prediction result, is carrying out mesh to video When mark detection, Kalman filter is added and is tracked.

10. a kind of detection system for realizing the described in any item infrared image pedestrian detection methods of claim 1-9, feature exist In, comprising:

Obtain data module: for obtaining the infrared image comprising pedestrian；

Data preprocessing module: it is manually marked for being pre-processed to infrared image, and to pretreated infrared image Then note is divided into the training set of detection model according to setting ratio and verifying collects；

Construct target detection FIDN network module: for constructing target detection FIDN network, the mesh based on convolutional neural networks Mark detection FIDN network includes several layers convolutional layer and maximum pond layer, and setting in convolutional layer and the subsequent expansion of maximum pond layer Convolutional layer, in the stacking of convolutional layer, when port number reaches setting value, the port number for expanding convolutional layer is not further added by；

Model training module: it for carrying out model training to target detection FIDN network using training set, and selects and collects in verifying Show optimal optimal models；

Optimal models prediction module: being based on optimal models, predicted on GPU server, realizes and carries out target to video flowing Detection.