CN112446436A

CN112446436A - Anti-fuzzy unmanned vehicle multi-target tracking method based on generation countermeasure network

Info

Publication number: CN112446436A
Application number: CN202011460523.6A
Authority: CN
Inventors: 梁军; 马皓月; 刘创; 张婳; 张智源
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-05

Abstract

The invention discloses an anti-fuzzy unmanned vehicle multi-target tracking method based on a generation countermeasure network. The method and the device aim at the situation that images collected by a camera are fuzzy due to vehicle body shake and the like, realize the processing and multi-target tracking of the fuzzy video sequence collected by the unmanned vehicle, have simple and convenient realization method and flexible means, can effectively solve the problem that the multi-target tracking effect is poor due to shake, and improve the accuracy of multi-target tracking.

Description

Anti-fuzzy unmanned vehicle multi-target tracking method based on generation countermeasure network

Technical Field

The invention relates to the technical field of computer networks, in particular to an anti-fuzzy unmanned vehicle multi-target tracking method based on a generation countermeasure network.

Background

The problem of tracking an unmanned vehicle during movement is a big problem. The excellent tracking algorithm needs to have good detection and tracking effects on the front obstacles, particularly pedestrians and vehicles, and accurately judges the behavior intention of the pedestrians and vehicles, so that reasonable path planning is performed. The multi-target tracking detects moving targets in a video sequence, corresponds the targets in different frames one by one, gives the moving tracks of different targets, predicts the short-term moving trend of the targets and judges the behavior intention of the targets. These objects may be arbitrary, pedestrians, vehicles, or animals, etc. And the multi-target tracking result can be used for making an obstacle avoidance strategy and planning a dynamic path of the unmanned vehicle.

The current multi-target tracking algorithm almost does not consider whether the acquired image is clear or not. In reality, if the driving road surface is uneven, the vehicle body shakes, and the pictures acquired by the camera generate motion blur. The blurred image can greatly reduce the performance of the tracking algorithm, so that some dangerous targets cannot be found in time, and the detected targets cannot be tracked well. This is very fatal to the stability and safety of the unmanned vehicle. Therefore, how to process image blurring caused by motion is an urgent problem to be solved in multi-target tracking of the unmanned vehicle.

Disclosure of Invention

The invention aims to provide an anti-fuzzy unmanned vehicle multi-target tracking method based on a generation countermeasure network, aiming at the problem of image fuzzy in the existing unmanned vehicle multi-target tracking.

The purpose of the invention is realized by the following technical scheme: an anti-fuzzy unmanned vehicle multi-target tracking method based on a generation countermeasure network comprises the following steps:

the method comprises the following steps: and acquiring a road condition video sequence by using the vehicle-mounted camera equipment of the unmanned vehicle.

Step two: and (3) detecting whether the image in the video sequence acquired in the step one is a blurred image or not by using a blurred image detection method, if so, directly performing the step four, and if so, performing the step three.

Step three: and (5) using a deblurring generation countermeasure method for the blurred image in the step two to remove the blur of the blurred image, so that the blurred image becomes clear.

Step four: and detecting the targets appearing in each frame of image by using a YOLO (you only look once, a single neural network-based target detection algorithm) on the clear images obtained in the second step and the third step, and determining the main targets to be tracked by the multi-target tracking algorithm.

Step five: aiming at the main target tracked in the fourth step, the data association is realized by using a re-recognition model and a Hungarian algorithm. And associating the currently detected target with the historical target track to obtain a complete target track.

Step six: and a Kalman filter is used as a tracker to estimate the position of the current target in the next frame as prior information of target tracking, and the predicted position and the detection position of the detector are fused as output to smooth the track and realize a target tracking task.

Further, the step is realized by the following sub-steps:

(2.1) graying and laplacian filtering: and converting the RGB color image into a gray image, and filtering by using a Laplace operator to realize the pretreatment of the image.

(2.2) variance calculation: the more serious the image blurring degree is, the lower the image variance is, the higher the clear image variance is, and when the variance is greater than the threshold value 200, the image is judged to be a non-blurred image.

(2.3) preventing misjudgment in combination with the upper frame image: if the previous frame image is fuzzy, when the variance ratio of the current frame image to the previous frame image is greater than a threshold value 5, judging that the image is a non-fuzzy image; if the previous image is not blurred, the image is judged to be a non-blurred image when the variance ratio of the current frame image to the previous frame image is greater than the threshold value 0.3. Otherwise, the image is blurred.

Further, the step three is realized by the following sub-steps:

and (3.1) constructing a defuzzification generation countermeasure network. Constructing a generator network: and constructing a defuzzification generation countermeasure network. Constructing a generator network: the method comprises the steps of designing a network structure based on a super-resolution reconstruction depth network, simulating iterative fitting residual errors through convolution operation, approximating a clear image, wherein an improved neural network consists of 2 convolution layers and 9 blocks, and each block consists of 2 series-connected 3 x 3 convolution layers, a normalization layer and a linear rectification function. Constructing a discriminator network: the design of discriminator network is realized by using the area generation countermeasure network in the image translation algorithm based on the condition generation countermeasure network, the area generation countermeasure network changes the generation countermeasure network discriminator into the full convolution network, and the input is mapped into the matrix X, X of NxN_ijThe value of (d) represents the probability that each matrix is a true sample, and averaging is the final output of the discriminator.

(3.2) determining the deblurring generates a countering network loss function, countering loss function L_allComprises four parts, namely a conditional generation antagonistic network loss function L_cGANError squared sum loss function L₂Structural similarity loss function L_ssimThe perceptual loss function L_perceptual，k_n(n-1, 2,3) is the corresponding hyperparameter.

L_all＝L_cGAN+(k₁)L₂+(k₂)L_ssim+(k₃)L_perceptual

Conditional generation countering network loss function:

where G represents the generator, D represents the discriminator, E (. + -.) represents the expected value of the distribution function, x represents the blurred image, y represents the sharp image, z represents the noise, P represents the noise_data(x) represents the distribution of samples, using cross entropy loss function as a condition to generate a countermeasureA network loss function.

Structural similarity loss function SSIM (x, y):

SSIM (x, y) can be viewed as the product of two aspects, image illumination similarity L (x, y), image contrast similarity c (x, y).

Wherein mu_xAnd σ_xRepresents the mean and variance, μ, of the deblurred image_yAnd σ_yMean and variance, σ, representing the original sharp image_xyAs a covariance of both, C₁And C₂Is a constant used for stabilization. For a three-channel RGB map, the values of each channel are averaged, and then the local mean and variance are calculated. Corresponding structural loss L_ssimComprises the following steps:

L_ssim＝1-SSIM(x,y)

perceptual loss function L_perceptual：

C_j、W_jAnd H_jIs the number of channels, width and height of the jth feature map of the network. Phi is a_jIs the output of the network corresponding to the jth convolutional layer. G (I)_B) Is the output of the generator from the fuzzy graph of the input, I_SIs a corresponding clear picture, I_BIs the corresponding fuzzy graph.

And (3.3) training deblurring to generate a countermeasure network. During training, the network convolution kernel is 3 multiplied by 3, the batch is 8, the initial learning rates of the generator and the discriminator are both 0.01, and the training is carried out on two GTX1080Ti video cards by using ADAM optimization. The network training process is as follows:

initialization: initial learning rate ρ_G,ρ_DIs 0.01 and a loss function weight k_n(n＝1,2,3)

An update generator G: sampling N samples from the training set, (x, y) — (x)₁,y₁),…，(x_N,y_N)

Updating the G parameter:

training the discriminator D for multiple times: updating the parameters D:

after training, the well-trained generation confrontation network model is obtained

And (3.4) deblurring of the blurred image is realized. And inputting the blurred image into a trained generation countermeasure network to obtain a deblurred clear image.

The method has the advantages that the problem that images acquired by a camera are fuzzy due to vehicle body shake can be solved, processing of fuzzy video sequences acquired by the unmanned vehicle and multi-target tracking are achieved, the method is simple and convenient to achieve, means are flexible, and the accuracy of the multi-target tracking can be effectively improved.

Drawings

FIG. 1 is a flow chart of an anti-fuzzy unmanned vehicle multi-target tracking method based on a generative countermeasure network;

fig. 2 is a flow chart of a step two blur detection method.

Detailed Description

The present invention is described in detail below with reference to the accompanying drawings.

The invention relates to an anti-fuzzy unmanned vehicle multi-target tracking method based on a generated countermeasure network, which comprises the following steps:

the method comprises the following steps: and acquiring a road condition video sequence by using the vehicle-mounted camera equipment. The unmanned vehicle needs to capture surrounding environment information during driving, especially pedestrian and vehicle information in the environment. According to the method, a camera sensor is selected, and multi-target tracking is achieved through an image sequence acquired by a camera.

Step two: and detecting whether the image in the video sequence is a blurred image or not by using a blurred image detection method, if so, directly performing the step four, and if so, performing the step three. In an actual vehicle driving scene, various interferences exist, for example, a captured image is blurred due to unevenness of a road, the captured image is polluted due to rain and snow weather, and before multi-target tracking, the image needs to be subjected to blur detection, that is, blurred image detection. This is one of the keys of the present invention, and the flow chart is shown in fig. 2. The detection of the blurred image mainly depends on the variance of the image, the image is clearer when the image variance is larger, and the image is blurred when the image variance is smaller.

This step is achieved by the following substeps:

(2.1) graying and laplacian filtering: the RGB color image is converted into a gray image, and filtering is carried out by using a 3 x 3 Laplacian operator, so that the image preprocessing is realized.

Step three: and removing the blur of the blurred image by using a deblurring generation countermeasure method to make the blurred image clear. Step three is one of the keys of the present invention. Clear images are generated mainly by generating games against the network.

This step is achieved by the following sub-steps:

and (3.1) constructing a defuzzification generation countermeasure network. Constructing a generator network: the method comprises the steps of designing a network structure based on VDSR (Very Deep network for Super-Resolution reconstruction depth network), simulating iterative fitting residual errors through convolution operation, and approximating a clear image so as to be more suitable for an image deblurring task. The improved neural network consists of 2 convolutional layers and 9 blocks, and each block consists of 2 3 × 3 convolutional layers connected in series, an InstanceNorm normalization layer and a leak activation layer. Constructing a countermeasure network: the design of the countermeasure network is realized by using PatchGAN in pix2pix, the PatchGAN converts the GAN discriminator into a full convolution network, and the input is mapped into an NxN matrix X, X_ijThe value of (d) represents the probability that each matrix is a true sample, and the average value is the final output of the discriminator.

(3.2) determining a deblurring generation countermeasure network loss function, wherein the countermeasure loss function comprises four parts, namely a conditional generation countermeasure network loss function L_cGANError squared sum loss function L₂Structural similarity loss function L_ssimThe perceptual loss function L_perceptual，k_n(n-1, 2,3) is the corresponding hyperparameter.

L_all＝L_cGAN+(k₁)L₂+(k₂)_ssim+(k₃)L_perceptual

The choice of the loss function is considered from four points: (1) in common image conversion tasks such as image deblurring and image style conversion, the loss function mostly adopts error square sum loss L₂. (2) The error sum of squares loss can well capture the low-frequency information in the image, so that the reconstructed image accords with the true value on the whole and macro level, but the high-frequency information, namely local texture and detail, is distorted, and the loss function L of the structural similarity degree_ssimThe method can solve the problem well, and is often used for measuring the local texture similarity. (3) Perceptual loss function L_perceptualThe method is also used for image segmentation and migration, selects an intermediate layer of a pre-training model as the high-level features of the image, and calculates and generates the Euclidean form between the image and a real image from the feature levelDistance is a loss of reconstruction of the image content. (4) Conditional Generation Confrontation network loss function L_cGANIs a function required for network training.

Conditional generation countering network loss function:

wherein G represents a generator and D represents a discriminator, and the countering network loss function is generated using the cross entropy loss function as a condition.

Structural similarity loss function:

Wherein mu_xAnd σ_xRepresents the mean and variance, μ, of the deblurred image_yAnd σ_yMean and variance, σ, representing the original sharp image_xyIs the covariance of the two. For a three-channel RGB map, the values of each channel are averaged, and then the local mean and variance are calculated. The corresponding structural losses are:

L_ssim＝1-SSIM(x,y)

perceptual loss function:

C_j、W_jand H_jIs the number of channels, width and height of the jth feature map of the network. Phi is a_jIs the output of the network corresponding to the jth convolutional layer. G (I)_B) Is the output of the generator from the fuzzy graph of the input, I_SIs a corresponding clear view.

Updating the G parameter:

training the discriminator D for multiple times: updating the parameters D:

and obtaining a trained generated confrontation network model after training.

Step four: and detecting the targets appearing in each frame of image by using a YOLO algorithm, and determining the main targets to be tracked by multi-target tracking. The YOLO algorithm gives consideration to efficiency and accuracy, and can realize real-time accurate detection of the target.

This step is achieved by the following sub-steps:

and (4.1) loading a base model pre-trained by a YOLO algorithm. The YOLO algorithm is a representative one-stage target detection algorithm, the target detection is regarded as a simple regression problem, a target frame and a target category are regressed from pixel points, and the frame and the category of the target are obtained by using only one simple convolutional neural network. The convolution neural network of the YOLO algorithm mainly comprises the following steps: dividing the image into a 7 x 7 grid; if an object center falls within a grid, the grid is used to detect the class of the object. The convolutional neural network is composed of 24 convolutional layers and 2 fully-connected layers. In the method, the darknet model with trained weights is first downloaded from YOLO Real-Time Object Detection (pjredbie. com).

And (4.2) inputting each frame of clear image into a YOLO algorithm according to a sequence to obtain a target object in the clear image sequence, and realizing primary target identification. The YOLO algorithm relies mainly on the COCO dataset for training, which has 80 classes, and the main labels include car, bird, cat, dog, etc.

And (4.3) screening the targets of the labels, namely pedestrians and vehicles, and taking the targets as targets of multi-target tracking. In the method, attention needs to be paid to pedestrian and vehicle targets, and car, bus and person are screened as tracking targets.

Step five: and (3) realizing data association by using a re-recognition model and a Hungarian algorithm. And associating the currently detected target with the historical target track to obtain a complete target track. The step and the step six are iterated circularly continuously until the image sequence is finished.

This step is achieved by the following sub-steps:

and (5.1) loading a pre-trained re-recognition model, and distinguishing different pedestrians or vehicles in the detected target. In the method, the re-recognition mainly comprises pedestrian re-recognition and vehicle re-recognition, i.e. recognizing the same object (pedestrian or vehicle) in different frames of the video sequence. The pre-trained re-recognition model used in the method is obtained by training based on a large-scale ReiD data set, and a residual error network with 2 convolutional layers and 6 residual error blocks is constructed based on the pre-trained network to extract the appearance characteristics of the target.

And (5.2) searching a matching optimal solution of a plurality of targets in the front frame and the rear frame by using a Hungarian algorithm (an algorithm for searching the maximum matching of bipartite graphs), and associating the detected targets with historical target tracks. In the method, two matched parties are the matching of the targets of the front frame and the rear frame, and the motion information and the appearance information of the targets are mainly used. In the matching of motion information: and matching the two target track states (u, v, r and h) obtained by detection and the target track state (u, v, r and h) predicted by Kalman in the step six, and if the two target track states are smaller than a certain threshold value t which is 9.4877, the target track states are considered to be the same target. In the matching of the appearance information: and the two matched frames are target appearance information obtained by two-frame detection, the appearance similarity is extracted by using the re-recognition model in (5.1), and the minimum cosine distance between the feature vectors is used as an appearance similarity index. After the two matching indexes are determined, the method uses a weighted Hungarian algorithm to find the optimal solution of the matching of the two frames of targets before and after.

Step six: and using a Kalman filter as a tracker to estimate the position of the current target in the next frame, and fusing the predicted position and the detection position of the detector as output to smooth the track and realize multi-target tracking.

This step is achieved by the following sub-steps:

and (6.1) state definition of the tracking target. Using 8-dimensional space

Defining the state of a tracking target, (u, v) represents the center position of a two-dimensional frame, r represents the aspect ratio, h represents the height,

representing the rate of change of each of the aforementioned states in the coordinate system.

And (6.2) solving by using a Kalman filter. And the Kalman filtering utilizes a linear system state equation, observation data is input and output through the system, and the optimal estimation is carried out on the system state. The method assumes that a Kalman filter adopts a uniform motion model and a linear observation model, and uses a standard Kalman filter to solve, wherein the observation variables are (u, v, r, h).

And (6.3) multi-target tracking track prediction. Definition ofThreshold value A_maxUsing the variable a_kRecording the time length from the last successful matching to the current time when a_kGreater than A_maxThen, the track is considered to have ended, and recording of the track is stopped. For a newly generated track, it needs to be observed whether the track can be matched successfully in the next 3 frames, if the track can be matched successfully, the track is considered to be generated newly, and if the track cannot be matched successfully, the track is deleted.

Claims

1. An anti-fuzzy unmanned vehicle multi-target tracking method based on a generation countermeasure network is characterized by comprising the following steps:

Step two: and (3) detecting whether the image in the road condition video sequence acquired in the step one is a fuzzy image or a fuzzy image by using a fuzzy image detection method, if the image is a clear image, directly performing the step four, and if the image is a fuzzy image, performing the step three.

Step four: and detecting the targets appearing in each frame of image by using a target detection algorithm based on a single neural network on the clear images obtained in the second step and the third step, and determining the main targets to be tracked by multi-target tracking.

2. The anti-blur unmanned vehicle multi-target tracking method according to claim 1, wherein the second step is realized by the following sub-steps:

3. The anti-blur unmanned vehicle target tracking method according to claim 1, wherein the third step is realized by the following sub-steps:

and (3.1) constructing a defuzzification generation countermeasure network. Constructing a generator network: the method comprises the steps of designing a network structure based on a super-resolution reconstruction depth network, simulating iterative fitting residual errors through convolution operation, approximating a clear image, wherein an improved neural network consists of 2 convolution layers and 9 blocks, and each block consists of 2 series-connected 3 x 3 convolution layers, a normalization layer and a linear rectification function. Constructing a discriminator network: the design of discriminator network is realized by using the area generation countermeasure network in the image translation algorithm based on the condition generation countermeasure network, the area generation countermeasure network changes the generation countermeasure network discriminator into the full convolution network, and the input is mapped into the matrix X, X of NxN_ijThe value of (d) represents the probability that each matrix is a true sample, and averaging is the final output of the discriminator.

(3.2) determining the deblurring generates a countering network loss function, countering loss function L_allComprises four parts, namely a conditional generation antagonistic network loss function L_cGANError, error ofSum of squares loss function L₂Structural similarity loss function L_ssimThe perceptual loss function L_perceptual，k_n(n-1, 2,3) is the corresponding hyperparameter.

L_all＝L_cGAN+(k₁)L₂+(k₂)L_ssim+(k₃)L_perceptual

Conditional generation countering network loss function:

where G represents the generator, D represents the discriminator, E (. + -.) represents the expected value of the distribution function, x represents the blurred image, y represents the sharp image, z represents the noise, P represents the noise_data(. x) represents the distribution of samples, and a countering network loss function is generated using the cross entropy loss function as a condition.

Structural similarity loss function SSIM (x, y):

Wherein mu_xAnd σ_xRepresents the mean and variance, μ, of the deblurred image_yAnd σ_yMean and variance, σ, representing the original sharp image_xyAs a covariance of both, C₁And C₂Is a constant used for stabilization. For threeThe RGB map of a channel is obtained by averaging the values of each channel and then calculating the local mean and variance. Corresponding structural loss L_ssimComprises the following steps:

L_ssim＝1-SSIM(x，y)

perceptual loss function L_perceptual：

And (3.3) training deblurring to generate a countermeasure network. During training, the network convolution is checked to be 3 multiplied by 3, the batch is checked to be 8, the initial learning rates of the generator and the discriminator are both 0.01, and the training is carried out on two GTX1080Ti video cards by using ADAM optimization. The network training process is as follows:

initialization: initial learning rate ρ_G，ρ_DIs 0.01 and a loss function weight k_n(n＝1，2，3)；

An update generator G: sampling N samples from the training set, (x, y) — (x)₁，y₁)，...，(x_N，y_N)；

Updating the G parameter:

training the discriminator D a plurality of times: updating the parameters D:

obtaining a trained generated confrontation network model after training;

4. The anti-blur unmanned vehicle target tracking method according to claim 1, wherein the fourth step is realized by the following sub-steps:

and (4.1) loading a base model pre-trained by a target detection algorithm based on a single neural network.

And (4.2) inputting the obtained clear image into a target detection algorithm based on a single neural network to realize primary target identification.

And (4.3) the labels of the screened targets are pedestrians and vehicles which are used as targets for multi-target tracking.

5. The anti-blur unmanned vehicle target tracking method according to claim 1, wherein the step five is realized by the following sub-steps:

and (5.1) loading a pre-trained re-recognition model, and distinguishing different pedestrians or vehicles in the detected target.

And (5.2) finding the optimal matching solution of a plurality of targets of the front frame and the rear frame by using the Hungarian algorithm, and associating the detected targets with the historical target tracks.