CN113269118B

CN113269118B - Monocular vision forward vehicle distance detection method based on depth estimation

Info

Publication number: CN113269118B
Application number: CN202110633046.7A
Authority: CN
Inventors: 赵敏; 孙棣华; 周璇
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2022-10-11
Anticipated expiration: 2041-06-07
Also published as: CN113269118A

Abstract

The invention discloses a monocular vision forward vehicle distance detection method based on depth estimation, which is characterized by comprising the following steps: the method comprises the following steps: step1, building a forward vehicle distance detection model based on depth estimation; step2, introducing a DORN algorithm, and building a forward vehicle distance detection model based on the DORN; step3, optimizing a target key point fitting method; step4, designing a loss function in network training; and Step5, utilizing a model compression acceleration tool to realize acceleration of the forward vehicle distance detection model. The method can predict the forward vehicle distance efficiently and accurately.

Description

Monocular vision forward vehicle distance detection method based on depth estimation

Technical Field

The invention relates to a monocular vision forward vehicle distance detection method based on depth estimation.

Background

The distance detection of the vehicle is used as an important part of environment perception of the intelligent driving system, and the collision which is possibly generated in the driving process of the vehicle is predicted in real time by detecting the motion trend of the vehicle and the pedestrian in the front direction, so that the intelligent driving system plays a very important role. However, the forward vehicle distance detection is greatly challenged by phenomena of vehicle-to-vehicle shielding, vehicle-to-environment shielding, complexity and changeability of an unstructured road, pitch angle change in a driving process and the like in a real traffic scene. Therefore, it is a difficult point of research for intelligent driving system to rapidly and accurately detect the distance between the forward vehicle. According to different detection modes, methods for realizing distance detection can be mainly divided into the following categories: electromagnetic ranging, ultrasonic ranging, visual ranging, and the like. At present, distance measurement methods based on active sensors such as millimeter wave radar and laser radar are expensive, have limited scanning range and speed, and are easily interfered by external signals [3]. The distance detection technology based on vision has the advantages of low cost, convenience in installation and debugging, rich acquired information and the like, and has a great application prospect. In the distance detection method based on vision, the distance detection method can be divided into monocular detection, binocular detection and multi-eye detection according to the number of cameras. The monocular vision-based distance measurement method has the advantages of convenience in equipment installation and debugging, low computing resource consumption, good dynamic real-time performance and the like, and has a good application prospect.

The existing mainstream monocular vision-based distance detection method is generally based on a similar geometric principle and combines camera parameter matching to estimate the distance of a forward vehicle. However, the method needs to acquire geometric information about the obstacle, so that some methods cannot judge the non-standard obstacle, the phenomenon of difficult camera parameter matching also exists, the pitching and rolling phenomena in the driving process and the unstructured road scene are not fully considered, and the defects of short effective distance measurement distance, complex work, large calculation error and the like also exist. With the great improvement of computer technology, the artificial intelligence algorithm based on deep learning is widely applied to the industrial field and achieves good effect. Currently, some experts and scholars apply a deep learning technique to a monocular visual distance detection technique to improve distance detection performance. However, most of the existing research is depth estimation performed on all pixel points in an RGB map, and distance detection is not performed for a specific target in a traffic scene. Moreover, the phenomena of obvious shielding, environment transformation and the like exist in real traffic, and a large error exists in the distance detection for a specific target. In few algorithms for realizing the detection of the distance of the specific target based on the deep learning, the distance of the vehicle is detected by directly utilizing an end-to-end distance regression algorithm, but the method easily loses more spatial information, thereby damaging the spatial structure and having certain influence on the depth prediction precision.

Disclosure of Invention

According to the analysis, aiming at the defects in the prior art, the forward vehicle distance detection model is built by utilizing a related deep learning algorithm in the field of computer vision, a corresponding optimization strategy is proposed, and the forward vehicle distance detection based on monocular vision is realized.

Specifically, the forward vehicle distance detection based on monocular vision is realized from three aspects of model building, target key point fitting and loss function design. On the other hand, in consideration of the practical application requirement of the forward vehicle distance detection model, the invention adopts a TensorRT tool to optimize the model so as to improve the detection speed of the model.

The technical method provided by the invention comprises the following five steps:

the method comprises the following steps: a forward vehicle distance detection model based on depth estimation is built, and the forward vehicle distance detection model mainly comprises the following three parts:

1) Determining the input and the output of a forward vehicle distance detection model;

2) Selecting a convolution neural network for extracting image features;

3) And designing a vehicle target key point fitting method.

Step two: introducing a DORN algorithm, and building a forward vehicle distance detection model based on the DORN, wherein the model mainly comprises the following four parts:

1) Replacing a common feature extraction network with a dense feature extractor;

2) A scene understanding module is added to realize the comprehensive understanding of the network to the input image;

3) Dividing the discrete depth values into a plurality of classes by using a ordinal number regression module to convert the regression problem into a classification problem;

4) And selecting a fitting method of the key points of the vehicle target.

Step three: the method for fitting the optimized target key points mainly comprises the following two parts:

1) Introducing a k-means clustering algorithm to realize the fitting of the key points of the vehicle targets;

2) And the fitting precision of the vehicle target key points is improved through effective parameter configuration.

Step four: a loss function in the network training is designed. The device mainly comprises the following two parts:

1) Designing a regression loss function of a target key point by using an L1 norm loss function;

2) And combining the ordinal regression function to realize the network training regression.

Step five: the model compression acceleration tool is utilized to realize acceleration of a forward vehicle distance detection model, and the model compression acceleration tool mainly comprises the following two parts:

1) Processing data which cannot be directly converted in a network;

2) Converting the forward vehicle distance detection model into a TensorRT model;

has the advantages that:

the forward vehicle distance detection based on monocular vision is realized from three aspects of model building, target key point fitting and loss function design. On the other hand, in consideration of the practical application requirement of the forward vehicle distance detection model, the TensorRT tool is adopted to optimize the model, so that the detection speed of the model is improved, and the detection speed is improved from 0.0284s of each graph on average to 0.0003s of each graph on average, wherein the error is about 1.5259e-05.

Drawings

FIG. 1 is a schematic flow diagram of a DORN-based forward vehicle distance detection model;

FIG. 2 is a schematic structural diagram of a DORN-based forward vehicle distance detection model.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and are only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the application include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Example 1: as shown in figures 1 and 2 of the drawings,

the embodiment provides a monocular vision forward vehicle distance detection method based on depth estimation, which specifically comprises the following five steps:

1) Firstly, a forward vehicle distance detection model is built, and the whole model structure is divided into three parts: input, intermediate processing and output. The input part comprises RGB original graphs, vehicle target frame coordinates and a depth map. The RGB original image is an input value used for analyzing and detecting the whole network, the depth map is a real value used for comparing with a predicted value and training and learning by the network, and the coordinates of the vehicle target frame are coordinates required for finally displaying an output map of the vehicle target. The intermediate processing part comprises feature extraction, pooling and regression, and is a process needed to be experienced by the network for learning and predicting. Besides, a key point fitting part is added, and the information of the target key points is obtained from the predicted depth map. The output part is an RGB map with a vehicle target frame and distance values.

2) The feature extraction network is a convolutional neural network, and usually VGG16 or Resnet50 is used as a basic feature extraction network in the field of computer vision.

3) In the key point fitting part, a vehicle target key point fitting method is designed.

1) And introducing a dense feature extractor module in the DORN network as a feature extraction network, and expanding the visual field of a filter on the basis of not reducing the spatial resolution or increasing the parameter number by removing the last downsampling operators in the feature extractor DCNNs and inserting spaces into a subsequent convolution layer for filtering to form expanded convolution.

2) A scene understanding module is introduced, which consists of three parallel components, a hole space convolutional pooling pyramid (ASPP) module, a cross-channel leaner, and a fullimage full image encoder. The ASPP extracts features from a plurality of larger receptive fields by expanding convolution operations, with expansion ratios of 6, 12 and 18, respectively, wherein a convolution branch with a kernel size of 1 × 1 can learn complex cross-channel leaner interactions, while the fullimage encoder can capture global context information, thereby reducing local aliasing problems.

3) And introducing an ordinal regression module to divide the discrete depth values into multiple categories, converting the regression problem into a classification problem, and finally adopting a Softmax function for regression loss values to realize network training. Thereby, the depth value is taken as oneThe ordered set thus produces a strong ordered correlation, and an ordinal loss is used to learn the network parameters. Is provided with

A feature map of size W × H × C obtained from the input image I, wherein

Parameters involved in both the dense feature extractor and the scene understanding module are shown. Is provided with

And (3) expressing the ordinal output of each spatial position with the size of W multiplied by H multiplied by 2K, wherein theta = (theta 0, theta 1, …, theta 2K-1) comprises a weight vector, 2K refers to the number of convoluted channels and expresses that K depth values are score values for two classifications, and when the value of the real quantized groudtruth is larger than K and expressed by 1, the value is not larger than K and expressed by 0. Let l (w, h) be the discrete label generated by SID at spatial position (w, h) and {0,1, …, K-1 }. Defining the ordinal loss value L (χ, Θ) as the average value of the ordinal loss Ψ (w, h, χ, Θ) of the pixel level over the entire image domain, with the formula:

wherein N = W × H, wherein

Representing the estimated discrete value decoded from y (w, h), P is the predicted probability value, where Ψ is the calculation of the depth at the pixel point between 0 and KThe logarithmized added values of the probabilities of the respective points are divided into a front part from 0 to the true value and a rear part from the true value to the farthest distance K, and the rear part needs (1-P) because the y value for calculating the probability is output in accordance with the true value being larger than K. Furthermore, the regression probability is calculated from y (w, h,2 k) and y (w, h,2k + 1) using the Softmax function

The formula is as follows:

in the formula (I), the compound is shown in the specification,

x _(w,h) the field belongs to x, wherein K is 0 to K-1,y, and the output of the prediction result is only obtained after the prediction result is larger than K, namely, the numerical value of the quantized real depth value of the groudtruth is only obtained when the numerical value is larger than K. Minimizing the ordinal loss value L (χ, Θ) ensures that predicted values further from the true value label get a lower score than test values closer thereto.

4) Preliminarily selecting a coordinate of half of the length and the width of the target frame as a threshold, averaging pixel values in the threshold to serve as a key point distance value drawn by a model, wherein the formula is as follows:

in the formula, W and H represent the abscissa and the ordinate of the pixel, respectively, W and H represent the threshold after the target frame is reduced, and N represents the number of pixels included in the threshold range, thereby fitting the target distance value.

1) And introducing a k-means clustering algorithm to realize the fitting of the key points of the vehicle targets. The method mainly comprises the steps of clustering pixels in a prediction target frame, solving a first category and a second category which are ranked in a front quantity, and analyzing the two categories. And when the number of the first categories is more than 1.5 times of that of the second categories, if the distance value of the center points of the first categories is less than a threshold value 80m, selecting the first categories as final categories, otherwise, if the center value of the first categories is more than the threshold value 80m, selecting the second categories as final categories. And when the difference between the number of the first categories and the number of the second categories is not 1.5 times, the final category with a small central point distance value is taken to be used for segmenting the target to be measured and the environmental interference around the target. And then, taking the central point of the final category as a key point of the target, and selecting the distance value of the target key point as a predicted value of the forward vehicle target distance in the experiment.

2) Effective parameter configuration is carried out on the algorithm, the cutting multiple of the target frame and the set K value are determined through a plurality of groups of comparison tests, and the final set typical parameters are as follows: the K value is 4, and the cutting multiple is 1/2, so that the fitting precision of the vehicle target key points is improved.

1) And designing a regression loss function of the target key point by using the L1 norm loss function. Defining the regression loss value of the central point as L _d If the distance value output for the vehicle target in the original network is D (i), the distance value of the real central point is D, and the number of the target frames in the graph is N2, the central point regression loss value formula is as follows:

where the L1 norm loss function is also referred to as the minimum absolute value error, minimizing the sum of absolute differences of the target and estimated values can be achieved. Wherein Smooth _L1 The smoothed L1 norm loss function is represented, the instability problem caused by the break point in the L1 loss curve is solved, and the formula is as follows:

when the absolute value of the input x is larger than 1 or smaller than-1, the derivative function has a value of 1 or-1, so that more outliers can be avoided at the initial stage of network training, and gradient explosion is avoided, and when the absolute value of the input x is between-1 and 1, the derivative function has a value of a linear increasing function between 1 and-1, so that stable transition is realized, and convergence is promoted.

2) Combining the ordinal regression function to realize the training regression of the forward vehicle distance detection network model, wherein the total loss function of the network model is as follows:

in the formula, λ ₁ And λ ₂ For the self-defined parameters during the network training, the model ensures the predicted value closer to the real value label by minimizing the total loss value L of the model, and obtains a higher score than the test value farther away, wherein the method for minimizing the loss value adopts an SGD random gradient descent algorithm to realize iterative optimization.

1) The model output in the network cannot be in dictionary form, so the output containing dictionary types is converted into operational tenor form.

2) When a forward vehicle distance detection model is built, the used deep learning frame is a Pytrch, and under a Python environment, the Pytrch model is converted into TensorRT by two methods, wherein one method is to convert a pt model into onnx and then into TensorRT, and the other method is to directly convert the pt model into TensorRT. Because only fixed pitch size is supported during onnx model conversion, and the torch2trt library can be directly imported and used, the invention adopts a method of directly converting the TensorRT model by means of the torch2trt library.

Finally, the above examples are intended only to illustrate the technical solution of the present invention and not to limit it, and although the present invention has been described in detail with reference to preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention defined by the appended claims.

Claims

1. A monocular vision forward vehicle distance detection method based on depth estimation is characterized in that: the method comprises the following steps:

s1, building a forward vehicle distance detection model based on depth estimation;

the structure of the forward vehicle distance detection model is divided into three parts: the method comprises the steps of inputting, intermediate processing and outputting, wherein the input part comprises RGB (red, green and blue) original images, vehicle target frame coordinates and a depth map, the RGB original images are input values used for analyzing and detecting the whole network, the depth map is a real value used for comparing with a predicted value and training and learning by the network, and the vehicle target frame coordinates are coordinates required for finally displaying an output map of a vehicle target; the intermediate processing part comprises feature extraction, pooling, regression and key point fitting, wherein the key point fitting is to acquire the information of target key points from the predicted depth map; the output part is an RGB graph with a vehicle target frame and distance values;

s2, introducing a DORN algorithm, and building a forward vehicle distance detection model based on the DORN;

s2.1, introducing a dense feature extractor module in the DORN network as a feature extraction network, removing the last downsampling operators in the DCNNs of the feature extractor, and inserting spaces into a subsequent convolution layer for filtering, so that the visual field of a filter is expanded on the basis of not reducing the spatial resolution or increasing the number of parameters, and an expanded convolution is formed;

s2.2, a scene understanding module is introduced, wherein the scene understanding module consists of three parallel components, namely an empty space convolution pooling pyramid (ASPP) module, a cross-channel leaner and a fullimage full-image encoder;

s2.3, introducing an ordinal regression module, dividing the discrete depth values into multiple categories, converting the regression problem into a classification problem, and adopting a Softmax function at the end of the network for regression loss values to realize network training;

s2.4 vehicle target key point fitting method, selecting the coordinate of half of the length and width of a target frame as a threshold, taking the average of pixel values in the threshold as a key point distance value drawn by a model, wherein the formula is as follows:

in the formula, W and H respectively represent the abscissa and the ordinate of a pixel, W and H are thresholds after the target frame is reduced, N represents the number of pixels contained in the threshold range, and therefore a target distance value is fitted;

s3, optimizing a target key point fitting method;

s3.1, introducing a k-means clustering algorithm to realize the fitting of the key points of the vehicle target; clustering pixels in a prediction target frame to obtain a first category and a second category which are arranged in front of each other in quantity, and then analyzing: when the number of the first categories is larger than 1.5 times of that of the second categories, if the distance value of the center points of the first categories is smaller than a threshold value 80m, the first categories are selected as final categories, and if the center value of the first categories is larger than the threshold value 80m, the second categories are selected as final categories; when the difference between the number of the first categories and the number of the second categories is not 1.5 times, the central point with a small distance value is taken as a final category, then the central point of the final category is taken as a key point of a target, and the distance value of the target key point is selected as a predicted value of the forward vehicle target distance;

s3.2, improving the fitting precision of the key points of the vehicle target through parameter configuration;

s4, designing a loss function in network training;

and S5, the acceleration of the forward vehicle distance detection model is realized by utilizing a model compression acceleration tool.

2. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 1, characterized in that: in the step S1, the feature extraction network in the intermediate processing portion is a convolutional neural network, and VGG16 or Resnet50 is used as the feature extraction network.

3. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 1, characterized in that: the S4 comprises the following steps: 1) Designing a regression loss function of a target key point by using the L1 norm loss function; 2) And combining the ordinal regression function to realize the network training regression.

4. The depth estimation-based monocular vision forward vehicle distance detecting method of claim 1, characterized in that: the S5 comprises the following steps: 1) Converting the output of data which cannot be directly converted in the network into an operable tenor form; 2) And converting the forward vehicle distance detection model into a TensorRT model.