CN116311113A

CN116311113A - Driving environment sensing method based on vehicle-mounted monocular camera

Info

Publication number: CN116311113A
Application number: CN202310093603.XA
Authority: CN
Inventors: 朱宗卫; 魏冉; 王超; 周学海; 李曦
Original assignee: Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Current assignee: Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority date: 2023-02-10
Filing date: 2023-02-10
Publication date: 2023-06-23

Abstract

The invention discloses a driving environment sensing method based on a vehicle-mounted monocular camera, which comprises the steps of carrying out structural re-parameterization and automatic driving multi-task sensing on an up-sampling module. Compared with the common linear interpolation and transpose convolution, the method has the advantage that the accuracy of the network model is improved to a certain extent by using the RepUPSAmple. On the task of the semantic segmentation model, compared with the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used, the performance of the semantic segmentation network can be improved by using the RepUPSAmple as an upsampling method according to different network models, different upsampling positions and different network scales. The mIOU can be improved by 1.77% on average and 0.74% on average by P.A. and 1.16% by P.A. and 0.35% by transpose convolution compared to bilinear interpolation algorithm.

Description

Driving environment sensing method based on vehicle-mounted monocular camera

Technical Field

The invention relates to the field of automatic driving, in particular to a driving environment sensing method based on a vehicle-mounted monocular camera.

Background

In recent years, deep learning and computer vision technologies are rapidly emerging, and automatic driving technologies based on the deep learning and the computer vision technologies provide new solutions for improving traffic safety and efficiency. The automatic driving system is not tired, can strictly adhere to traffic rules, and has great potential in reducing accident occurrence rate. Autopilot incorporates a number of technologies such as artificial intelligence, communications, semiconductors, automobiles, and the like. The related industry chain is wide, the space of value that can be created is huge, and the technology has become the necessary place of competing for the cross-border of the automobile industry and the science and technology industry of various countries, and huge companies such as ***, tesla, general purpose, hundred degrees and the like develop the automatic driving technology in great numbers. Technological progress, policy pushing, huge head entry, capital entry, cost reduction, clear scene and other factors are comprehensively promoted, and after more than ten years of exploration and development of automatic driving technology, the automatic driving technology stands at key commercial landing nodes at present.

The automatic driving system is a comprehensive system integrating the functions of environment sensing, decision control, action execution and the like. The automatic driving system is huge and complex, and can be generally divided into three modules of perception, decision and control according to functions. Wherein the sensing module is defined as the collection and processing of environmental information and in-vehicle information, which senses information of the surrounding environment through various sensors. The method relates to multiple tasks such as road boundary detection, drivable area detection, traffic sign recognition, vehicle and pedestrian detection, road surface information perception and the like. The sensing module is equivalent to the basic stone of the whole system, and the decision system can perform correct judgment only if the sensing system provides accurate information. The robustness of the perception system directly affects the reliability of the overall autopilot system. The decision module can be understood to perform decision judgment according to the perception information, determine a proper working model, and assign a corresponding control strategy to replace a driver to make driving decisions. The execution module is used for controlling the vehicle by installing a decision result after the system makes a decision, each operation system of the vehicle needs to be connected with the decision system through a bus, and the installation instruction accurately controls driving actions such as an acceleration program, a braking program, a steering amplitude, light control and the like so as to realize autonomous control of the vehicle.

In the sensing module, if the cooperation of multiple sensors is needed to achieve accurate sensing, the existing methods can be divided into the following three types: a perception method based on monocular images, laser point cloud and millimeter wave radar. The laser radar and the millimeter wave radar both belong to active sensors, the detection signals are emitted to the periphery, the signals reflected by the object are received, and the azimuth and the distance of the object are calculated through comparison between the emitted signals and the received signals. The camera belongs to a passive sensor and senses an object by light reflection. The three sensors have different advantages and disadvantages. The laser radar can generate high-precision three-dimensional point cloud information, but the laser radar in the current stage has high cost and short service life, and a large amount of noise exists in rainy and snowy weather. Millimeter wave radar has small weather effect, is sensitive to moving objects, and has the defects of lower resolution and poorer discrimination capability. The monocular camera has the advantages of low cost, high resolution, and capability of providing rich visual information, but cannot acquire depth information. For safety reasons, autopilot vehicles are typically equipped with three types of sensors, which are used by advanced driving assistance systems (Advanced Driving Assistance System, ADAS) for environmental perception. However, on common civilian automobiles, the price for installing the sensors is too high, and the common monocular cameras have a wider hardware base. ADAS based on monocular cameras is therefore of greater practical value.

Disclosure of Invention

The invention aims at: the driving environment sensing method based on the vehicle-mounted monocular camera aims to solve the following problems:

an optimized neural network upsampling module: the depth neural network generally adopts transposed convolution or bilinear interpolation as an up-sampling module, the bilinear interpolation has no learnable parameters, the reasoning speed is high, the expression capacity is weak, and the transposed convolution has the learnable parameters and has stronger expression capacity. Note the identity between them, using multi-branched up-sampling structures in the training phase, while the lossless merging of multiple branches into a single branch in the reasoning phase, mentions the network capacity while not causing a loss of reasoning speed.

Autopilot multitasking awareness: the automatic driving sensing module needs to sense the surrounding environment by utilizing various sensors of the vehicle body, including target detection, road drivable area segmentation and lane line detection. The monocular camera has a wider hardware base because of low cost, and the monocular camera-based automatic driving perception system has higher practicability. Based on the image shot by the vehicle-mounted monocular camera, the automatic driving perception multitasking algorithm is designed by combining with the structural heavy parameterization up-sampling module, and a plurality of tasks are completed through one-time reasoning, so that the throughput speed of the whole system is improved, and the power consumption and the use of a memory are reduced.

The technical scheme of the invention is as follows:

a driving environment sensing method based on a vehicle-mounted monocular camera comprises the following steps:

s1, carrying out structural re-parameterization on an up-sampling module: comprising a training phase and an reasoning phase, wherein,

in the training stage, expanding one transposed convolution layer of an up-sampling module into multiple branches, wherein one branch uses a linear interpolation algorithm, and the other branches use transposed convolutions with different convolution kernel sizes;

in the reasoning stage, the multi-branch structure is subjected to re-parameterization and is converted into a single-branch structure in a lossless manner;

s2, automatic driving multitasking awareness: through a multi-task deep learning model, perception real-time reasoning based on a monocular camera is realized, and three tasks of target detection, road drivable region segmentation and lane line segmentation are completed.

Preferably, in the training stage in S1, the number of channels of the output feature map is changed by adding a convolution layer of 1*1 to the linear interpolation; batch normalizaiton is added after each upsampling branch to further improve model performance.

Preferably, in the reasoning stage in S1, for double up-sampling, three branches are respectively convolved with bilinear interpolation bins 1*1, 2 x 2 transpose convolutions, 4*4 transpose convolutions, and thereafter each concatenated batch normalization layers; inputting a feature map

After upsampling a feature map is obtained>

Convolution kernel of 4×4 transposed convolution

Bias->

Because of the transposed convolution, the first padding and then the convolution are needed, and the convolution kernel of the convolution process is +.>

The corresponding relation is shown in formula 1, and is transposed first

Is inverted +.>

Is the last two dimensions of (2);

after the convolution, the normalization is carried out through the BN layer, in the reasoning process, the BN layer parameters are fused by the convolution, and transposed convolution parameters after the fusion are shown as formulas 2 and 3:

wherein gamma, beta, sigma and mu respectively correspond to the weight, bias, variance and mean of BN layer to obtain weight W _4×4 Finally, obtaining the final transposed convolution weight through the transformation of the formula 3

Convolution kernel of 2 x 2 transpose convolution

The weight is obtained by filling the convolution kernel outer circle of 2×2 with zero value to 4×4 size and fusing the parameters of BN layer>

Bilinear interpolation is followed by 1*1 convolution, whose convolution kernel

Convolution kernel of a 1 x 1 convolution of the number of transform channels without bias>

Bias->

The obtained 4×4 transpose convolution is firstly subjected to parameter fusion with 1×1 convolution, and the obtained new weights are shown in formulas 4 and 5:

W _bilinear ←W _1×1 ×W _bilinear (4)

b _bilinear ←b _1×1 (5)

finally, againFusing BN layer parameters to obtain weights

After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and offsets of the three transposed convolutions are added to obtain the final result, as shown in equations 6, 7:

and through structural re-parameterization, the multi-branch complex structure during training is compressed in a lossless manner.

Preferably, in S2, the overall structure of the multi-task deep learning model adopts an encoding-decoding network structure, and for three tasks, different decoding header networks are used respectively, and the encoding network is shared; the coding network is divided into a backhaul network and a back network according to different actions and positions, the backhaul network directly receives input from a camera and is used for mining shallow characteristic information of the network, the back network receives the characteristic information from the backhaul network, further characteristic fusion and characteristic extraction are carried out, deeper characteristic information is obtained, and the deeper characteristic information is transmitted to different decoding networks.

Preferably, the Backbone network adopts ResNet, and images shot by the monocular camera are scaled to 640 x 340, and feature images with different sizes are obtained through a multi-layer residual structure.

Preferably, the Neck network adopts an improved BiF PN structure, the Bi FPN structure comprises a plurality of upsampling operations, and upsampling in the Bi FPN is replaced by a repa psample capable of being re-parameterized;

a plurality of Bi FPNs are connected to form a Neck structure of the whole perception network, the characteristic diagram input into the Neck is P3, P4 and P5 correspondingly, P3-P7 with deeper characteristics is obtained after the first Bi FPN, and the P3-P7 is input into the next Bi FPN structure; the output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.

Preferably, the road drivable region segmentation adopts a semantic segmentation algorithm to carry out pixel-by-pixel identification classification, receives characteristic information obtained from a back bone network and a network, adopts a BiFPN structure and FCNs to carry out classification, and relates to a plurality of up-sampling operations, and substitution is carried out by using a RepU PSAmple; road drivable region segmentation employs cross entropy loss.

Preferably, the target detection adopts an anchor-base detection scheme, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature images of different scales, and finally, repeated detection frames are removed by inhibiting N MS through a non-maximum value, so that a final detection result is obtained; the target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.

Preferably, the lane line segmentation is based on key point detection, the picture is horizontally divided into a plurality of strips, each strip is further divided into a plurality of blocks, the positions occupied by the lane lines in each strip are predicted, the positions of the key points on the lane lines are obtained through the lane line detection, the key points belonging to the same lane line are connected into lines, and the loss is calculated by using cross entropy.

The invention has the advantages that:

compared with the common linear interpolation and transpose convolution, the method has the advantage that the accuracy of the network model is improved to a certain extent by using the RepUPSAmple. On the task of the semantic segmentation model, compared with the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used, the performance of the semantic segmentation network can be improved by using the RepUPSAmple as an upsampling method according to different network models, different upsampling positions and different network scales. The mIOU can be improved by 1.77% on average and 0.74% on average by P.A. and 1.16% by P.A. and 0.35% by transpose convolution compared to bilinear interpolation algorithm.

Drawings

The invention is further described below with reference to the accompanying drawings and examples:

FIG. 1 is a block diagram of a reparameterized upsampling RepUPSAmple;

FIG. 2 is a diagram of the overall architecture of a multitasking aware network;

FIG. 3 is a block diagram of a backhaul network;

fig. 4 is a diagram of a bipin network structure.

Detailed Description

The invention provides a driving environment sensing method based on a vehicle-mounted monocular camera, which comprises the steps of carrying out structural re-parameterization and automatic driving multi-task sensing on an up-sampling module.

S1, carrying out structural re-parameterization on an up-sampling module

Upsampling plays an irreplaceable role in neural networks, and many network models require upsampling to recover feature map dimensions and fuse multi-channel information. The common up-sampling module comprises linear interpolation and transposed convolution, wherein the linear interpolation has no learnable parameters, and has high reasoning speed and weak expression capability; the transposed convolution reasoning has leachable parameters and has stronger expression capability. In practice, linear interpolation is a special transposed convolution, each of which can be replaced without loss by a transposed convolution. Based on this, in combination with the idea of reparameterization, an upsampling layer structure reparameterization method repappable is proposed.

In the training phase, one transposed convolution layer is extended to multiple branches, where one branch uses a common linear interpolation algorithm and the other branches use transposed convolutions of different convolution kernel sizes. Linear interpolation can only change the size of the feature map, but not the number of channels outputting the feature map, which can be changed by adding a convolution layer of 1*1. In addition, batch normalization can normalize the feature map, improve the generalization ability of the network, and adding batch normalizaiton after each upsampling branch can further improve model performance. Adding interpolation branches is equivalent to providing one jump connection for transposed convolution, which is focused on learning residual errors.

In the reasoning stage, the multi-branch structure can be re-parameterized and converted into a single-branch structure without damage. Twice of the productionAs an example, three branches during training are respectively convolved with bilinear interpolation, the 1*1 convolution, the 2 x 2 transposed convolution, and the 4*4 transposed convolution, and then each concatenated batch normalization layers, as shown in fig. 1. Inputting a feature map

After upsampling a feature map is obtained>

(1) 4*4 transposed convolution branches

Convolution kernel of 4×4 transposed convolution

Bias->

Their correspondence is shown in formula 1, transpose +.>

Is inverted +.>

Is a part of the two dimensions.

After the convolution, normalization is carried out through the BN layer, and in the reasoning process, the convolution can fuse the BN layer parameters so as to achieve the effect of reasoning acceleration. The transposed convolution parameters after fusion are shown in equations 2, 3:

(2) 2 x 2 transposed convolutional branches

Convolution kernel of 2 x 2 transpose convolution

The weight can be obtained by filling the convolution kernel outer circle of 2×2 with zero value to 4×4 size and then fusing the parameters of BN layer>

(3) Bilinear interpolation 1*1 convolution branch

Bilinear interpolation layer can be transformed into transposed convolution of 4×4 convolution kernel size with its convolution kernel lossless

Bias->

W _bilinear ←W _1×1 ×W _bilinear (4)

b _bilinear ←b _1×1 (5)

finally, the BN layer parameters are fused to obtain the weight

(4) Multi-branch fusion

through the structure heavy parameterization, the multi-branch complex structure during training can be compressed in a lossless manner, the accuracy is kept unchanged, and the reasoning efficiency is improved.

S2, automatic driving multitasking awareness

The automatic driving scene pays attention to precision and speed at the same time, and through a multi-task deep learning model, perception real-time reasoning based on a monocular camera can be realized, and three tasks of target detection, road drivable area segmentation and lane line segmentation are completed. The overall structure of the algorithm model adopts an encoding-decoding network structure, and different decoding head networks are respectively used for three tasks and share the encoding network. The coding network is divided into a back box and a back structure according to different actions and positions, the back box directly receives input from a camera, belongs to a relatively shallow position in the network and is used for mining shallow characteristic information of the network, the back belongs to the middle part of the whole network, receives the characteristic information from the back box, performs further characteristic fusion and characteristic extraction, acquires deeper characteristic information, and transmits the information to different decoding networks. The overall structure of the entire network is shown in fig. 2.

(1) Backbone network

The Backbone network adopts a more classical ResNet in a neural network, and has a good effect on image feature extraction. The image shot by the monocular camera is scaled to 640 x 340 size, and the characteristic diagrams of different sizes of P1-P5 are obtained through a multi-layer residual structure, as shown in fig. 3.

(2) Neck network

The Neck network adopts an improved BiFPN structure, and the bidirectional pyramid structure is beneficial to generating and fusing features with different scales, so that the finally generated feature map simultaneously contains multi-scale and multi-semantic information. The Bi FPN structure comprises a plurality of upsampling operations, upsampling in the Bi FPN is replaced by a Rep upscan which can be re-parameterized, the flexibility of the upsampling module is improved, the change of the dimension of the feature map is better adapted, and the overall performance of the network is improved. The single Bi FPN structure is shown in fig. 4.

And connecting a plurality of Bi FPNs to form a Neck structure of the whole sensing network, inputting P3, P4 and P5 corresponding to the feature map of the Neck, obtaining P3-P7 with deeper features after the first Bi FPN, and inputting the P3-P7 into the next Bi FPN structure. The output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.

(3) Road drivable area dividing head network

The road drivable region segmentation adopts a semantic segmentation algorithm to carry out pixel-by-pixel identification classification, receives characteristic information obtained from backbones and negs, and also adopts a Bi FPN structure and FCNs to carry out classification, wherein a plurality of up-sampling operations are involved, and Rep U PSAmple is also used for replacement. Road drivable region segmentation employs cross entropy loss.

(4) Target detection header network

The target detection head is similar to a YOLOv5 network, an anchor-base detection scheme is adopted, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature graphs of different scales, and finally the repeated detection frame is removed through non-maximum value inhibition N MS, so that a final detection result is obtained. The target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.

(5) Lane line detection head network

The lane line detection is based on the key point detection, the picture is horizontally divided into a plurality of strips, each strip is divided into a plurality of blocks, the positions occupied by the lane lines in each strip are predicted, and compared with the method for detecting the lane lines by semantic segmentation, the method for detecting the lane lines by using the key points can greatly reduce the calculated amount of the network model. And detecting the lane lines to obtain the positions of key points on the lane lines, connecting the key points belonging to the same lane line into lines, and calculating the loss by using cross entropy.

Compared with the common linear interpolation and transpose convolution, the Rep U PSAmple has a certain improvement on the precision of the network model. On the task of the semantic segmentation model, the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used is compared, and the results are shown in Table 1.

Table 1 influence of different upsampling methods on semantic segmentation network

According to experimental results, the performance of the semantic segmentation network can be improved by taking the Rep U PSAmple as an up-sampling method according to different network models, different up-sampling positions and different network scales. The m IOU can be raised by 1.77% on average, 0.74% on average, 1.16% on average, and 0.35% on p.a. on average, compared to the transposed convolution, compared to the bilinear interpolation algorithm.

The automatic driving technology is still in a high-speed development stage, and is not easy to imagine, and human traffic in the future world is necessarily automatic driving based on artificial intelligence. However, at present, the automatic driving technology is still not mature, besides the high technical requirement of software, the automatic driving technology has additional requirements on hardware, and if accurate sensing needs to be completed, the mutual matching of various sensors is needed, and a laser radar, a millimeter wave radar, a camera, an inertial measurement unit and the like are needed. It is often very difficult to equip old cars with an on-board artificial intelligence system having some intelligence.

The network model that this patent provided, based on monocular camera, the hardware requirement is low, and most vehicles all are equipped with, say the camera on the vehicle event data recorder, and required computational performance is not high, detects pedestrian, the condition of deviating from the lane again under, can give the timely warning of driver, can assist the driving to a certain extent. In addition, along with the development of intelligent traffic, the cooperation of vehicles and the cooperation of vehicles and roads are possible in the future, and different vehicles exchange information at intersections to provide early warning for the vision blind area information of the vehicle.

The sky-eye system of China is widely covered, most urban intersections are provided with cameras for monitoring illegal and illegal behaviors, and the network model provided by the patent can be applied to intersection monitoring equipment to provide artificial intelligence for edge-end equipment. The intersection camera can detect traffic flow and control traffic light change according to the traffic flow, ensure the traffic efficiency of the intersection, detect red light running behavior of pedestrians and give warning, detect overspeed, stop violations, red light running behavior of vehicles and the like, record license plate numbers of vehicles and report to the supervision department.

The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same according to the content of the present invention, and are not intended to limit the scope of the present invention. All modifications made according to the spirit of the main technical proposal of the invention should be covered in the protection scope of the invention.

Claims

1. The driving environment sensing method based on the vehicle-mounted monocular camera is characterized by comprising the following steps of:

2. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 1, wherein in the training stage in S1, the number of channels of the output feature map is changed by adding a convolution layer of 1*1 to the linear interpolation; batch normalizaiton is added after each upsampling branch to further improve model performance.

3. The driving environment sensing method based on an on-vehicle monocular camera according to claim 2, wherein in the reasoning stage in S1, for double up-sampling, three branches during training use bilinear interpolation tap 1*1 convolution, 2 x 2 transpose convolution, 4*4 transpose convolution, respectively, and thereafter each concatenate batch normalization layers; inputting a feature map

After upsampling a feature map is obtained>

4. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 3, wherein the convolution kernel of the 4×4 transpose convolution is

Bias->

The corresponding relation is shown in formula 1, which is transposed first>

Is inverted +.>

Is the last two dimensions of (2);

Convolution kernel of 2 x 2 transpose convolution

Bilinear interpolation is followed by 1*1 convolution, whose convolution kernel

Bias->

W _bilinear ←W _1×1 ×W _bilinear (4)

b _bilinear ←b _1×1 (5)

finally, the BN layer parameters are fused to obtain the weight

5. The driving environment perception method based on the vehicle-mounted monocular camera according to claim 1, wherein in S2, the overall structure of the multi-task deep learning model adopts an encoding-decoding network structure, and for three tasks, different decoding head networks are used respectively and share the encoding network; the coding network is divided into a backhaul network and a back network according to different actions and positions, the backhaul network directly receives input from a camera and is used for mining shallow characteristic information of the network, the back network receives the characteristic information from the backhaul network, further characteristic fusion and characteristic extraction are carried out, deeper characteristic information is obtained, and the deeper characteristic information is transmitted to different decoding networks.

6. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 5, wherein the Backbone network adopts ResNet, and images shot by the monocular camera are scaled to 640 x 340, and feature maps with different sizes are obtained through a multi-layer residual structure.

7. The vehicle-mounted monocular camera-based driving environment sensing method of claim 5, wherein the Neck network employs an improved BiFPN structure, the BiFPN structure comprising a plurality of upsampling operations, the upsampling in the BiFPN being replaced with a repapppattern that can be re-parameterized;

a plurality of BiFPNs are connected to form a Neck structure of the whole sensing network, the characteristic diagrams input into the Neck are P3, P4 and P5 correspondingly, P3-P7 with deeper characteristics are obtained after the first BiFPN is processed, and the P3-P7 is input into the next BiFPN structure; the output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.

8. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 7, wherein the road drivable region segmentation adopts a semantic segmentation algorithm to perform pixel-by-pixel recognition classification, receives characteristic information obtained from a back bone network and a network, performs classification by adopting a BiFPN structure and FCN, and involves a plurality of up-sampling operations to replace with a repappamp; road drivable region segmentation employs cross entropy loss.

9. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 8, wherein the target detection adopts an anchor-base detection scheme, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature maps of different scales, and finally, repeated detection frames are removed through non-maximum suppression NMS to obtain a final detection result; the target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.

10. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 9, wherein the lane line segmentation is based on key point detection, the picture is horizontally divided into a plurality of strips, each strip is divided into a plurality of blocks, the position occupied by the lane line in each strip is predicted, the position of the key point on the lane line is obtained by the lane line detection, the key points belonging to the same lane line are connected into a line, and the loss is calculated by using cross entropy.