CN116311113A - Driving environment sensing method based on vehicle-mounted monocular camera - Google Patents
Driving environment sensing method based on vehicle-mounted monocular camera Download PDFInfo
- Publication number
- CN116311113A CN116311113A CN202310093603.XA CN202310093603A CN116311113A CN 116311113 A CN116311113 A CN 116311113A CN 202310093603 A CN202310093603 A CN 202310093603A CN 116311113 A CN116311113 A CN 116311113A
- Authority
- CN
- China
- Prior art keywords
- convolution
- network
- monocular camera
- vehicle
- driving environment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000011218 segmentation Effects 0.000 claims abstract description 26
- 238000005070 sampling Methods 0.000 claims abstract description 18
- 238000001514 detection method Methods 0.000 claims description 32
- 230000008447 perception Effects 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 13
- 238000010586 diagram Methods 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 6
- 230000009471 action Effects 0.000 claims description 5
- 238000013136 deep learning model Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 238000005065 mining Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 210000000988 bone and bone Anatomy 0.000 claims description 2
- 230000001629 suppression Effects 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 description 12
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 102100034112 Alkyldihydroxyacetonephosphate synthase, peroxisomal Human genes 0.000 description 2
- 101000799143 Homo sapiens Alkyldihydroxyacetonephosphate synthase, peroxisomal Proteins 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 2
- 238000000848 angular dependent Auger electron spectroscopy Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- QXAITBQSYVNQDR-UHFFFAOYSA-N amitraz Chemical compound C=1C=C(C)C=C(C)C=1N=CN(C)C=NC1=CC=C(C)C=C1C QXAITBQSYVNQDR-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 239000012633 leachable Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/588—Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
- G06V10/765—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention discloses a driving environment sensing method based on a vehicle-mounted monocular camera, which comprises the steps of carrying out structural re-parameterization and automatic driving multi-task sensing on an up-sampling module. Compared with the common linear interpolation and transpose convolution, the method has the advantage that the accuracy of the network model is improved to a certain extent by using the RepUPSAmple. On the task of the semantic segmentation model, compared with the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used, the performance of the semantic segmentation network can be improved by using the RepUPSAmple as an upsampling method according to different network models, different upsampling positions and different network scales. The mIOU can be improved by 1.77% on average and 0.74% on average by P.A. and 1.16% by P.A. and 0.35% by transpose convolution compared to bilinear interpolation algorithm.
Description
Technical Field
The invention relates to the field of automatic driving, in particular to a driving environment sensing method based on a vehicle-mounted monocular camera.
Background
In recent years, deep learning and computer vision technologies are rapidly emerging, and automatic driving technologies based on the deep learning and the computer vision technologies provide new solutions for improving traffic safety and efficiency. The automatic driving system is not tired, can strictly adhere to traffic rules, and has great potential in reducing accident occurrence rate. Autopilot incorporates a number of technologies such as artificial intelligence, communications, semiconductors, automobiles, and the like. The related industry chain is wide, the space of value that can be created is huge, and the technology has become the necessary place of competing for the cross-border of the automobile industry and the science and technology industry of various countries, and huge companies such as ***, tesla, general purpose, hundred degrees and the like develop the automatic driving technology in great numbers. Technological progress, policy pushing, huge head entry, capital entry, cost reduction, clear scene and other factors are comprehensively promoted, and after more than ten years of exploration and development of automatic driving technology, the automatic driving technology stands at key commercial landing nodes at present.
The automatic driving system is a comprehensive system integrating the functions of environment sensing, decision control, action execution and the like. The automatic driving system is huge and complex, and can be generally divided into three modules of perception, decision and control according to functions. Wherein the sensing module is defined as the collection and processing of environmental information and in-vehicle information, which senses information of the surrounding environment through various sensors. The method relates to multiple tasks such as road boundary detection, drivable area detection, traffic sign recognition, vehicle and pedestrian detection, road surface information perception and the like. The sensing module is equivalent to the basic stone of the whole system, and the decision system can perform correct judgment only if the sensing system provides accurate information. The robustness of the perception system directly affects the reliability of the overall autopilot system. The decision module can be understood to perform decision judgment according to the perception information, determine a proper working model, and assign a corresponding control strategy to replace a driver to make driving decisions. The execution module is used for controlling the vehicle by installing a decision result after the system makes a decision, each operation system of the vehicle needs to be connected with the decision system through a bus, and the installation instruction accurately controls driving actions such as an acceleration program, a braking program, a steering amplitude, light control and the like so as to realize autonomous control of the vehicle.
In the sensing module, if the cooperation of multiple sensors is needed to achieve accurate sensing, the existing methods can be divided into the following three types: a perception method based on monocular images, laser point cloud and millimeter wave radar. The laser radar and the millimeter wave radar both belong to active sensors, the detection signals are emitted to the periphery, the signals reflected by the object are received, and the azimuth and the distance of the object are calculated through comparison between the emitted signals and the received signals. The camera belongs to a passive sensor and senses an object by light reflection. The three sensors have different advantages and disadvantages. The laser radar can generate high-precision three-dimensional point cloud information, but the laser radar in the current stage has high cost and short service life, and a large amount of noise exists in rainy and snowy weather. Millimeter wave radar has small weather effect, is sensitive to moving objects, and has the defects of lower resolution and poorer discrimination capability. The monocular camera has the advantages of low cost, high resolution, and capability of providing rich visual information, but cannot acquire depth information. For safety reasons, autopilot vehicles are typically equipped with three types of sensors, which are used by advanced driving assistance systems (Advanced Driving Assistance System, ADAS) for environmental perception. However, on common civilian automobiles, the price for installing the sensors is too high, and the common monocular cameras have a wider hardware base. ADAS based on monocular cameras is therefore of greater practical value.
Disclosure of Invention
The invention aims at: the driving environment sensing method based on the vehicle-mounted monocular camera aims to solve the following problems:
an optimized neural network upsampling module: the depth neural network generally adopts transposed convolution or bilinear interpolation as an up-sampling module, the bilinear interpolation has no learnable parameters, the reasoning speed is high, the expression capacity is weak, and the transposed convolution has the learnable parameters and has stronger expression capacity. Note the identity between them, using multi-branched up-sampling structures in the training phase, while the lossless merging of multiple branches into a single branch in the reasoning phase, mentions the network capacity while not causing a loss of reasoning speed.
Autopilot multitasking awareness: the automatic driving sensing module needs to sense the surrounding environment by utilizing various sensors of the vehicle body, including target detection, road drivable area segmentation and lane line detection. The monocular camera has a wider hardware base because of low cost, and the monocular camera-based automatic driving perception system has higher practicability. Based on the image shot by the vehicle-mounted monocular camera, the automatic driving perception multitasking algorithm is designed by combining with the structural heavy parameterization up-sampling module, and a plurality of tasks are completed through one-time reasoning, so that the throughput speed of the whole system is improved, and the power consumption and the use of a memory are reduced.
The technical scheme of the invention is as follows:
a driving environment sensing method based on a vehicle-mounted monocular camera comprises the following steps:
s1, carrying out structural re-parameterization on an up-sampling module: comprising a training phase and an reasoning phase, wherein,
in the training stage, expanding one transposed convolution layer of an up-sampling module into multiple branches, wherein one branch uses a linear interpolation algorithm, and the other branches use transposed convolutions with different convolution kernel sizes;
in the reasoning stage, the multi-branch structure is subjected to re-parameterization and is converted into a single-branch structure in a lossless manner;
s2, automatic driving multitasking awareness: through a multi-task deep learning model, perception real-time reasoning based on a monocular camera is realized, and three tasks of target detection, road drivable region segmentation and lane line segmentation are completed.
Preferably, in the training stage in S1, the number of channels of the output feature map is changed by adding a convolution layer of 1*1 to the linear interpolation; batch normalizaiton is added after each upsampling branch to further improve model performance.
Preferably, in the reasoning stage in S1, for double up-sampling, three branches are respectively convolved with bilinear interpolation bins 1*1, 2 x 2 transpose convolutions, 4*4 transpose convolutions, and thereafter each concatenated batch normalization layers; inputting a feature mapAfter upsampling a feature map is obtained>
Convolution kernel of 4×4 transposed convolutionBias->Because of the transposed convolution, the first padding and then the convolution are needed, and the convolution kernel of the convolution process is +.>The corresponding relation is shown in formula 1, and is transposed firstIs inverted +.>Is the last two dimensions of (2);
after the convolution, the normalization is carried out through the BN layer, in the reasoning process, the BN layer parameters are fused by the convolution, and transposed convolution parameters after the fusion are shown as formulas 2 and 3:
wherein gamma, beta, sigma and mu respectively correspond to the weight, bias, variance and mean of BN layer to obtain weight W 4×4 Finally, obtaining the final transposed convolution weight through the transformation of the formula 3
Convolution kernel of 2 x 2 transpose convolutionThe weight is obtained by filling the convolution kernel outer circle of 2×2 with zero value to 4×4 size and fusing the parameters of BN layer>
Bilinear interpolation is followed by 1*1 convolution, whose convolution kernelConvolution kernel of a 1 x 1 convolution of the number of transform channels without bias>Bias->The obtained 4×4 transpose convolution is firstly subjected to parameter fusion with 1×1 convolution, and the obtained new weights are shown in formulas 4 and 5:
W bilinear ←W 1×1 ×W bilinear (4)
b bilinear ←b 1×1 (5)
After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and offsets of the three transposed convolutions are added to obtain the final result, as shown in equations 6, 7:
and through structural re-parameterization, the multi-branch complex structure during training is compressed in a lossless manner.
Preferably, in S2, the overall structure of the multi-task deep learning model adopts an encoding-decoding network structure, and for three tasks, different decoding header networks are used respectively, and the encoding network is shared; the coding network is divided into a backhaul network and a back network according to different actions and positions, the backhaul network directly receives input from a camera and is used for mining shallow characteristic information of the network, the back network receives the characteristic information from the backhaul network, further characteristic fusion and characteristic extraction are carried out, deeper characteristic information is obtained, and the deeper characteristic information is transmitted to different decoding networks.
Preferably, the Backbone network adopts ResNet, and images shot by the monocular camera are scaled to 640 x 340, and feature images with different sizes are obtained through a multi-layer residual structure.
Preferably, the Neck network adopts an improved BiF PN structure, the Bi FPN structure comprises a plurality of upsampling operations, and upsampling in the Bi FPN is replaced by a repa psample capable of being re-parameterized;
a plurality of Bi FPNs are connected to form a Neck structure of the whole perception network, the characteristic diagram input into the Neck is P3, P4 and P5 correspondingly, P3-P7 with deeper characteristics is obtained after the first Bi FPN, and the P3-P7 is input into the next Bi FPN structure; the output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.
Preferably, the road drivable region segmentation adopts a semantic segmentation algorithm to carry out pixel-by-pixel identification classification, receives characteristic information obtained from a back bone network and a network, adopts a BiFPN structure and FCNs to carry out classification, and relates to a plurality of up-sampling operations, and substitution is carried out by using a RepU PSAmple; road drivable region segmentation employs cross entropy loss.
Preferably, the target detection adopts an anchor-base detection scheme, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature images of different scales, and finally, repeated detection frames are removed by inhibiting N MS through a non-maximum value, so that a final detection result is obtained; the target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.
Preferably, the lane line segmentation is based on key point detection, the picture is horizontally divided into a plurality of strips, each strip is further divided into a plurality of blocks, the positions occupied by the lane lines in each strip are predicted, the positions of the key points on the lane lines are obtained through the lane line detection, the key points belonging to the same lane line are connected into lines, and the loss is calculated by using cross entropy.
The invention has the advantages that:
compared with the common linear interpolation and transpose convolution, the method has the advantage that the accuracy of the network model is improved to a certain extent by using the RepUPSAmple. On the task of the semantic segmentation model, compared with the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used, the performance of the semantic segmentation network can be improved by using the RepUPSAmple as an upsampling method according to different network models, different upsampling positions and different network scales. The mIOU can be improved by 1.77% on average and 0.74% on average by P.A. and 1.16% by P.A. and 0.35% by transpose convolution compared to bilinear interpolation algorithm.
Drawings
The invention is further described below with reference to the accompanying drawings and examples:
FIG. 1 is a block diagram of a reparameterized upsampling RepUPSAmple;
FIG. 2 is a diagram of the overall architecture of a multitasking aware network;
FIG. 3 is a block diagram of a backhaul network;
fig. 4 is a diagram of a bipin network structure.
Detailed Description
The invention provides a driving environment sensing method based on a vehicle-mounted monocular camera, which comprises the steps of carrying out structural re-parameterization and automatic driving multi-task sensing on an up-sampling module.
S1, carrying out structural re-parameterization on an up-sampling module
Upsampling plays an irreplaceable role in neural networks, and many network models require upsampling to recover feature map dimensions and fuse multi-channel information. The common up-sampling module comprises linear interpolation and transposed convolution, wherein the linear interpolation has no learnable parameters, and has high reasoning speed and weak expression capability; the transposed convolution reasoning has leachable parameters and has stronger expression capability. In practice, linear interpolation is a special transposed convolution, each of which can be replaced without loss by a transposed convolution. Based on this, in combination with the idea of reparameterization, an upsampling layer structure reparameterization method repappable is proposed.
In the training phase, one transposed convolution layer is extended to multiple branches, where one branch uses a common linear interpolation algorithm and the other branches use transposed convolutions of different convolution kernel sizes. Linear interpolation can only change the size of the feature map, but not the number of channels outputting the feature map, which can be changed by adding a convolution layer of 1*1. In addition, batch normalization can normalize the feature map, improve the generalization ability of the network, and adding batch normalizaiton after each upsampling branch can further improve model performance. Adding interpolation branches is equivalent to providing one jump connection for transposed convolution, which is focused on learning residual errors.
In the reasoning stage, the multi-branch structure can be re-parameterized and converted into a single-branch structure without damage. Twice of the productionAs an example, three branches during training are respectively convolved with bilinear interpolation, the 1*1 convolution, the 2 x 2 transposed convolution, and the 4*4 transposed convolution, and then each concatenated batch normalization layers, as shown in fig. 1. Inputting a feature mapAfter upsampling a feature map is obtained>
(1) 4*4 transposed convolution branches
Convolution kernel of 4×4 transposed convolutionBias->Because of the transposed convolution, the first padding and then the convolution are needed, and the convolution kernel of the convolution process is +.>Their correspondence is shown in formula 1, transpose +.>Is inverted +.>Is a part of the two dimensions.
After the convolution, normalization is carried out through the BN layer, and in the reasoning process, the convolution can fuse the BN layer parameters so as to achieve the effect of reasoning acceleration. The transposed convolution parameters after fusion are shown in equations 2, 3:
wherein gamma, beta, sigma and mu respectively correspond to the weight, bias, variance and mean of BN layer to obtain weight W 4×4 Finally, obtaining the final transposed convolution weight through the transformation of the formula 3
(2) 2 x 2 transposed convolutional branches
Convolution kernel of 2 x 2 transpose convolutionThe weight can be obtained by filling the convolution kernel outer circle of 2×2 with zero value to 4×4 size and then fusing the parameters of BN layer>
(3) Bilinear interpolation 1*1 convolution branch
Bilinear interpolation layer can be transformed into transposed convolution of 4×4 convolution kernel size with its convolution kernel losslessConvolution kernel of a 1 x 1 convolution of the number of transform channels without bias>Bias->The obtained 4×4 transpose convolution is firstly subjected to parameter fusion with 1×1 convolution, and the obtained new weights are shown in formulas 4 and 5:
W bilinear ←W 1×1 ×W bilinear (4)
b bilinear ←b 1×1 (5)
(4) Multi-branch fusion
After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and offsets of the three transposed convolutions are added to obtain the final result, as shown in equations 6, 7:
through the structure heavy parameterization, the multi-branch complex structure during training can be compressed in a lossless manner, the accuracy is kept unchanged, and the reasoning efficiency is improved.
S2, automatic driving multitasking awareness
The automatic driving scene pays attention to precision and speed at the same time, and through a multi-task deep learning model, perception real-time reasoning based on a monocular camera can be realized, and three tasks of target detection, road drivable area segmentation and lane line segmentation are completed. The overall structure of the algorithm model adopts an encoding-decoding network structure, and different decoding head networks are respectively used for three tasks and share the encoding network. The coding network is divided into a back box and a back structure according to different actions and positions, the back box directly receives input from a camera, belongs to a relatively shallow position in the network and is used for mining shallow characteristic information of the network, the back belongs to the middle part of the whole network, receives the characteristic information from the back box, performs further characteristic fusion and characteristic extraction, acquires deeper characteristic information, and transmits the information to different decoding networks. The overall structure of the entire network is shown in fig. 2.
(1) Backbone network
The Backbone network adopts a more classical ResNet in a neural network, and has a good effect on image feature extraction. The image shot by the monocular camera is scaled to 640 x 340 size, and the characteristic diagrams of different sizes of P1-P5 are obtained through a multi-layer residual structure, as shown in fig. 3.
(2) Neck network
The Neck network adopts an improved BiFPN structure, and the bidirectional pyramid structure is beneficial to generating and fusing features with different scales, so that the finally generated feature map simultaneously contains multi-scale and multi-semantic information. The Bi FPN structure comprises a plurality of upsampling operations, upsampling in the Bi FPN is replaced by a Rep upscan which can be re-parameterized, the flexibility of the upsampling module is improved, the change of the dimension of the feature map is better adapted, and the overall performance of the network is improved. The single Bi FPN structure is shown in fig. 4.
And connecting a plurality of Bi FPNs to form a Neck structure of the whole sensing network, inputting P3, P4 and P5 corresponding to the feature map of the Neck, obtaining P3-P7 with deeper features after the first Bi FPN, and inputting the P3-P7 into the next Bi FPN structure. The output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.
(3) Road drivable area dividing head network
The road drivable region segmentation adopts a semantic segmentation algorithm to carry out pixel-by-pixel identification classification, receives characteristic information obtained from backbones and negs, and also adopts a Bi FPN structure and FCNs to carry out classification, wherein a plurality of up-sampling operations are involved, and Rep U PSAmple is also used for replacement. Road drivable region segmentation employs cross entropy loss.
(4) Target detection header network
The target detection head is similar to a YOLOv5 network, an anchor-base detection scheme is adopted, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature graphs of different scales, and finally the repeated detection frame is removed through non-maximum value inhibition N MS, so that a final detection result is obtained. The target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.
(5) Lane line detection head network
The lane line detection is based on the key point detection, the picture is horizontally divided into a plurality of strips, each strip is divided into a plurality of blocks, the positions occupied by the lane lines in each strip are predicted, and compared with the method for detecting the lane lines by semantic segmentation, the method for detecting the lane lines by using the key points can greatly reduce the calculated amount of the network model. And detecting the lane lines to obtain the positions of key points on the lane lines, connecting the key points belonging to the same lane line into lines, and calculating the loss by using cross entropy.
Compared with the common linear interpolation and transpose convolution, the Rep U PSAmple has a certain improvement on the precision of the network model. On the task of the semantic segmentation model, the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used is compared, and the results are shown in Table 1.
Table 1 influence of different upsampling methods on semantic segmentation network
According to experimental results, the performance of the semantic segmentation network can be improved by taking the Rep U PSAmple as an up-sampling method according to different network models, different up-sampling positions and different network scales. The m IOU can be raised by 1.77% on average, 0.74% on average, 1.16% on average, and 0.35% on p.a. on average, compared to the transposed convolution, compared to the bilinear interpolation algorithm.
The automatic driving technology is still in a high-speed development stage, and is not easy to imagine, and human traffic in the future world is necessarily automatic driving based on artificial intelligence. However, at present, the automatic driving technology is still not mature, besides the high technical requirement of software, the automatic driving technology has additional requirements on hardware, and if accurate sensing needs to be completed, the mutual matching of various sensors is needed, and a laser radar, a millimeter wave radar, a camera, an inertial measurement unit and the like are needed. It is often very difficult to equip old cars with an on-board artificial intelligence system having some intelligence.
The network model that this patent provided, based on monocular camera, the hardware requirement is low, and most vehicles all are equipped with, say the camera on the vehicle event data recorder, and required computational performance is not high, detects pedestrian, the condition of deviating from the lane again under, can give the timely warning of driver, can assist the driving to a certain extent. In addition, along with the development of intelligent traffic, the cooperation of vehicles and the cooperation of vehicles and roads are possible in the future, and different vehicles exchange information at intersections to provide early warning for the vision blind area information of the vehicle.
The sky-eye system of China is widely covered, most urban intersections are provided with cameras for monitoring illegal and illegal behaviors, and the network model provided by the patent can be applied to intersection monitoring equipment to provide artificial intelligence for edge-end equipment. The intersection camera can detect traffic flow and control traffic light change according to the traffic flow, ensure the traffic efficiency of the intersection, detect red light running behavior of pedestrians and give warning, detect overspeed, stop violations, red light running behavior of vehicles and the like, record license plate numbers of vehicles and report to the supervision department.
The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same according to the content of the present invention, and are not intended to limit the scope of the present invention. All modifications made according to the spirit of the main technical proposal of the invention should be covered in the protection scope of the invention.
Claims (10)
1. The driving environment sensing method based on the vehicle-mounted monocular camera is characterized by comprising the following steps of:
s1, carrying out structural re-parameterization on an up-sampling module: comprising a training phase and an reasoning phase, wherein,
in the training stage, expanding one transposed convolution layer of an up-sampling module into multiple branches, wherein one branch uses a linear interpolation algorithm, and the other branches use transposed convolutions with different convolution kernel sizes;
in the reasoning stage, the multi-branch structure is subjected to re-parameterization and is converted into a single-branch structure in a lossless manner;
s2, automatic driving multitasking awareness: through a multi-task deep learning model, perception real-time reasoning based on a monocular camera is realized, and three tasks of target detection, road drivable region segmentation and lane line segmentation are completed.
2. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 1, wherein in the training stage in S1, the number of channels of the output feature map is changed by adding a convolution layer of 1*1 to the linear interpolation; batch normalizaiton is added after each upsampling branch to further improve model performance.
3. The driving environment sensing method based on an on-vehicle monocular camera according to claim 2, wherein in the reasoning stage in S1, for double up-sampling, three branches during training use bilinear interpolation tap 1*1 convolution, 2 x 2 transpose convolution, 4*4 transpose convolution, respectively, and thereafter each concatenate batch normalization layers; inputting a feature mapAfter upsampling a feature map is obtained>
4. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 3, wherein the convolution kernel of the 4×4 transpose convolution isBias->Because of the transposed convolution, the first padding and then the convolution are needed, and the convolution kernel of the convolution process is +.>The corresponding relation is shown in formula 1, which is transposed first>Is inverted +.>Is the last two dimensions of (2);
after the convolution, the normalization is carried out through the BN layer, in the reasoning process, the BN layer parameters are fused by the convolution, and transposed convolution parameters after the fusion are shown as formulas 2 and 3:
wherein gamma, beta, sigma and mu respectively correspond to the weight, bias, variance and mean of BN layer to obtain weight W 4×4 Finally, obtaining the final transposed convolution weight through the transformation of the formula 3
Convolution kernel of 2 x 2 transpose convolutionThe weight is obtained by filling the convolution kernel outer circle of 2×2 with zero value to 4×4 size and fusing the parameters of BN layer>
Bilinear interpolation is followed by 1*1 convolution, whose convolution kernelConvolution kernel of a 1 x 1 convolution of the number of transform channels without bias>Bias->The obtained 4×4 transpose convolution is firstly subjected to parameter fusion with 1×1 convolution, and the obtained new weights are shown in formulas 4 and 5:
W bilinear ←W 1×1 ×W bilinear (4)
b bilinear ←b 1×1 (5)
After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and offsets of the three transposed convolutions are added to obtain the final result, as shown in equations 6, 7:
and through structural re-parameterization, the multi-branch complex structure during training is compressed in a lossless manner.
5. The driving environment perception method based on the vehicle-mounted monocular camera according to claim 1, wherein in S2, the overall structure of the multi-task deep learning model adopts an encoding-decoding network structure, and for three tasks, different decoding head networks are used respectively and share the encoding network; the coding network is divided into a backhaul network and a back network according to different actions and positions, the backhaul network directly receives input from a camera and is used for mining shallow characteristic information of the network, the back network receives the characteristic information from the backhaul network, further characteristic fusion and characteristic extraction are carried out, deeper characteristic information is obtained, and the deeper characteristic information is transmitted to different decoding networks.
6. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 5, wherein the Backbone network adopts ResNet, and images shot by the monocular camera are scaled to 640 x 340, and feature maps with different sizes are obtained through a multi-layer residual structure.
7. The vehicle-mounted monocular camera-based driving environment sensing method of claim 5, wherein the Neck network employs an improved BiFPN structure, the BiFPN structure comprising a plurality of upsampling operations, the upsampling in the BiFPN being replaced with a repapppattern that can be re-parameterized;
a plurality of BiFPNs are connected to form a Neck structure of the whole sensing network, the characteristic diagrams input into the Neck are P3, P4 and P5 correspondingly, P3-P7 with deeper characteristics are obtained after the first BiFPN is processed, and the P3-P7 is input into the next BiFPN structure; the output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.
8. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 7, wherein the road drivable region segmentation adopts a semantic segmentation algorithm to perform pixel-by-pixel recognition classification, receives characteristic information obtained from a back bone network and a network, performs classification by adopting a BiFPN structure and FCN, and involves a plurality of up-sampling operations to replace with a repappamp; road drivable region segmentation employs cross entropy loss.
9. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 8, wherein the target detection adopts an anchor-base detection scheme, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature maps of different scales, and finally, repeated detection frames are removed through non-maximum suppression NMS to obtain a final detection result; the target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.
10. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 9, wherein the lane line segmentation is based on key point detection, the picture is horizontally divided into a plurality of strips, each strip is divided into a plurality of blocks, the position occupied by the lane line in each strip is predicted, the position of the key point on the lane line is obtained by the lane line detection, the key points belonging to the same lane line are connected into a line, and the loss is calculated by using cross entropy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310093603.XA CN116311113A (en) | 2023-02-10 | 2023-02-10 | Driving environment sensing method based on vehicle-mounted monocular camera |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310093603.XA CN116311113A (en) | 2023-02-10 | 2023-02-10 | Driving environment sensing method based on vehicle-mounted monocular camera |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116311113A true CN116311113A (en) | 2023-06-23 |
Family
ID=86789716
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310093603.XA Pending CN116311113A (en) | 2023-02-10 | 2023-02-10 | Driving environment sensing method based on vehicle-mounted monocular camera |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116311113A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117789153A (en) * | 2024-02-26 | 2024-03-29 | 浙江驿公里智能科技有限公司 | Automobile oil tank outer cover positioning system and method based on computer vision |
-
2023
- 2023-02-10 CN CN202310093603.XA patent/CN116311113A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117789153A (en) * | 2024-02-26 | 2024-03-29 | 浙江驿公里智能科技有限公司 | Automobile oil tank outer cover positioning system and method based on computer vision |
CN117789153B (en) * | 2024-02-26 | 2024-05-03 | 浙江驿公里智能科技有限公司 | Automobile oil tank outer cover positioning system and method based on computer vision |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022206942A1 (en) | Laser radar point cloud dynamic segmentation and fusion method based on driving safety risk field | |
Han et al. | Research on road environmental sense method of intelligent vehicle based on tracking check | |
CN108345822B (en) | Point cloud data processing method and device | |
CN113313154A (en) | Integrated multi-sensor integrated automatic driving intelligent sensing device | |
CN112581612B (en) | Vehicle-mounted grid map generation method and system based on fusion of laser radar and all-round-looking camera | |
CN112633176B (en) | Rail transit obstacle detection method based on deep learning | |
GB2621048A (en) | Vehicle-road laser radar point cloud dynamic segmentation and fusion method based on driving safety risk field | |
EP4089659A1 (en) | Map updating method, apparatus and device | |
US11966234B2 (en) | System and method for monocular depth estimation from semantic information | |
CN116685874A (en) | Camera-laser radar fusion object detection system and method | |
CN102685516A (en) | Active safety type assistant driving method based on stereoscopic vision | |
CN112950678A (en) | Beyond-the-horizon fusion sensing system based on vehicle-road cooperation | |
CN114419874B (en) | Target driving safety risk early warning method based on road side sensing equipment data fusion | |
CN113359709A (en) | Unmanned motion planning method based on digital twins | |
CN115019043B (en) | Cross-attention mechanism-based three-dimensional object detection method based on image point cloud fusion | |
WO2022098511A2 (en) | Architecture for map change detection in autonomous vehicles | |
CN116311113A (en) | Driving environment sensing method based on vehicle-mounted monocular camera | |
CN115775378A (en) | Vehicle-road cooperative target detection method based on multi-sensor fusion | |
CN117387647A (en) | Road planning method integrating vehicle-mounted sensor data and road sensor data | |
Habib et al. | Lane departure detection and transmission using Hough transform method | |
Pan et al. | Vision-based Vehicle Forward Collision Warning System Using Optical Flow Algorithm. | |
US11555928B2 (en) | Three-dimensional object detection with ground removal intelligence | |
Jung et al. | Intelligent Hybrid Fusion Algorithm with Vision Patterns for Generation of Precise Digital Road Maps in Self-driving Vehicles. | |
CN116129553A (en) | Fusion sensing method and system based on multi-source vehicle-mounted equipment | |
CN116453205A (en) | Method, device and system for identifying stay behavior of commercial vehicle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |