CN116311113A - Driving environment sensing method based on vehicle-mounted monocular camera - Google Patents

Driving environment sensing method based on vehicle-mounted monocular camera Download PDF

Info

Publication number
CN116311113A
CN116311113A CN202310093603.XA CN202310093603A CN116311113A CN 116311113 A CN116311113 A CN 116311113A CN 202310093603 A CN202310093603 A CN 202310093603A CN 116311113 A CN116311113 A CN 116311113A
Authority
CN
China
Prior art keywords
convolution
network
monocular camera
vehicle
driving environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310093603.XA
Other languages
Chinese (zh)
Inventor
朱宗卫
魏冉
王超
周学海
李曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Original Assignee
Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Institute Of Higher Studies University Of Science And Technology Of China filed Critical Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority to CN202310093603.XA priority Critical patent/CN116311113A/en
Publication of CN116311113A publication Critical patent/CN116311113A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/588Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a driving environment sensing method based on a vehicle-mounted monocular camera, which comprises the steps of carrying out structural re-parameterization and automatic driving multi-task sensing on an up-sampling module. Compared with the common linear interpolation and transpose convolution, the method has the advantage that the accuracy of the network model is improved to a certain extent by using the RepUPSAmple. On the task of the semantic segmentation model, compared with the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used, the performance of the semantic segmentation network can be improved by using the RepUPSAmple as an upsampling method according to different network models, different upsampling positions and different network scales. The mIOU can be improved by 1.77% on average and 0.74% on average by P.A. and 1.16% by P.A. and 0.35% by transpose convolution compared to bilinear interpolation algorithm.

Description

Driving environment sensing method based on vehicle-mounted monocular camera
Technical Field
The invention relates to the field of automatic driving, in particular to a driving environment sensing method based on a vehicle-mounted monocular camera.
Background
In recent years, deep learning and computer vision technologies are rapidly emerging, and automatic driving technologies based on the deep learning and the computer vision technologies provide new solutions for improving traffic safety and efficiency. The automatic driving system is not tired, can strictly adhere to traffic rules, and has great potential in reducing accident occurrence rate. Autopilot incorporates a number of technologies such as artificial intelligence, communications, semiconductors, automobiles, and the like. The related industry chain is wide, the space of value that can be created is huge, and the technology has become the necessary place of competing for the cross-border of the automobile industry and the science and technology industry of various countries, and huge companies such as ***, tesla, general purpose, hundred degrees and the like develop the automatic driving technology in great numbers. Technological progress, policy pushing, huge head entry, capital entry, cost reduction, clear scene and other factors are comprehensively promoted, and after more than ten years of exploration and development of automatic driving technology, the automatic driving technology stands at key commercial landing nodes at present.
The automatic driving system is a comprehensive system integrating the functions of environment sensing, decision control, action execution and the like. The automatic driving system is huge and complex, and can be generally divided into three modules of perception, decision and control according to functions. Wherein the sensing module is defined as the collection and processing of environmental information and in-vehicle information, which senses information of the surrounding environment through various sensors. The method relates to multiple tasks such as road boundary detection, drivable area detection, traffic sign recognition, vehicle and pedestrian detection, road surface information perception and the like. The sensing module is equivalent to the basic stone of the whole system, and the decision system can perform correct judgment only if the sensing system provides accurate information. The robustness of the perception system directly affects the reliability of the overall autopilot system. The decision module can be understood to perform decision judgment according to the perception information, determine a proper working model, and assign a corresponding control strategy to replace a driver to make driving decisions. The execution module is used for controlling the vehicle by installing a decision result after the system makes a decision, each operation system of the vehicle needs to be connected with the decision system through a bus, and the installation instruction accurately controls driving actions such as an acceleration program, a braking program, a steering amplitude, light control and the like so as to realize autonomous control of the vehicle.
In the sensing module, if the cooperation of multiple sensors is needed to achieve accurate sensing, the existing methods can be divided into the following three types: a perception method based on monocular images, laser point cloud and millimeter wave radar. The laser radar and the millimeter wave radar both belong to active sensors, the detection signals are emitted to the periphery, the signals reflected by the object are received, and the azimuth and the distance of the object are calculated through comparison between the emitted signals and the received signals. The camera belongs to a passive sensor and senses an object by light reflection. The three sensors have different advantages and disadvantages. The laser radar can generate high-precision three-dimensional point cloud information, but the laser radar in the current stage has high cost and short service life, and a large amount of noise exists in rainy and snowy weather. Millimeter wave radar has small weather effect, is sensitive to moving objects, and has the defects of lower resolution and poorer discrimination capability. The monocular camera has the advantages of low cost, high resolution, and capability of providing rich visual information, but cannot acquire depth information. For safety reasons, autopilot vehicles are typically equipped with three types of sensors, which are used by advanced driving assistance systems (Advanced Driving Assistance System, ADAS) for environmental perception. However, on common civilian automobiles, the price for installing the sensors is too high, and the common monocular cameras have a wider hardware base. ADAS based on monocular cameras is therefore of greater practical value.
Disclosure of Invention
The invention aims at: the driving environment sensing method based on the vehicle-mounted monocular camera aims to solve the following problems:
an optimized neural network upsampling module: the depth neural network generally adopts transposed convolution or bilinear interpolation as an up-sampling module, the bilinear interpolation has no learnable parameters, the reasoning speed is high, the expression capacity is weak, and the transposed convolution has the learnable parameters and has stronger expression capacity. Note the identity between them, using multi-branched up-sampling structures in the training phase, while the lossless merging of multiple branches into a single branch in the reasoning phase, mentions the network capacity while not causing a loss of reasoning speed.
Autopilot multitasking awareness: the automatic driving sensing module needs to sense the surrounding environment by utilizing various sensors of the vehicle body, including target detection, road drivable area segmentation and lane line detection. The monocular camera has a wider hardware base because of low cost, and the monocular camera-based automatic driving perception system has higher practicability. Based on the image shot by the vehicle-mounted monocular camera, the automatic driving perception multitasking algorithm is designed by combining with the structural heavy parameterization up-sampling module, and a plurality of tasks are completed through one-time reasoning, so that the throughput speed of the whole system is improved, and the power consumption and the use of a memory are reduced.
The technical scheme of the invention is as follows:
a driving environment sensing method based on a vehicle-mounted monocular camera comprises the following steps:
s1, carrying out structural re-parameterization on an up-sampling module: comprising a training phase and an reasoning phase, wherein,
in the training stage, expanding one transposed convolution layer of an up-sampling module into multiple branches, wherein one branch uses a linear interpolation algorithm, and the other branches use transposed convolutions with different convolution kernel sizes;
in the reasoning stage, the multi-branch structure is subjected to re-parameterization and is converted into a single-branch structure in a lossless manner;
s2, automatic driving multitasking awareness: through a multi-task deep learning model, perception real-time reasoning based on a monocular camera is realized, and three tasks of target detection, road drivable region segmentation and lane line segmentation are completed.
Preferably, in the training stage in S1, the number of channels of the output feature map is changed by adding a convolution layer of 1*1 to the linear interpolation; batch normalizaiton is added after each upsampling branch to further improve model performance.
Preferably, in the reasoning stage in S1, for double up-sampling, three branches are respectively convolved with bilinear interpolation bins 1*1, 2 x 2 transpose convolutions, 4*4 transpose convolutions, and thereafter each concatenated batch normalization layers; inputting a feature map
Figure SMS_1
After upsampling a feature map is obtained>
Figure SMS_2
Convolution kernel of 4×4 transposed convolution
Figure SMS_3
Bias->
Figure SMS_4
Because of the transposed convolution, the first padding and then the convolution are needed, and the convolution kernel of the convolution process is +.>
Figure SMS_5
The corresponding relation is shown in formula 1, and is transposed first
Figure SMS_6
Is inverted +.>
Figure SMS_7
Is the last two dimensions of (2);
Figure SMS_8
after the convolution, the normalization is carried out through the BN layer, in the reasoning process, the BN layer parameters are fused by the convolution, and transposed convolution parameters after the fusion are shown as formulas 2 and 3:
Figure SMS_9
Figure SMS_10
wherein gamma, beta, sigma and mu respectively correspond to the weight, bias, variance and mean of BN layer to obtain weight W 4×4 Finally, obtaining the final transposed convolution weight through the transformation of the formula 3
Figure SMS_11
Convolution kernel of 2 x 2 transpose convolution
Figure SMS_12
The weight is obtained by filling the convolution kernel outer circle of 2×2 with zero value to 4×4 size and fusing the parameters of BN layer>
Figure SMS_13
Bilinear interpolation is followed by 1*1 convolution, whose convolution kernel
Figure SMS_14
Convolution kernel of a 1 x 1 convolution of the number of transform channels without bias>
Figure SMS_15
Bias->
Figure SMS_16
The obtained 4×4 transpose convolution is firstly subjected to parameter fusion with 1×1 convolution, and the obtained new weights are shown in formulas 4 and 5:
W bilinear ←W 1×1 ×W bilinear (4)
b bilinear ←b 1×1 (5)
finally, againFusing BN layer parameters to obtain weights
Figure SMS_17
After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and offsets of the three transposed convolutions are added to obtain the final result, as shown in equations 6, 7:
Figure SMS_18
Figure SMS_19
and through structural re-parameterization, the multi-branch complex structure during training is compressed in a lossless manner.
Preferably, in S2, the overall structure of the multi-task deep learning model adopts an encoding-decoding network structure, and for three tasks, different decoding header networks are used respectively, and the encoding network is shared; the coding network is divided into a backhaul network and a back network according to different actions and positions, the backhaul network directly receives input from a camera and is used for mining shallow characteristic information of the network, the back network receives the characteristic information from the backhaul network, further characteristic fusion and characteristic extraction are carried out, deeper characteristic information is obtained, and the deeper characteristic information is transmitted to different decoding networks.
Preferably, the Backbone network adopts ResNet, and images shot by the monocular camera are scaled to 640 x 340, and feature images with different sizes are obtained through a multi-layer residual structure.
Preferably, the Neck network adopts an improved BiF PN structure, the Bi FPN structure comprises a plurality of upsampling operations, and upsampling in the Bi FPN is replaced by a repa psample capable of being re-parameterized;
a plurality of Bi FPNs are connected to form a Neck structure of the whole perception network, the characteristic diagram input into the Neck is P3, P4 and P5 correspondingly, P3-P7 with deeper characteristics is obtained after the first Bi FPN, and the P3-P7 is input into the next Bi FPN structure; the output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.
Preferably, the road drivable region segmentation adopts a semantic segmentation algorithm to carry out pixel-by-pixel identification classification, receives characteristic information obtained from a back bone network and a network, adopts a BiFPN structure and FCNs to carry out classification, and relates to a plurality of up-sampling operations, and substitution is carried out by using a RepU PSAmple; road drivable region segmentation employs cross entropy loss.
Preferably, the target detection adopts an anchor-base detection scheme, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature images of different scales, and finally, repeated detection frames are removed by inhibiting N MS through a non-maximum value, so that a final detection result is obtained; the target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.
Preferably, the lane line segmentation is based on key point detection, the picture is horizontally divided into a plurality of strips, each strip is further divided into a plurality of blocks, the positions occupied by the lane lines in each strip are predicted, the positions of the key points on the lane lines are obtained through the lane line detection, the key points belonging to the same lane line are connected into lines, and the loss is calculated by using cross entropy.
The invention has the advantages that:
compared with the common linear interpolation and transpose convolution, the method has the advantage that the accuracy of the network model is improved to a certain extent by using the RepUPSAmple. On the task of the semantic segmentation model, compared with the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used, the performance of the semantic segmentation network can be improved by using the RepUPSAmple as an upsampling method according to different network models, different upsampling positions and different network scales. The mIOU can be improved by 1.77% on average and 0.74% on average by P.A. and 1.16% by P.A. and 0.35% by transpose convolution compared to bilinear interpolation algorithm.
Drawings
The invention is further described below with reference to the accompanying drawings and examples:
FIG. 1 is a block diagram of a reparameterized upsampling RepUPSAmple;
FIG. 2 is a diagram of the overall architecture of a multitasking aware network;
FIG. 3 is a block diagram of a backhaul network;
fig. 4 is a diagram of a bipin network structure.
Detailed Description
The invention provides a driving environment sensing method based on a vehicle-mounted monocular camera, which comprises the steps of carrying out structural re-parameterization and automatic driving multi-task sensing on an up-sampling module.
S1, carrying out structural re-parameterization on an up-sampling module
Upsampling plays an irreplaceable role in neural networks, and many network models require upsampling to recover feature map dimensions and fuse multi-channel information. The common up-sampling module comprises linear interpolation and transposed convolution, wherein the linear interpolation has no learnable parameters, and has high reasoning speed and weak expression capability; the transposed convolution reasoning has leachable parameters and has stronger expression capability. In practice, linear interpolation is a special transposed convolution, each of which can be replaced without loss by a transposed convolution. Based on this, in combination with the idea of reparameterization, an upsampling layer structure reparameterization method repappable is proposed.
In the training phase, one transposed convolution layer is extended to multiple branches, where one branch uses a common linear interpolation algorithm and the other branches use transposed convolutions of different convolution kernel sizes. Linear interpolation can only change the size of the feature map, but not the number of channels outputting the feature map, which can be changed by adding a convolution layer of 1*1. In addition, batch normalization can normalize the feature map, improve the generalization ability of the network, and adding batch normalizaiton after each upsampling branch can further improve model performance. Adding interpolation branches is equivalent to providing one jump connection for transposed convolution, which is focused on learning residual errors.
In the reasoning stage, the multi-branch structure can be re-parameterized and converted into a single-branch structure without damage. Twice of the productionAs an example, three branches during training are respectively convolved with bilinear interpolation, the 1*1 convolution, the 2 x 2 transposed convolution, and the 4*4 transposed convolution, and then each concatenated batch normalization layers, as shown in fig. 1. Inputting a feature map
Figure SMS_20
After upsampling a feature map is obtained>
Figure SMS_21
(1) 4*4 transposed convolution branches
Convolution kernel of 4×4 transposed convolution
Figure SMS_22
Bias->
Figure SMS_23
Because of the transposed convolution, the first padding and then the convolution are needed, and the convolution kernel of the convolution process is +.>
Figure SMS_24
Their correspondence is shown in formula 1, transpose +.>
Figure SMS_25
Is inverted +.>
Figure SMS_26
Is a part of the two dimensions.
Figure SMS_27
After the convolution, normalization is carried out through the BN layer, and in the reasoning process, the convolution can fuse the BN layer parameters so as to achieve the effect of reasoning acceleration. The transposed convolution parameters after fusion are shown in equations 2, 3:
Figure SMS_28
Figure SMS_29
wherein gamma, beta, sigma and mu respectively correspond to the weight, bias, variance and mean of BN layer to obtain weight W 4×4 Finally, obtaining the final transposed convolution weight through the transformation of the formula 3
Figure SMS_30
(2) 2 x 2 transposed convolutional branches
Convolution kernel of 2 x 2 transpose convolution
Figure SMS_31
The weight can be obtained by filling the convolution kernel outer circle of 2×2 with zero value to 4×4 size and then fusing the parameters of BN layer>
Figure SMS_32
(3) Bilinear interpolation 1*1 convolution branch
Bilinear interpolation layer can be transformed into transposed convolution of 4×4 convolution kernel size with its convolution kernel lossless
Figure SMS_33
Convolution kernel of a 1 x 1 convolution of the number of transform channels without bias>
Figure SMS_34
Bias->
Figure SMS_35
The obtained 4×4 transpose convolution is firstly subjected to parameter fusion with 1×1 convolution, and the obtained new weights are shown in formulas 4 and 5:
W bilinear ←W 1×1 ×W bilinear (4)
b bilinear ←b 1×1 (5)
finally, the BN layer parameters are fused to obtain the weight
Figure SMS_36
(4) Multi-branch fusion
After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and offsets of the three transposed convolutions are added to obtain the final result, as shown in equations 6, 7:
Figure SMS_37
Figure SMS_38
through the structure heavy parameterization, the multi-branch complex structure during training can be compressed in a lossless manner, the accuracy is kept unchanged, and the reasoning efficiency is improved.
S2, automatic driving multitasking awareness
The automatic driving scene pays attention to precision and speed at the same time, and through a multi-task deep learning model, perception real-time reasoning based on a monocular camera can be realized, and three tasks of target detection, road drivable area segmentation and lane line segmentation are completed. The overall structure of the algorithm model adopts an encoding-decoding network structure, and different decoding head networks are respectively used for three tasks and share the encoding network. The coding network is divided into a back box and a back structure according to different actions and positions, the back box directly receives input from a camera, belongs to a relatively shallow position in the network and is used for mining shallow characteristic information of the network, the back belongs to the middle part of the whole network, receives the characteristic information from the back box, performs further characteristic fusion and characteristic extraction, acquires deeper characteristic information, and transmits the information to different decoding networks. The overall structure of the entire network is shown in fig. 2.
(1) Backbone network
The Backbone network adopts a more classical ResNet in a neural network, and has a good effect on image feature extraction. The image shot by the monocular camera is scaled to 640 x 340 size, and the characteristic diagrams of different sizes of P1-P5 are obtained through a multi-layer residual structure, as shown in fig. 3.
(2) Neck network
The Neck network adopts an improved BiFPN structure, and the bidirectional pyramid structure is beneficial to generating and fusing features with different scales, so that the finally generated feature map simultaneously contains multi-scale and multi-semantic information. The Bi FPN structure comprises a plurality of upsampling operations, upsampling in the Bi FPN is replaced by a Rep upscan which can be re-parameterized, the flexibility of the upsampling module is improved, the change of the dimension of the feature map is better adapted, and the overall performance of the network is improved. The single Bi FPN structure is shown in fig. 4.
And connecting a plurality of Bi FPNs to form a Neck structure of the whole sensing network, inputting P3, P4 and P5 corresponding to the feature map of the Neck, obtaining P3-P7 with deeper features after the first Bi FPN, and inputting the P3-P7 into the next Bi FPN structure. The output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.
(3) Road drivable area dividing head network
The road drivable region segmentation adopts a semantic segmentation algorithm to carry out pixel-by-pixel identification classification, receives characteristic information obtained from backbones and negs, and also adopts a Bi FPN structure and FCNs to carry out classification, wherein a plurality of up-sampling operations are involved, and Rep U PSAmple is also used for replacement. Road drivable region segmentation employs cross entropy loss.
(4) Target detection header network
The target detection head is similar to a YOLOv5 network, an anchor-base detection scheme is adopted, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature graphs of different scales, and finally the repeated detection frame is removed through non-maximum value inhibition N MS, so that a final detection result is obtained. The target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.
(5) Lane line detection head network
The lane line detection is based on the key point detection, the picture is horizontally divided into a plurality of strips, each strip is divided into a plurality of blocks, the positions occupied by the lane lines in each strip are predicted, and compared with the method for detecting the lane lines by semantic segmentation, the method for detecting the lane lines by using the key points can greatly reduce the calculated amount of the network model. And detecting the lane lines to obtain the positions of key points on the lane lines, connecting the key points belonging to the same lane line into lines, and calculating the loss by using cross entropy.
Compared with the common linear interpolation and transpose convolution, the Rep U PSAmple has a certain improvement on the precision of the network model. On the task of the semantic segmentation model, the precision performance of the deep Labv3 model, the FPN model and the U-Net model when different upsampling modules are used is compared, and the results are shown in Table 1.
Table 1 influence of different upsampling methods on semantic segmentation network
Figure SMS_39
Figure SMS_40
According to experimental results, the performance of the semantic segmentation network can be improved by taking the Rep U PSAmple as an up-sampling method according to different network models, different up-sampling positions and different network scales. The m IOU can be raised by 1.77% on average, 0.74% on average, 1.16% on average, and 0.35% on p.a. on average, compared to the transposed convolution, compared to the bilinear interpolation algorithm.
The automatic driving technology is still in a high-speed development stage, and is not easy to imagine, and human traffic in the future world is necessarily automatic driving based on artificial intelligence. However, at present, the automatic driving technology is still not mature, besides the high technical requirement of software, the automatic driving technology has additional requirements on hardware, and if accurate sensing needs to be completed, the mutual matching of various sensors is needed, and a laser radar, a millimeter wave radar, a camera, an inertial measurement unit and the like are needed. It is often very difficult to equip old cars with an on-board artificial intelligence system having some intelligence.
The network model that this patent provided, based on monocular camera, the hardware requirement is low, and most vehicles all are equipped with, say the camera on the vehicle event data recorder, and required computational performance is not high, detects pedestrian, the condition of deviating from the lane again under, can give the timely warning of driver, can assist the driving to a certain extent. In addition, along with the development of intelligent traffic, the cooperation of vehicles and the cooperation of vehicles and roads are possible in the future, and different vehicles exchange information at intersections to provide early warning for the vision blind area information of the vehicle.
The sky-eye system of China is widely covered, most urban intersections are provided with cameras for monitoring illegal and illegal behaviors, and the network model provided by the patent can be applied to intersection monitoring equipment to provide artificial intelligence for edge-end equipment. The intersection camera can detect traffic flow and control traffic light change according to the traffic flow, ensure the traffic efficiency of the intersection, detect red light running behavior of pedestrians and give warning, detect overspeed, stop violations, red light running behavior of vehicles and the like, record license plate numbers of vehicles and report to the supervision department.
The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same according to the content of the present invention, and are not intended to limit the scope of the present invention. All modifications made according to the spirit of the main technical proposal of the invention should be covered in the protection scope of the invention.

Claims (10)

1. The driving environment sensing method based on the vehicle-mounted monocular camera is characterized by comprising the following steps of:
s1, carrying out structural re-parameterization on an up-sampling module: comprising a training phase and an reasoning phase, wherein,
in the training stage, expanding one transposed convolution layer of an up-sampling module into multiple branches, wherein one branch uses a linear interpolation algorithm, and the other branches use transposed convolutions with different convolution kernel sizes;
in the reasoning stage, the multi-branch structure is subjected to re-parameterization and is converted into a single-branch structure in a lossless manner;
s2, automatic driving multitasking awareness: through a multi-task deep learning model, perception real-time reasoning based on a monocular camera is realized, and three tasks of target detection, road drivable region segmentation and lane line segmentation are completed.
2. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 1, wherein in the training stage in S1, the number of channels of the output feature map is changed by adding a convolution layer of 1*1 to the linear interpolation; batch normalizaiton is added after each upsampling branch to further improve model performance.
3. The driving environment sensing method based on an on-vehicle monocular camera according to claim 2, wherein in the reasoning stage in S1, for double up-sampling, three branches during training use bilinear interpolation tap 1*1 convolution, 2 x 2 transpose convolution, 4*4 transpose convolution, respectively, and thereafter each concatenate batch normalization layers; inputting a feature map
Figure FDA0004071021820000011
After upsampling a feature map is obtained>
Figure FDA0004071021820000012
4. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 3, wherein the convolution kernel of the 4×4 transpose convolution is
Figure FDA0004071021820000013
Bias->
Figure FDA0004071021820000014
Because of the transposed convolution, the first padding and then the convolution are needed, and the convolution kernel of the convolution process is +.>
Figure FDA0004071021820000015
The corresponding relation is shown in formula 1, which is transposed first>
Figure FDA0004071021820000016
Is inverted +.>
Figure FDA0004071021820000017
Is the last two dimensions of (2);
Figure FDA0004071021820000018
after the convolution, the normalization is carried out through the BN layer, in the reasoning process, the BN layer parameters are fused by the convolution, and transposed convolution parameters after the fusion are shown as formulas 2 and 3:
Figure FDA0004071021820000019
Figure FDA00040710218200000110
wherein gamma, beta, sigma and mu respectively correspond to the weight, bias, variance and mean of BN layer to obtain weight W 4×4 Finally, obtaining the final transposed convolution weight through the transformation of the formula 3
Figure FDA00040710218200000111
Convolution kernel of 2 x 2 transpose convolution
Figure FDA00040710218200000112
The weight is obtained by filling the convolution kernel outer circle of 2×2 with zero value to 4×4 size and fusing the parameters of BN layer>
Figure FDA0004071021820000021
Bilinear interpolation is followed by 1*1 convolution, whose convolution kernel
Figure FDA0004071021820000022
Convolution kernel of a 1 x 1 convolution of the number of transform channels without bias>
Figure FDA0004071021820000023
Bias->
Figure FDA0004071021820000024
The obtained 4×4 transpose convolution is firstly subjected to parameter fusion with 1×1 convolution, and the obtained new weights are shown in formulas 4 and 5:
W bilinear ←W 1×1 ×W bilinear (4)
b bilinear ←b 1×1 (5)
finally, the BN layer parameters are fused to obtain the weight
Figure FDA0004071021820000025
After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and offsets of the three transposed convolutions are added to obtain the final result, as shown in equations 6, 7:
Figure FDA0004071021820000026
Figure FDA0004071021820000027
and through structural re-parameterization, the multi-branch complex structure during training is compressed in a lossless manner.
5. The driving environment perception method based on the vehicle-mounted monocular camera according to claim 1, wherein in S2, the overall structure of the multi-task deep learning model adopts an encoding-decoding network structure, and for three tasks, different decoding head networks are used respectively and share the encoding network; the coding network is divided into a backhaul network and a back network according to different actions and positions, the backhaul network directly receives input from a camera and is used for mining shallow characteristic information of the network, the back network receives the characteristic information from the backhaul network, further characteristic fusion and characteristic extraction are carried out, deeper characteristic information is obtained, and the deeper characteristic information is transmitted to different decoding networks.
6. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 5, wherein the Backbone network adopts ResNet, and images shot by the monocular camera are scaled to 640 x 340, and feature maps with different sizes are obtained through a multi-layer residual structure.
7. The vehicle-mounted monocular camera-based driving environment sensing method of claim 5, wherein the Neck network employs an improved BiFPN structure, the BiFPN structure comprising a plurality of upsampling operations, the upsampling in the BiFPN being replaced with a repapppattern that can be re-parameterized;
a plurality of BiFPNs are connected to form a Neck structure of the whole sensing network, the characteristic diagrams input into the Neck are P3, P4 and P5 correspondingly, P3-P7 with deeper characteristics are obtained after the first BiFPN is processed, and the P3-P7 is input into the next BiFPN structure; the output result of the Neck network is obtained through the series connection of 4 BiFPN structures, and the sizes of the characteristic diagrams are 5*3, 10×6, 20×12, 40×24 and 80×48 respectively.
8. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 7, wherein the road drivable region segmentation adopts a semantic segmentation algorithm to perform pixel-by-pixel recognition classification, receives characteristic information obtained from a back bone network and a network, performs classification by adopting a BiFPN structure and FCN, and involves a plurality of up-sampling operations to replace with a repappamp; road drivable region segmentation employs cross entropy loss.
9. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 8, wherein the target detection adopts an anchor-base detection scheme, a priori frame is preset, the probability of each grid existence target and the probability of each target class are judged according to feature maps of different scales, and finally, repeated detection frames are removed through non-maximum suppression NMS to obtain a final detection result; the target detection loss function includes a classification loss for the target, a location loss for the target, and a confidence loss for the target.
10. The driving environment sensing method based on the vehicle-mounted monocular camera according to claim 9, wherein the lane line segmentation is based on key point detection, the picture is horizontally divided into a plurality of strips, each strip is divided into a plurality of blocks, the position occupied by the lane line in each strip is predicted, the position of the key point on the lane line is obtained by the lane line detection, the key points belonging to the same lane line are connected into a line, and the loss is calculated by using cross entropy.
CN202310093603.XA 2023-02-10 2023-02-10 Driving environment sensing method based on vehicle-mounted monocular camera Pending CN116311113A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310093603.XA CN116311113A (en) 2023-02-10 2023-02-10 Driving environment sensing method based on vehicle-mounted monocular camera

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310093603.XA CN116311113A (en) 2023-02-10 2023-02-10 Driving environment sensing method based on vehicle-mounted monocular camera

Publications (1)

Publication Number Publication Date
CN116311113A true CN116311113A (en) 2023-06-23

Family

ID=86789716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310093603.XA Pending CN116311113A (en) 2023-02-10 2023-02-10 Driving environment sensing method based on vehicle-mounted monocular camera

Country Status (1)

Country Link
CN (1) CN116311113A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789153A (en) * 2024-02-26 2024-03-29 浙江驿公里智能科技有限公司 Automobile oil tank outer cover positioning system and method based on computer vision

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789153A (en) * 2024-02-26 2024-03-29 浙江驿公里智能科技有限公司 Automobile oil tank outer cover positioning system and method based on computer vision
CN117789153B (en) * 2024-02-26 2024-05-03 浙江驿公里智能科技有限公司 Automobile oil tank outer cover positioning system and method based on computer vision

Similar Documents

Publication Publication Date Title
WO2022206942A1 (en) Laser radar point cloud dynamic segmentation and fusion method based on driving safety risk field
Han et al. Research on road environmental sense method of intelligent vehicle based on tracking check
CN108345822B (en) Point cloud data processing method and device
CN113313154A (en) Integrated multi-sensor integrated automatic driving intelligent sensing device
CN112581612B (en) Vehicle-mounted grid map generation method and system based on fusion of laser radar and all-round-looking camera
CN112633176B (en) Rail transit obstacle detection method based on deep learning
GB2621048A (en) Vehicle-road laser radar point cloud dynamic segmentation and fusion method based on driving safety risk field
EP4089659A1 (en) Map updating method, apparatus and device
US11966234B2 (en) System and method for monocular depth estimation from semantic information
CN116685874A (en) Camera-laser radar fusion object detection system and method
CN102685516A (en) Active safety type assistant driving method based on stereoscopic vision
CN112950678A (en) Beyond-the-horizon fusion sensing system based on vehicle-road cooperation
CN114419874B (en) Target driving safety risk early warning method based on road side sensing equipment data fusion
CN113359709A (en) Unmanned motion planning method based on digital twins
CN115019043B (en) Cross-attention mechanism-based three-dimensional object detection method based on image point cloud fusion
WO2022098511A2 (en) Architecture for map change detection in autonomous vehicles
CN116311113A (en) Driving environment sensing method based on vehicle-mounted monocular camera
CN115775378A (en) Vehicle-road cooperative target detection method based on multi-sensor fusion
CN117387647A (en) Road planning method integrating vehicle-mounted sensor data and road sensor data
Habib et al. Lane departure detection and transmission using Hough transform method
Pan et al. Vision-based Vehicle Forward Collision Warning System Using Optical Flow Algorithm.
US11555928B2 (en) Three-dimensional object detection with ground removal intelligence
Jung et al. Intelligent Hybrid Fusion Algorithm with Vision Patterns for Generation of Precise Digital Road Maps in Self-driving Vehicles.
CN116129553A (en) Fusion sensing method and system based on multi-source vehicle-mounted equipment
CN116453205A (en) Method, device and system for identifying stay behavior of commercial vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination