CN116665189B - Multi-mode-based automatic driving task processing method and system - Google Patents
Multi-mode-based automatic driving task processing method and system Download PDFInfo
- Publication number
- CN116665189B CN116665189B CN202310945276.6A CN202310945276A CN116665189B CN 116665189 B CN116665189 B CN 116665189B CN 202310945276 A CN202310945276 A CN 202310945276A CN 116665189 B CN116665189 B CN 116665189B
- Authority
- CN
- China
- Prior art keywords
- perception
- voxel
- automatic driving
- type
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 13
- 230000008447 perception Effects 0.000 claims abstract description 114
- 238000012545 processing Methods 0.000 claims abstract description 33
- 230000004927 fusion Effects 0.000 claims abstract description 16
- 230000009471 action Effects 0.000 claims abstract description 11
- 238000000034 method Methods 0.000 claims abstract description 11
- 238000012216 screening Methods 0.000 claims description 15
- 230000002093 peripheral effect Effects 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 3
- 230000001953 sensory effect Effects 0.000 claims 1
- 230000006399 behavior Effects 0.000 abstract description 37
- 238000013136 deep learning model Methods 0.000 abstract description 4
- 230000006872 improvement Effects 0.000 abstract description 3
- 238000012549 training Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012512 characterization method Methods 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 101100409194 Rattus norvegicus Ppargc1b gene Proteins 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W60/00—Drive control systems specially adapted for autonomous road vehicles
- B60W60/001—Planning or execution of driving tasks
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W50/00—Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
- B60W50/0097—Predicting future conditions
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W50/00—Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
- B60W50/0098—Details of control systems ensuring comfort, safety or stability not otherwise provided for
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W60/00—Drive control systems specially adapted for autonomous road vehicles
- B60W60/001—Planning or execution of driving tasks
- B60W60/0027—Planning or execution of driving tasks using trajectory prediction for other traffic participants
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/771—Feature selection, e.g. selecting representative features from a multi-dimensional feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/58—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W50/00—Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
- B60W2050/0001—Details of the control system
- B60W2050/0043—Signal treatments, identification of variables or parameters, parameter estimation or state estimation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- Automation & Control Theory (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Human Computer Interaction (AREA)
- Mechanical Engineering (AREA)
- Mathematical Physics (AREA)
- Transportation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Probability & Statistics with Applications (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention discloses a multi-mode-based automatic driving task processing method and a multi-mode-based automatic driving task processing system, wherein the method comprises the following steps: acquiring modal data acquired by a plurality of perception sensors, and extracting voxel characteristics of the modal data; after unifying the feature dimension and resolution of the extracted voxel features, carrying out feature fusion to obtain first type of feature adaptively fused with different modes; and acquiring a perception task of the automatic driving, inputting the first type of body characteristics and the perception task into a pre-established and trained transducer model of the automatic driving, and completing tasks of the perception of the automatic driving vehicle, the prediction of actions of surrounding objects and the planning of driving behaviors. The method can effectively reduce the training cost and the deep learning model deployment difficulty caused by a mode of a plurality of independent models, and can fully utilize the relevance among perception, prediction and planning tasks to obtain the effect of mutual improvement in performance.
Description
Technical Field
The invention relates to the technical field of automatic driving, in particular to a multi-mode-based automatic driving task processing method and system.
Background
The autopilot (Autonomous Driving) technology has led to an industrial revolution in the automotive industry, whose development has kept away from the continual innovation and advancement of autopilot awareness, prediction and planning technology. With the continuous improvement of the Perception sensor technology and related artificial intelligence algorithms, the automatic driving vehicle can obtain more accurate and comprehensive scene information, and complete automatic driving Perception (permission), prediction (Prediction) and Planning (Planning) tasks, so that safer and more efficient driving is realized. The perception is a 'visual system' of the automatic driving vehicle, the prediction and the planning are 'brains' of the automatic driving vehicle, and the intelligent vehicle is a key technology for constructing intelligent traffic and intelligent cities in the smart city, and has an important technical support function for future Chinese smart city construction.
The sensing sensor technology mainly relates to a laser radar, a millimeter wave radar and a camera, the current mainstream automatic driving technology uses a plurality of independent deep learning models, and utilizes multi-mode data from the three mainstream sensing sensors to respectively complete sensing, prediction and planning tasks. This method has the following disadvantages: 1) Extracting features from multi-modal data is a deep learning network structure common to each task and is one of the main components of a model structure, which can lead to an increase in training cost in a manner of using a plurality of independent models; 2) The perceived, predicted and planned tasks have correlation, and the independent model cannot utilize the correlation to improve the accuracy of the tasks; 3) Multiple independent models increase the actual deployment cost of the deep learning model.
Disclosure of Invention
In order to solve the technical problems in the background technology, the invention provides a multi-mode-based automatic driving task processing method and a multi-mode-based automatic driving task processing system.
The invention provides a multi-mode-based automatic driving task processing method, which comprises the following steps:
s1, acquiring modal data acquired by a plurality of perception sensors, and extracting voxel characteristics of the modal data;
s2, after unifying the feature dimension and the resolution of the extracted voxel features, carrying out feature fusion to obtain first type voxel features which are adaptively fused with different modes;
s3, acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completing tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning; the transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network;
"S3" specifically includes:
acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing the corresponding automatic driving vehicle perception task through a perception task output head, and acquiring a perception output result;
constructing a Key and a Value related to perception by using a perception output result;
inputting the perception output result into a voxel feature filter to obtain sparse interesting second type of voxel features, and constructing a first type Key and Value related to voxel environment through the second type of voxel features;
inputting the Key and Value related to perception and the Key and Value of the first type into a surrounding object motion prediction transducer network at the same time to obtain the Key and Value of the second type related to motion prediction, and then completing the corresponding task of predicting the motion of the surrounding object of the automatic driving vehicle through a surrounding object motion prediction output head;
and after the second type Key and Value are simultaneously input into the driving behavior programming converter network, the corresponding task of driving behavior programming of the automatic driving vehicle is completed through the driving behavior programming output head.
Preferably, the mode data collected by the plurality of sensing sensors specifically includes: image collected by camera sensorPoint cloud acquired by laser radar sensor>Point cloud acquired by millimeter wave radar sensor>。
Preferably, the perceived tasks of autopilot include, but are not limited to, three-dimensional object detection, three-dimensional object tracking, three-dimensional semantic segmentation, three-dimensional space occupancy prediction, and online map generation.
Preferably, "inputting the sensing output result into a voxel feature filter to obtain a sparse interesting second type of voxel feature, and constructing a first type Key and Value related to voxel environment through the second type of voxel feature" specifically includes:
and inputting the perception output result into a voxel feature screening device, selecting interesting voxel features of the first type of voxel features through the perception output result by the voxel feature screening device, selecting sparse interesting second type of voxel features, and constructing first type Key and Value related to voxel environments through the second type of voxel features.
Preferably, the tasks of the driving behavior planning include, but are not limited to, keeping straight, turning left, turning right, accelerating, decelerating and stopping.
A multi-modal based autopilot task processing system comprising:
the feature extraction module is used for acquiring the modal data acquired by the plurality of perception sensors and extracting voxel features of the modal data;
the feature fusion module is used for carrying out feature fusion after unifying the feature dimension and the resolution of the extracted voxel features to obtain first type voxel features which are adaptively fused with different modes;
the task processing module acquires an automatic driving perception task, inputs the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completes tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning; the transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network; the task processing module comprises: the system comprises an automatic driving perception processing module, a surrounding object action prediction processing module, a driving behavior planning processing module and a voxel characteristic screening module;
the automatic driving perception processing module is used for acquiring an automatic driving perception task, inputting a first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing a corresponding automatic driving vehicle perception task through a perception task output head, acquiring a perception output result, and constructing a perception-related Key and Value by utilizing the perception output result;
the voxel feature screening module is used for inputting the perception output result into the voxel feature screening device to obtain sparse interesting second type voxel features, and constructing a first type Key and Value related to voxel environments through the second type voxel features;
the peripheral object motion prediction processing module is used for inputting the Key and the Value related to perception and the first type Key and the Value into a peripheral object motion prediction transducer network at the same time to obtain the second type Key and the Value related to motion prediction, and then completing the corresponding task of peripheral object motion prediction of the automatic driving vehicle through a peripheral object motion prediction output head;
the driving behavior planning processing module is used for inputting the second type Key and the Value into the driving behavior planning converter network at the same time and then completing corresponding tasks of driving behavior planning of the automatic driving vehicle through the driving behavior planning output head.
Preferably, the mode data collected by the plurality of sensing sensors specifically includes: an image acquired by a camera sensor, a point cloud acquired by a laser radar sensor and a point cloud acquired by a millimeter wave radar sensor;
the perception tasks of the autopilot comprise, but are not limited to, three-dimensional target detection, three-dimensional target tracking, three-dimensional semantic segmentation, three-dimensional space occupation prediction and online map generation.
Preferably, the tasks of the driving behavior planning include, but are not limited to, keeping straight, turning left, turning right, accelerating, decelerating and stopping.
According to the multi-mode-based automatic driving task processing method and system, various sensor data can be processed and fused into a unified voxel space in a multi-mode voxel feature generation stage, so that the addition and deletion of the number of sensors can be flexibly supported, and the feature requirements of a plurality of subsequent tasks can be met. In the multi-task output stage, multi-stage tasks such as perception, prediction and planning are combined, so that training cost increase and deep learning model deployment difficulty can be brought by a mode of effectively reducing a plurality of independent models, and the effect of mutual improvement in performance can be obtained by fully utilizing the relevance among the perception, prediction and planning tasks.
Drawings
FIG. 1 is a schematic diagram of a workflow of a multi-mode-based automatic driving task processing method according to the present invention;
FIG. 2 is a schematic diagram of the operation flow of the multi-mode-based automatic driving task processing method according to the present invention;
FIG. 3 is a schematic structural diagram of a multi-mode autopilot algorithm system based on a unified large model according to the present invention;
fig. 4 is a schematic structural diagram of a task processing module of the multi-mode automatic driving algorithm system based on the unified big model.
Detailed Description
Variable subscripts "cam", "LiDAR" and "Radar" are set for distinguishing between respective sensors of Camera, laser Radar (LiDAR) and millimeter wave Radar (Radar), and variable subscripts "perc", "pred" and "plan" are set for distinguishing between Perception (task), prediction (task) and Planning (task).
Referring to fig. 1 and 2, the multi-mode-based automatic driving task processing method provided by the invention comprises the following steps:
s1, acquiring modal data acquired by a plurality of perception sensors, and extracting voxel characteristics of the modal data.
In this embodiment, the sensing sensor collects the modal data of the autopilot application scene by using the sensors such as a camera, a laser radar, a millimeter wave radar, and the like, and correspondingly, the image collected by the camera sensor is collectedPoint cloud acquired by laser radar sensor>Point cloud acquired by millimeter wave radar sensor>And respectively inputting the corresponding voxel characteristics into a corresponding voxel characteristic generation network to obtain corresponding voxel characteristics.
And S2, after unifying the feature dimension and the resolution of the extracted voxel features, carrying out feature fusion to obtain first type voxel features which are adaptively fused with different modes.
In this embodiment, the image acquired by the camera sensor is usedPoint cloud acquired by laser radar sensorPoint cloud acquired by millimeter wave radar sensor>Conversion of corresponding voxel features into unified voxel feature space to form respective setsHaving the same characteristic dimension C and spatial resolution +.>Voxel characteristics of->、Input to the corresponding adaptive voxel feature fusion network +.>And obtaining the self-adaptive fusion weight.
Voxel characterization by image modalityGenerating corresponding image voxel characteristic self-adaptive fusion weight +.>:
;
Voxel characterization by lidar point cloud modalityGenerating self-adaptive fusion weight of corresponding laser radar point cloud voxel characteristics>:
;
Voxel characterization by millimeter wave Lei Dadian cloud modalityGenerating corresponding millimeter wave Lei Dadian cloud voxel characteristic self-adaptive fusion weight +.>:
;
Fusion weights to be generatedAnd (3) carrying out numerical normalization, namely:
wherein, as the normalization function, a Softmax function can be adopted.
Voxel characteristic of three modesFusion weights with corresponding adaptationsMultiplying and adding to obtain self-adaptive fused voxel characteristic +.>;
The fused voxel featuresHaving the same characteristic dimension C and resolution +.>The system can flexibly adapt to the increase and decrease of the number of the sensors, namely, the input modes can be compatible with multiple modes (cameras, laser radars and millimeter wave radars), double modes (cameras and laser radars, laser radars and millimeter wave radars, cameras and millimeter wave radars) and single modes (cameras, laser radars and millimeter wave radars), and uniform voxel characteristics are obtained.
S3, acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completing tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning.
The perception tasks of autopilot include, but are not limited to, three-dimensional object detection, three-dimensional object tracking, three-dimensional semantic segmentation, three-dimensional space occupation prediction, and online map generation.
In particular, the unified large model is specifically a transducer model for autopilot.
Specifically, as shown in fig. 1 and 2, the transducer model specifically includes: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network;
"S3" specifically includes:
acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing the corresponding automatic driving vehicle perception task through a perception task output head, and acquiring a perception output result; and constructing a Key and a Value related to perception by using the perception output result.
In particular, the output result is perceivedThe result is->And is used to construct multi-type perceptually relevant Key and Value, denoted as +.>。
Inputting the perception output result into a voxel feature filter to obtain sparse interesting second type of voxel features, and constructing a first type Key and Value related to voxel environment through the second type of voxel features; respectively marked as。
Inputting the perception output result into a voxel feature filter to obtain sparse interesting second type of voxel features, and constructing a first type Key and Value related to voxel environment through the second type of voxel features specifically comprises the following steps:
and inputting the perception output result into a voxel feature screening device, selecting interesting voxel features of the first type of voxel features through the perception output result by the voxel feature screening device, selecting sparse interesting second type of voxel features, and constructing first type Key and Value related to voxel environments through the second type of voxel features.
The perception-related Key and Value and the first type Key and Value are input into the surrounding object action prediction transducer network at the same timeObtaining a second type Key and Value related to motion prediction, and then completing corresponding tasks of motion prediction of surrounding objects of the automatic driving vehicle through a surrounding object motion prediction output head;
in this embodiment, surrounding object motion prediction transform neural network is usedVoxel context dependent +.>Motion prediction Query (denoted +.>) Learning and updating are carried out, the updated motion prediction Query is used for completing the task of motion prediction of objects around the automatic driving vehicle, and Key and Value related to motion prediction are recorded as +.>。
The method comprises the following steps:
step 1:is->Part of use->For->Learning and updating are performed, and the process is based on a calculation mode of a transducer structure, as follows:
wherein (1)>The method comprises the following main calculation steps:
calculating a correlation matrix of the two; />The function normalizes the correlation matrix and is realized by a Softmax function; FFN is a feedforward neural network and can be arranged into a two-layer structure;may be set to 128.
Step 2:is->Voxel context dependent +.>And->Multiple pairs->Further learning and updating are performed, the process still being based on +.>A similar calculation is as follows:
step 3: the +.2 updated by the above step>To the motion prediction output head>In, for outputting the motion prediction result +.>:
Step 4: the +.2 updated by the above step>Key and Value, which are also used as motion prediction-related, are denoted +.>。
And after the second type Key and Value are simultaneously input into the driving behavior programming converter network, the corresponding task of driving behavior programming of the automatic driving vehicle is completed through the driving behavior programming output head.
In this embodiment, the driving behavior planning transducer neural networkAction prediction related +.>Planning driving behaviorQuery (denoted as->) And learning and updating, wherein the updated driving behavior planning Query is used for completing the corresponding task of driving behavior planning of the automatic driving vehicle.
Tasks for driving behavior planning include, but are not limited to, maintaining straight, left turn, right turn, acceleration, deceleration, and stopping.
The specific implementation process is as follows:
step 5: the driving behavior planning transducer neural networkAction prediction related +.>Planning a Query for driving behavior (denoted +.>) Learning and updating is performed, the process still being based on +.>A similar calculation is as follows:
step 6: the +.2 updated by the above step>Send to driving behavior planning output head->In for outputting the driving behavior planning prediction result +.>:
Wherein the result isOutput->Including but not limited to maintaining specific driving behaviors such as straight, left turn, right turn, acceleration, deceleration, and parking.
Referring to fig. 3, a multi-modality based autopilot task processing system includes:
the feature extraction module is used for acquiring the modal data acquired by the plurality of perception sensors and extracting voxel features of the modal data;
the feature fusion module is used for carrying out feature fusion after unifying the feature dimension and the resolution of the extracted voxel features to obtain first type voxel features which are adaptively fused with different modes;
the task processing module acquires an automatic driving perception task, inputs the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completes tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning. The transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network.
Specifically, as shown in fig. 3, the task processing module includes: the system comprises an automatic driving perception processing module, a surrounding object action prediction processing module, a driving behavior planning processing module and a voxel characteristic screening module;
the automatic driving perception processing module is used for acquiring an automatic driving perception task, inputting a first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing a corresponding automatic driving vehicle perception task through a perception task output head, acquiring a perception output result, and constructing a perception-related Key and Value by utilizing the perception output result;
the voxel feature screening module is used for inputting the perception output result into the voxel feature screening device to obtain sparse interesting second type voxel features, and constructing a first type Key and Value related to voxel environments through the second type voxel features;
the peripheral object motion prediction processing module is used for inputting the Key and the Value related to perception and the first type Key and the Value into a peripheral object motion prediction transducer network at the same time to obtain the second type Key and the Value related to motion prediction, and then completing the corresponding task of peripheral object motion prediction of the automatic driving vehicle through a peripheral object motion prediction output head;
the driving behavior planning processing module is used for inputting the second type Key and the Value into the driving behavior planning converter network at the same time and then completing corresponding tasks of driving behavior planning of the automatic driving vehicle through the driving behavior planning output head.
Specifically, as shown in fig. 2, the modal data collected by the plurality of sensing sensors specifically includes: an image acquired by a camera sensor, a point cloud acquired by a laser radar sensor and a point cloud acquired by a millimeter wave radar sensor;
the perception tasks of autopilot include, but are not limited to, three-dimensional object detection, three-dimensional object tracking, three-dimensional semantic segmentation, three-dimensional space occupation prediction, and online map generation.
Specifically, as shown in FIG. 2, tasks of driving behavior planning include, but are not limited to, maintaining straight, left-turn, right-turn, accelerating, decelerating, and stopping.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (8)
1. The multi-mode-based automatic driving task processing method is characterized by comprising the following steps of:
s1, acquiring modal data acquired by a plurality of perception sensors, and extracting voxel characteristics of the modal data;
s2, after unifying the feature dimension and the resolution of the extracted voxel features, carrying out feature fusion to obtain first type voxel features which are adaptively fused with different modes;
s3, acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completing tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning; the transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network;
"S3" specifically includes:
acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing the corresponding automatic driving vehicle perception task through a perception task output head, and acquiring a perception output result;
constructing a Key and a Value related to perception by using a perception output result;
inputting the perception output result into a voxel feature filter to obtain sparse interesting second type of voxel features, and constructing a first type Key and Value related to voxel environment through the second type of voxel features;
inputting the Key and Value related to perception and the Key and Value of the first type into a surrounding object motion prediction transducer network at the same time to obtain the Key and Value of the second type related to motion prediction, and then completing the corresponding task of predicting the motion of the surrounding object of the automatic driving vehicle through a surrounding object motion prediction output head;
and after the second type Key and Value are simultaneously input into the driving behavior programming converter network, the corresponding task of driving behavior programming of the automatic driving vehicle is completed through the driving behavior programming output head.
2. The multi-modal-based automatic driving task processing method according to claim 1, wherein the modal data collected by the plurality of sensing sensors specifically includes: image collected by camera sensorPoint cloud acquired by laser radar sensor>Point cloud acquired by millimeter wave radar sensor>。
3. The multi-modal based autopilot task processing method of claim 1 wherein the autopilot-based sensory tasks include, but are not limited to, three-dimensional object detection, three-dimensional object tracking, three-dimensional semantic segmentation, three-dimensional space occupancy prediction, online map generation.
4. The method for processing the automatic driving task based on the multiple modes according to claim 1, wherein inputting the perception output result into the voxel feature filter to obtain sparse interesting second type of voxel features, and constructing the first type Key and Value related to the voxel environment through the second type of voxel features specifically comprises:
and inputting the perception output result into a voxel feature screening device, selecting interesting voxel features of the first type of voxel features through the perception output result by the voxel feature screening device, selecting sparse interesting second type of voxel features, and constructing first type Key and Value related to voxel environments through the second type of voxel features.
5. The method of claim 1, wherein the driving behavior planning tasks include, but are not limited to, keep straight, turn left, turn right, accelerate, decelerate, and park.
6. A multi-modal based autopilot task processing system comprising:
the feature extraction module is used for acquiring the modal data acquired by the plurality of perception sensors and extracting voxel features of the modal data;
the feature fusion module is used for carrying out feature fusion after unifying the feature dimension and the resolution of the extracted voxel features to obtain first type voxel features which are adaptively fused with different modes;
the task processing module acquires an automatic driving perception task, inputs the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completes tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning; the transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network;
the task processing module comprises: the system comprises an automatic driving perception processing module, a surrounding object action prediction processing module, a driving behavior planning processing module and a voxel characteristic screening module;
the automatic driving perception processing module is used for acquiring an automatic driving perception task, inputting a first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing a corresponding automatic driving vehicle perception task through a perception task output head, acquiring a perception output result, and constructing a perception-related Key and Value by utilizing the perception output result;
the voxel feature screening module is used for inputting the perception output result into the voxel feature screening device to obtain sparse interesting second type voxel features, and constructing a first type Key and Value related to voxel environments through the second type voxel features;
the peripheral object motion prediction processing module is used for inputting the Key and the Value related to perception and the first type Key and the Value into a peripheral object motion prediction transducer network at the same time to obtain the second type Key and the Value related to motion prediction, and then completing the corresponding task of peripheral object motion prediction of the automatic driving vehicle through a peripheral object motion prediction output head;
the driving behavior planning processing module is used for inputting the second type Key and the Value into the driving behavior planning converter network at the same time and then completing corresponding tasks of driving behavior planning of the automatic driving vehicle through the driving behavior planning output head.
7. The multi-modal based autopilot task processing system of claim 6 wherein the modal data collected by the plurality of perception sensors specifically includes: an image acquired by a camera sensor, a point cloud acquired by a laser radar sensor and a point cloud acquired by a millimeter wave radar sensor;
the perception tasks of the autopilot comprise, but are not limited to, three-dimensional target detection, three-dimensional target tracking, three-dimensional semantic segmentation, three-dimensional space occupation prediction and online map generation.
8. The multimodal automatic driving task processing system of claim 6 wherein the driving behavior planning tasks include, but are not limited to, holding straight, turning left, turning right, accelerating, decelerating, and stopping.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310945276.6A CN116665189B (en) | 2023-07-31 | 2023-07-31 | Multi-mode-based automatic driving task processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310945276.6A CN116665189B (en) | 2023-07-31 | 2023-07-31 | Multi-mode-based automatic driving task processing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116665189A CN116665189A (en) | 2023-08-29 |
CN116665189B true CN116665189B (en) | 2023-10-31 |
Family
ID=87710145
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310945276.6A Active CN116665189B (en) | 2023-07-31 | 2023-07-31 | Multi-mode-based automatic driving task processing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116665189B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115063539A (en) * | 2022-07-19 | 2022-09-16 | 上海人工智能创新中心 | Image dimension increasing method and three-dimensional target detection method |
CN115775378A (en) * | 2022-11-30 | 2023-03-10 | 北京航空航天大学 | Vehicle-road cooperative target detection method based on multi-sensor fusion |
CN116229224A (en) * | 2023-01-18 | 2023-06-06 | 重庆长安汽车股份有限公司 | Fusion perception method and device, electronic equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113674421B (en) * | 2021-08-25 | 2023-10-13 | 北京百度网讯科技有限公司 | 3D target detection method, model training method, related device and electronic equipment |
JP2023073231A (en) * | 2021-11-15 | 2023-05-25 | 三星電子株式会社 | Method and device for image processing |
-
2023
- 2023-07-31 CN CN202310945276.6A patent/CN116665189B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115063539A (en) * | 2022-07-19 | 2022-09-16 | 上海人工智能创新中心 | Image dimension increasing method and three-dimensional target detection method |
CN115775378A (en) * | 2022-11-30 | 2023-03-10 | 北京航空航天大学 | Vehicle-road cooperative target detection method based on multi-sensor fusion |
CN116229224A (en) * | 2023-01-18 | 2023-06-06 | 重庆长安汽车股份有限公司 | Fusion perception method and device, electronic equipment and storage medium |
Non-Patent Citations (5)
Title |
---|
《Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection》;Xin Li等;《arXiv》;全文 * |
《ST-P3:End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning》;Shengchao Hu等;《arXiv》;第3节,图2 * |
《Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving》;Zhenxun Yuan等;《arXiv》;全文 * |
《体素点云融合的三维动态目标检测算法》;周锋 等;《计算机辅助设计与图形学学报》;第34卷(第6期);全文 * |
《基于改进GAN的端到端自动驾驶图像生成方法》;孙雄风 等;《交通信息与安全》;第39卷(第5期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN116665189A (en) | 2023-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mittal | A survey on optimized implementation of deep learning models on the nvidia jetson platform | |
JP7086911B2 (en) | Real-time decision making for self-driving vehicles | |
US11480972B2 (en) | Hybrid reinforcement learning for autonomous driving | |
EP3289529B1 (en) | Reducing image resolution in deep convolutional networks | |
Ondruska et al. | End-to-end tracking and semantic segmentation using recurrent neural networks | |
CN111368972B (en) | Convolutional layer quantization method and device | |
CN111797983A (en) | Neural network construction method and device | |
Galvao et al. | Pedestrian and vehicle detection in autonomous vehicle perception systems—A review | |
CN112990211A (en) | Neural network training method, image processing method and device | |
WO2022007867A1 (en) | Method and device for constructing neural network | |
CN113191241A (en) | Model training method and related equipment | |
CN114494158A (en) | Image processing method, lane line detection method and related equipment | |
CN111340190A (en) | Method and device for constructing network structure, and image generation method and device | |
CN117157678A (en) | Method and system for graph-based panorama segmentation | |
CN116109678B (en) | Method and system for tracking target based on context self-attention learning depth network | |
Zaghari et al. | Improving the learning of self-driving vehicles based on real driving behavior using deep neural network techniques | |
CN115880560A (en) | Image processing via an isotonic convolutional neural network | |
Kanchana et al. | Computer vision for autonomous driving | |
US20230368513A1 (en) | Method and system for training a neural network | |
CN116665189B (en) | Multi-mode-based automatic driving task processing method and system | |
CN115146757A (en) | Training method and device of neural network model | |
CN116680656B (en) | Automatic driving movement planning method and system based on generating pre-training converter | |
CN116863430B (en) | Point cloud fusion method for automatic driving | |
CN116902003B (en) | Unmanned method based on laser radar and camera mixed mode | |
WO2023029704A1 (en) | Data processing method, apparatus and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |