CN116665189B

CN116665189B - Multi-mode-based automatic driving task processing method and system

Info

Publication number: CN116665189B
Application number: CN202310945276.6A
Authority: CN
Inventors: 丁勇; 刘瑞香; 戴行
Original assignee: Hefei Haipu Microelectronics Co ltd
Current assignee: Hefei Haipu Microelectronics Co ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-10-31
Anticipated expiration: 2043-07-31
Also published as: CN116665189A

Abstract

The invention discloses a multi-mode-based automatic driving task processing method and a multi-mode-based automatic driving task processing system, wherein the method comprises the following steps: acquiring modal data acquired by a plurality of perception sensors, and extracting voxel characteristics of the modal data; after unifying the feature dimension and resolution of the extracted voxel features, carrying out feature fusion to obtain first type of feature adaptively fused with different modes; and acquiring a perception task of the automatic driving, inputting the first type of body characteristics and the perception task into a pre-established and trained transducer model of the automatic driving, and completing tasks of the perception of the automatic driving vehicle, the prediction of actions of surrounding objects and the planning of driving behaviors. The method can effectively reduce the training cost and the deep learning model deployment difficulty caused by a mode of a plurality of independent models, and can fully utilize the relevance among perception, prediction and planning tasks to obtain the effect of mutual improvement in performance.

Description

Multi-mode-based automatic driving task processing method and system

Technical Field

The invention relates to the technical field of automatic driving, in particular to a multi-mode-based automatic driving task processing method and system.

Background

The autopilot (Autonomous Driving) technology has led to an industrial revolution in the automotive industry, whose development has kept away from the continual innovation and advancement of autopilot awareness, prediction and planning technology. With the continuous improvement of the Perception sensor technology and related artificial intelligence algorithms, the automatic driving vehicle can obtain more accurate and comprehensive scene information, and complete automatic driving Perception (permission), prediction (Prediction) and Planning (Planning) tasks, so that safer and more efficient driving is realized. The perception is a 'visual system' of the automatic driving vehicle, the prediction and the planning are 'brains' of the automatic driving vehicle, and the intelligent vehicle is a key technology for constructing intelligent traffic and intelligent cities in the smart city, and has an important technical support function for future Chinese smart city construction.

The sensing sensor technology mainly relates to a laser radar, a millimeter wave radar and a camera, the current mainstream automatic driving technology uses a plurality of independent deep learning models, and utilizes multi-mode data from the three mainstream sensing sensors to respectively complete sensing, prediction and planning tasks. This method has the following disadvantages: 1) Extracting features from multi-modal data is a deep learning network structure common to each task and is one of the main components of a model structure, which can lead to an increase in training cost in a manner of using a plurality of independent models; 2) The perceived, predicted and planned tasks have correlation, and the independent model cannot utilize the correlation to improve the accuracy of the tasks; 3) Multiple independent models increase the actual deployment cost of the deep learning model.

Disclosure of Invention

In order to solve the technical problems in the background technology, the invention provides a multi-mode-based automatic driving task processing method and a multi-mode-based automatic driving task processing system.

The invention provides a multi-mode-based automatic driving task processing method, which comprises the following steps:

s1, acquiring modal data acquired by a plurality of perception sensors, and extracting voxel characteristics of the modal data;

s2, after unifying the feature dimension and the resolution of the extracted voxel features, carrying out feature fusion to obtain first type voxel features which are adaptively fused with different modes;

s3, acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completing tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning; the transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network;

"S3" specifically includes:

acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing the corresponding automatic driving vehicle perception task through a perception task output head, and acquiring a perception output result;

constructing a Key and a Value related to perception by using a perception output result;

inputting the perception output result into a voxel feature filter to obtain sparse interesting second type of voxel features, and constructing a first type Key and Value related to voxel environment through the second type of voxel features;

inputting the Key and Value related to perception and the Key and Value of the first type into a surrounding object motion prediction transducer network at the same time to obtain the Key and Value of the second type related to motion prediction, and then completing the corresponding task of predicting the motion of the surrounding object of the automatic driving vehicle through a surrounding object motion prediction output head;

and after the second type Key and Value are simultaneously input into the driving behavior programming converter network, the corresponding task of driving behavior programming of the automatic driving vehicle is completed through the driving behavior programming output head.

Preferably, the mode data collected by the plurality of sensing sensors specifically includes: image collected by camera sensorPoint cloud acquired by laser radar sensor>Point cloud acquired by millimeter wave radar sensor>。

Preferably, the perceived tasks of autopilot include, but are not limited to, three-dimensional object detection, three-dimensional object tracking, three-dimensional semantic segmentation, three-dimensional space occupancy prediction, and online map generation.

Preferably, "inputting the sensing output result into a voxel feature filter to obtain a sparse interesting second type of voxel feature, and constructing a first type Key and Value related to voxel environment through the second type of voxel feature" specifically includes:

and inputting the perception output result into a voxel feature screening device, selecting interesting voxel features of the first type of voxel features through the perception output result by the voxel feature screening device, selecting sparse interesting second type of voxel features, and constructing first type Key and Value related to voxel environments through the second type of voxel features.

Preferably, the tasks of the driving behavior planning include, but are not limited to, keeping straight, turning left, turning right, accelerating, decelerating and stopping.

A multi-modal based autopilot task processing system comprising:

the feature extraction module is used for acquiring the modal data acquired by the plurality of perception sensors and extracting voxel features of the modal data;

the feature fusion module is used for carrying out feature fusion after unifying the feature dimension and the resolution of the extracted voxel features to obtain first type voxel features which are adaptively fused with different modes;

the task processing module acquires an automatic driving perception task, inputs the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completes tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning; the transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network; the task processing module comprises: the system comprises an automatic driving perception processing module, a surrounding object action prediction processing module, a driving behavior planning processing module and a voxel characteristic screening module;

the automatic driving perception processing module is used for acquiring an automatic driving perception task, inputting a first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing a corresponding automatic driving vehicle perception task through a perception task output head, acquiring a perception output result, and constructing a perception-related Key and Value by utilizing the perception output result;

the voxel feature screening module is used for inputting the perception output result into the voxel feature screening device to obtain sparse interesting second type voxel features, and constructing a first type Key and Value related to voxel environments through the second type voxel features;

the peripheral object motion prediction processing module is used for inputting the Key and the Value related to perception and the first type Key and the Value into a peripheral object motion prediction transducer network at the same time to obtain the second type Key and the Value related to motion prediction, and then completing the corresponding task of peripheral object motion prediction of the automatic driving vehicle through a peripheral object motion prediction output head;

the driving behavior planning processing module is used for inputting the second type Key and the Value into the driving behavior planning converter network at the same time and then completing corresponding tasks of driving behavior planning of the automatic driving vehicle through the driving behavior planning output head.

Preferably, the mode data collected by the plurality of sensing sensors specifically includes: an image acquired by a camera sensor, a point cloud acquired by a laser radar sensor and a point cloud acquired by a millimeter wave radar sensor;

the perception tasks of the autopilot comprise, but are not limited to, three-dimensional target detection, three-dimensional target tracking, three-dimensional semantic segmentation, three-dimensional space occupation prediction and online map generation.

According to the multi-mode-based automatic driving task processing method and system, various sensor data can be processed and fused into a unified voxel space in a multi-mode voxel feature generation stage, so that the addition and deletion of the number of sensors can be flexibly supported, and the feature requirements of a plurality of subsequent tasks can be met. In the multi-task output stage, multi-stage tasks such as perception, prediction and planning are combined, so that training cost increase and deep learning model deployment difficulty can be brought by a mode of effectively reducing a plurality of independent models, and the effect of mutual improvement in performance can be obtained by fully utilizing the relevance among the perception, prediction and planning tasks.

Drawings

FIG. 1 is a schematic diagram of a workflow of a multi-mode-based automatic driving task processing method according to the present invention;

FIG. 2 is a schematic diagram of the operation flow of the multi-mode-based automatic driving task processing method according to the present invention;

FIG. 3 is a schematic structural diagram of a multi-mode autopilot algorithm system based on a unified large model according to the present invention;

fig. 4 is a schematic structural diagram of a task processing module of the multi-mode automatic driving algorithm system based on the unified big model.

Detailed Description

Variable subscripts "cam", "LiDAR" and "Radar" are set for distinguishing between respective sensors of Camera, laser Radar (LiDAR) and millimeter wave Radar (Radar), and variable subscripts "perc", "pred" and "plan" are set for distinguishing between Perception (task), prediction (task) and Planning (task).

Referring to fig. 1 and 2, the multi-mode-based automatic driving task processing method provided by the invention comprises the following steps:

s1, acquiring modal data acquired by a plurality of perception sensors, and extracting voxel characteristics of the modal data.

In this embodiment, the sensing sensor collects the modal data of the autopilot application scene by using the sensors such as a camera, a laser radar, a millimeter wave radar, and the like, and correspondingly, the image collected by the camera sensor is collectedPoint cloud acquired by laser radar sensor>Point cloud acquired by millimeter wave radar sensor>And respectively inputting the corresponding voxel characteristics into a corresponding voxel characteristic generation network to obtain corresponding voxel characteristics.

And S2, after unifying the feature dimension and the resolution of the extracted voxel features, carrying out feature fusion to obtain first type voxel features which are adaptively fused with different modes.

In this embodiment, the image acquired by the camera sensor is usedPoint cloud acquired by laser radar sensorPoint cloud acquired by millimeter wave radar sensor>Conversion of corresponding voxel features into unified voxel feature space to form respective setsHaving the same characteristic dimension C and spatial resolution +.>Voxel characteristics of->、Input to the corresponding adaptive voxel feature fusion network +.>And obtaining the self-adaptive fusion weight.

Voxel characterization by image modalityGenerating corresponding image voxel characteristic self-adaptive fusion weight +.>：

；

Voxel characterization by lidar point cloud modalityGenerating self-adaptive fusion weight of corresponding laser radar point cloud voxel characteristics>：

；

Voxel characterization by millimeter wave Lei Dadian cloud modalityGenerating corresponding millimeter wave Lei Dadian cloud voxel characteristic self-adaptive fusion weight +.>：

；

Fusion weights to be generatedAnd (3) carrying out numerical normalization, namely:

wherein, as the normalization function, a Softmax function can be adopted.

Voxel characteristic of three modesFusion weights with corresponding adaptationsMultiplying and adding to obtain self-adaptive fused voxel characteristic +.>；

The fused voxel featuresHaving the same characteristic dimension C and resolution +.>The system can flexibly adapt to the increase and decrease of the number of the sensors, namely, the input modes can be compatible with multiple modes (cameras, laser radars and millimeter wave radars), double modes (cameras and laser radars, laser radars and millimeter wave radars, cameras and millimeter wave radars) and single modes (cameras, laser radars and millimeter wave radars), and uniform voxel characteristics are obtained.

S3, acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completing tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning.

The perception tasks of autopilot include, but are not limited to, three-dimensional object detection, three-dimensional object tracking, three-dimensional semantic segmentation, three-dimensional space occupation prediction, and online map generation.

In particular, the unified large model is specifically a transducer model for autopilot.

Specifically, as shown in fig. 1 and 2, the transducer model specifically includes: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network;

"S3" specifically includes:

acquiring an automatic driving perception task, inputting the first type of body characteristics and the perception task into an automatic driving perception Transformer network, completing the corresponding automatic driving vehicle perception task through a perception task output head, and acquiring a perception output result; and constructing a Key and a Value related to perception by using the perception output result.

In particular, the output result is perceivedThe result is->And is used to construct multi-type perceptually relevant Key and Value, denoted as +.>。

Inputting the perception output result into a voxel feature filter to obtain sparse interesting second type of voxel features, and constructing a first type Key and Value related to voxel environment through the second type of voxel features; respectively marked as。

Inputting the perception output result into a voxel feature filter to obtain sparse interesting second type of voxel features, and constructing a first type Key and Value related to voxel environment through the second type of voxel features specifically comprises the following steps:

The perception-related Key and Value and the first type Key and Value are input into the surrounding object action prediction transducer network at the same timeObtaining a second type Key and Value related to motion prediction, and then completing corresponding tasks of motion prediction of surrounding objects of the automatic driving vehicle through a surrounding object motion prediction output head;

in this embodiment, surrounding object motion prediction transform neural network is usedVoxel context dependent +.>Motion prediction Query (denoted +.>) Learning and updating are carried out, the updated motion prediction Query is used for completing the task of motion prediction of objects around the automatic driving vehicle, and Key and Value related to motion prediction are recorded as +.>。

The method comprises the following steps:

step 1:is->Part of use->For->Learning and updating are performed, and the process is based on a calculation mode of a transducer structure, as follows:

wherein (1)>The method comprises the following main calculation steps:

calculating a correlation matrix of the two; />The function normalizes the correlation matrix and is realized by a Softmax function; FFN is a feedforward neural network and can be arranged into a two-layer structure;may be set to 128.

Step 2:is->Voxel context dependent +.>And->Multiple pairs->Further learning and updating are performed, the process still being based on +.>A similar calculation is as follows:

step 3: the +.2 updated by the above step>To the motion prediction output head>In, for outputting the motion prediction result +.>：

Step 4: the +.2 updated by the above step>Key and Value, which are also used as motion prediction-related, are denoted +.>。

In this embodiment, the driving behavior planning transducer neural networkAction prediction related +.>Planning driving behaviorQuery (denoted as->) And learning and updating, wherein the updated driving behavior planning Query is used for completing the corresponding task of driving behavior planning of the automatic driving vehicle.

Tasks for driving behavior planning include, but are not limited to, maintaining straight, left turn, right turn, acceleration, deceleration, and stopping.

The specific implementation process is as follows:

step 5: the driving behavior planning transducer neural networkAction prediction related +.>Planning a Query for driving behavior (denoted +.>) Learning and updating is performed, the process still being based on +.>A similar calculation is as follows:

step 6: the +.2 updated by the above step>Send to driving behavior planning output head->In for outputting the driving behavior planning prediction result +.>：

Wherein the result isOutput->Including but not limited to maintaining specific driving behaviors such as straight, left turn, right turn, acceleration, deceleration, and parking.

Referring to fig. 3, a multi-modality based autopilot task processing system includes:

the task processing module acquires an automatic driving perception task, inputs the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completes tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning. The transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network.

Specifically, as shown in fig. 3, the task processing module includes: the system comprises an automatic driving perception processing module, a surrounding object action prediction processing module, a driving behavior planning processing module and a voxel characteristic screening module;

Specifically, as shown in fig. 2, the modal data collected by the plurality of sensing sensors specifically includes: an image acquired by a camera sensor, a point cloud acquired by a laser radar sensor and a point cloud acquired by a millimeter wave radar sensor;

Specifically, as shown in FIG. 2, tasks of driving behavior planning include, but are not limited to, maintaining straight, left-turn, right-turn, accelerating, decelerating, and stopping.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The multi-mode-based automatic driving task processing method is characterized by comprising the following steps of:

"S3" specifically includes:

2. The multi-modal-based automatic driving task processing method according to claim 1, wherein the modal data collected by the plurality of sensing sensors specifically includes: image collected by camera sensorPoint cloud acquired by laser radar sensor>Point cloud acquired by millimeter wave radar sensor>。

3. The multi-modal based autopilot task processing method of claim 1 wherein the autopilot-based sensory tasks include, but are not limited to, three-dimensional object detection, three-dimensional object tracking, three-dimensional semantic segmentation, three-dimensional space occupancy prediction, online map generation.

4. The method for processing the automatic driving task based on the multiple modes according to claim 1, wherein inputting the perception output result into the voxel feature filter to obtain sparse interesting second type of voxel features, and constructing the first type Key and Value related to the voxel environment through the second type of voxel features specifically comprises:

5. The method of claim 1, wherein the driving behavior planning tasks include, but are not limited to, keep straight, turn left, turn right, accelerate, decelerate, and park.

6. A multi-modal based autopilot task processing system comprising:

the task processing module acquires an automatic driving perception task, inputs the first type of body characteristics and the perception task into a pre-established and trained automatic driving transducer model, and completes tasks of automatic driving vehicle perception, surrounding object action prediction and driving behavior planning; the transducer model specifically comprises: an autopilot-aware converter network, a surrounding object motion prediction converter network, and a driving behavior planning converter network;

the task processing module comprises: the system comprises an automatic driving perception processing module, a surrounding object action prediction processing module, a driving behavior planning processing module and a voxel characteristic screening module;

7. The multi-modal based autopilot task processing system of claim 6 wherein the modal data collected by the plurality of perception sensors specifically includes: an image acquired by a camera sensor, a point cloud acquired by a laser radar sensor and a point cloud acquired by a millimeter wave radar sensor;

8. The multimodal automatic driving task processing system of claim 6 wherein the driving behavior planning tasks include, but are not limited to, holding straight, turning left, turning right, accelerating, decelerating, and stopping.