CN113139446B

CN113139446B - End-to-end automatic driving behavior decision method, system and terminal equipment

Info

Publication number: CN113139446B
Application number: CN202110391084.6A
Authority: CN
Inventors: 刘占文; 赵祥模; 樊星; 齐明远; 范颂华; 李超; 张嘉颖; 高涛; 王润民; 林杉; 员惠莹
Original assignee: Changan University
Current assignee: Changan University
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2024-02-06
Anticipated expiration: 2041-04-12
Also published as: CN113139446A

Abstract

The invention discloses an end-to-end automatic driving behavior decision method, an end-to-end automatic driving behavior decision system and terminal equipment, and belongs to the field of automatic driving. Extracting scene space position features through a convolutional neural network embedded with an attention mechanism, constructing a space feature extraction network, and accurately analyzing scene target space features and semantic information; capturing scene time context characteristics through a long-term memory network coding-decoding structure embedded with a time attention mechanism, constructing a time characteristic extraction network, and understanding memory scene time sequence information; the method combines scene space information and time sequence information, and simultaneously gives higher weight to the key visual area and the motion sequence by combining the attention mechanism, so that the prediction process is more in line with the driving habit of a human driver, and the prediction result is more accurate.

Description

End-to-end automatic driving behavior decision method, system and terminal equipment

Technical Field

The invention belongs to the field of automatic driving, and relates to an automatic driving behavior decision method, in particular to an end-to-end automatic driving behavior decision method, an end-to-end automatic driving behavior decision system and terminal equipment.

Background

The automatic driving decision making technology is an important research direction in the fields of artificial intelligence and automatic driving, the effectiveness of decision making influences the performance of the whole automatic driving system to a great extent, however, the traditional rule-based automatic driving decision making method does not accord with the habit of human driving behaviors, and the automatic driving behavior decision making becomes a classical problem in the automatic driving field. The automatic driving behavior decision is not only related to the driving scene in which the current vehicle is located, but also related to the historical motion speed of the vehicle, so we consider the common influence of the current driving scene and the historical motion state on the vehicle. The human visual system can selectively notice the primary content of the observed scene and ignore other secondary content, and the driver should notice things that have a great influence on driving decisions, such as vehicles, pedestrians, traffic lights, etc., while ignoring features that are not important to driving, such as sky, trees, etc. Therefore, an end-to-end automatic driving decision model based on the attention mechanism and the space-time characteristics becomes a new research hotspot.

The automatic driving decision method is mainly divided into an automatic driving decision method based on rules and an end-to-end automatic driving decision method. The automatic driving decision process is divided into different task modules by the rule-based decision method, the vehicle state is understood and divided according to the traffic situation, and real-time reasonable driving actions are generated by combining the manually constructed rule base and priori knowledge, so that the automatic driving vehicle is controlled. The automatic driving model based on end-to-end learning can unify various driving subtasks such as scene environment perception, target recognition, target tracking, planning decision-making and the like into the deep neural network, directly map perception information into control quantities such as throttle, steering wheel, braking and the like, unify the cognition to the decision-making, do not need to split modules, can simplify complicated task steps of feature engineering, and enable the structure of the automatic driving system to be simpler and more efficient. The existing end-to-end automatic driving decision method does not consider the influence of the historical motion state of the vehicle on the vehicle decision, and also has the problems of low decision accuracy, low efficiency and the like.

Disclosure of Invention

In order to overcome the defects of low decision accuracy and low efficiency caused by the influence of the historical motion state of the vehicle on the vehicle decision in the prior art, the invention aims to provide an end-to-end automatic driving behavior decision method, an end-to-end automatic driving behavior decision system and terminal equipment.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

an end-to-end automatic driving behavior decision method comprises the following steps:

based on the image information, the depth information and the semantic segmentation information, acquiring the spatial characteristics of the scene through a convolutional neural network embedded with an attention mechanism;

based on the historical motion state sequence information of the vehicle, acquiring time characteristics through a memory network coding-decoding structure embedded with an attention mechanism;

and communicating the spatial characteristics with the time characteristics, establishing an end-to-end automatic driving behavior decision model, and acquiring an end-to-end automatic driving behavior prediction result.

Preferably, the extracting process of the spatial feature includes:

inputting the packed image and depth image into a backbone network to obtain image information and depth information;

the image information is pooled through a backbone network and a pyramid to obtain semantic segmentation information;

inputting the image information, the depth information and the semantic segmentation information into a connecting layer, generating a space feature vector with a fixed length, and acquiring the space feature of a scene;

wherein the attention mechanism is embedded in the backbone network.

Further preferably, the image information and the depth information are acquired by:

capturing an input feature map by using three spatial attention branch networks, establishing interaction between a spatial dimension and a channel dimension, and acquiring a spatial attention map;

training a backbone network to obtain different sparse masks; pruning is carried out on the main network by utilizing different sparse masks, two different sub-networks are generated, and image information and depth information are acquired based on the different sub-networks.

Further preferably, the process of acquiring the spatial attention map is specifically:

respectively establishing interaction of space dimension and channel dimension in three space attention branch networks through rotating an input feature map; respectively carrying out maximum pooling and average pooling on the rotated feature images; cascading the feature map after the average pooling and the feature map after the maximum pooling, and inputting the cascaded feature map to two full-connection layers for coding; generating attention weights through a sigmoid activation function, and combining the attention weights with the original input feature map to obtain three attention diagrams; the spatial attention force map is obtained after averaging the three attention force maps.

Preferably, the specific operation of pruning is:

randomly initializing a base network and a mask matrix, and simultaneously setting a pruning threshold value;

training the base network and the mask matrix phase after phase, and iteratively updating the mask matrix to obtain two sparse mask matrices with different sharing parameters;

and obtaining different sparse masks of the two sub-networks through training, and pruning the main network by using the different sparse masks.

Preferably, the extracting process of the time feature includes:

the method comprises the steps of utilizing an encoder to understand, summarize and memorize historical motion state sequence information of a vehicle to obtain a historical motion state feature vector of the vehicle;

utilizing the time characteristics to extract a network to construct a time attention mechanism;

and (3) carrying out time sequence generation and feature extraction on the vehicle historical motion state feature vector after the time attention mechanism by using a decoder, updating the hidden state at the current moment, and obtaining the time feature.

Preferably, the time attention mechanism is constructed based on a time attention module, and the specific operations include:

the multi-layer perceptron in the time attention module obtains energy items according to the hidden state of the encoder and the hidden state of the decoder;

the Softmax function in the time attention module obtains real-time encoder abstract features and decoder attention coefficients according to energy items;

the time attention module takes the attention coefficient as a weight, and performs weighted summation on the hidden states of all moments to obtain the context vector of the decoder at each moment.

An end-to-end autopilot behavior decision system comprising:

the spatial attention module is used for acquiring spatial features of the scene based on the image information, the depth information and the semantic segmentation information;

the time attention module is used for acquiring time characteristics based on the vehicle historical motion state sequence information and constructing a time attention mechanism based on the time characteristics;

the model building module is respectively interacted with the space attention module and the time attention module, and is used for building an end-to-end automatic driving behavior decision model based on the space characteristics and the time attention mechanism, and predicting end-to-end automatic driving behavior results through the end-to-end automatic driving behavior decision model.

Preferably, the spatial attention module comprises a ResNet feature extractor, a pyramid pooling unit, and three spatial attention branch networks;

the ResNet feature extractor is used for acquiring feature information in the image information;

the pyramid pooling unit is used for carrying out maximum pooling and average pooling on the feature information acquired by the ResNet feature extractor;

the spatial attention branch network is used for establishing interaction of a spatial dimension and a channel dimension based on the input feature diagram;

the time attention module comprises an encoder, a decoder and a multi-layer perceptron;

the encoder is used for understanding, summarizing and memorizing the historical motion state sequence of the vehicle;

the decoder is used for carrying out time sequence generation and feature extraction and updating the hidden state of the current moment;

the multi-layer perceptron is operative to derive energy terms based on hidden states of the encoder and decoder.

A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the end-to-end autopilot behavior decision method when the computer program is executed.

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses an end-to-end automatic driving decision method, which is characterized in that a convolutional neural network embedded with an attention mechanism is used for extracting scene space position characteristics, a space characteristic extraction network is constructed, and scene target space characteristics and semantic information are accurately analyzed; capturing scene time context characteristics through a long-term memory network coding-decoding structure embedded with a time attention mechanism, constructing a time characteristic extraction network, and understanding memory scene time sequence information; the method combines scene space information and time sequence information, and simultaneously gives higher weight to the key visual area and the motion sequence by combining the attention mechanism, so that the prediction process is more in line with the driving habit of a human driver, and the prediction result is more accurate.

Furthermore, in the end-to-end automatic driving decision process, visual information in the RGB image is not enough to fully sense objects such as vehicles, pedestrians and obstacles in the scene, depth information comprises more position features and outline features of the scene object, and semantic segmentation information covers high-level semantic understanding of the driving scene.

Further, the spatial position features and the time context features of the scene are respectively extracted and fused through a convolution network and an LSTM, and meanwhile, the semantic segmentation information is utilized to improve the model prediction precision in the process of extracting the spatial position features, and the decision quantity of the steering wheel and the steering angle is output. Although the prediction effect of the model after the multi-modal input is considered to be improved, the method does not pay attention to key objects in the scene, such as pedestrians, lane lines, traffic signs and the like.

Further, in order to improve the capability of extracting the space-time saliency features of the driving scene, a attention module is introduced, so that the end-to-end automatic driving behavior decision method focuses on the detailed information of the current task target area, the system can improve the performance and efficiency under limited resources, and unnecessary resource waste is reduced.

Further, as the network model is large due to the multi-modal input and the embedding of the attention module, the network is pruned through the sparse mask, so that the complexity of the model is reduced.

The invention also discloses an end-to-end automatic driving behavior decision system,

drawings

FIG. 1 is an overall architecture of an end-to-end autopilot decision system of the present invention;

FIG. 2 is a sparse mask matrix pruning training process in the method of the present invention;

FIG. 3 is a spatial attention module in the present invention;

FIG. 4 is an LSTM encoding-decoding structure in the present invention;

FIG. 5 is a graph of the loss of the system of the present invention with the MM-STConv model;

FIG. 6 is a graph of accuracy of the system of the present invention with the MM-STConv model;

FIG. 7 is a graph of a speed prediction of 100s (1000 frames) of successive images in a dataset by the system of the present invention;

FIG. 8 is a plot of steering angle predictions for 100s (1000 frames) of successive images in a dataset for a system of the present invention;

Detailed Description

The invention is described in further detail below with reference to the attached drawing figures:

example 1

The end-to-end automatic driving behavior decision model based on the attention mechanism and the space-time characteristics comprises the following steps:

step 1: the spatial feature extraction network describes scene spatial location features using RGB image information, depth information, and semantic segmentation information.

And step 11, inputting the packed RGB image and depth image into a backbone network embedded with an attention mechanism to extract RGB image features and depth features.

Step 12, the RGB image acquires semantic segmentation information and perception context information through a ResNet network and pyramid pooling module embedded with an attention mechanism;

and 13, inputting the fused three kinds of characteristic information into a full-connection layer to generate a space characteristic vector with a fixed length.

Step 2: the temporal feature extraction network extracts temporal context features using the vehicle historical motion state sequence information.

Step 3: and connecting the features obtained by the spatial feature extraction network with the features obtained by the temporal feature extraction network to obtain a final prediction result through two full-connection layers.

In step 11, inputting the RGB image and the depth image into a backbone network embedded with an attention mechanism to extract RGB image features and depth features, which specifically includes:

step 111: by adopting three spatial attention branch networks to capture interaction between the spatial dimension and the channel dimension of the input feature map, important position features of the traffic scene are highlighted to suppress irrelevant scene features.

Step 112: and pruning the backbone network through different sparse masks obtained by training the backbone network, and generating two different sub-networks for extracting RGB features, depth features and semantic features.

The step 111 specifically includes steps A-D:

A. establishing interaction of space dimension H and channel dimension C through rotating an input feature map in a first branch of an attention module, respectively carrying out maximum pooling and average pooling on the rotated feature map, inputting the rotated feature map into two full connection layers (FC) for coding, and finally generating attention weight through a sigmoid activation function and multiplying the attention weight with the original input feature map to obtain an attention map;

B. establishing interaction of a space dimension W and a channel latitude C in a second branch of the attention module through rotating an input feature map, respectively carrying out maximum pooling and average pooling on the rotated feature map, inputting the rotated feature map into two full connection layers (FC) for coding, and finally generating attention weight through a sigmoid activation function and multiplying the attention weight with the original input feature map to obtain an attention map;

C. carrying out maximum pooling and average pooling on the input feature images in a third branch of the attention module, cascading the feature images after average pooling and the feature images after maximum pooling, generating attention weights through a sigmoid activation function, and multiplying the attention weights with the input feature images by elements to obtain a weighted attention map;

D. the attention force obtained by the three branches of the attention module is averaged by element addition to obtain the final spatial attention force.

The step 112 specifically includes steps A-C:

A. randomly initializing a base network and a mask matrix, and simultaneously setting a pruning threshold value;

B. the base network and the mask matrix phase are combined with training, the mask matrix is updated in an iterative mode, and two different sparse mask matrixes with shared parameters are obtained;

C. and training to obtain different sparse masks of two sub-networks in the spatial feature extraction network, and pruning the base network.

In the step 12, the RGB image is subjected to a res net network and pyramid pooling module with an attention mechanism embedded to obtain semantic segmentation information, which is specifically as follows:

the RGB image is subjected to a ResNet feature extractor embedded with a spatial attention module and a pyramid pooling module to obtain semantic segmentation features fused with global information and multi-scale context information, so that a spatial attention diagram is obtained, and further advanced semantic information is obtained.

The step 2 specifically comprises the following steps:

step 21, the LSTM encoder is used for understanding, summarizing and memorizing the historical motion state sequence of the vehicle;

and 22, constructing a time attention mechanism in the time feature extraction network, modeling the relation between the historical speed state sequences, and giving more weight to important time context features.

Step 23, the decoder performs time series generation and feature extraction to update the hidden state at the current time.

The LSTM encoder in step 21 understands, summarizes and memorizes the vehicle history continuous motion state sequence, specifically as follows:

the LSTM encoder carries out T times of recursion update on the vehicle history continuous motion state sequence with the length of T to obtain a time context coding vector c with a fixed length _t 。

The step 22 builds a time attention mechanism in the time feature extraction network, specifically as follows:

step 221, the multi-layer sensor in the time attention module is based on the encoder hiding state,Hidden state of decoder, resulting in energy term e _ji ；

Step 222, softmax function in the time attention module is based on e _ji Obtaining abstract features of the encoder at the ith step and attention coefficient a of the decoder at the jth step _ji ；

Step 223, the time attention module applies attention coefficient a _ji As the weight, the hidden states of all moments are weighted and summed to obtain the context vector m of the decoder at the j-th step _j 。

The decoder performs time sequence generation and feature extraction in step 23, and updates the hidden state at the current time, specifically as follows:

the decoder outputs the vector y according to the previous time _j-1 Hidden state s _j-1 And context vector m _j Hidden state s for step j _j Update according to s _j 、y _j-1 And m _j Updating the historical motion output vector decoded in the j step.

Example 2

As shown in fig. 1, the end-to-end automatic driving behavior decision method based on the attention mechanism and the space-time characteristics specifically comprises the following steps:

step 1: the space feature extraction network utilizes RGB image information, depth information and semantic segmentation information to describe scene space position features, and utilizes a sparse mask pruning backbone network to generate two sub-networks with shared parameters for extracting the image space position features and the semantic features;

step 2: the temporal feature extraction network extracts temporal context features using the vehicle historical motion speed sequence information.

Step 3: and connecting the characteristics obtained by the space network with the characteristics obtained by the time sequence network to obtain a final prediction result.

The step 1 comprises the steps of 11, 12 and 13:

and step 11, inputting the packed RGB image and depth image into a ResNet network embedded with an attention mechanism to extract RGB image features and depth features.

and 13, generating a feature map with the same size as the space feature extraction feature map by passing the semantic segmentation feature map through a convolution layer and two pooling layers, and connecting and inputting the feature map and the pooling layer to a full-connection layer to generate a space feature vector with a fixed length.

In the step 11, the RGB image and the depth image are input into a res net network embedded with an attention mechanism to extract RGB image features and depth features, which specifically comprises the steps 111 to 113:

step 111, extracting feature information useful for decision tasks by embedding a spatial attention module in each bottleneck block of the res net, as shown in fig. 3, the attention module suppresses irrelevant scene features by establishing dependency between spatial channels to highlight important position features of traffic scenes; the method comprises the following steps:

A. the feature map is input in the first branch of the attention moduleRotating 90 degrees anticlockwise along the H dimension, establishing interaction of the H dimension and the C dimension, and rotating the obtained feature map F _r1 (x) The shape of the feature map is W multiplied by H multiplied by C, the rotated feature map is subjected to maximum pooling and average pooling respectively, and vectors obtained by the maximum pooling and average pooling operation are input into two Full Communication (FC) layers to encode the relation between channels. And finally, adding two coded feature vector pixel levels, generating attention weight through sigmoid, multiplying the attention weight by the original input feature map, and finally, rotating the output clockwise by 90 degrees along the H dimension to keep the shape consistent with the input. The following formula is obtained:

B. in the second branch of the attention module, the feature map is inputRotating 90 degrees anticlockwise along the W dimension, establishing interaction of the W dimension and the C dimension, and rotating the obtained feature map F _r1 (x) The shape of (C) is h×w×c, and then the rotated feature map is subjected to maximum pooling and average pooling, respectively, and vectors obtained by the maximum pooling and average pooling operations are input to two full-communication (FC) layers to encode the relationship between channels. And finally, adding two coded eigenvector pixel levels, generating attention weight through sigmoid, multiplying the attention weight by the original input eigenvector, and finally, rotating the output clockwise by 90 degrees along the w axis, so as to keep the shape consistent with the input.

The following formula is obtained:

C. feature map for input in the third branch of the attention moduleAnd carrying out maximum pooling and average pooling, and cascading the feature map with the shape of 1 XH multiplied by W after the average pooling and the feature map with the shape of 1 XH multiplied by W after the maximum pooling into feature vectors of 2 XH multiplied by W. The feature vector firstly passes through a standard convolution layer with the convolution kernel size of K multiplied by K and a batch normalization layer, then generates attention weight through a sigmoid activation function, and multiplies the attention weight with an input feature map according to elements to obtain a weighted attention map. The following formula is obtained:

In step 112, two different sub-networks are generated for extracting RGB features, depth features and semantic features by pruning the backbone network with different sparse masks obtained by training the backbone network. As shown in fig. 2, the specific steps are as follows:

A. and randomly initializing a base network, setting a mask matrix to be 1, and setting a pruning threshold.

B. And (3) training the base network and the mask matrix phase after phase, and comparing the training result with a threshold value to iteratively update the mask matrix to obtain two different sparse mask matrices sharing common parameters.

C. And obtaining different sparse masks of two sub-networks in the spatial feature extraction network through training, and pruning elements with smaller contributions to tasks in each network of ResNet.

The step 2 specifically comprises the following steps:

step 21, the LSTM encoder is used for understanding, summarizing and memorizing the historical continuous motion state sequence of the vehicle;

Step 23, the decoder performs time sequence generation and feature extraction, and updates the hidden state at the current moment;

as shown in fig. 4, the LSTM encoder in step 21 understands, summarizes and memorizes the vehicle history continuous motion state sequence as follows:

LSTM encoder pair length t vehicle history continuous motion state sequence s ₁ ,...,s _t Performing T times of recursion update to obtain a time context coding vector c with fixed length _t The vector contains an encoder's understanding, summarizing and memorizing a sequence of historical continuous motion states of the vehicle;

step 221, the multi-layer perceptron in the time attention branching network based on the encoder hidden state h of step i _i Hidden state s of the decoder of step j-1 _j-1 Obtaining energy term e _ji ＝w ^T tanh(W[s _j-1 ,h _i ]+b), where W and b are the input layer to hidden layer weights and bias vectors, and W is the hidden layer to output layer weight vector.

Step 222, softmax function in time attention branching network according to e _ji Obtaining abstract features of the encoder at the ith step and attention coefficient a of the decoder at the jth step _ji I.e.Where t is the length of the input sequence.

Step 223, the time attention branching network takes attention coefficient a _ji As the weight, the hidden states of all moments are weighted and summed to obtain the context vector m of the decoder at the j-th step _j I.e.

And (3) effect verification:

in order to verify the effectiveness of the method, a data set generated and marked by an automatic driving simulation test platform is adopted, 8112 images are selected for training, and the rest 3476 images are used for carrying out algorithm verification on the test images.

The error comparison is carried out on the training of the method and the MM-STConv behavior decision model, the result is shown in figure 5, and the training loss curves of the two models are gradually reduced along with the increase of training period, wherein the training loss curve of the end-to-end automatic driving decision method based on the attention mechanism and the time-space characteristics is lower than the MM-STConv behavior decision method as a whole, and the descending speed is faster than that of the MM-STConv behavior decision model. Meanwhile, compared with the MM-STConv behavior decision model, the training loss curve merging the attention mechanism model has smaller jitter. Compared with an MM-STConv behavior decision model, the end-to-end automatic driving behavior decision method based on the attention mechanism and the space-time characteristics has the advantages that the training process is more stable and efficient, and the convergence speed is higher.

The method of the invention is compared with the prediction accuracy of the MM-STConv behavior decision model, and the result is shown in FIG. 6. It can be seen from the figure that the prediction accuracy curves of both methods gradually rise with increasing training period, wherein the prediction accuracy curve based on the attention mechanism and the spatiotemporal features is higher than the MM-STConv behavior decision model as a whole, and the rising speed is faster than the MM-STConv behavior decision model. Compared with the MM-STConv behavior decision model, the end-to-end automatic driving behavior decision method based on the attention mechanism and the space-time characteristics has better model performance and more stable prediction result due to the introduction of the attention module.

The speed prediction and the steering angle prediction are carried out by using the system, the result is shown in fig. 7 and 8, and from the result, it can be seen that the speed and the steering angle prediction curve of the method are closer to the real reference curve, can be well fitted with the reference curve, and the curve jitter is smaller, so that the prediction is more stable.

Example 3

The deep neural network-based estimation method of the present invention, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals. The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), and semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NANDFLASH), solid State Disk (SSD)), etc.

Example 4

In an exemplary embodiment, a computer device is also provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which processor implements the steps of the deep neural network based channel estimation method when executing the computer program. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. An end-to-end automatic driving behavior decision method is characterized by comprising the following steps:

the space features are communicated with the time features, an end-to-end automatic driving behavior decision model is established, and an end-to-end automatic driving behavior prediction result is obtained;

the extraction process of the spatial features comprises the following steps:

wherein, the attention mechanism is embedded in the backbone network;

the image information and depth information are acquired by the following steps:

training a backbone network to obtain different sparse masks; pruning a backbone network by using different sparse masks to generate two different sub-networks, and acquiring image information and depth information based on the different sub-networks;

the acquisition process of the spatial attention map specifically includes:

respectively establishing interaction of space dimension and channel dimension in three space attention branch networks through rotating an input feature map; respectively carrying out maximum pooling and average pooling on the rotated feature images; cascading the feature map after the average pooling and the feature map after the maximum pooling, and inputting the cascaded feature map to two full-connection layers for coding; generating attention weights through a sigmoid activation function, and combining the attention weights with the original input feature map to obtain three attention diagrams; three attentives are averaged then obtaining a space attention diagram;

the specific operation of pruning is as follows:

obtaining different sparse masks of two sub-networks through training, and pruning a main network by using the different sparse masks;

the extraction process of the time features comprises the following steps:

the decoder is utilized to carry out time sequence generation and feature extraction on the vehicle historical motion state feature vector after the time attention mechanism, the hidden state at the current moment is updated, and the time feature is obtained;

the time attention mechanism is constructed based on a time attention module, and specific operations comprise:

2. An end-to-end autopilot behavior decision system in accordance with the method of claim 1 comprising:

3. The end-to-end autopilot behavior decision system of claim 2 wherein the spatial attention module includes a res net feature extractor, a pyramid pooling unit, and three spatial attention branch networks;

4. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the end-to-end autopilot behavior decision method of claim 1 when the computer program is executed by the processor.