CN113139446B - End-to-end automatic driving behavior decision method, system and terminal equipment - Google Patents

End-to-end automatic driving behavior decision method, system and terminal equipment Download PDF

Info

Publication number
CN113139446B
CN113139446B CN202110391084.6A CN202110391084A CN113139446B CN 113139446 B CN113139446 B CN 113139446B CN 202110391084 A CN202110391084 A CN 202110391084A CN 113139446 B CN113139446 B CN 113139446B
Authority
CN
China
Prior art keywords
attention
time
information
feature
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110391084.6A
Other languages
Chinese (zh)
Other versions
CN113139446A (en
Inventor
刘占文
赵祥模
樊星
齐明远
范颂华
李超
张嘉颖
高涛
王润民
林杉
员惠莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changan University
Original Assignee
Changan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changan University filed Critical Changan University
Priority to CN202110391084.6A priority Critical patent/CN113139446B/en
Publication of CN113139446A publication Critical patent/CN113139446A/en
Application granted granted Critical
Publication of CN113139446B publication Critical patent/CN113139446B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an end-to-end automatic driving behavior decision method, an end-to-end automatic driving behavior decision system and terminal equipment, and belongs to the field of automatic driving. Extracting scene space position features through a convolutional neural network embedded with an attention mechanism, constructing a space feature extraction network, and accurately analyzing scene target space features and semantic information; capturing scene time context characteristics through a long-term memory network coding-decoding structure embedded with a time attention mechanism, constructing a time characteristic extraction network, and understanding memory scene time sequence information; the method combines scene space information and time sequence information, and simultaneously gives higher weight to the key visual area and the motion sequence by combining the attention mechanism, so that the prediction process is more in line with the driving habit of a human driver, and the prediction result is more accurate.

Description

End-to-end automatic driving behavior decision method, system and terminal equipment
Technical Field
The invention belongs to the field of automatic driving, and relates to an automatic driving behavior decision method, in particular to an end-to-end automatic driving behavior decision method, an end-to-end automatic driving behavior decision system and terminal equipment.
Background
The automatic driving decision making technology is an important research direction in the fields of artificial intelligence and automatic driving, the effectiveness of decision making influences the performance of the whole automatic driving system to a great extent, however, the traditional rule-based automatic driving decision making method does not accord with the habit of human driving behaviors, and the automatic driving behavior decision making becomes a classical problem in the automatic driving field. The automatic driving behavior decision is not only related to the driving scene in which the current vehicle is located, but also related to the historical motion speed of the vehicle, so we consider the common influence of the current driving scene and the historical motion state on the vehicle. The human visual system can selectively notice the primary content of the observed scene and ignore other secondary content, and the driver should notice things that have a great influence on driving decisions, such as vehicles, pedestrians, traffic lights, etc., while ignoring features that are not important to driving, such as sky, trees, etc. Therefore, an end-to-end automatic driving decision model based on the attention mechanism and the space-time characteristics becomes a new research hotspot.
The automatic driving decision method is mainly divided into an automatic driving decision method based on rules and an end-to-end automatic driving decision method. The automatic driving decision process is divided into different task modules by the rule-based decision method, the vehicle state is understood and divided according to the traffic situation, and real-time reasonable driving actions are generated by combining the manually constructed rule base and priori knowledge, so that the automatic driving vehicle is controlled. The automatic driving model based on end-to-end learning can unify various driving subtasks such as scene environment perception, target recognition, target tracking, planning decision-making and the like into the deep neural network, directly map perception information into control quantities such as throttle, steering wheel, braking and the like, unify the cognition to the decision-making, do not need to split modules, can simplify complicated task steps of feature engineering, and enable the structure of the automatic driving system to be simpler and more efficient. The existing end-to-end automatic driving decision method does not consider the influence of the historical motion state of the vehicle on the vehicle decision, and also has the problems of low decision accuracy, low efficiency and the like.
Disclosure of Invention
In order to overcome the defects of low decision accuracy and low efficiency caused by the influence of the historical motion state of the vehicle on the vehicle decision in the prior art, the invention aims to provide an end-to-end automatic driving behavior decision method, an end-to-end automatic driving behavior decision system and terminal equipment.
In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:
an end-to-end automatic driving behavior decision method comprises the following steps:
based on the image information, the depth information and the semantic segmentation information, acquiring the spatial characteristics of the scene through a convolutional neural network embedded with an attention mechanism;
based on the historical motion state sequence information of the vehicle, acquiring time characteristics through a memory network coding-decoding structure embedded with an attention mechanism;
and communicating the spatial characteristics with the time characteristics, establishing an end-to-end automatic driving behavior decision model, and acquiring an end-to-end automatic driving behavior prediction result.
Preferably, the extracting process of the spatial feature includes:
inputting the packed image and depth image into a backbone network to obtain image information and depth information;
the image information is pooled through a backbone network and a pyramid to obtain semantic segmentation information;
inputting the image information, the depth information and the semantic segmentation information into a connecting layer, generating a space feature vector with a fixed length, and acquiring the space feature of a scene;
wherein the attention mechanism is embedded in the backbone network.
Further preferably, the image information and the depth information are acquired by:
capturing an input feature map by using three spatial attention branch networks, establishing interaction between a spatial dimension and a channel dimension, and acquiring a spatial attention map;
training a backbone network to obtain different sparse masks; pruning is carried out on the main network by utilizing different sparse masks, two different sub-networks are generated, and image information and depth information are acquired based on the different sub-networks.
Further preferably, the process of acquiring the spatial attention map is specifically:
respectively establishing interaction of space dimension and channel dimension in three space attention branch networks through rotating an input feature map; respectively carrying out maximum pooling and average pooling on the rotated feature images; cascading the feature map after the average pooling and the feature map after the maximum pooling, and inputting the cascaded feature map to two full-connection layers for coding; generating attention weights through a sigmoid activation function, and combining the attention weights with the original input feature map to obtain three attention diagrams; the spatial attention force map is obtained after averaging the three attention force maps.
Preferably, the specific operation of pruning is:
randomly initializing a base network and a mask matrix, and simultaneously setting a pruning threshold value;
training the base network and the mask matrix phase after phase, and iteratively updating the mask matrix to obtain two sparse mask matrices with different sharing parameters;
and obtaining different sparse masks of the two sub-networks through training, and pruning the main network by using the different sparse masks.
Preferably, the extracting process of the time feature includes:
the method comprises the steps of utilizing an encoder to understand, summarize and memorize historical motion state sequence information of a vehicle to obtain a historical motion state feature vector of the vehicle;
utilizing the time characteristics to extract a network to construct a time attention mechanism;
and (3) carrying out time sequence generation and feature extraction on the vehicle historical motion state feature vector after the time attention mechanism by using a decoder, updating the hidden state at the current moment, and obtaining the time feature.
Preferably, the time attention mechanism is constructed based on a time attention module, and the specific operations include:
the multi-layer perceptron in the time attention module obtains energy items according to the hidden state of the encoder and the hidden state of the decoder;
the Softmax function in the time attention module obtains real-time encoder abstract features and decoder attention coefficients according to energy items;
the time attention module takes the attention coefficient as a weight, and performs weighted summation on the hidden states of all moments to obtain the context vector of the decoder at each moment.
An end-to-end autopilot behavior decision system comprising:
the spatial attention module is used for acquiring spatial features of the scene based on the image information, the depth information and the semantic segmentation information;
the time attention module is used for acquiring time characteristics based on the vehicle historical motion state sequence information and constructing a time attention mechanism based on the time characteristics;
the model building module is respectively interacted with the space attention module and the time attention module, and is used for building an end-to-end automatic driving behavior decision model based on the space characteristics and the time attention mechanism, and predicting end-to-end automatic driving behavior results through the end-to-end automatic driving behavior decision model.
Preferably, the spatial attention module comprises a ResNet feature extractor, a pyramid pooling unit, and three spatial attention branch networks;
the ResNet feature extractor is used for acquiring feature information in the image information;
the pyramid pooling unit is used for carrying out maximum pooling and average pooling on the feature information acquired by the ResNet feature extractor;
the spatial attention branch network is used for establishing interaction of a spatial dimension and a channel dimension based on the input feature diagram;
the time attention module comprises an encoder, a decoder and a multi-layer perceptron;
the encoder is used for understanding, summarizing and memorizing the historical motion state sequence of the vehicle;
the decoder is used for carrying out time sequence generation and feature extraction and updating the hidden state of the current moment;
the multi-layer perceptron is operative to derive energy terms based on hidden states of the encoder and decoder.
A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the end-to-end autopilot behavior decision method when the computer program is executed.
Compared with the prior art, the invention has the following beneficial effects:
the invention discloses an end-to-end automatic driving decision method, which is characterized in that a convolutional neural network embedded with an attention mechanism is used for extracting scene space position characteristics, a space characteristic extraction network is constructed, and scene target space characteristics and semantic information are accurately analyzed; capturing scene time context characteristics through a long-term memory network coding-decoding structure embedded with a time attention mechanism, constructing a time characteristic extraction network, and understanding memory scene time sequence information; the method combines scene space information and time sequence information, and simultaneously gives higher weight to the key visual area and the motion sequence by combining the attention mechanism, so that the prediction process is more in line with the driving habit of a human driver, and the prediction result is more accurate.
Furthermore, in the end-to-end automatic driving decision process, visual information in the RGB image is not enough to fully sense objects such as vehicles, pedestrians and obstacles in the scene, depth information comprises more position features and outline features of the scene object, and semantic segmentation information covers high-level semantic understanding of the driving scene.
Further, the spatial position features and the time context features of the scene are respectively extracted and fused through a convolution network and an LSTM, and meanwhile, the semantic segmentation information is utilized to improve the model prediction precision in the process of extracting the spatial position features, and the decision quantity of the steering wheel and the steering angle is output. Although the prediction effect of the model after the multi-modal input is considered to be improved, the method does not pay attention to key objects in the scene, such as pedestrians, lane lines, traffic signs and the like.
Further, in order to improve the capability of extracting the space-time saliency features of the driving scene, a attention module is introduced, so that the end-to-end automatic driving behavior decision method focuses on the detailed information of the current task target area, the system can improve the performance and efficiency under limited resources, and unnecessary resource waste is reduced.
Further, as the network model is large due to the multi-modal input and the embedding of the attention module, the network is pruned through the sparse mask, so that the complexity of the model is reduced.
The invention also discloses an end-to-end automatic driving behavior decision system,
drawings
FIG. 1 is an overall architecture of an end-to-end autopilot decision system of the present invention;
FIG. 2 is a sparse mask matrix pruning training process in the method of the present invention;
FIG. 3 is a spatial attention module in the present invention;
FIG. 4 is an LSTM encoding-decoding structure in the present invention;
FIG. 5 is a graph of the loss of the system of the present invention with the MM-STConv model;
FIG. 6 is a graph of accuracy of the system of the present invention with the MM-STConv model;
FIG. 7 is a graph of a speed prediction of 100s (1000 frames) of successive images in a dataset by the system of the present invention;
FIG. 8 is a plot of steering angle predictions for 100s (1000 frames) of successive images in a dataset for a system of the present invention;
Detailed Description
The invention is described in further detail below with reference to the attached drawing figures:
example 1
The end-to-end automatic driving behavior decision model based on the attention mechanism and the space-time characteristics comprises the following steps:
step 1: the spatial feature extraction network describes scene spatial location features using RGB image information, depth information, and semantic segmentation information.
And step 11, inputting the packed RGB image and depth image into a backbone network embedded with an attention mechanism to extract RGB image features and depth features.
Step 12, the RGB image acquires semantic segmentation information and perception context information through a ResNet network and pyramid pooling module embedded with an attention mechanism;
and 13, inputting the fused three kinds of characteristic information into a full-connection layer to generate a space characteristic vector with a fixed length.
Step 2: the temporal feature extraction network extracts temporal context features using the vehicle historical motion state sequence information.
Step 3: and connecting the features obtained by the spatial feature extraction network with the features obtained by the temporal feature extraction network to obtain a final prediction result through two full-connection layers.
In step 11, inputting the RGB image and the depth image into a backbone network embedded with an attention mechanism to extract RGB image features and depth features, which specifically includes:
step 111: by adopting three spatial attention branch networks to capture interaction between the spatial dimension and the channel dimension of the input feature map, important position features of the traffic scene are highlighted to suppress irrelevant scene features.
Step 112: and pruning the backbone network through different sparse masks obtained by training the backbone network, and generating two different sub-networks for extracting RGB features, depth features and semantic features.
The step 111 specifically includes steps A-D:
A. establishing interaction of space dimension H and channel dimension C through rotating an input feature map in a first branch of an attention module, respectively carrying out maximum pooling and average pooling on the rotated feature map, inputting the rotated feature map into two full connection layers (FC) for coding, and finally generating attention weight through a sigmoid activation function and multiplying the attention weight with the original input feature map to obtain an attention map;
B. establishing interaction of a space dimension W and a channel latitude C in a second branch of the attention module through rotating an input feature map, respectively carrying out maximum pooling and average pooling on the rotated feature map, inputting the rotated feature map into two full connection layers (FC) for coding, and finally generating attention weight through a sigmoid activation function and multiplying the attention weight with the original input feature map to obtain an attention map;
C. carrying out maximum pooling and average pooling on the input feature images in a third branch of the attention module, cascading the feature images after average pooling and the feature images after maximum pooling, generating attention weights through a sigmoid activation function, and multiplying the attention weights with the input feature images by elements to obtain a weighted attention map;
D. the attention force obtained by the three branches of the attention module is averaged by element addition to obtain the final spatial attention force.
The step 112 specifically includes steps A-C:
A. randomly initializing a base network and a mask matrix, and simultaneously setting a pruning threshold value;
B. the base network and the mask matrix phase are combined with training, the mask matrix is updated in an iterative mode, and two different sparse mask matrixes with shared parameters are obtained;
C. and training to obtain different sparse masks of two sub-networks in the spatial feature extraction network, and pruning the base network.
In the step 12, the RGB image is subjected to a res net network and pyramid pooling module with an attention mechanism embedded to obtain semantic segmentation information, which is specifically as follows:
the RGB image is subjected to a ResNet feature extractor embedded with a spatial attention module and a pyramid pooling module to obtain semantic segmentation features fused with global information and multi-scale context information, so that a spatial attention diagram is obtained, and further advanced semantic information is obtained.
The step 2 specifically comprises the following steps:
step 21, the LSTM encoder is used for understanding, summarizing and memorizing the historical motion state sequence of the vehicle;
and 22, constructing a time attention mechanism in the time feature extraction network, modeling the relation between the historical speed state sequences, and giving more weight to important time context features.
Step 23, the decoder performs time series generation and feature extraction to update the hidden state at the current time.
The LSTM encoder in step 21 understands, summarizes and memorizes the vehicle history continuous motion state sequence, specifically as follows:
the LSTM encoder carries out T times of recursion update on the vehicle history continuous motion state sequence with the length of T to obtain a time context coding vector c with a fixed length t
The step 22 builds a time attention mechanism in the time feature extraction network, specifically as follows:
step 221, the multi-layer sensor in the time attention module is based on the encoder hiding state,Hidden state of decoder, resulting in energy term e ji
Step 222, softmax function in the time attention module is based on e ji Obtaining abstract features of the encoder at the ith step and attention coefficient a of the decoder at the jth step ji
Step 223, the time attention module applies attention coefficient a ji As the weight, the hidden states of all moments are weighted and summed to obtain the context vector m of the decoder at the j-th step j
The decoder performs time sequence generation and feature extraction in step 23, and updates the hidden state at the current time, specifically as follows:
the decoder outputs the vector y according to the previous time j-1 Hidden state s j-1 And context vector m j Hidden state s for step j j Update according to s j 、y j-1 And m j Updating the historical motion output vector decoded in the j step.
Example 2
As shown in fig. 1, the end-to-end automatic driving behavior decision method based on the attention mechanism and the space-time characteristics specifically comprises the following steps:
step 1: the space feature extraction network utilizes RGB image information, depth information and semantic segmentation information to describe scene space position features, and utilizes a sparse mask pruning backbone network to generate two sub-networks with shared parameters for extracting the image space position features and the semantic features;
step 2: the temporal feature extraction network extracts temporal context features using the vehicle historical motion speed sequence information.
Step 3: and connecting the characteristics obtained by the space network with the characteristics obtained by the time sequence network to obtain a final prediction result.
The step 1 comprises the steps of 11, 12 and 13:
and step 11, inputting the packed RGB image and depth image into a ResNet network embedded with an attention mechanism to extract RGB image features and depth features.
Step 12, the RGB image acquires semantic segmentation information and perception context information through a ResNet network and pyramid pooling module embedded with an attention mechanism;
and 13, generating a feature map with the same size as the space feature extraction feature map by passing the semantic segmentation feature map through a convolution layer and two pooling layers, and connecting and inputting the feature map and the pooling layer to a full-connection layer to generate a space feature vector with a fixed length.
In the step 11, the RGB image and the depth image are input into a res net network embedded with an attention mechanism to extract RGB image features and depth features, which specifically comprises the steps 111 to 113:
step 111, extracting feature information useful for decision tasks by embedding a spatial attention module in each bottleneck block of the res net, as shown in fig. 3, the attention module suppresses irrelevant scene features by establishing dependency between spatial channels to highlight important position features of traffic scenes; the method comprises the following steps:
A. the feature map is input in the first branch of the attention moduleRotating 90 degrees anticlockwise along the H dimension, establishing interaction of the H dimension and the C dimension, and rotating the obtained feature map F r1 (x) The shape of the feature map is W multiplied by H multiplied by C, the rotated feature map is subjected to maximum pooling and average pooling respectively, and vectors obtained by the maximum pooling and average pooling operation are input into two Full Communication (FC) layers to encode the relation between channels. And finally, adding two coded feature vector pixel levels, generating attention weight through sigmoid, multiplying the attention weight by the original input feature map, and finally, rotating the output clockwise by 90 degrees along the H dimension to keep the shape consistent with the input. The following formula is obtained:
B. in the second branch of the attention module, the feature map is inputRotating 90 degrees anticlockwise along the W dimension, establishing interaction of the W dimension and the C dimension, and rotating the obtained feature map F r1 (x) The shape of (C) is h×w×c, and then the rotated feature map is subjected to maximum pooling and average pooling, respectively, and vectors obtained by the maximum pooling and average pooling operations are input to two full-communication (FC) layers to encode the relationship between channels. And finally, adding two coded eigenvector pixel levels, generating attention weight through sigmoid, multiplying the attention weight by the original input eigenvector, and finally, rotating the output clockwise by 90 degrees along the w axis, so as to keep the shape consistent with the input.
The following formula is obtained:
C. feature map for input in the third branch of the attention moduleAnd carrying out maximum pooling and average pooling, and cascading the feature map with the shape of 1 XH multiplied by W after the average pooling and the feature map with the shape of 1 XH multiplied by W after the maximum pooling into feature vectors of 2 XH multiplied by W. The feature vector firstly passes through a standard convolution layer with the convolution kernel size of K multiplied by K and a batch normalization layer, then generates attention weight through a sigmoid activation function, and multiplies the attention weight with an input feature map according to elements to obtain a weighted attention map. The following formula is obtained:
D. the attention force obtained by the three branches of the attention module is averaged by element addition to obtain the final spatial attention force.
In step 112, two different sub-networks are generated for extracting RGB features, depth features and semantic features by pruning the backbone network with different sparse masks obtained by training the backbone network. As shown in fig. 2, the specific steps are as follows:
A. and randomly initializing a base network, setting a mask matrix to be 1, and setting a pruning threshold.
B. And (3) training the base network and the mask matrix phase after phase, and comparing the training result with a threshold value to iteratively update the mask matrix to obtain two different sparse mask matrices sharing common parameters.
C. And obtaining different sparse masks of two sub-networks in the spatial feature extraction network through training, and pruning elements with smaller contributions to tasks in each network of ResNet.
In the step 12, the RGB image is subjected to a res net network and pyramid pooling module with an attention mechanism embedded to obtain semantic segmentation information, which is specifically as follows:
the RGB image is subjected to a ResNet feature extractor embedded with a spatial attention module and a pyramid pooling module to obtain semantic segmentation features fused with global information and multi-scale context information, so that a spatial attention diagram is obtained, and further advanced semantic information is obtained.
The step 2 specifically comprises the following steps:
step 21, the LSTM encoder is used for understanding, summarizing and memorizing the historical continuous motion state sequence of the vehicle;
and 22, constructing a time attention mechanism in the time feature extraction network, modeling the relation between the historical speed state sequences, and giving more weight to important time context features.
Step 23, the decoder performs time sequence generation and feature extraction, and updates the hidden state at the current moment;
as shown in fig. 4, the LSTM encoder in step 21 understands, summarizes and memorizes the vehicle history continuous motion state sequence as follows:
LSTM encoder pair length t vehicle history continuous motion state sequence s 1 ,...,s t Performing T times of recursion update to obtain a time context coding vector c with fixed length t The vector contains an encoder's understanding, summarizing and memorizing a sequence of historical continuous motion states of the vehicle;
the step 22 builds a time attention mechanism in the time feature extraction network, specifically as follows:
step 221, the multi-layer perceptron in the time attention branching network based on the encoder hidden state h of step i i Hidden state s of the decoder of step j-1 j-1 Obtaining energy term e ji =w T tanh(W[s j-1 ,h i ]+b), where W and b are the input layer to hidden layer weights and bias vectors, and W is the hidden layer to output layer weight vector.
Step 222, softmax function in time attention branching network according to e ji Obtaining abstract features of the encoder at the ith step and attention coefficient a of the decoder at the jth step ji I.e.Where t is the length of the input sequence.
Step 223, the time attention branching network takes attention coefficient a ji As the weight, the hidden states of all moments are weighted and summed to obtain the context vector m of the decoder at the j-th step j I.e.
The decoder performs time sequence generation and feature extraction in step 23, and updates the hidden state at the current time, specifically as follows:
the decoder outputs the vector y according to the previous time j-1 Hidden state s j-1 And context vector m j Hidden state s for step j j Update according to s j 、y j-1 And m j Updating the historical motion output vector decoded in the j step.
And (3) effect verification:
in order to verify the effectiveness of the method, a data set generated and marked by an automatic driving simulation test platform is adopted, 8112 images are selected for training, and the rest 3476 images are used for carrying out algorithm verification on the test images.
The error comparison is carried out on the training of the method and the MM-STConv behavior decision model, the result is shown in figure 5, and the training loss curves of the two models are gradually reduced along with the increase of training period, wherein the training loss curve of the end-to-end automatic driving decision method based on the attention mechanism and the time-space characteristics is lower than the MM-STConv behavior decision method as a whole, and the descending speed is faster than that of the MM-STConv behavior decision model. Meanwhile, compared with the MM-STConv behavior decision model, the training loss curve merging the attention mechanism model has smaller jitter. Compared with an MM-STConv behavior decision model, the end-to-end automatic driving behavior decision method based on the attention mechanism and the space-time characteristics has the advantages that the training process is more stable and efficient, and the convergence speed is higher.
The method of the invention is compared with the prediction accuracy of the MM-STConv behavior decision model, and the result is shown in FIG. 6. It can be seen from the figure that the prediction accuracy curves of both methods gradually rise with increasing training period, wherein the prediction accuracy curve based on the attention mechanism and the spatiotemporal features is higher than the MM-STConv behavior decision model as a whole, and the rising speed is faster than the MM-STConv behavior decision model. Compared with the MM-STConv behavior decision model, the end-to-end automatic driving behavior decision method based on the attention mechanism and the space-time characteristics has better model performance and more stable prediction result due to the introduction of the attention module.
The speed prediction and the steering angle prediction are carried out by using the system, the result is shown in fig. 7 and 8, and from the result, it can be seen that the speed and the steering angle prediction curve of the method are closer to the real reference curve, can be well fitted with the reference curve, and the curve jitter is smaller, so that the prediction is more stable.
Example 3
The deep neural network-based estimation method of the present invention, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals. The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), and semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NANDFLASH), solid State Disk (SSD)), etc.
Example 4
In an exemplary embodiment, a computer device is also provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which processor implements the steps of the deep neural network based channel estimation method when executing the computer program. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like.
The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (4)

1. An end-to-end automatic driving behavior decision method is characterized by comprising the following steps:
based on the image information, the depth information and the semantic segmentation information, acquiring the spatial characteristics of the scene through a convolutional neural network embedded with an attention mechanism;
based on the historical motion state sequence information of the vehicle, acquiring time characteristics through a memory network coding-decoding structure embedded with an attention mechanism;
the space features are communicated with the time features, an end-to-end automatic driving behavior decision model is established, and an end-to-end automatic driving behavior prediction result is obtained;
the extraction process of the spatial features comprises the following steps:
inputting the packed image and depth image into a backbone network to obtain image information and depth information;
the image information is pooled through a backbone network and a pyramid to obtain semantic segmentation information;
inputting the image information, the depth information and the semantic segmentation information into a connecting layer, generating a space feature vector with a fixed length, and acquiring the space feature of a scene;
wherein, the attention mechanism is embedded in the backbone network;
the image information and depth information are acquired by the following steps:
capturing an input feature map by using three spatial attention branch networks, establishing interaction between a spatial dimension and a channel dimension, and acquiring a spatial attention map;
training a backbone network to obtain different sparse masks; pruning a backbone network by using different sparse masks to generate two different sub-networks, and acquiring image information and depth information based on the different sub-networks;
the acquisition process of the spatial attention map specifically includes:
respectively establishing interaction of space dimension and channel dimension in three space attention branch networks through rotating an input feature map; respectively carrying out maximum pooling and average pooling on the rotated feature images; cascading the feature map after the average pooling and the feature map after the maximum pooling, and inputting the cascaded feature map to two full-connection layers for coding; generating attention weights through a sigmoid activation function, and combining the attention weights with the original input feature map to obtain three attention diagrams; three attentives are averaged then obtaining a space attention diagram;
the specific operation of pruning is as follows:
randomly initializing a base network and a mask matrix, and simultaneously setting a pruning threshold value;
training the base network and the mask matrix phase after phase, and iteratively updating the mask matrix to obtain two sparse mask matrices with different sharing parameters;
obtaining different sparse masks of two sub-networks through training, and pruning a main network by using the different sparse masks;
the extraction process of the time features comprises the following steps:
the method comprises the steps of utilizing an encoder to understand, summarize and memorize historical motion state sequence information of a vehicle to obtain a historical motion state feature vector of the vehicle;
utilizing the time characteristics to extract a network to construct a time attention mechanism;
the decoder is utilized to carry out time sequence generation and feature extraction on the vehicle historical motion state feature vector after the time attention mechanism, the hidden state at the current moment is updated, and the time feature is obtained;
the time attention mechanism is constructed based on a time attention module, and specific operations comprise:
the multi-layer perceptron in the time attention module obtains energy items according to the hidden state of the encoder and the hidden state of the decoder;
the Softmax function in the time attention module obtains real-time encoder abstract features and decoder attention coefficients according to energy items;
the time attention module takes the attention coefficient as a weight, and performs weighted summation on the hidden states of all moments to obtain the context vector of the decoder at each moment.
2. An end-to-end autopilot behavior decision system in accordance with the method of claim 1 comprising:
the spatial attention module is used for acquiring spatial features of the scene based on the image information, the depth information and the semantic segmentation information;
the time attention module is used for acquiring time characteristics based on the vehicle historical motion state sequence information and constructing a time attention mechanism based on the time characteristics;
the model building module is respectively interacted with the space attention module and the time attention module, and is used for building an end-to-end automatic driving behavior decision model based on the space characteristics and the time attention mechanism, and predicting end-to-end automatic driving behavior results through the end-to-end automatic driving behavior decision model.
3. The end-to-end autopilot behavior decision system of claim 2 wherein the spatial attention module includes a res net feature extractor, a pyramid pooling unit, and three spatial attention branch networks;
the ResNet feature extractor is used for acquiring feature information in the image information;
the pyramid pooling unit is used for carrying out maximum pooling and average pooling on the feature information acquired by the ResNet feature extractor;
the spatial attention branch network is used for establishing interaction of a spatial dimension and a channel dimension based on the input feature diagram;
the time attention module comprises an encoder, a decoder and a multi-layer perceptron;
the encoder is used for understanding, summarizing and memorizing the historical motion state sequence of the vehicle;
the decoder is used for carrying out time sequence generation and feature extraction and updating the hidden state of the current moment;
the multi-layer perceptron is operative to derive energy terms based on hidden states of the encoder and decoder.
4. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the end-to-end autopilot behavior decision method of claim 1 when the computer program is executed by the processor.
CN202110391084.6A 2021-04-12 2021-04-12 End-to-end automatic driving behavior decision method, system and terminal equipment Active CN113139446B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110391084.6A CN113139446B (en) 2021-04-12 2021-04-12 End-to-end automatic driving behavior decision method, system and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110391084.6A CN113139446B (en) 2021-04-12 2021-04-12 End-to-end automatic driving behavior decision method, system and terminal equipment

Publications (2)

Publication Number Publication Date
CN113139446A CN113139446A (en) 2021-07-20
CN113139446B true CN113139446B (en) 2024-02-06

Family

ID=76811192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110391084.6A Active CN113139446B (en) 2021-04-12 2021-04-12 End-to-end automatic driving behavior decision method, system and terminal equipment

Country Status (1)

Country Link
CN (1) CN113139446B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673412B (en) * 2021-08-17 2023-09-26 驭势(上海)汽车科技有限公司 Method and device for identifying key target object, computer equipment and storage medium
CN114463670B (en) * 2021-12-29 2023-02-03 电子科技大学 Airport scene monitoring video change detection system and method
CN114423061B (en) * 2022-01-20 2024-05-07 重庆邮电大学 Wireless route optimization method based on attention mechanism and deep reinforcement learning
CN114777797B (en) * 2022-06-13 2022-09-30 长沙金维信息技术有限公司 High-precision map visual positioning method for automatic driving and automatic driving method
CN115049130B (en) * 2022-06-20 2024-06-04 重庆邮电大学 Automatic driving track prediction method based on space-time pyramid

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
WO2020253965A1 (en) * 2019-06-20 2020-12-24 Toyota Motor Europe Control device, system and method for determining perceptual load of a visual and dynamic driving scene in real time
CN112418409A (en) * 2020-12-14 2021-02-26 南京信息工程大学 Method for predicting time-space sequence of convolution long-short term memory network improved by using attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10940863B2 (en) * 2018-11-01 2021-03-09 GM Global Technology Operations LLC Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
WO2020253965A1 (en) * 2019-06-20 2020-12-24 Toyota Motor Europe Control device, system and method for determining perceptual load of a visual and dynamic driving scene in real time
CN112418409A (en) * 2020-12-14 2021-02-26 南京信息工程大学 Method for predicting time-space sequence of convolution long-short term memory network improved by using attention mechanism

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
杜圣东 ; 李天瑞 ; 杨燕 ; 王浩 ; 谢鹏 ; 洪西进 ; .一种基于序列到序列时空注意力学习的交通流预测模型.计算机研究与发展.2020,(第08期),全文. *
王军 ; 鹿姝 ; 李云伟 ; .融合注意力机制和连接时序分类的多模态手语识别.信号处理.2020,(第09期),全文. *
胡学敏 ; 童秀迟 ; 郭琳 ; 张若晗 ; 孔力 ; .基于深度视觉注意神经网络的端到端自动驾驶模型.计算机应用.2020,(第07期),全文. *
蔡英凤 ; 朱南楠 ; 邰康盛 ; 刘擎超 ; 王海 ; .基于注意力机制的车辆行为预测.江苏大学学报(自然科学版).2020,(第02期),全文. *
赵祥模 ; 连心雨 ; 刘占文 ; 沈超 ; 董鸣 ; .基于MM-STConv的端到端自动驾驶行为决策模型.中国公路学报.2020,(第03期),全文. *

Also Published As

Publication number Publication date
CN113139446A (en) 2021-07-20

Similar Documents

Publication Publication Date Title
CN113139446B (en) End-to-end automatic driving behavior decision method, system and terminal equipment
Cai et al. Environment-attention network for vehicle trajectory prediction
WO2021180130A1 (en) Trajectory prediction
Akan et al. Stretchbev: Stretching future instance prediction spatially and temporally
US10325371B1 (en) Method and device for segmenting image to be used for surveillance using weighted convolution filters for respective grid cells by converting modes according to classes of areas to satisfy level 4 of autonomous vehicle, and testing method and testing device using the same
CN114372116B (en) Vehicle track prediction method based on LSTM and space-time attention mechanism
CN112686281A (en) Vehicle track prediction method based on space-time attention and multi-stage LSTM information expression
CN113468978B (en) Fine granularity car body color classification method, device and equipment based on deep learning
CN110570035B (en) People flow prediction system for simultaneously modeling space-time dependency and daily flow dependency
KR20170038622A (en) Device and method to segment object from image
CN113362491A (en) Vehicle track prediction and driving behavior analysis method
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN117157678A (en) Method and system for graph-based panorama segmentation
CN115298670A (en) Method for continuously learning classifier for classifying images of client by using continuous learning server and continuous learning server using same
Pavlitskaya et al. Using mixture of expert models to gain insights into semantic segmentation
Wang et al. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection
CN113051983B (en) Method for training field crop disease recognition model and field crop disease recognition
Zhao et al. End‐to‐end autonomous driving decision model joined by attention mechanism and spatiotemporal features
CN115018039A (en) Neural network distillation method, target detection method and device
Oh et al. Hcnaf: Hyper-conditioned neural autoregressive flow and its application for probabilistic occupancy map forecasting
CN116432736A (en) Neural network model optimization method and device and computing equipment
CN116861262B (en) Perception model training method and device, electronic equipment and storage medium
Du et al. Adaptive visual interaction based multi-target future state prediction for autonomous driving vehicles
CN114626500A (en) Neural network computing method and related equipment
CN117116048A (en) Knowledge-driven traffic prediction method based on knowledge representation model and graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant