CN114707630A

CN114707630A - Multi-modal trajectory prediction method by paying attention to scenes and states

Info

Publication number: CN114707630A
Application number: CN202210142283.8A
Authority: CN
Inventors: 李琳辉; 王雪成; 连静; 丁荣琪
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2022-07-05

Abstract

The invention discloses a multi-modal trajectory prediction method by paying attention to scenes and states, which comprises the following steps: extracting scene information by the full convolution neural network and focusing on a target agent by using affine transformation; encoding the historical state of the agent by using a Transformer; and (5) feature fusion and decoding. According to the method, historical state information and surrounding environment information of an agent are comprehensively considered, and the three attention mechanisms are combined to extract and fuse the two kinds of information, namely the self-attention mechanism of a Transformer captures the potential relation between the historical states; the affine transformation cuts out key positions from the feature graph, namely a hard attention mechanism to focus on a specific intelligent agent; the multi-head attention mechanism can effectively capture the interaction information of the state and the scene. Finally, a multi-modal trajectory prediction method by attention to the scene and state is generated for a plurality of socially acceptable trajectories and their associated probabilities.

Description

Multi-modal trajectory prediction method by paying attention to scenes and states

Technical Field

The invention belongs to the field of automatic driving, and particularly relates to an intelligent multi-modal trajectory prediction method.

Background

In recent years, Artificial Intelligence (AI) has achieved unprecedented development, and automated driving technology has now become one of the most potential applications in the field of artificial intelligence, and it is believed that its impact on the production and life style of human society will be no less than that of the birth of automobiles hundreds of years ago. Currently, autodrive technology is not mature, and intelligent vehicles need to share the same traffic scene with human drivers. In order to safely and efficiently travel, the intelligent vehicle needs to know which objects exist around, and also needs to predict what trajectories the surrounding intelligent agents (vehicles, riders, pedestrians and the like) can follow, which is a problem of trajectory prediction in the field of automatic driving.

The trajectory prediction is used as an intermediate link of automatic driving environment perception and decision planning, the future trajectory of the intelligent agent is accurately predicted in dynamic, interactive and uncertain scenes, and the method has important significance for deeply understanding the surrounding environment and making decisions of the automatic driving automobile and guaranteeing the driving stability, safety and economy of the automobile. However, trajectory prediction is a very challenging task due to the randomness of the agent's motion, the complexity of the interaction, and the multi-modal nature of the future trajectory distribution. To solve the problem of trajectory prediction, methods based on dynamics and kinematics have been proposed at first, but these methods have poor accuracy and generalization capability, and are therefore only suitable for transient trajectory prediction in simple driving scenarios. In recent years, trajectory prediction methods based on deep learning have become popular. Among them, there are methods of capturing interaction information between agents by designing various interaction mechanisms based only on historical trajectories of agents, but they ignore other important factors affecting agent movements, such as environmental conditions and traffic rules, which may cause prediction results in complex traffic scenarios to become unacceptable, and most methods generate only one predicted trajectory, which does not conform to the nature that future movements are multi-modal; there is also an end-to-end approach that only utilizes Convolutional Neural Networks (CNNs) to extract state information and scene information of the agent from the sequence images, although CNNs are good at handling spatial dependency, they lack a mechanism for modeling sequence data, and in trajectory prediction it is necessary to model the dependency of the state of the agent over time; the historical track and the scene information are simultaneously used for track prediction, but no attention mechanism is used, so that the model cannot accurately understand which are the most important factors influencing future movement, data redundancy and model space degree are generally complex, the final prediction result is inaccurate, and the social acceptability degree is poor.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a multi-modal trajectory prediction method capable of generating a plurality of socially acceptable trajectories and associated probabilities thereof by paying attention to scenes and states.

In order to achieve the purpose, the specific technical scheme of the invention is as follows: a method of multi-modal trajectory prediction by noting scenes and states, comprising the steps of:

A. full convolution neural network extracts scene information and focuses on target intelligent agent by using affine transformation

And A1, constructing a composite grid map. The environment information is first rendered on a high-definition map under a bird's eye view to preserve the size and position of the agent and the road geometry, and to ignore their texture. In order to embody the dynamic interaction between agents and between the agents and the environment and avoid the problems of semantic annotation blocking and data redundancy, a plurality of continuous original rasterized high-precision maps are split and recombined to obtain a composite raster map

H is the length of the composite grid map, W is the width of the composite grid map, and C is the number of channels.

A2, extracting scene information and interaction information by the full convolution neural network. Composite grid map obtained from step A1 using full convolutional neural network (FCN)

And (3) learning representative topological information, semantic information and interactive information, and extracting information by using the FCN according to the following formula:

wherein FCN (-) is a full convolution neural network,

is a composite grid map, W_sIs the weight of the full convolution neural network.

To reduce the model parameters, the convolutional neural network inside the FCN, CNN, was chosen using MobileNetV 2. Using FCN not only enables scene and interaction information to be obtained, but also keeps the size of the input image consistent with the size of the output signature, thus focusing on the target agent based on its initial position.

A3, paying attention to the target agent. The agent usually focuses more on objects that are closer to itself and more interactive, so a profile is obtained in step A2

Then, a small feature map is cut according to the position of the target agent

To focus on it. In obtaining a small feature map

And then rotating the intelligent body by a certain angle by using affine transformation according to the orientation of the intelligent body so as to normalize the orientation of the intelligent body, wherein the formulas of the affine transformation and the affine matrix are shown as formulas (2) and (3):

in the formula, Affine (·) is an Affine transformation function, θ is an Affine matrix, and h is the orientation of the agent.

B. Encoding agent historical state using Transformer

Before the historical state information is coded by using a Transformer, the state information is firstly connected together, and then a multi-layer perceptron (MLP) is applied to embed the state information into a high-dimensional space to obtain f_tIn fixed size, last f_tIs embedded in a higher dimensional space to obtain F_aThe formulas for embedding state information into a high-dimensional space using MLP and embedding into a higher-dimensional space using an embedding matrix are as follows:

f_t＝MPL(concat(x_t,y_t,v_t,sin(h),cos(h)) (4)

F_a＝f_t·W_f (5)

in the formula (x)_t,y_t) Is the position coordinate of the agent, v_tIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, W_fIs the embedding matrix and t is the time stamp. The historical state information includes location, speed, and heading.

Each past time instant t is time-coded using the Transformer "position coding", which is formulated as follows:

I_a＝F_a+P_a (8)

where D is the data dimension and D is the model embedding dimension.

To capture the inherent potential dependencies between the agent's historical states, I is first paired_aPerforming a three-time linear projective transformation to obtain three matrices, i.e., Q_a＝W_qI_a,K_a＝W_kI_aAnd V_a＝W_vI_aWherein W is_qIs Q_aProjection matrix of W_kIs K_aProjection matrix of W_vIs V_aThe projection matrix of (2). Then, the internal relationship between the historical states of the agent is obtained by using a self-attention mechanism. Finally, a final representation E of the historical state of the agent is obtained by utilizing a feedforward network_a. The self-attention mechanism equation (9) and the feed-forward network equation (10) are as follows:

E_a＝FFN(Att(Q_a,K_a,V_a)) (10)

where softmax (. cndot.) is the softmax function, Att (. cndot.) is multi-headed self-attention, FFN (. cndot.) is the feed-forward network,

is K_aOf (c) is calculated.

C. Feature fusion and decoding

C1, feature fusion. The clipped feature map is obtained in step A3

Thereafter, the scene information is further extracted using depth separable convolution, i.e.

Wherein DSC (. cndot.) is a depth separable convolution. Then, using MLP to the historical state E of the agent_aAnd (5) performing dimensionality reduction to obtain K 'and V', which are shown in the formula (11). Finally, scene information Q ' extracted by depth separable convolution and state information K ' subjected to dimensionality reduction are subjected to multi-head attention mechanism 'And obtaining a fusion feature I by fusing with V', as shown in formulas (12) and (13), wherein formula (12) is an attention head function, and formula (13) is obtained by connecting 8 attention head functions, and the formula is as follows:

K′＝V′＝MLP(E_a) (11)

I＝Concat(head₁,…,head₈) (13)

in the formula, MLP (-) is a multi-layer perceptron, K 'is the Key Value of the state information after dimension reduction, V' is the Value of the state information after dimension reduction, head_iAttention is called, Softmax (-) is a Softmax function,

is the projection matrix of Q' and,

is a projection matrix of K' and,

is the projection matrix of V', d_K′Is the dimension of K', Concat (. cndot.) is the connection function.

C2, decoding the fused features. Inputting the fusion characteristics I into a regression branch and a classification branch respectively to finally obtain K predicted tracks and related probabilities thereof, wherein the formula of the regression branch is shown as (14), and the formula of the classification branch is shown as (15) and (16):

logit＝MLP₂{Concat(O_K,reg,MLP₁(I))} (15)

in the formula, MLP₁(·)、MLP₂(. and MLP)₃() are 1, 2 and 3 layer multilayer sensors respectively,

is the k-th predicted trajectory,

is the probability of the k-th predicted trajectory.

Compared with the prior art, the invention has the following beneficial effects and benefits:

1. the invention carries out multi-mode track prediction by paying attention to scenes and states, and comprehensively considers scene context information and intelligent agent historical state information to generate a plurality of accurate and acceptable future tracks and probabilities thereof.

2. The invention can avoid the problems of semantic annotation shielding and data redundancy by using a unique scene representation method and affine transformation.

3. The method combines three attention mechanisms (affine transformation, namely a hard attention mechanism, a transform self-attention mechanism and a multi-head attention mechanism) to fuse scene information and state information, and improves prediction precision and social acceptability compared with a traditional track prediction method.

4. In conclusion, the historical state information and the ambient environment information of the agent are comprehensively considered, and the two kinds of information are extracted and fused by combining three attention mechanisms, namely the self-attention mechanism of the Transformer captures the potential relation between the historical states; the affine transformation cuts out key positions from the feature graph, namely a hard attention mechanism to focus on a specific intelligent agent; the multi-head attention mechanism can effectively capture the interaction information of the state and the scene. Finally, a multi-modal trajectory prediction method by looking at scenes and states that produces a plurality of socially acceptable trajectories and their associated probabilities.

Drawings

The invention has 4 sheets, wherein:

FIG. 1 is a flow chart of a method for multi-modal trajectory prediction by noting the scene and state.

Fig. 2 is a model structure diagram of a multi-modal trajectory prediction method by paying attention to scenes and states.

Fig. 3 is a composite grid map.

Fig. 4 is a diagram of a full convolutional neural network FCN architecture.

Detailed Description

The following detailed description of the embodiments of the invention refers to the accompanying drawings. As shown in fig. 1, a multi-modal trajectory prediction method by paying attention to scenes and states includes the following steps:

And A1, constructing a composite grid map. The accurate prediction of the target trajectory not only takes into account the historical state information of the target agent, but also needs to take into account scene semantic information, as shown in fig. 2, for which purpose the environment information is rendered on a high-definition map under a bird's-eye view. This has the advantage of simplicity in representing the context information, while preserving the size and location of the agent and road geometry, and ignoring their texture. In order to embody the dynamic interaction between agents and between the agents and the environment and avoid semantic annotation blocking and data redundancy, the original rasterized high-precision map is split into a form shown in fig. 3, then information of only one channel is taken from each picture, and finally all the obtained channels are stacked together to obtain a composite grid map

H512 is the length of the composite raster image, W512 is the width, and C6 is the number of channels.

A2, extracting scene information and interaction information by the full convolution neural network. The full convolution neural network FCN can combine shallow fine appearance information with deep coarse semantic information, and can obtain a composite rasterized map from A1

The representative topological information, semantic information and interactive information are learned, and the formula for extracting the information by using the FCN is as follows:

wherein FCN (-) is a full convolutional network,

is a composite grid map, W_sWeight of the full convolution network.

To reduce the model parameters, the CNN network selection inside the full convolutional neural network FCN uses a lightweight network MobileNetV2 that not only has good performance, but also has fewer parameters. Connecting the 7 th layer, the 14 th layer and the 19 th layer of the MobileNet V2 with the deconvoluted output respectively by using jump connection to obtain a feature map

See figure 4 in particular. Using FCN not only achieves interaction between agents and the environmental context, but more importantly it keeps the size of the input composite grid map and the output feature map consistent so that it can be focused on based on the initial location of the target agent.

A3, paying attention to the target agent. Agents are generally more concerned with objects that are closer to themselves and more interactive, and therefore most of the information in the entire map is not useful to a single agent. The characteristic map is obtained in step A2

Then, a small feature map is cut according to the position of the target agent

To focus on it. In addition, the orientation of the agent is valuable information in obtaining a small feature map

Then rotating a certain angle by affine transformation according to the orientation of the intelligent body to unify the intelligenceThe orientation of the energy body, the affine transformation and the affine matrix are expressed as (2) and (3):

B. Encoding agent historical state using Transformer

Before encoding the historical state information of the target agent, namely the position, the speed, the orientation and the like by using a Transformer, all the state information is firstly connected together, and then a multilayer perceptron (MLP) is applied to embed the state information into a high-dimensional space (16-dimensional) to obtain the state information

With fixed input size, finally

Is embedded in a higher dimensional (64 dimensional) space to obtain F_aThe formulas for embedding state information into a high-dimensional space using MLP and embedding into a higher-dimensional space using an embedding matrix are as follows:

f_t＝MPL(concat(x_t,y_t,v_t,sin(h),cos(h)) (4)

F_a＝f_t·W_f (5)

in the formula (x)_t,y_t) Is the position coordinate of the agent, v_tIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, W_fIs the embedding matrix and t is the time stamp.

The state of each past time instant t is time coded using the Transformer "position coding", which is formulated as follows:

I_a＝F_a+P_a (8)

where D64 is the data dimension and D512 is the model embedding dimension.

To capture the inherent potential dependencies between the agent's historical states, I is first paired_aPerforming a three-time linear projective transformation to obtain three matrices, i.e., Q_a＝W_qI_a,K_a＝W_kI_aAnd V_a＝W_vI_aWherein W is_qIs Q_aProjection matrix of W_kIs K_aProjection matrix of W_vIs a V_aThe projection matrix of (2). Then, the internal relationships between the historical states are obtained using a multi-headed self-attention mechanism. Finally, a representation E of the history is obtained using a feed-forward network_aDimension 2048. The formula of the self-attention mechanism and the formula of the feed-forward network are as follows:

E_a＝FFN(Att(Q_a,K_a,V_a)) (10)

is K_aOf (c) is calculated.

In encoding the agent history state, 3 encoder layers of the Transformer are used, each encoder layer having 64 hidden units and 8 attention heads.

C. Feature fusion and decoding

C1, feature fusion. Obtaining the trimmed features at step A3

Then, the features are further extracted by utilizing the depth separable convolution to obtain scene information, namely

Where DSC (-) is a deep separable convolution with 4 hidden layers. Then, using MLP to the historical state E of the agent_aThe dimension is reduced to 32 to obtain K 'and V', as shown in the formula (11). And finally, fusing the scene information Q ' extracted by the depth separable convolution and the state information K ' and V ' subjected to dimensionality reduction by using a multi-head attention mechanism to obtain a fusion feature I, such as formulas (12) and (13), wherein (12) is an attention head function, and (13) is a function for connecting 8 attention head functions, and the formula is as follows:

K′＝V′＝MLP(E_a) (11)

,i＝1,…,8 (12)

I＝Concat(head₁,…,head₈) (13)

wherein MLP (-) is a multilayer perceptron, K 'is a Key Value of the state information after dimension reduction, V' is a Value of the state information after dimension reduction, head_iAttention is called, Softmax (-) is a Softmax function,

is the projection matrix of Q' and,

is a projection matrix of K' and,

is the projection matrix of V', d _K′32 is the dimension of K', Concat (·) is the connection function.

logit＝MLP₂{Concat(O_K,reg,MLP₁(I))} (15)

in the formula, MLP₁(·)、MLP₂(. and MLP)₃(. h) are 1-layer, 2-layer and 3-layer MLPs,

is the k-th predicted trajectory and,

is the probability of the k-th predicted trajectory.

The present invention is not limited to the embodiment, and any equivalent idea or change within the technical scope of the present invention is to be regarded as the protection scope of the present invention.

Claims

1. A multi-modal trajectory prediction method by paying attention to scenes and states, characterized in that: the method comprises the following steps:

A1, constructing a composite grid map; firstly, rendering environment information on a high-definition map under a bird's-eye view angle to reserve the size and position of an intelligent agent and geometric characteristics of a road and neglect textures of the intelligent agent and the road; in order to embody the dynamic interaction between agents and between the agents and the environment and avoid the problems of semantic annotation blocking and data redundancy, a plurality of continuous original rasterized high-precision maps are split and recombined to obtain a composite raster map

H is the length of the composite grid map, W is the width of the composite grid map, and C is the number of channels;

a2, extracting scene information and interaction information by a full convolution neural network; composite grid map obtained from step A1 using full convolutional neural network (FCN)

wherein FCN (-) is a full convolution neural network,

is a composite grid map, W_sIs the weight of the full convolution neural network;

in order to reduce model parameters, a Convolutional Neural Network (CNN) inside the FCN selects to use MobileNet V2; the FCN is utilized to not only obtain scene and interaction information, but also keep the size of an input image consistent with that of an output feature map, so that the target agent is focused on according to the initial position of the target agent;

a3, paying attention to the target agent; the agent usually focuses more on objects that are closer to itself and more interactive, so a profile is obtained in step A2

Then, a small feature map is cut according to the position of the target agent

To focus on it; in obtaining a small feature map

in the formula, Affine (·) is an Affine transformation function, theta is an Affine matrix, and h is the orientation of the agent;

B. encoding agent historical state using Transformer

f_t＝MPL(concat(x_t，y_t，v_t，sin(h)，cos(h)) (4)

F_a＝f_t·W_f (5)

in the formula (x)_t，y_t) Is the position coordinate of the agent, v_tIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, W_fIs the embedding matrix, t is the timestamp; the historical state information includes location, speed, and heading;

I_a＝F_a+P_a (8)

wherein D is the data dimension and D is the model embedding dimension;

to capture the inherent potential dependencies between the agent's historical states, I is first paired_aPerforming a three-time linear projective transformation to obtain three matrices, i.e., Q_a＝W_qI_a，K_a＝W_kI_aAnd V_a＝W_vI_aWherein W is_qIs Q_aProjection matrix of W_kIs K_aProjection matrix of, W_vIs V_aThe projection matrix of (2); then, obtaining the internal relation between the historical states of the intelligent agent by using a self-attention mechanism; finally, a final representation E of the historical state of the agent is obtained by utilizing a feedforward network_a(ii) a The self-attention mechanism equation (9) and the feed-forward network equation (10) are as follows:

E_a＝FFN(Att(Q_a，K_a，V_a)) (10)

is K_aDimension (d);

C. feature fusion and decoding

C1, feature fusion; the clipped feature map is obtained in step A3

Wherein DSC (-) is a deep separable convolution; then, using MLP to the historical state E of the agent_aPerforming dimensionality reduction to obtain K 'and V', which are shown as a formula (11); and finally, fusing the scene information Q ' extracted by the depth separable convolution and the state information K ' and V ' subjected to dimensionality reduction by using a multi-head attention mechanism to obtain a fusion feature I, such as formulas (12) and (13), wherein the formula (12) is an attention head function, and the formula (13) is to connect 8 attention head functions, and the formula is as follows:

K′＝V′＝MLP(E_a) (11)

I＝Concat(head₁，...，head₈) (13)

is the projection matrix of Q' and,

is a projection matrix of K' and,

is the projection matrix of V', d_K′Is the dimension of K', Concat (. cndot.) is the connection function;

c2, decoding the fused features; inputting the fusion characteristics I into a regression branch and a classification branch respectively to finally obtain K predicted tracks and related probabilities thereof, wherein the formula of the regression branch is shown as (14), and the formula of the classification branch is shown as (15) and (16):

logit＝MLP₂{Concat(O_K，reg，MLP₁(I))} (15)

is the k-th predicted trajectory,

is the probability of the k-th predicted trajectory.