CN114707630A - Multi-modal trajectory prediction method by paying attention to scenes and states - Google Patents

Multi-modal trajectory prediction method by paying attention to scenes and states Download PDF

Info

Publication number
CN114707630A
CN114707630A CN202210142283.8A CN202210142283A CN114707630A CN 114707630 A CN114707630 A CN 114707630A CN 202210142283 A CN202210142283 A CN 202210142283A CN 114707630 A CN114707630 A CN 114707630A
Authority
CN
China
Prior art keywords
agent
information
attention
formula
mlp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210142283.8A
Other languages
Chinese (zh)
Inventor
李琳辉
王雪成
连静
丁荣琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202210142283.8A priority Critical patent/CN114707630A/en
Publication of CN114707630A publication Critical patent/CN114707630A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/02Affine transformations
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-modal trajectory prediction method by paying attention to scenes and states, which comprises the following steps: extracting scene information by the full convolution neural network and focusing on a target agent by using affine transformation; encoding the historical state of the agent by using a Transformer; and (5) feature fusion and decoding. According to the method, historical state information and surrounding environment information of an agent are comprehensively considered, and the three attention mechanisms are combined to extract and fuse the two kinds of information, namely the self-attention mechanism of a Transformer captures the potential relation between the historical states; the affine transformation cuts out key positions from the feature graph, namely a hard attention mechanism to focus on a specific intelligent agent; the multi-head attention mechanism can effectively capture the interaction information of the state and the scene. Finally, a multi-modal trajectory prediction method by attention to the scene and state is generated for a plurality of socially acceptable trajectories and their associated probabilities.

Description

Multi-modal trajectory prediction method by paying attention to scenes and states
Technical Field
The invention belongs to the field of automatic driving, and particularly relates to an intelligent multi-modal trajectory prediction method.
Background
In recent years, Artificial Intelligence (AI) has achieved unprecedented development, and automated driving technology has now become one of the most potential applications in the field of artificial intelligence, and it is believed that its impact on the production and life style of human society will be no less than that of the birth of automobiles hundreds of years ago. Currently, autodrive technology is not mature, and intelligent vehicles need to share the same traffic scene with human drivers. In order to safely and efficiently travel, the intelligent vehicle needs to know which objects exist around, and also needs to predict what trajectories the surrounding intelligent agents (vehicles, riders, pedestrians and the like) can follow, which is a problem of trajectory prediction in the field of automatic driving.
The trajectory prediction is used as an intermediate link of automatic driving environment perception and decision planning, the future trajectory of the intelligent agent is accurately predicted in dynamic, interactive and uncertain scenes, and the method has important significance for deeply understanding the surrounding environment and making decisions of the automatic driving automobile and guaranteeing the driving stability, safety and economy of the automobile. However, trajectory prediction is a very challenging task due to the randomness of the agent's motion, the complexity of the interaction, and the multi-modal nature of the future trajectory distribution. To solve the problem of trajectory prediction, methods based on dynamics and kinematics have been proposed at first, but these methods have poor accuracy and generalization capability, and are therefore only suitable for transient trajectory prediction in simple driving scenarios. In recent years, trajectory prediction methods based on deep learning have become popular. Among them, there are methods of capturing interaction information between agents by designing various interaction mechanisms based only on historical trajectories of agents, but they ignore other important factors affecting agent movements, such as environmental conditions and traffic rules, which may cause prediction results in complex traffic scenarios to become unacceptable, and most methods generate only one predicted trajectory, which does not conform to the nature that future movements are multi-modal; there is also an end-to-end approach that only utilizes Convolutional Neural Networks (CNNs) to extract state information and scene information of the agent from the sequence images, although CNNs are good at handling spatial dependency, they lack a mechanism for modeling sequence data, and in trajectory prediction it is necessary to model the dependency of the state of the agent over time; the historical track and the scene information are simultaneously used for track prediction, but no attention mechanism is used, so that the model cannot accurately understand which are the most important factors influencing future movement, data redundancy and model space degree are generally complex, the final prediction result is inaccurate, and the social acceptability degree is poor.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a multi-modal trajectory prediction method capable of generating a plurality of socially acceptable trajectories and associated probabilities thereof by paying attention to scenes and states.
In order to achieve the purpose, the specific technical scheme of the invention is as follows: a method of multi-modal trajectory prediction by noting scenes and states, comprising the steps of:
A. full convolution neural network extracts scene information and focuses on target intelligent agent by using affine transformation
And A1, constructing a composite grid map. The environment information is first rendered on a high-definition map under a bird's eye view to preserve the size and position of the agent and the road geometry, and to ignore their texture. In order to embody the dynamic interaction between agents and between the agents and the environment and avoid the problems of semantic annotation blocking and data redundancy, a plurality of continuous original rasterized high-precision maps are split and recombined to obtain a composite raster map
Figure BDA0003507544490000021
H is the length of the composite grid map, W is the width of the composite grid map, and C is the number of channels.
A2, extracting scene information and interaction information by the full convolution neural network. Composite grid map obtained from step A1 using full convolutional neural network (FCN)
Figure BDA0003507544490000022
And (3) learning representative topological information, semantic information and interactive information, and extracting information by using the FCN according to the following formula:
Figure BDA0003507544490000023
wherein FCN (-) is a full convolution neural network,
Figure BDA0003507544490000031
is a composite grid map, WsIs the weight of the full convolution neural network.
To reduce the model parameters, the convolutional neural network inside the FCN, CNN, was chosen using MobileNetV 2. Using FCN not only enables scene and interaction information to be obtained, but also keeps the size of the input image consistent with the size of the output signature, thus focusing on the target agent based on its initial position.
A3, paying attention to the target agent. The agent usually focuses more on objects that are closer to itself and more interactive, so a profile is obtained in step A2
Figure BDA0003507544490000032
Then, a small feature map is cut according to the position of the target agent
Figure BDA0003507544490000033
To focus on it. In obtaining a small feature map
Figure BDA0003507544490000034
And then rotating the intelligent body by a certain angle by using affine transformation according to the orientation of the intelligent body so as to normalize the orientation of the intelligent body, wherein the formulas of the affine transformation and the affine matrix are shown as formulas (2) and (3):
Figure BDA0003507544490000035
Figure BDA0003507544490000036
in the formula, Affine (·) is an Affine transformation function, θ is an Affine matrix, and h is the orientation of the agent.
B. Encoding agent historical state using Transformer
Before the historical state information is coded by using a Transformer, the state information is firstly connected together, and then a multi-layer perceptron (MLP) is applied to embed the state information into a high-dimensional space to obtain ftIn fixed size, last ftIs embedded in a higher dimensional space to obtain FaThe formulas for embedding state information into a high-dimensional space using MLP and embedding into a higher-dimensional space using an embedding matrix are as follows:
ft=MPL(concat(xt,yt,vt,sin(h),cos(h)) (4)
Fa=ft·Wf (5)
in the formula (x)t,yt) Is the position coordinate of the agent, vtIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, WfIs the embedding matrix and t is the time stamp. The historical state information includes location, speed, and heading.
Each past time instant t is time-coded using the Transformer "position coding", which is formulated as follows:
Figure BDA0003507544490000041
Figure BDA0003507544490000042
Ia=Fa+Pa (8)
where D is the data dimension and D is the model embedding dimension.
To capture the inherent potential dependencies between the agent's historical states, I is first pairedaPerforming a three-time linear projective transformation to obtain three matrices, i.e., Qa=WqIa,Ka=WkIaAnd Va=WvIaWherein W isqIs QaProjection matrix of WkIs KaProjection matrix of WvIs VaThe projection matrix of (2). Then, the internal relationship between the historical states of the agent is obtained by using a self-attention mechanism. Finally, a final representation E of the historical state of the agent is obtained by utilizing a feedforward networka. The self-attention mechanism equation (9) and the feed-forward network equation (10) are as follows:
Figure BDA0003507544490000043
Ea=FFN(Att(Qa,Ka,Va)) (10)
where softmax (. cndot.) is the softmax function, Att (. cndot.) is multi-headed self-attention, FFN (. cndot.) is the feed-forward network,
Figure BDA0003507544490000044
is KaOf (c) is calculated.
C. Feature fusion and decoding
C1, feature fusion. The clipped feature map is obtained in step A3
Figure BDA0003507544490000045
Thereafter, the scene information is further extracted using depth separable convolution, i.e.
Figure BDA0003507544490000046
Wherein DSC (. cndot.) is a depth separable convolution. Then, using MLP to the historical state E of the agentaAnd (5) performing dimensionality reduction to obtain K 'and V', which are shown in the formula (11). Finally, scene information Q ' extracted by depth separable convolution and state information K ' subjected to dimensionality reduction are subjected to multi-head attention mechanism 'And obtaining a fusion feature I by fusing with V', as shown in formulas (12) and (13), wherein formula (12) is an attention head function, and formula (13) is obtained by connecting 8 attention head functions, and the formula is as follows:
K′=V′=MLP(Ea) (11)
Figure BDA0003507544490000051
I=Concat(head1,…,head8) (13)
in the formula, MLP (-) is a multi-layer perceptron, K 'is the Key Value of the state information after dimension reduction, V' is the Value of the state information after dimension reduction, headiAttention is called, Softmax (-) is a Softmax function,
Figure BDA0003507544490000052
is the projection matrix of Q' and,
Figure BDA0003507544490000053
is a projection matrix of K' and,
Figure BDA0003507544490000054
is the projection matrix of V', dK′Is the dimension of K', Concat (. cndot.) is the connection function.
C2, decoding the fused features. Inputting the fusion characteristics I into a regression branch and a classification branch respectively to finally obtain K predicted tracks and related probabilities thereof, wherein the formula of the regression branch is shown as (14), and the formula of the classification branch is shown as (15) and (16):
Figure BDA0003507544490000055
logit=MLP2{Concat(OK,reg,MLP1(I))} (15)
Figure BDA0003507544490000056
in the formula, MLP1(·)、MLP2(. and MLP)3() are 1, 2 and 3 layer multilayer sensors respectively,
Figure BDA0003507544490000057
is the k-th predicted trajectory,
Figure BDA0003507544490000058
is the probability of the k-th predicted trajectory.
Compared with the prior art, the invention has the following beneficial effects and benefits:
1. the invention carries out multi-mode track prediction by paying attention to scenes and states, and comprehensively considers scene context information and intelligent agent historical state information to generate a plurality of accurate and acceptable future tracks and probabilities thereof.
2. The invention can avoid the problems of semantic annotation shielding and data redundancy by using a unique scene representation method and affine transformation.
3. The method combines three attention mechanisms (affine transformation, namely a hard attention mechanism, a transform self-attention mechanism and a multi-head attention mechanism) to fuse scene information and state information, and improves prediction precision and social acceptability compared with a traditional track prediction method.
4. In conclusion, the historical state information and the ambient environment information of the agent are comprehensively considered, and the two kinds of information are extracted and fused by combining three attention mechanisms, namely the self-attention mechanism of the Transformer captures the potential relation between the historical states; the affine transformation cuts out key positions from the feature graph, namely a hard attention mechanism to focus on a specific intelligent agent; the multi-head attention mechanism can effectively capture the interaction information of the state and the scene. Finally, a multi-modal trajectory prediction method by looking at scenes and states that produces a plurality of socially acceptable trajectories and their associated probabilities.
Drawings
The invention has 4 sheets, wherein:
FIG. 1 is a flow chart of a method for multi-modal trajectory prediction by noting the scene and state.
Fig. 2 is a model structure diagram of a multi-modal trajectory prediction method by paying attention to scenes and states.
Fig. 3 is a composite grid map.
Fig. 4 is a diagram of a full convolutional neural network FCN architecture.
Detailed Description
The following detailed description of the embodiments of the invention refers to the accompanying drawings. As shown in fig. 1, a multi-modal trajectory prediction method by paying attention to scenes and states includes the following steps:
A. full convolution neural network extracts scene information and focuses on target intelligent agent by using affine transformation
And A1, constructing a composite grid map. The accurate prediction of the target trajectory not only takes into account the historical state information of the target agent, but also needs to take into account scene semantic information, as shown in fig. 2, for which purpose the environment information is rendered on a high-definition map under a bird's-eye view. This has the advantage of simplicity in representing the context information, while preserving the size and location of the agent and road geometry, and ignoring their texture. In order to embody the dynamic interaction between agents and between the agents and the environment and avoid semantic annotation blocking and data redundancy, the original rasterized high-precision map is split into a form shown in fig. 3, then information of only one channel is taken from each picture, and finally all the obtained channels are stacked together to obtain a composite grid map
Figure BDA0003507544490000061
H512 is the length of the composite raster image, W512 is the width, and C6 is the number of channels.
A2, extracting scene information and interaction information by the full convolution neural network. The full convolution neural network FCN can combine shallow fine appearance information with deep coarse semantic information, and can obtain a composite rasterized map from A1
Figure BDA0003507544490000062
The representative topological information, semantic information and interactive information are learned, and the formula for extracting the information by using the FCN is as follows:
Figure BDA0003507544490000071
wherein FCN (-) is a full convolutional network,
Figure BDA0003507544490000072
is a composite grid map, WsWeight of the full convolution network.
To reduce the model parameters, the CNN network selection inside the full convolutional neural network FCN uses a lightweight network MobileNetV2 that not only has good performance, but also has fewer parameters. Connecting the 7 th layer, the 14 th layer and the 19 th layer of the MobileNet V2 with the deconvoluted output respectively by using jump connection to obtain a feature map
Figure BDA0003507544490000073
See figure 4 in particular. Using FCN not only achieves interaction between agents and the environmental context, but more importantly it keeps the size of the input composite grid map and the output feature map consistent so that it can be focused on based on the initial location of the target agent.
A3, paying attention to the target agent. Agents are generally more concerned with objects that are closer to themselves and more interactive, and therefore most of the information in the entire map is not useful to a single agent. The characteristic map is obtained in step A2
Figure BDA0003507544490000074
Then, a small feature map is cut according to the position of the target agent
Figure BDA0003507544490000075
To focus on it. In addition, the orientation of the agent is valuable information in obtaining a small feature map
Figure BDA0003507544490000076
Then rotating a certain angle by affine transformation according to the orientation of the intelligent body to unify the intelligenceThe orientation of the energy body, the affine transformation and the affine matrix are expressed as (2) and (3):
Figure BDA0003507544490000077
Figure BDA0003507544490000078
in the formula, Affine (·) is an Affine transformation function, θ is an Affine matrix, and h is the orientation of the agent.
B. Encoding agent historical state using Transformer
Before encoding the historical state information of the target agent, namely the position, the speed, the orientation and the like by using a Transformer, all the state information is firstly connected together, and then a multilayer perceptron (MLP) is applied to embed the state information into a high-dimensional space (16-dimensional) to obtain the state information
Figure BDA0003507544490000079
With fixed input size, finally
Figure BDA00035075444900000710
Is embedded in a higher dimensional (64 dimensional) space to obtain FaThe formulas for embedding state information into a high-dimensional space using MLP and embedding into a higher-dimensional space using an embedding matrix are as follows:
ft=MPL(concat(xt,yt,vt,sin(h),cos(h)) (4)
Fa=ft·Wf (5)
in the formula (x)t,yt) Is the position coordinate of the agent, vtIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, WfIs the embedding matrix and t is the time stamp.
The state of each past time instant t is time coded using the Transformer "position coding", which is formulated as follows:
Figure BDA0003507544490000081
Figure BDA0003507544490000082
Ia=Fa+Pa (8)
where D64 is the data dimension and D512 is the model embedding dimension.
To capture the inherent potential dependencies between the agent's historical states, I is first pairedaPerforming a three-time linear projective transformation to obtain three matrices, i.e., Qa=WqIa,Ka=WkIaAnd Va=WvIaWherein W isqIs QaProjection matrix of WkIs KaProjection matrix of WvIs a VaThe projection matrix of (2). Then, the internal relationships between the historical states are obtained using a multi-headed self-attention mechanism. Finally, a representation E of the history is obtained using a feed-forward networkaDimension 2048. The formula of the self-attention mechanism and the formula of the feed-forward network are as follows:
Figure BDA0003507544490000083
Ea=FFN(Att(Qa,Ka,Va)) (10)
where softmax (. cndot.) is the softmax function, Att (. cndot.) is multi-headed self-attention, FFN (. cndot.) is the feed-forward network,
Figure BDA0003507544490000084
is KaOf (c) is calculated.
In encoding the agent history state, 3 encoder layers of the Transformer are used, each encoder layer having 64 hidden units and 8 attention heads.
C. Feature fusion and decoding
C1, feature fusion. Obtaining the trimmed features at step A3
Figure BDA0003507544490000085
Then, the features are further extracted by utilizing the depth separable convolution to obtain scene information, namely
Figure BDA0003507544490000091
Where DSC (-) is a deep separable convolution with 4 hidden layers. Then, using MLP to the historical state E of the agentaThe dimension is reduced to 32 to obtain K 'and V', as shown in the formula (11). And finally, fusing the scene information Q ' extracted by the depth separable convolution and the state information K ' and V ' subjected to dimensionality reduction by using a multi-head attention mechanism to obtain a fusion feature I, such as formulas (12) and (13), wherein (12) is an attention head function, and (13) is a function for connecting 8 attention head functions, and the formula is as follows:
K′=V′=MLP(Ea) (11)
,i=1,…,8 (12)
I=Concat(head1,…,head8) (13)
wherein MLP (-) is a multilayer perceptron, K 'is a Key Value of the state information after dimension reduction, V' is a Value of the state information after dimension reduction, headiAttention is called, Softmax (-) is a Softmax function,
Figure BDA0003507544490000092
is the projection matrix of Q' and,
Figure BDA0003507544490000093
is a projection matrix of K' and,
Figure BDA0003507544490000094
is the projection matrix of V', d K′32 is the dimension of K', Concat (·) is the connection function.
C2, decoding the fused features. Inputting the fusion characteristics I into a regression branch and a classification branch respectively to finally obtain K predicted tracks and related probabilities thereof, wherein the formula of the regression branch is shown as (14), and the formula of the classification branch is shown as (15) and (16):
Figure BDA0003507544490000095
logit=MLP2{Concat(OK,reg,MLP1(I))} (15)
Figure BDA0003507544490000096
in the formula, MLP1(·)、MLP2(. and MLP)3(. h) are 1-layer, 2-layer and 3-layer MLPs,
Figure BDA0003507544490000097
is the k-th predicted trajectory and,
Figure BDA0003507544490000098
is the probability of the k-th predicted trajectory.
The present invention is not limited to the embodiment, and any equivalent idea or change within the technical scope of the present invention is to be regarded as the protection scope of the present invention.

Claims (1)

1. A multi-modal trajectory prediction method by paying attention to scenes and states, characterized in that: the method comprises the following steps:
A. full convolution neural network extracts scene information and focuses on target intelligent agent by using affine transformation
A1, constructing a composite grid map; firstly, rendering environment information on a high-definition map under a bird's-eye view angle to reserve the size and position of an intelligent agent and geometric characteristics of a road and neglect textures of the intelligent agent and the road; in order to embody the dynamic interaction between agents and between the agents and the environment and avoid the problems of semantic annotation blocking and data redundancy, a plurality of continuous original rasterized high-precision maps are split and recombined to obtain a composite raster map
Figure FDA0003507544480000011
H is the length of the composite grid map, W is the width of the composite grid map, and C is the number of channels;
a2, extracting scene information and interaction information by a full convolution neural network; composite grid map obtained from step A1 using full convolutional neural network (FCN)
Figure FDA0003507544480000012
And (3) learning representative topological information, semantic information and interactive information, and extracting information by using the FCN according to the following formula:
Figure FDA0003507544480000013
wherein FCN (-) is a full convolution neural network,
Figure FDA0003507544480000018
is a composite grid map, WsIs the weight of the full convolution neural network;
in order to reduce model parameters, a Convolutional Neural Network (CNN) inside the FCN selects to use MobileNet V2; the FCN is utilized to not only obtain scene and interaction information, but also keep the size of an input image consistent with that of an output feature map, so that the target agent is focused on according to the initial position of the target agent;
a3, paying attention to the target agent; the agent usually focuses more on objects that are closer to itself and more interactive, so a profile is obtained in step A2
Figure FDA0003507544480000014
Then, a small feature map is cut according to the position of the target agent
Figure FDA0003507544480000015
To focus on it; in obtaining a small feature map
Figure FDA0003507544480000016
And then rotating the intelligent body by a certain angle by using affine transformation according to the orientation of the intelligent body so as to normalize the orientation of the intelligent body, wherein the formulas of the affine transformation and the affine matrix are shown as formulas (2) and (3):
Figure FDA0003507544480000017
Figure FDA0003507544480000021
in the formula, Affine (·) is an Affine transformation function, theta is an Affine matrix, and h is the orientation of the agent;
B. encoding agent historical state using Transformer
Before the historical state information is coded by using a Transformer, the state information is firstly connected together, and then a multi-layer perceptron (MLP) is applied to embed the state information into a high-dimensional space to obtain ftIn fixed size, last ftIs embedded in a higher dimensional space to obtain FaThe formulas for embedding state information into a high-dimensional space using MLP and embedding into a higher-dimensional space using an embedding matrix are as follows:
ft=MPL(concat(xt,yt,vt,sin(h),cos(h)) (4)
Fa=ft·Wf (5)
in the formula (x)t,yt) Is the position coordinate of the agent, vtIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, WfIs the embedding matrix, t is the timestamp; the historical state information includes location, speed, and heading;
each past time instant t is time-coded using the Transformer "position coding", which is formulated as follows:
Figure FDA0003507544480000022
Figure FDA0003507544480000023
Ia=Fa+Pa (8)
wherein D is the data dimension and D is the model embedding dimension;
to capture the inherent potential dependencies between the agent's historical states, I is first pairedaPerforming a three-time linear projective transformation to obtain three matrices, i.e., Qa=WqIa,Ka=WkIaAnd Va=WvIaWherein W isqIs QaProjection matrix of WkIs KaProjection matrix of, WvIs VaThe projection matrix of (2); then, obtaining the internal relation between the historical states of the intelligent agent by using a self-attention mechanism; finally, a final representation E of the historical state of the agent is obtained by utilizing a feedforward networka(ii) a The self-attention mechanism equation (9) and the feed-forward network equation (10) are as follows:
Figure FDA0003507544480000031
Ea=FFN(Att(Qa,Ka,Va)) (10)
where softmax (. cndot.) is the softmax function, Att (. cndot.) is multi-headed self-attention, FFN (. cndot.) is the feed-forward network,
Figure FDA0003507544480000039
is KaDimension (d);
C. feature fusion and decoding
C1, feature fusion; the clipped feature map is obtained in step A3
Figure FDA0003507544480000038
Thereafter, the scene information is further extracted using depth separable convolution, i.e.
Figure FDA0003507544480000037
Wherein DSC (-) is a deep separable convolution; then, using MLP to the historical state E of the agentaPerforming dimensionality reduction to obtain K 'and V', which are shown as a formula (11); and finally, fusing the scene information Q ' extracted by the depth separable convolution and the state information K ' and V ' subjected to dimensionality reduction by using a multi-head attention mechanism to obtain a fusion feature I, such as formulas (12) and (13), wherein the formula (12) is an attention head function, and the formula (13) is to connect 8 attention head functions, and the formula is as follows:
K′=V′=MLP(Ea) (11)
Figure FDA0003507544480000032
I=Concat(head1,...,head8) (13)
in the formula, MLP (-) is a multi-layer perceptron, K 'is the Key Value of the state information after dimension reduction, V' is the Value of the state information after dimension reduction, headiAttention is called, Softmax (-) is a Softmax function,
Figure FDA0003507544480000033
is the projection matrix of Q' and,
Figure FDA0003507544480000034
is a projection matrix of K' and,
Figure FDA0003507544480000035
is the projection matrix of V', dK′Is the dimension of K', Concat (. cndot.) is the connection function;
c2, decoding the fused features; inputting the fusion characteristics I into a regression branch and a classification branch respectively to finally obtain K predicted tracks and related probabilities thereof, wherein the formula of the regression branch is shown as (14), and the formula of the classification branch is shown as (15) and (16):
Figure FDA0003507544480000036
logit=MLP2{Concat(OK,reg,MLP1(I))} (15)
Figure FDA0003507544480000041
in the formula, MLP1(·)、MLP2(. and MLP)3() are 1, 2 and 3 layer multilayer sensors respectively,
Figure FDA0003507544480000042
is the k-th predicted trajectory,
Figure FDA0003507544480000043
is the probability of the k-th predicted trajectory.
CN202210142283.8A 2022-02-16 2022-02-16 Multi-modal trajectory prediction method by paying attention to scenes and states Pending CN114707630A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210142283.8A CN114707630A (en) 2022-02-16 2022-02-16 Multi-modal trajectory prediction method by paying attention to scenes and states

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210142283.8A CN114707630A (en) 2022-02-16 2022-02-16 Multi-modal trajectory prediction method by paying attention to scenes and states

Publications (1)

Publication Number Publication Date
CN114707630A true CN114707630A (en) 2022-07-05

Family

ID=82166723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210142283.8A Pending CN114707630A (en) 2022-02-16 2022-02-16 Multi-modal trajectory prediction method by paying attention to scenes and states

Country Status (1)

Country Link
CN (1) CN114707630A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071925A (en) * 2023-02-13 2023-05-05 北京爱芯科技有限公司 Track prediction method and device and electronic processing device
CN116629462A (en) * 2023-07-25 2023-08-22 清华大学 Multi-agent unified interaction track prediction method, system, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071925A (en) * 2023-02-13 2023-05-05 北京爱芯科技有限公司 Track prediction method and device and electronic processing device
CN116071925B (en) * 2023-02-13 2024-04-12 北京爱芯科技有限公司 Track prediction method and device and electronic processing device
CN116629462A (en) * 2023-07-25 2023-08-22 清华大学 Multi-agent unified interaction track prediction method, system, equipment and medium
CN116629462B (en) * 2023-07-25 2023-11-21 清华大学 Multi-agent unified interaction track prediction method, system, equipment and medium

Similar Documents

Publication Publication Date Title
Messaoud et al. Non-local social pooling for vehicle trajectory prediction
CN113887610B (en) Pollen image classification method based on cross-attention distillation transducer
CN114707630A (en) Multi-modal trajectory prediction method by paying attention to scenes and states
CN111339867B (en) Pedestrian trajectory prediction method based on generation of countermeasure network
CN109597087A (en) A kind of 3D object detection method based on point cloud data
He et al. Rail transit obstacle detection based on improved CNN
CN111372123A (en) Video time sequence segment extraction method based on local to global
Peng et al. Dynamic fusion network for RGBT tracking
Wenzel et al. Modular vehicle control for transferring semantic information between weather conditions using gans
Meng et al. Forecasting human trajectory from scene history
Mobahi et al. An improved deep learning solution for object detection in self-driving cars
Spiegl Contrastive unpaired translation using focal loss for patch classification
CN111242044B (en) Night unmanned vehicle scene prediction method based on ConvLSTM dual-channel coding network
CN111178584B (en) Unmanned behavior prediction method based on double-layer fusion model
Chaturvedi et al. Constrained manifold learning for videos
CN117994823B (en) Mask priori and hierarchical aggregation converter-based method for re-identifying blocked pedestrians
CN115221971B (en) Trajectory prediction method based on heterogeneous graph
CN116863430B (en) Point cloud fusion method for automatic driving
Barua et al. PTIN: Enriching Pedestrian Safety with an LSTM-GRU-Transformer Based Trajectory Imputation System for Autonomous Vehicles
Zhang et al. Texture-Guided Transfer Learning for Low-Quality Face Recognition
Chen et al. Spatial-temporal attention networks for vehicle trajectory prediction
Wang et al. Deep Reinforcement Learning based Planning for Urban Self-driving with Demonstration and Depth Completion
CN117094459A (en) Motion planning method based on safe track tree network
Rafique et al. Vehicle CAN Bus Data Prediction Using Transformers with Auxiliary Decoder Loss
CN117994823A (en) Mask priori and hierarchical aggregation converter-based method for re-identifying blocked pedestrians

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination