CN114707630A - Multi-modal trajectory prediction method by paying attention to scenes and states - Google Patents
Multi-modal trajectory prediction method by paying attention to scenes and states Download PDFInfo
- Publication number
- CN114707630A CN114707630A CN202210142283.8A CN202210142283A CN114707630A CN 114707630 A CN114707630 A CN 114707630A CN 202210142283 A CN202210142283 A CN 202210142283A CN 114707630 A CN114707630 A CN 114707630A
- Authority
- CN
- China
- Prior art keywords
- agent
- information
- attention
- formula
- mlp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims abstract description 24
- 230000009466 transformation Effects 0.000 claims abstract description 20
- 230000004927 fusion Effects 0.000 claims abstract description 13
- 238000013528 artificial neural network Methods 0.000 claims abstract description 12
- 230000003993 interaction Effects 0.000 claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims description 30
- 230000006870 function Effects 0.000 claims description 22
- 239000002131 composite material Substances 0.000 claims description 19
- 238000013527 convolutional neural network Methods 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 11
- 230000002452 interceptive effect Effects 0.000 claims description 7
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 claims description 3
- 230000000903 blocking effect Effects 0.000 claims description 3
- 230000008846 dynamic interplay Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 240000004050 Pentaglottis sempervirens Species 0.000 claims description 2
- 238000009877 rendering Methods 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/02—Affine transformations
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/01—Detecting movement of traffic to be counted or controlled
- G08G1/0104—Measuring and analyzing of parameters relative to traffic conditions
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-modal trajectory prediction method by paying attention to scenes and states, which comprises the following steps: extracting scene information by the full convolution neural network and focusing on a target agent by using affine transformation; encoding the historical state of the agent by using a Transformer; and (5) feature fusion and decoding. According to the method, historical state information and surrounding environment information of an agent are comprehensively considered, and the three attention mechanisms are combined to extract and fuse the two kinds of information, namely the self-attention mechanism of a Transformer captures the potential relation between the historical states; the affine transformation cuts out key positions from the feature graph, namely a hard attention mechanism to focus on a specific intelligent agent; the multi-head attention mechanism can effectively capture the interaction information of the state and the scene. Finally, a multi-modal trajectory prediction method by attention to the scene and state is generated for a plurality of socially acceptable trajectories and their associated probabilities.
Description
Technical Field
The invention belongs to the field of automatic driving, and particularly relates to an intelligent multi-modal trajectory prediction method.
Background
In recent years, Artificial Intelligence (AI) has achieved unprecedented development, and automated driving technology has now become one of the most potential applications in the field of artificial intelligence, and it is believed that its impact on the production and life style of human society will be no less than that of the birth of automobiles hundreds of years ago. Currently, autodrive technology is not mature, and intelligent vehicles need to share the same traffic scene with human drivers. In order to safely and efficiently travel, the intelligent vehicle needs to know which objects exist around, and also needs to predict what trajectories the surrounding intelligent agents (vehicles, riders, pedestrians and the like) can follow, which is a problem of trajectory prediction in the field of automatic driving.
The trajectory prediction is used as an intermediate link of automatic driving environment perception and decision planning, the future trajectory of the intelligent agent is accurately predicted in dynamic, interactive and uncertain scenes, and the method has important significance for deeply understanding the surrounding environment and making decisions of the automatic driving automobile and guaranteeing the driving stability, safety and economy of the automobile. However, trajectory prediction is a very challenging task due to the randomness of the agent's motion, the complexity of the interaction, and the multi-modal nature of the future trajectory distribution. To solve the problem of trajectory prediction, methods based on dynamics and kinematics have been proposed at first, but these methods have poor accuracy and generalization capability, and are therefore only suitable for transient trajectory prediction in simple driving scenarios. In recent years, trajectory prediction methods based on deep learning have become popular. Among them, there are methods of capturing interaction information between agents by designing various interaction mechanisms based only on historical trajectories of agents, but they ignore other important factors affecting agent movements, such as environmental conditions and traffic rules, which may cause prediction results in complex traffic scenarios to become unacceptable, and most methods generate only one predicted trajectory, which does not conform to the nature that future movements are multi-modal; there is also an end-to-end approach that only utilizes Convolutional Neural Networks (CNNs) to extract state information and scene information of the agent from the sequence images, although CNNs are good at handling spatial dependency, they lack a mechanism for modeling sequence data, and in trajectory prediction it is necessary to model the dependency of the state of the agent over time; the historical track and the scene information are simultaneously used for track prediction, but no attention mechanism is used, so that the model cannot accurately understand which are the most important factors influencing future movement, data redundancy and model space degree are generally complex, the final prediction result is inaccurate, and the social acceptability degree is poor.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a multi-modal trajectory prediction method capable of generating a plurality of socially acceptable trajectories and associated probabilities thereof by paying attention to scenes and states.
In order to achieve the purpose, the specific technical scheme of the invention is as follows: a method of multi-modal trajectory prediction by noting scenes and states, comprising the steps of:
A. full convolution neural network extracts scene information and focuses on target intelligent agent by using affine transformation
And A1, constructing a composite grid map. The environment information is first rendered on a high-definition map under a bird's eye view to preserve the size and position of the agent and the road geometry, and to ignore their texture. In order to embody the dynamic interaction between agents and between the agents and the environment and avoid the problems of semantic annotation blocking and data redundancy, a plurality of continuous original rasterized high-precision maps are split and recombined to obtain a composite raster mapH is the length of the composite grid map, W is the width of the composite grid map, and C is the number of channels.
A2, extracting scene information and interaction information by the full convolution neural network. Composite grid map obtained from step A1 using full convolutional neural network (FCN)And (3) learning representative topological information, semantic information and interactive information, and extracting information by using the FCN according to the following formula:
wherein FCN (-) is a full convolution neural network,is a composite grid map, WsIs the weight of the full convolution neural network.
To reduce the model parameters, the convolutional neural network inside the FCN, CNN, was chosen using MobileNetV 2. Using FCN not only enables scene and interaction information to be obtained, but also keeps the size of the input image consistent with the size of the output signature, thus focusing on the target agent based on its initial position.
A3, paying attention to the target agent. The agent usually focuses more on objects that are closer to itself and more interactive, so a profile is obtained in step A2Then, a small feature map is cut according to the position of the target agentTo focus on it. In obtaining a small feature mapAnd then rotating the intelligent body by a certain angle by using affine transformation according to the orientation of the intelligent body so as to normalize the orientation of the intelligent body, wherein the formulas of the affine transformation and the affine matrix are shown as formulas (2) and (3):
in the formula, Affine (·) is an Affine transformation function, θ is an Affine matrix, and h is the orientation of the agent.
B. Encoding agent historical state using Transformer
Before the historical state information is coded by using a Transformer, the state information is firstly connected together, and then a multi-layer perceptron (MLP) is applied to embed the state information into a high-dimensional space to obtain ftIn fixed size, last ftIs embedded in a higher dimensional space to obtain FaThe formulas for embedding state information into a high-dimensional space using MLP and embedding into a higher-dimensional space using an embedding matrix are as follows:
ft=MPL(concat(xt,yt,vt,sin(h),cos(h)) (4)
Fa=ft·Wf (5)
in the formula (x)t,yt) Is the position coordinate of the agent, vtIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, WfIs the embedding matrix and t is the time stamp. The historical state information includes location, speed, and heading.
Each past time instant t is time-coded using the Transformer "position coding", which is formulated as follows:
Ia=Fa+Pa (8)
where D is the data dimension and D is the model embedding dimension.
To capture the inherent potential dependencies between the agent's historical states, I is first pairedaPerforming a three-time linear projective transformation to obtain three matrices, i.e., Qa=WqIa,Ka=WkIaAnd Va=WvIaWherein W isqIs QaProjection matrix of WkIs KaProjection matrix of WvIs VaThe projection matrix of (2). Then, the internal relationship between the historical states of the agent is obtained by using a self-attention mechanism. Finally, a final representation E of the historical state of the agent is obtained by utilizing a feedforward networka. The self-attention mechanism equation (9) and the feed-forward network equation (10) are as follows:
Ea=FFN(Att(Qa,Ka,Va)) (10)
where softmax (. cndot.) is the softmax function, Att (. cndot.) is multi-headed self-attention, FFN (. cndot.) is the feed-forward network,is KaOf (c) is calculated.
C. Feature fusion and decoding
C1, feature fusion. The clipped feature map is obtained in step A3Thereafter, the scene information is further extracted using depth separable convolution, i.e.Wherein DSC (. cndot.) is a depth separable convolution. Then, using MLP to the historical state E of the agentaAnd (5) performing dimensionality reduction to obtain K 'and V', which are shown in the formula (11). Finally, scene information Q ' extracted by depth separable convolution and state information K ' subjected to dimensionality reduction are subjected to multi-head attention mechanism 'And obtaining a fusion feature I by fusing with V', as shown in formulas (12) and (13), wherein formula (12) is an attention head function, and formula (13) is obtained by connecting 8 attention head functions, and the formula is as follows:
K′=V′=MLP(Ea) (11)
I=Concat(head1,…,head8) (13)
in the formula, MLP (-) is a multi-layer perceptron, K 'is the Key Value of the state information after dimension reduction, V' is the Value of the state information after dimension reduction, headiAttention is called, Softmax (-) is a Softmax function,is the projection matrix of Q' and,is a projection matrix of K' and,is the projection matrix of V', dK′Is the dimension of K', Concat (. cndot.) is the connection function.
C2, decoding the fused features. Inputting the fusion characteristics I into a regression branch and a classification branch respectively to finally obtain K predicted tracks and related probabilities thereof, wherein the formula of the regression branch is shown as (14), and the formula of the classification branch is shown as (15) and (16):
logit=MLP2{Concat(OK,reg,MLP1(I))} (15)
in the formula, MLP1(·)、MLP2(. and MLP)3() are 1, 2 and 3 layer multilayer sensors respectively,is the k-th predicted trajectory,is the probability of the k-th predicted trajectory.
Compared with the prior art, the invention has the following beneficial effects and benefits:
1. the invention carries out multi-mode track prediction by paying attention to scenes and states, and comprehensively considers scene context information and intelligent agent historical state information to generate a plurality of accurate and acceptable future tracks and probabilities thereof.
2. The invention can avoid the problems of semantic annotation shielding and data redundancy by using a unique scene representation method and affine transformation.
3. The method combines three attention mechanisms (affine transformation, namely a hard attention mechanism, a transform self-attention mechanism and a multi-head attention mechanism) to fuse scene information and state information, and improves prediction precision and social acceptability compared with a traditional track prediction method.
4. In conclusion, the historical state information and the ambient environment information of the agent are comprehensively considered, and the two kinds of information are extracted and fused by combining three attention mechanisms, namely the self-attention mechanism of the Transformer captures the potential relation between the historical states; the affine transformation cuts out key positions from the feature graph, namely a hard attention mechanism to focus on a specific intelligent agent; the multi-head attention mechanism can effectively capture the interaction information of the state and the scene. Finally, a multi-modal trajectory prediction method by looking at scenes and states that produces a plurality of socially acceptable trajectories and their associated probabilities.
Drawings
The invention has 4 sheets, wherein:
FIG. 1 is a flow chart of a method for multi-modal trajectory prediction by noting the scene and state.
Fig. 2 is a model structure diagram of a multi-modal trajectory prediction method by paying attention to scenes and states.
Fig. 3 is a composite grid map.
Fig. 4 is a diagram of a full convolutional neural network FCN architecture.
Detailed Description
The following detailed description of the embodiments of the invention refers to the accompanying drawings. As shown in fig. 1, a multi-modal trajectory prediction method by paying attention to scenes and states includes the following steps:
A. full convolution neural network extracts scene information and focuses on target intelligent agent by using affine transformation
And A1, constructing a composite grid map. The accurate prediction of the target trajectory not only takes into account the historical state information of the target agent, but also needs to take into account scene semantic information, as shown in fig. 2, for which purpose the environment information is rendered on a high-definition map under a bird's-eye view. This has the advantage of simplicity in representing the context information, while preserving the size and location of the agent and road geometry, and ignoring their texture. In order to embody the dynamic interaction between agents and between the agents and the environment and avoid semantic annotation blocking and data redundancy, the original rasterized high-precision map is split into a form shown in fig. 3, then information of only one channel is taken from each picture, and finally all the obtained channels are stacked together to obtain a composite grid mapH512 is the length of the composite raster image, W512 is the width, and C6 is the number of channels.
A2, extracting scene information and interaction information by the full convolution neural network. The full convolution neural network FCN can combine shallow fine appearance information with deep coarse semantic information, and can obtain a composite rasterized map from A1The representative topological information, semantic information and interactive information are learned, and the formula for extracting the information by using the FCN is as follows:
wherein FCN (-) is a full convolutional network,is a composite grid map, WsWeight of the full convolution network.
To reduce the model parameters, the CNN network selection inside the full convolutional neural network FCN uses a lightweight network MobileNetV2 that not only has good performance, but also has fewer parameters. Connecting the 7 th layer, the 14 th layer and the 19 th layer of the MobileNet V2 with the deconvoluted output respectively by using jump connection to obtain a feature mapSee figure 4 in particular. Using FCN not only achieves interaction between agents and the environmental context, but more importantly it keeps the size of the input composite grid map and the output feature map consistent so that it can be focused on based on the initial location of the target agent.
A3, paying attention to the target agent. Agents are generally more concerned with objects that are closer to themselves and more interactive, and therefore most of the information in the entire map is not useful to a single agent. The characteristic map is obtained in step A2Then, a small feature map is cut according to the position of the target agentTo focus on it. In addition, the orientation of the agent is valuable information in obtaining a small feature mapThen rotating a certain angle by affine transformation according to the orientation of the intelligent body to unify the intelligenceThe orientation of the energy body, the affine transformation and the affine matrix are expressed as (2) and (3):
in the formula, Affine (·) is an Affine transformation function, θ is an Affine matrix, and h is the orientation of the agent.
B. Encoding agent historical state using Transformer
Before encoding the historical state information of the target agent, namely the position, the speed, the orientation and the like by using a Transformer, all the state information is firstly connected together, and then a multilayer perceptron (MLP) is applied to embed the state information into a high-dimensional space (16-dimensional) to obtain the state informationWith fixed input size, finallyIs embedded in a higher dimensional (64 dimensional) space to obtain FaThe formulas for embedding state information into a high-dimensional space using MLP and embedding into a higher-dimensional space using an embedding matrix are as follows:
ft=MPL(concat(xt,yt,vt,sin(h),cos(h)) (4)
Fa=ft·Wf (5)
in the formula (x)t,yt) Is the position coordinate of the agent, vtIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, WfIs the embedding matrix and t is the time stamp.
The state of each past time instant t is time coded using the Transformer "position coding", which is formulated as follows:
Ia=Fa+Pa (8)
where D64 is the data dimension and D512 is the model embedding dimension.
To capture the inherent potential dependencies between the agent's historical states, I is first pairedaPerforming a three-time linear projective transformation to obtain three matrices, i.e., Qa=WqIa,Ka=WkIaAnd Va=WvIaWherein W isqIs QaProjection matrix of WkIs KaProjection matrix of WvIs a VaThe projection matrix of (2). Then, the internal relationships between the historical states are obtained using a multi-headed self-attention mechanism. Finally, a representation E of the history is obtained using a feed-forward networkaDimension 2048. The formula of the self-attention mechanism and the formula of the feed-forward network are as follows:
Ea=FFN(Att(Qa,Ka,Va)) (10)
where softmax (. cndot.) is the softmax function, Att (. cndot.) is multi-headed self-attention, FFN (. cndot.) is the feed-forward network,is KaOf (c) is calculated.
In encoding the agent history state, 3 encoder layers of the Transformer are used, each encoder layer having 64 hidden units and 8 attention heads.
C. Feature fusion and decoding
C1, feature fusion. Obtaining the trimmed features at step A3Then, the features are further extracted by utilizing the depth separable convolution to obtain scene information, namelyWhere DSC (-) is a deep separable convolution with 4 hidden layers. Then, using MLP to the historical state E of the agentaThe dimension is reduced to 32 to obtain K 'and V', as shown in the formula (11). And finally, fusing the scene information Q ' extracted by the depth separable convolution and the state information K ' and V ' subjected to dimensionality reduction by using a multi-head attention mechanism to obtain a fusion feature I, such as formulas (12) and (13), wherein (12) is an attention head function, and (13) is a function for connecting 8 attention head functions, and the formula is as follows:
K′=V′=MLP(Ea) (11)
,i=1,…,8 (12)
I=Concat(head1,…,head8) (13)
wherein MLP (-) is a multilayer perceptron, K 'is a Key Value of the state information after dimension reduction, V' is a Value of the state information after dimension reduction, headiAttention is called, Softmax (-) is a Softmax function,is the projection matrix of Q' and,is a projection matrix of K' and,is the projection matrix of V', d K′32 is the dimension of K', Concat (·) is the connection function.
C2, decoding the fused features. Inputting the fusion characteristics I into a regression branch and a classification branch respectively to finally obtain K predicted tracks and related probabilities thereof, wherein the formula of the regression branch is shown as (14), and the formula of the classification branch is shown as (15) and (16):
logit=MLP2{Concat(OK,reg,MLP1(I))} (15)
in the formula, MLP1(·)、MLP2(. and MLP)3(. h) are 1-layer, 2-layer and 3-layer MLPs,is the k-th predicted trajectory and,is the probability of the k-th predicted trajectory.
The present invention is not limited to the embodiment, and any equivalent idea or change within the technical scope of the present invention is to be regarded as the protection scope of the present invention.
Claims (1)
1. A multi-modal trajectory prediction method by paying attention to scenes and states, characterized in that: the method comprises the following steps:
A. full convolution neural network extracts scene information and focuses on target intelligent agent by using affine transformation
A1, constructing a composite grid map; firstly, rendering environment information on a high-definition map under a bird's-eye view angle to reserve the size and position of an intelligent agent and geometric characteristics of a road and neglect textures of the intelligent agent and the road; in order to embody the dynamic interaction between agents and between the agents and the environment and avoid the problems of semantic annotation blocking and data redundancy, a plurality of continuous original rasterized high-precision maps are split and recombined to obtain a composite raster mapH is the length of the composite grid map, W is the width of the composite grid map, and C is the number of channels;
a2, extracting scene information and interaction information by a full convolution neural network; composite grid map obtained from step A1 using full convolutional neural network (FCN)And (3) learning representative topological information, semantic information and interactive information, and extracting information by using the FCN according to the following formula:
wherein FCN (-) is a full convolution neural network,is a composite grid map, WsIs the weight of the full convolution neural network;
in order to reduce model parameters, a Convolutional Neural Network (CNN) inside the FCN selects to use MobileNet V2; the FCN is utilized to not only obtain scene and interaction information, but also keep the size of an input image consistent with that of an output feature map, so that the target agent is focused on according to the initial position of the target agent;
a3, paying attention to the target agent; the agent usually focuses more on objects that are closer to itself and more interactive, so a profile is obtained in step A2Then, a small feature map is cut according to the position of the target agentTo focus on it; in obtaining a small feature mapAnd then rotating the intelligent body by a certain angle by using affine transformation according to the orientation of the intelligent body so as to normalize the orientation of the intelligent body, wherein the formulas of the affine transformation and the affine matrix are shown as formulas (2) and (3):
in the formula, Affine (·) is an Affine transformation function, theta is an Affine matrix, and h is the orientation of the agent;
B. encoding agent historical state using Transformer
Before the historical state information is coded by using a Transformer, the state information is firstly connected together, and then a multi-layer perceptron (MLP) is applied to embed the state information into a high-dimensional space to obtain ftIn fixed size, last ftIs embedded in a higher dimensional space to obtain FaThe formulas for embedding state information into a high-dimensional space using MLP and embedding into a higher-dimensional space using an embedding matrix are as follows:
ft=MPL(concat(xt,yt,vt,sin(h),cos(h)) (4)
Fa=ft·Wf (5)
in the formula (x)t,yt) Is the position coordinate of the agent, vtIs the speed of the agent, h is the orientation of the agent, concat (. cndot.) is the connection function, MPL (. cndot.) is the multilayer perceptron, WfIs the embedding matrix, t is the timestamp; the historical state information includes location, speed, and heading;
each past time instant t is time-coded using the Transformer "position coding", which is formulated as follows:
Ia=Fa+Pa (8)
wherein D is the data dimension and D is the model embedding dimension;
to capture the inherent potential dependencies between the agent's historical states, I is first pairedaPerforming a three-time linear projective transformation to obtain three matrices, i.e., Qa=WqIa,Ka=WkIaAnd Va=WvIaWherein W isqIs QaProjection matrix of WkIs KaProjection matrix of, WvIs VaThe projection matrix of (2); then, obtaining the internal relation between the historical states of the intelligent agent by using a self-attention mechanism; finally, a final representation E of the historical state of the agent is obtained by utilizing a feedforward networka(ii) a The self-attention mechanism equation (9) and the feed-forward network equation (10) are as follows:
Ea=FFN(Att(Qa,Ka,Va)) (10)
where softmax (. cndot.) is the softmax function, Att (. cndot.) is multi-headed self-attention, FFN (. cndot.) is the feed-forward network,is KaDimension (d);
C. feature fusion and decoding
C1, feature fusion; the clipped feature map is obtained in step A3Thereafter, the scene information is further extracted using depth separable convolution, i.e.Wherein DSC (-) is a deep separable convolution; then, using MLP to the historical state E of the agentaPerforming dimensionality reduction to obtain K 'and V', which are shown as a formula (11); and finally, fusing the scene information Q ' extracted by the depth separable convolution and the state information K ' and V ' subjected to dimensionality reduction by using a multi-head attention mechanism to obtain a fusion feature I, such as formulas (12) and (13), wherein the formula (12) is an attention head function, and the formula (13) is to connect 8 attention head functions, and the formula is as follows:
K′=V′=MLP(Ea) (11)
I=Concat(head1,...,head8) (13)
in the formula, MLP (-) is a multi-layer perceptron, K 'is the Key Value of the state information after dimension reduction, V' is the Value of the state information after dimension reduction, headiAttention is called, Softmax (-) is a Softmax function,is the projection matrix of Q' and,is a projection matrix of K' and,is the projection matrix of V', dK′Is the dimension of K', Concat (. cndot.) is the connection function;
c2, decoding the fused features; inputting the fusion characteristics I into a regression branch and a classification branch respectively to finally obtain K predicted tracks and related probabilities thereof, wherein the formula of the regression branch is shown as (14), and the formula of the classification branch is shown as (15) and (16):
logit=MLP2{Concat(OK,reg,MLP1(I))} (15)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210142283.8A CN114707630A (en) | 2022-02-16 | 2022-02-16 | Multi-modal trajectory prediction method by paying attention to scenes and states |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210142283.8A CN114707630A (en) | 2022-02-16 | 2022-02-16 | Multi-modal trajectory prediction method by paying attention to scenes and states |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114707630A true CN114707630A (en) | 2022-07-05 |
Family
ID=82166723
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210142283.8A Pending CN114707630A (en) | 2022-02-16 | 2022-02-16 | Multi-modal trajectory prediction method by paying attention to scenes and states |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114707630A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116071925A (en) * | 2023-02-13 | 2023-05-05 | 北京爱芯科技有限公司 | Track prediction method and device and electronic processing device |
CN116629462A (en) * | 2023-07-25 | 2023-08-22 | 清华大学 | Multi-agent unified interaction track prediction method, system, equipment and medium |
-
2022
- 2022-02-16 CN CN202210142283.8A patent/CN114707630A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116071925A (en) * | 2023-02-13 | 2023-05-05 | 北京爱芯科技有限公司 | Track prediction method and device and electronic processing device |
CN116071925B (en) * | 2023-02-13 | 2024-04-12 | 北京爱芯科技有限公司 | Track prediction method and device and electronic processing device |
CN116629462A (en) * | 2023-07-25 | 2023-08-22 | 清华大学 | Multi-agent unified interaction track prediction method, system, equipment and medium |
CN116629462B (en) * | 2023-07-25 | 2023-11-21 | 清华大学 | Multi-agent unified interaction track prediction method, system, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Messaoud et al. | Non-local social pooling for vehicle trajectory prediction | |
CN113887610B (en) | Pollen image classification method based on cross-attention distillation transducer | |
CN114707630A (en) | Multi-modal trajectory prediction method by paying attention to scenes and states | |
CN111339867B (en) | Pedestrian trajectory prediction method based on generation of countermeasure network | |
CN109597087A (en) | A kind of 3D object detection method based on point cloud data | |
He et al. | Rail transit obstacle detection based on improved CNN | |
CN111372123A (en) | Video time sequence segment extraction method based on local to global | |
Peng et al. | Dynamic fusion network for RGBT tracking | |
Wenzel et al. | Modular vehicle control for transferring semantic information between weather conditions using gans | |
Meng et al. | Forecasting human trajectory from scene history | |
Mobahi et al. | An improved deep learning solution for object detection in self-driving cars | |
Spiegl | Contrastive unpaired translation using focal loss for patch classification | |
CN111242044B (en) | Night unmanned vehicle scene prediction method based on ConvLSTM dual-channel coding network | |
CN111178584B (en) | Unmanned behavior prediction method based on double-layer fusion model | |
Chaturvedi et al. | Constrained manifold learning for videos | |
CN117994823B (en) | Mask priori and hierarchical aggregation converter-based method for re-identifying blocked pedestrians | |
CN115221971B (en) | Trajectory prediction method based on heterogeneous graph | |
CN116863430B (en) | Point cloud fusion method for automatic driving | |
Barua et al. | PTIN: Enriching Pedestrian Safety with an LSTM-GRU-Transformer Based Trajectory Imputation System for Autonomous Vehicles | |
Zhang et al. | Texture-Guided Transfer Learning for Low-Quality Face Recognition | |
Chen et al. | Spatial-temporal attention networks for vehicle trajectory prediction | |
Wang et al. | Deep Reinforcement Learning based Planning for Urban Self-driving with Demonstration and Depth Completion | |
CN117094459A (en) | Motion planning method based on safe track tree network | |
Rafique et al. | Vehicle CAN Bus Data Prediction Using Transformers with Auxiliary Decoder Loss | |
CN117994823A (en) | Mask priori and hierarchical aggregation converter-based method for re-identifying blocked pedestrians |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |