WO2022022721A1 - 轨迹预测方法、装置、设备、存储介质及程序 - Google Patents

轨迹预测方法、装置、设备、存储介质及程序 Download PDF

Info

Publication number
WO2022022721A1
WO2022022721A1 PCT/CN2021/109871 CN2021109871W WO2022022721A1 WO 2022022721 A1 WO2022022721 A1 WO 2022022721A1 CN 2021109871 W CN2021109871 W CN 2021109871W WO 2022022721 A1 WO2022022721 A1 WO 2022022721A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
information
series
position information
neural network
Prior art date
Application number
PCT/CN2021/109871
Other languages
English (en)
French (fr)
Inventor
张世权
李亦宁
蒋沁宏
石建萍
周博磊
Original Assignee
商汤集团有限公司
本田技研工业株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 商汤集团有限公司, 本田技研工业株式会社 filed Critical 商汤集团有限公司
Priority to JP2022546580A priority Critical patent/JP7513726B2/ja
Publication of WO2022022721A1 publication Critical patent/WO2022022721A1/zh

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • B60W60/0027Planning or execution of driving tasks using trajectory prediction for other traffic participants
    • B60W60/00274Planning or execution of driving tasks using trajectory prediction for other traffic participants considering possible movement changes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • B60W2554/402Type
    • B60W2554/4023Type large-size vehicles, e.g. trucks
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • B60W2554/402Type
    • B60W2554/4029Pedestrians
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • B60W2554/404Characteristics
    • B60W2554/4041Position
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • B60W2554/404Characteristics
    • B60W2554/4045Intention, e.g. lane change or imminent movement

Definitions

  • the embodiments of the present disclosure relate to the technical field of intelligent driving, and relate to, but are not limited to, a trajectory prediction method, apparatus, device, storage medium, and program.
  • the intrinsic correlation of the historical motion of pedestrian or vehicle trajectories is mainly considered, such as using the historical trajectory position information of pedestrians or vehicles to make trajectory prediction in the future.
  • Embodiments of the present disclosure provide a trajectory prediction method, apparatus, device, storage medium, and program.
  • An embodiment of the present disclosure provides a trajectory prediction method, the method is executed by an electronic device, and the method includes:
  • the time-series location information is the location information of the object at different time points within a preset time period
  • the time-series attitude information is the The posture information of the object at different time points within a preset duration
  • the posture information at the different time points includes the orientation information of the multiple parts of the object at the different time points;
  • the future trajectory of the object is determined according to the time-series position information, the time-series posture information, and the motion intention.
  • the motion intention of the object can be determined more accurately; then, the future trajectory of the object is predicted based on the estimated motion intention, time-series position information and time-series attitude information as input, and in the predicted
  • the orientation information of the object is used in the process; in this way, by combining the time-series position information, the time-series pose information and the motion intention, and considering the orientation information of the object, the accuracy of predicting the future trajectory of the object can be effectively improved.
  • An embodiment of the present disclosure provides a trajectory prediction device, the device includes:
  • an intention determination module configured to determine the motion intention of the object according to the time-series position information and time-series attitude information of the object; wherein, the time-series position information is the position information of the object at different time points within a preset time period, and the The time sequence posture information is the posture information of the object at different time points within a preset duration; the posture information at different time points includes orientation information of multiple parts of the object at the different time points;
  • the future trajectory determination module is configured to determine the future trajectory of the object according to the time-series position information, the time-series attitude information and the motion intention.
  • An embodiment of the present disclosure provides a computer storage medium, where computer-executable instructions are stored thereon, and after the computer-executable instructions are executed, the above-mentioned trajectory prediction method can be implemented.
  • An embodiment of the present disclosure provides a computer device, where the computer device includes a memory and a processor, where computer-executable instructions are stored in the memory, and the processor can implement the above-mentioned when executing the computer-executable instructions in the memory.
  • the trajectory prediction method described above is a technique that uses the trajectory to predict the trajectory of the computer-executable instructions to implement the above-mentioned when executing the computer-executable instructions in the memory.
  • Embodiments of the present disclosure further provide a computer program, the computer program includes computer-readable codes, and when the computer-readable codes are executed in an electronic device, the processor of the electronic device executes the code to implement the above-mentioned The trajectory prediction method described above.
  • Embodiments of the present disclosure provide a trajectory prediction method, apparatus, device, storage medium, and program, which use the time-series position information and time-series attitude information of the object as input to estimate the motion intention of the object.
  • the future trajectory of the object can be predicted, and the orientation information of the relevant object is used in the prediction process.
  • the accuracy of predicting the future trajectory of the object can be effectively improved.
  • FIG. 1 is a schematic diagram of an implementation flowchart of a trajectory prediction method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a system architecture to which the trajectory prediction method according to the embodiment of the present disclosure can be applied;
  • FIG. 3A is a schematic flowchart of another implementation of a trajectory prediction method according to an embodiment of the present disclosure
  • 3B is a schematic flowchart of another implementation of the trajectory prediction method according to an embodiment of the present disclosure.
  • 4A is a schematic diagram of the distribution of objects in the data set according to an embodiment of the present disclosure and the distribution of intentions of each object type;
  • FIG. 4B is a schematic diagram of another distribution of objects in the data set and the intent of each object type according to an embodiment of the present disclosure
  • FIG. 4C is another schematic diagram of the distribution of objects in the data set and the intent of each object type according to an embodiment of the present disclosure
  • FIG. 4D is another schematic diagram of distribution of objects in the data set and the intent of each object type according to an embodiment of the present disclosure
  • FIG. 5 is a schematic framework diagram of a trajectory prediction system provided by an embodiment of the present disclosure.
  • FIG. 6 is a structural diagram of an implementation framework of a trajectory prediction method according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural composition diagram of a trajectory prediction apparatus according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
  • a trajectory prediction method is applied to a computer device.
  • the computer device may include objects or non-objects.
  • the functions implemented by the method may be implemented by calling a program code by a processor in the computer device.
  • the program code may be stored in In the computer storage medium, it can be seen that the computer device includes at least a processor and a storage medium.
  • FIG. 1 is a schematic diagram of an implementation flowchart of a trajectory prediction method according to an embodiment of the present disclosure, as shown in FIG. 1 , and is described in conjunction with the method shown in FIG. 1 :
  • Step S101 Determine the motion intention of the object according to the time-series position information and the time-series posture information of the object.
  • the time-series position information is the position information of the object at different time points within a preset time period
  • the time-series attitude information is the attitude information of the object at different time points within the preset time period
  • the objects are movable objects in the traffic environment, including human objects, such as pedestrians or cyclists. It also includes non-human objects, including but not limited to at least one of the following: vehicles with various functions (such as trucks, cars, motorcycles, bicycles, etc.), vehicles with various wheel numbers (such as four-wheeled vehicles) Vehicles, two-wheeled vehicles, etc.) and any movable equipment, such as robots, aircraft, blind guides, smart toys, toy cars, etc.
  • the posture information at different time points includes orientation information of one or more parts of the human object at the different time points.
  • Step S102 Determine the future trajectory of the object according to the time-series position information, the time-series posture information, and the motion intention.
  • the movement intention is the movement tendency of the object in the future period, for example, the object is a pedestrian, and the movement intention is whether he intends to pass a traffic light or whether he intends to go straight in the future period.
  • the time-series position information, the time-series pose information and the motion intention are combined and input into the neural network as a whole to predict the future trajectory of the object.
  • the time-series position information and the time-series attitude information are spliced together in a preset manner as a fusion feature, and the fusion feature and motion intention are jointly referenced to predict the future trajectory of the object.
  • the time-series position information and time-series pose information of the object are used to estimate the pedestrian's intention (eg, whether to cross the road, etc.), thus, by considering the more abundant time-series location of the moving object information and timing pose information, can more accurately determine the motion intention of the moving object; then, predict the future trajectory of the object based on the estimated object intention and the output of the learning model, and use multiple The time series information of the direction of each part in the part; in this way, by combining the time series information of the position and posture with the motion intention, the future trajectory of the moving object can be predicted, which can effectively improve the accuracy of the future trajectory prediction.
  • FIG. 2 shows a schematic diagram of a system architecture to which a trajectory prediction method according to an embodiment of the present disclosure can be applied; as shown in FIG. 2 , the system architecture includes an acquisition terminal 201 , a network 202 and a trajectory prediction terminal 203 .
  • the acquisition terminal 201 reports the time series position information and time series attitude information of the object to the trajectory prediction terminal 203 through the network 202 .
  • the trajectory prediction terminal 203 in response to the time-series position information and time-series attitude information of the object, firstly determines the motion intention of the object according to the time-series location information and time-series attitude information of the object; then, according to the time-series location information, the time-series attitude information information and the motion intention to determine the future trajectory of the object. At the same time, the trajectory prediction terminal 203 uploads the future trajectory of the object to the network 202 and sends it to the acquisition terminal 201 through the network 202 .
  • the acquisition terminal 201 may include an image acquisition device, and the trajectory prediction terminal 203 may include a visual processing device or a remote server with visual information processing capability.
  • Network 202 may employ wired or wireless connections.
  • the trajectory prediction terminal 203 is a visual processing device
  • the acquisition terminal 201 can be connected to the visual processing device through a wired connection, such as data communication through a bus;
  • the trajectory prediction terminal 203 is a remote server, the acquisition terminal 201 can Data exchange with remote server through wireless network.
  • the acquisition terminal 201 may be a vision processing device with a video capture module, or a host with a camera.
  • the trajectory prediction method of the embodiment of the present disclosure may be executed by the acquisition terminal 201 , and the above-mentioned system architecture may not include the network 202 and the trajectory prediction terminal 203 .
  • Step S101 can be implemented by the following steps, as shown in FIG. 3A , combined with FIG. 3A Make the following instructions:
  • Step S11 according to the time-series position information and the time-series attitude information, obtain the environment information of the environment where the object is located.
  • the environmental information includes at least one of the following: road information, pedestrian information, or traffic light information.
  • the time sequence position information of the object and the orientation information in the time sequence attitude information the world map is intercepted to obtain the local map area of the environment where the object is located, so as to obtain the local map information of the object, and the local map information is determined as the environmental information.
  • the time-series position information and time-series attitude information of an object at a historical moment can be obtained through the following process: first, determine at least two historical moments whose duration from the current moment is less than or equal to a preset duration; then, obtain the object at least two historical moments Time series position information and time series attitude information.
  • the acquired time-series position information and time-series attitude information of multiple historical moments whose duration from the current moment is less than the preset duration can be improved.
  • the current time is 10:05:20
  • the time-series position information and time-series attitude information of objects less than 5 seconds away from the current time are obtained, that is, from 10:05:15 to 10:05:20 between the temporal position information and temporal pose information of the object.
  • the time series position information and the time series attitude information are related to the attributes of the object.
  • the time-series position information and the time-series attitude information include at least: the person's time-series location information, body orientation and face orientation; if a set of time-series locations is acquired every 1 second between this historical period information and time-series pose information, for example, if the time-series position information and the time-series pose information include the object's body orientation, face orientation, and the position of the object, then determine the object's body orientation, face The orientation and location of the object.
  • time 10:05:15 to 10:05:20 obtain a set of time-series position information and time-series attitude information every 1 second, that is, there are 5 time point distances, then determine the body orientation, face orientation and the location of the object.
  • the time-series position information and the time-series attitude information include at least: time-series location information of the moving device, head orientation of the device, and driving instruction information of the moving device.
  • the time-series position information and the time-series attitude information include: the time-series position of the vehicle, the heading of the vehicle, and the driving instruction information of the vehicle; wherein, the driving instruction information includes but is not limited to at least one of the following: driving direction, driving speed and The status of the headlights (for example, the status of the turn signals), etc.
  • the obtained rich time-series position information and time-series attitude information are used as the basis for intercepting the world map, and the environment information of the environment where the object is located is obtained. That is to say, the environmental information can be intercepted from the world map by using the position information of the object and the orientation information of the object in the time series position information and the time series attitude information to determine the road structure and sidewalk in the local map where the object is currently located. information and traffic light information on the road; in this way, by obtaining the rich time-series location information and time-series attitude information of the object, to predict the environmental information such as the road structure where the object is currently located, the accuracy of the map division can be improved. Even when there are few observation points (even only one frame of observation data), it can still give reasonable prediction results.
  • Step S12 fuse the environment information, the time-series position information and the time-series attitude information to obtain a fusion feature.
  • the time-series position information and the time-series pose information include: body orientation, face orientation, and the position of the object; separate the body orientation, face orientation, and the position of the object; input three independent In the first neural network, time-series position information and time-series attitude information used to indicate the changes of body orientation, face orientation and the position of the object in time are obtained respectively; the time-series location information and time-series attitude information are input into the second neural network , obtain the adjusted time-series position information and adjusted time-series attitude information; input multiple different distances into the third neural network (for example, a fully connected model) to obtain the body orientation, face orientation and the position of the object at the distance Corresponding weight; multiply the weight by the adjusted time-series position information and adjusted time
  • the time-series position information, the time-series attitude information, and the environment information are acquired at the same time point. For example, all of them are for 5 time points in the historical period, so splicing the multiplication result with the environmental information obtained after encoding the local map area can be achieved by the following methods: combining the matrix representing the multiplication result with the environmental information representing the environment The matrix of information is spliced together according to the row or column to form a matrix, that is, the fusion feature is obtained.
  • the matrix representing the multiplication result is a matrix with 3 rows and 5 columns
  • the matrix representing the environmental information is a matrix with 6 rows and 5 columns
  • the two matrices are spliced together according to the columns, and a matrix with 9 rows and 5 columns is obtained, that is, the fusion feature is obtained. .
  • Step S13 Determine the motion intention of the object according to the fusion feature.
  • the movement intention can be understood as: the movement tendency of the object during the movement process.
  • the intention classification includes but is not limited to one or more of the following: turn left, turn right, go straight , stand still, make a U-turn, accelerate, decelerate, cross the road, wait for a red light, and walk backwards, etc.
  • the intent classification includes but is not limited to one or more of the following: turn left, turn right, go straight, stand still, change lanes left, change lanes right, accelerate, decelerate, overtake, reverse, and wait for red lights, etc. .
  • the fusion feature is decoded by using a fully connected layer network to obtain the probability that the fusion feature is each category in the preset category library, and the category with the highest probability is regarded as the most probable fusion feature.
  • category, and predicting the motion intent of an object based on the most probable category can improve the accuracy of predicting intent.
  • step S102 may be implemented in the following manner:
  • Step S14 Determine the future trajectory of the object according to the fusion feature and the motion intention.
  • the future trajectory of the object in the future period can be predicted by fusing the feature and the motion intention; it is also possible not to predict the motion intention of the object, and only the first neural network can be used to perform multiple iterations on the fused feature, Predict the future trajectory of an object over a future period. For example, by decoding the second adjusted time-series position information and time-series attitude information, the future trajectory of the predicted object can be obtained; in this way, trajectory prediction can be performed by using various time-series location information and time-series attitude information, even if there are few observation points (or even only One frame of observation data), or in scenarios such as sudden acceleration, deceleration, and sudden turns of the object, the accuracy of future trajectory prediction can still be guaranteed.
  • the map information is integrated into the time series position information and the time series attitude information to predict the motion intention, which can improve the accuracy of the prediction of the motion intention. Then, the future trajectory of the object is predicted based on the motion intention. Improve the accuracy of trajectory prediction.
  • the world map may be intercepted through the location information of the object and the orientation information of the object to determine the local map area of the current environment of the object, that is, step S11 may This is achieved through the following process:
  • Step S111 intercepting the world map according to the position information and orientation information of the object at the historical moment to obtain a local map area of the environment where the object is located.
  • the orientation information and the position information in the time series attitude information appear in pairs, that is, the position information of the object and the orientation information at the position are determined at a certain historical moment.
  • the object is a human body (for example, a pedestrian or a cyclist)
  • the current road structure where the person is located is determined according to the person's position information and the person's body orientation, so as to intercept the world map to determine the current road structure of the pedestrian.
  • the local map area in which it is located is located.
  • the object is a moving device such as a vehicle
  • the current road where the vehicle is located is determined according to the location information of the vehicle and the heading of the vehicle, so as to intercept the world map to determine the local map area where the vehicle is currently located.
  • the interception of the world map can be realized in the following manner: according to each position in the plurality of time series position information and the orientation of the object when the position is located, determine the local map area of the environment where the object is located, and obtain a plurality of local maps area.
  • the orientation of the object when it is in this position can be understood as the orientation of multiple parts when the object is in this position. In this way, by referring to the orientations of multiple parts of the object at one location, the local map area of the object can be delineated, which can improve the accuracy of the determined environmental information, thereby improving the accuracy of future trajectory prediction.
  • a local map area of the environment where the object is located is delineated in the world map. For example, taking this position as the center and along the facing direction, a rectangular area is delineated as a local map area of the environment where the object is located. In this way, multiple locations and multiple orientation information under each location can determine multiple local map areas.
  • the multiple local map areas are encoded to obtain multiple encoded maps, that is, environmental information. In this way, taking the location as the center and referring to the orientation information, the local map area is demarcated, so that the map information included in the demarcated local map area has a high correlation with the object, that is, the effectiveness of the environmental information can be improved.
  • Step S112 Encode the elements in the local map area to obtain the environment information.
  • each element represents map information of a corresponding area
  • the map information includes at least one of the following: road structure information, sidewalks, or road traffic lights.
  • the elements of this local map area are encoded as masks, and each codeword represents the map information of the corresponding area.
  • the environmental information is a matrix including 1 and 0, where 1 represents a sidewalk, 0 represents a road danger zone, and so on.
  • the multiple environmental information and the corresponding time series position information and time series attitude information are fused to obtain multiple sets of fusion features, and the motion intention of the object is predicted by classifying the fusion features.
  • the structure of the first neural network is not limited, including but not limited to a convolutional neural network, a long short-term memory network (Long Short-Term Memory, LSTM), etc.
  • LSTM Long Short-Term Memory
  • input the time-series position information and time-series attitude information of multiple historical moments for example, taking the object as a pedestrian as an example, multiple body orientations, multiple face orientations, and multiple positions of the object respectively
  • the bidirectional LSTM network Obtain the time-series position information and time-series attitude information which are used to indicate the time-series position information and time-series attitude information, respectively
  • the distance is input into the fully connected model, and the weight corresponding to the body orientation, face orientation and the position of the object under the distance is obtained; the weight is multiplied by the adjusted time-series position information and time-series attitude information to obtain multiple
  • a local map area of the object is defined according to the position and orientation of the object; and mask coding is performed on the local map area to obtain environmental information, and each codeword represents the map information of the area.
  • the time-series position information and time-series attitude information of the object are combined with the encoded map to predict the intention of the object, and then predict the future trajectory of the object, which can improve the accuracy of the obtained future trajectory.
  • time-series modeling is performed respectively to obtain the time-series changes of each time-series location information and time-series attitude information, and then each time-series location
  • the time-series position information of the information and the time-series attitude information and the time-series attitude information and the environment information are fused to obtain the fusion feature, that is, step S12, which can be realized through the following process, as shown in FIG.
  • FIG. Another schematic flow chart of implementation is described below in conjunction with the steps shown in FIG. 3A and FIG. 3B :
  • Step S201 predicting time-series location information and time-series attitude information in a future period according to the time-series location information and time-series attitude information through a first neural network.
  • step S201 can be performed by The following process is implemented:
  • the time-series position information and time-series attitude information (that is, multiple time-series location information and time-series attitude information) of each historical moment are arranged in time sequence; then, the arranged multiple time-series location information and time-series attitude information, Input the first neural network to obtain multiple time-series position information and time-series attitude information.
  • the first neural network may be a bidirectional LSTM network, and the number of the first neural networks matches the types contained in the time-series position information and the time-series attitude information.
  • the time-series position information and time-series pose information include: the object's body orientation, face orientation, and the location of the object; then, the first neural network is three independent bidirectional LSTM networks.
  • the time-series position information and the time-series attitude information include: the head orientation of the object, the state of the headlights, and the location of the object; then, the first neural network is three independent bidirectional LSTM networks.
  • a plurality of time series position information and time series attitude information are input into the bidirectional LSTM network to obtain corresponding time series position information and time series attitude information. For example, if the object is a pedestrian, input the pedestrian's body orientation, face orientation and the position of the pedestrian at different times into three independent bidirectional LSTM networks, respectively, and obtain multiple time series position information corresponding to the body orientation at different times.
  • time series posture information indicating the change of body orientation in time
  • multiple time series position information and time series posture information corresponding to the face orientation at different times indicating the change in face orientation in time
  • the pedestrian is at different times
  • a plurality of time-series position information and time-series attitude information corresponding to the location of the object indicating the temporal change of the location of the object.
  • the object is a vehicle
  • Corresponding multiple time-series position information and time-series attitude information (indicating the change of the head orientation in time), multiple time-series location information and time-series attitude information corresponding to the state of the lights at different times (indicating the change of the state of the lights in time) situation) and a plurality of time-series position information and time-series attitude information corresponding to the position of the vehicle at different times (indicating the change of the position of the vehicle in time).
  • the first neural network is a trained neural network, which can be obtained by training in the following manner:
  • the time-series location information and time-series attitude information of the object at historical moments are used as the input of the first neural network, and based on each set of time-series location information and time-series attitude information, it is predicted that the object corresponds to the future time period
  • the predicted time-series position information and time-series attitude information are obtained, so as to obtain the predicted time-series location information and time-series attitude information.
  • the object here can be understood as a sample object. For example, pedestrians or animals in the sample images of the preset dataset.
  • the preset data set at least includes time-series position information and time-series attitude information of the sample object in the sample image.
  • the preset data set at least includes the body orientation, face orientation of the sample object or the position of the sample object in the sample image.
  • time-series position information and time-series attitude information in the future period are fused with the environment information of the environment where the object is located to obtain the fusion prediction feature.
  • the time-series position information and time-series attitude information predicted by the first neural network to be trained and the environment information are fused to obtain the fusion prediction feature.
  • future trajectories of objects in future time periods are predicted, at least according to the fused predicted features.
  • the first neural network is used to iterate the fusion prediction feature, thereby predicting the future trajectory of the object in the future time period.
  • the trained fully-connected network is used for classification to predict the motion intention of the object, and the motion intention and the fusion prediction feature are combined to predict the future trajectory of the object.
  • the first prediction loss of the first neural network to be trained with respect to the future trajectory is determined.
  • the first prediction loss is determined based on the first neural network, the future trajectory, and the ground truth trajectory of the object.
  • the first prediction loss includes at least one of the following: the average number of failed predictions of future trajectories whose length is greater than a preset threshold, the success rate of future trajectories under error thresholds corresponding to different distances, or the end position and true value of future trajectories The error between the end positions of the trajectories.
  • the average number of failed predictions of future trajectories whose length is greater than the preset threshold can be understood as: for future trajectories whose length is greater than the preset threshold (for example, predicting the future trajectories in the future 5s); for each moment in the future trajectories All predictions are made, the historical trajectory of the first 5 seconds of the moment is used as input, and the future trajectory of the next 5 seconds is predicted; then, the motion prediction trajectory needs to be predicted multiple times, so as to obtain the results of multiple predictions; statistics of multiple predictions The number of failures in the result; then divide the number of failures by the length of the future trajectory to achieve normalization; since there are many future trajectories with a trajectory length greater than a preset threshold, divide the number of failed predictions in each trajectory by The length of the future trajectory, multiple normalized values are obtained; finally, these multiple normalized values are averaged to obtain the average number of failure predictions for each trajectory.
  • the success rate of the predicted future trajectory under the error thresholds corresponding to different distances can be understood as different error thresholds are preset for different distances. For example, the larger the distance, the larger the set error threshold. If the error of the obtained future trajectory is less than the error threshold at a certain distance, it is determined that the prediction is successful. In this way, the performance of the predicted future trajectory under different error thresholds can be characterized, and based on this, the detail effect of the neural network can be improved.
  • the error between the end position of the future trajectory and the end position of the true value trajectory can be understood as the difference between the end point of the future trajectory and the end point of the true value trajectory.
  • the network parameters of the first neural network are adjusted to train the first neural network.
  • the first prediction loss may be directly used to adjust the network parameters. For example, the average number of failed predictions of predicted future trajectories with a length greater than a preset threshold, the success rate of predicted future trajectories under error thresholds corresponding to different distances, or the difference between the end position of the future trajectory and the end position of the true value trajectory At least one of the errors of the network parameters is adjusted.
  • the performance of the first neural network obtained by training is better.
  • the above-mentioned reference adjustment process can also be implemented in the following ways. First, determine the size of the success rate and the average number of failure predictions, and when the success rate is less than the average number of failure predictions, determine all of the predictions for this time. The future trajectory fails; then, the network parameters of the neural network are adjusted using at least one of the average position error, the average number of failure predictions, the success rate, or the error. In this way, the predicted future trajectory in the training process is evaluated through multiple evaluation criteria, so that the network parameters of the neural network can be adjusted more accurately, so that the future trajectory predicted by the adjusted first neural network is more accurate.
  • Step S202 splicing the time-series position information, time-series attitude information and the environment information in the future period according to a preset method to obtain the fusion feature.
  • the time-series position information and time-series attitude information and the corresponding local map can be understood as the time-series location information and time-series attitude information belonging to a group of time-series location information and time-series attitude information and according to this group of time-series
  • the location information and orientation information in the location information and the time series attitude information are intercepted from the local map.
  • a plurality of time-series position information and time-series attitude information are spliced with the local map in a one-to-one correspondence according to a preset method to obtain a fusion feature; the preset method may be according to the sequence of inputting the time-series location information and the time-series attitude information into the neural network.
  • the three kinds of time-series position information and time-series attitude information are sequentially input into the neural network (for example, LSTM network); then in the order from the pedestrian's body orientation, face orientation to the pedestrian's position, the time-series position information and time-series pose information and the corresponding local map are spliced to obtain fusion features. Then, a fully connected network is used to decode the fusion features to predict the pedestrian's motion intention, that is, whether the pedestrian wants to turn left, turn right, go straight, stand still, or turn around.
  • the neural network For example, LSTM network
  • the time-series position information and the time-series attitude information include: vehicle front time-series location information and time-series attitude information, position time-series location information and time-series attitude information, and vehicle lamp state time-series location Information and timing attitude information, the three timing position information and timing attitude information are in the order of the front timing position information and timing attitude information, the position timing position information and timing attitude information, and the light state timing position information and timing attitude information.
  • the neural network for example, LSTM network
  • the neural network in turn; then according to the sequence from the front timing position information and timing attitude information, position timing position information and timing attitude information to the light state timing position information and timing attitude information, the timing position information
  • the fusion features are obtained by splicing with the time series pose information and the corresponding local map.
  • a fully connected network is used to decode the fused features to predict the motion intention of the vehicle, that is, whether the vehicle wants to turn left, turn right, go straight, stand still, change lanes left, change lanes right, overtake or reverse, etc.
  • the above-mentioned steps S201 and S202 provide a method for realizing "merging the environment information with the time-series position information and the time-series attitude information to obtain fusion features".
  • Step S203 the second neural network is used to determine the confidence level that the fusion feature is each intent category in the intent category library.
  • the second neural network may be a fully connected network for classifying fused features. For example, by using a fully connected network to predict the possibility that the fusion feature is each intent category in the intent category library, the confidence level of each intent category can be obtained.
  • the corresponding intent category library includes: turn left, turn right, go straight, stand still or turn around, etc.; a fully connected network is used to predict that the fusion features may be left turn, Confidence for each intent category in right turn, straight, stationary or U-turn, etc., eg, probability for each intent category.
  • the second neural network is a trained neural network, which can be obtained by training in the following manner:
  • the fusion feature is input into the second neural network to be trained, and the motion intention of the object is predicted as the confidence level of each intention category in the intention category library.
  • the second neural network to be trained may be a fully connected network to be trained, and the fusion feature is input into the second neural network to be trained to predict the motion intention of the object as the probability of each category in the category library.
  • the object may be a sample object, and the fusion feature of the sample object is input into the second neural network to be trained to classify the motion intention of the sample object.
  • a second prediction loss of the second neural network's confidence about each intent category is determined based on the object's ground-truth intent.
  • the second prediction loss may be a categorical cross-entropy loss function.
  • the network parameters of the second neural network to be trained are adjusted to train the second neural network to be trained to obtain the second neural network.
  • the network parameters of the second neural network to be trained are adjusted by using the classified cross-entropy loss function, so as to train the second neural network to be trained, and the trained second neural network is obtained.
  • the loss function is the sum of the first prediction loss and the second prediction loss.
  • Step S204 Determine the motion intention of the object according to the intention category with the highest confidence.
  • the category with the highest probability is selected, and the category with the highest probability is determined as the movement intention of the object. For example, using a fully connected network to predict that the fusion feature may be left turn, right turn, straight, stationary or U-turn, etc.
  • the probability of each category is: 0.1, 0.2, 0.2, 0.1 and 0.4, then the category with the highest probability is U-turn , indicating that the most probable movement intention of the object is to turn around, so that it is finally determined that the movement intention of the object is to turn around.
  • the neural network can accurately predict the most probable motion intention by classifying the intention category of the fusion feature.
  • the above steps S203 and S204 provide a method for realizing "determining the motion intention of the object according to the fusion feature".
  • the fusion feature is classified by using a fully connected network, so that accurate prediction can be achieved.
  • Step S205 Determine the iterative step size according to the length of the future period.
  • the length of the future period is 3 seconds, and the iteration step is determined to be 0.3 seconds.
  • Step S206 use the first neural network to iterate the motion intention and the fusion feature to obtain the coordinates of the object under each iteration step size.
  • the number of iterations required is determined first according to the iteration step size and the length of the future period, and then the first neural network is used to iterate the motion intent and the fusion feature to obtain the coordinates of each iteration.
  • the length of the future period is 3 seconds
  • the iterative step is determined to be 0.3 seconds
  • the number of iterations required is 10
  • the first neural network is used to perform successive iterations on the motion intent and the fusion feature
  • Step S207 Determine the future trajectory according to the coordinates of the object under each iteration step.
  • the intention prediction of the object and the trajectory prediction are combined into a system, and the coordinates under each step length are obtained through step-by-step iteration, and the future trajectory is predicted, so that the efficiency and prediction of the final predicted future trajectory can be improved. Effect.
  • the following processes are further included:
  • LSTM networks are used to adjust each time-series position information and time-series attitude information to obtain the first adjusted time-series location information and time-series attitude information.
  • a bidirectional LSTM network or a model of a fully connected layer may be used to adjust the time-series position information and the time-series attitude information; each time-series location information and The time series attitude information is input into the bidirectional LSTM network or the model of the fully connected layer, and a weight matrix is obtained. Then, the weight matrix is divided into parts of the same type as the time series position information and the time series attitude information, and each part corresponds to the time series position. Each time-series position information and time-series attitude information in the information and the time-series attitude information are multiplied to obtain a plurality of first adjusted time-series location information and time-series attitude information.
  • the time-series position information and the time-series posture information include: the object's body orientation, face orientation, and the position of the object; input these three features into three
  • three kinds of time-series position information and time-series attitude information corresponding to the three features are obtained;
  • the order of the positions of is input into the second neural network in turn, and a weight matrix is obtained;
  • the weight matrix is divided into three parts, the first part is multiplied by the time-series position information and time-series attitude information at different times, and the second part is combined with The time-series position information and time-series attitude information at different times are multiplied, and the third part is multiplied with the time-series location information and time-series attitude information of objects at different times to obtain the first adjusted time-series location information and time-series attitude information including three characteristics.
  • a weight vector is obtained, and the weight vector is used to adjust each first adjusted time-series location information and time-series attitude information , and obtain the second adjusted timing position information and timing attitude information.
  • a fully connected model is adopted, and for multiple input distances, a weight vector corresponding to each time-series position information and time-series attitude information at the multiple positions is output. And multiply the obtained weight vector corresponding to each time-series position information and time-series attitude information with the first adjusted time-series location information and time-series attitude information corresponding to this kind of time-series location information and time-series attitude information to obtain the second adjusted time-series location. information and time series attitude information, so as to obtain the second adjusted time series position information and time series attitude information.
  • the second adjusted time-series position information and time-series attitude information are spliced with the environment information to obtain the fusion feature.
  • the second adjustment time series position information and the time series attitude information in the second adjustment time series position information and the time series attitude information are spliced with the plurality of coded maps according to a preset method to obtain Fusion features.
  • the three kinds of time-series position information and time-series posture information are sequentially input into the neural network (for example, LSTM network) in the order of the pedestrian's body orientation, face orientation and the position of the object.
  • the obtained second adjustment time series position information and time series attitude information also include these three characteristics, according to the order from the pedestrian's body orientation, face orientation, the position of the pedestrian to the local map, the second adjustment time series position
  • the information and time series pose information are spliced with the corresponding local maps to obtain fusion features.
  • a fully connected network is used to decode the fusion features to predict the pedestrian's motion intention, that is, whether the pedestrian wants to turn left, turn right, go straight, stand still, or turn around.
  • the embodiments of the present disclosure provide a trajectory prediction method.
  • a vehicle, a pedestrian, or a non-motor vehicle may have complex behaviors, such as sudden turning, sudden left or right turn, or walking.
  • complex behavior cannot be easily predicted or expected from the historical trajectories of vehicles, pedestrians or non-motor vehicles alone.
  • autonomous systems with sensing capabilities can naturally extract richer information to make more informed decisions.
  • the embodiments of the present disclosure use the orientation of the object to describe the motion of the object and the local map area to describe the surrounding static environment.
  • the position is represented as a point (x, y) in the horizontal plane, while the volume and face orientations are extracted from the corresponding Red Green Blue (RGB) image and then projected onto the horizontal plane, represented as a unit vector (d x , dy ).
  • the local map area is obtained from the high-definition map and contains multiple road information, such as crosswalks, lane lines, intersections or sidewalks.
  • Embodiments of the present disclosure use a data collection vehicle to collect object trajectory data in an urban driving scene.
  • the car is equipped with cameras, 64-line LiDAR, radar, Global Positioning System (GPS) or Inertial Measurement Unit (IMU).
  • GPS Global Positioning System
  • IMU Inertial Measurement Unit
  • the embodiment of the present disclosure utilizes the marked high-definition map to detect, analyze and track the future trajectory of the generated object through the perception function.
  • the embodiments of the present disclosure provide future trajectories of pedestrians and raw data at 10 hertz (HZ), where the raw data includes raw images, point cloud points, vehicle poses of the ego vehicle and a high-definition map.
  • HZ hertz
  • the embodiments of the present disclosure use a first neural network and a second neural network (wherein the first neural network and the second neural network can be implemented by using a model of a deep neural network algorithm) to get the output.
  • the preset data set provided by the embodiment of the present disclosure includes: the face orientation, body orientation and position of the pedestrian, vehicle light information, vehicle head orientation information, and the like of the pedestrian. In this way, using the dataset containing such rich information to train the first neural network and the second neural network makes the trained first neural network and the second neural network more generalizable.
  • Embodiments of the present disclosure collect raw sensor data at a frequency of 10 Hz, including front-view RGB images (800 ⁇ 1762), LiDAR point clouds, and ego vehicle pose and motion information.
  • the embodiments of the present disclosure provide semantic annotation of road categories (ie, lane lines, intersections, crosswalks, sidewalks, etc.) for bird's-eye view High Definition Maps (HDMap).
  • Road categories are represented as polygons or lines with no overlapping areas.
  • the HDMap is cropped and aligned with the ego car for each data frame. With the help of perception function, the running trajectories of objects can be generated through detection and tracking.
  • the trajectories are sampled to 0.3 seconds per frame.
  • Embodiments of the present disclosure collect over 12,000 minutes of raw data and sample over 300,000 different trajectories for vehicles, pedestrians, and cyclists.
  • embodiments of the present disclosure manually annotate the objects in the collected trajectories with semantic attributes and intents.
  • Embodiments of the present disclosure use different property settings for each object class to better capture its functionality.
  • the embodiments of the present disclosure will indicate the age group (adult/juvenile), gender (female/male) ), face orientation (angle) and body orientation; for the vehicle, the embodiment of the present disclosure marks the turn signal status (turn left/turn right/brake) and the heading direction.
  • Intent can be understood as the future action of the object after a specific time (1s in the setting of the embodiment of the present disclosure) after the observation point. Similar to this attribute, embodiments of the present disclosure define different intent spaces for vehicles, pedestrians, and cyclists, as shown in FIGS. 4A to 4D , wherein: FIG. 4A represents different objects, namely vehicle 401 , pedestrian 402 and cyclist Among the cyclists 403, the number of vehicles 401 is 334,696, accounting for 58%, the number of pedestrians 402 is 178,343, accounting for 31%, and the number of cyclists 403 is 61,934, accounting for 11%.
  • FIG. 4B shows the result of the intention prediction for the vehicle, in which the straight 421 occupies 38.9% (that is, the intention of the vehicle to make a straight is 38.9%), the left turn 422 occupies 2%, the right turn 423 occupies 1%, and the left lane change 424 1.6%, right lane change 425 2%, left overtaking 426 0.1%, right overtaking 427 0.1%, stationary 428 54%, and other 429 0.2%.
  • Figure 4C shows the result of the intention prediction for pedestrians, in which 431 go straight 48.6%, 432 turn left 16.8%, 433 turn right 23.6%, stand still 434 occupied 6.8%, turn 435 occupied 0.4%, and other 436 occupied 3.7% %.
  • Figure 4D shows the results of intent prediction for cyclists, where straight 441 occupies 37.5%, left turn 442 occupies 13.5%, right turn 443 occupies 17.9%, stationary 444 occupies 24%, U-turn occupies 0.1%, and other 445 occupy 7%.
  • the datasets of the embodiments of the present disclosure cover more object categories and provide rich contextual annotations, including road information and attribute annotations.
  • the datasets of embodiments of the present disclosure use broader intent definitions and have larger data scales.
  • a unified framework is adopted to jointly predict the future trajectory and potential intention of objects.
  • At least one of the first neural network and the second neural network used in the embodiments of the present disclosure may include, but is not limited to, an LSTM-based encoder-decoder architecture, and based on the first neural network and the second neural network At least one of them can improve the immediacy and generality of the framework.
  • an encoder is used to extract object features from historical motion trajectories and rich contextual information of objects, including semantic object attributes and local road structures.
  • the decoder is used to estimate the intent distribution and regress to future positions.
  • FIG. 5 is a schematic diagram of the framework of a trajectory prediction system provided by an embodiment of the present disclosure, and the following description is made with reference to FIG. 5 :
  • a time series model is established for each time series position information and time series attitude information, that is, each time series position information and time series attitude information are input into the first neural network (here the first neural network can be implemented by LSTM network 506), and the corresponding timing characteristics.
  • the location information 502 is input into the LSTM network 506 to obtain the location time sequence feature
  • the body orientation 503 is input into the LSTM network 506 to obtain the body orientation time sequence feature
  • the face orientation 504 is input into the LSTM network 506 to obtain the face orientation time sequence feature
  • the road structure 505 is input into the second neural network (here, the second neural network can be implemented by using the CNN network 507 ) to encode the road structure to obtain road time-series position information and time-series attitude information.
  • the road time sequence position information, time sequence attitude information and time sequence features are fused to obtain fusion features, and the fusion features are input into the first neural network (here, the first neural network can be implemented by the MLP network 508 ), and the intention is predicted to obtain The result of the intent prediction is crossing the road 509.
  • the result of the intention prediction crossed the road 509 and the fusion feature are combined and input into the LSTM network 506, and multiple iterations are performed to predict the running trajectory of the pedestrian, and the predicted future trajectory 510 is obtained; in FIG. 5, by comparing the pedestrian 501 It can be seen that the accuracy of the predicted future trajectory 510 obtained by using the trajectory prediction method provided by the embodiment of the present disclosure is very high.
  • a set of LSTM or CNN networks are used to encode the object’s motion history and multimodal contextual input, depending on the specific form of each data item. After the encoded features are concatenated into fused features, they are fed into the decoder to jointly predict future trajectories and latent intents.
  • the observation result of the ith object is expressed as in, is the location information, is contextual information.
  • T is the last observation time (for example, the value of T can be greater than 0 and less than 5 minutes)
  • n, m are the observation period and the prediction period respectively (for example, the value of n, m can be greater than 0 and less than 5 real number of minutes).
  • Embodiments of the present disclosure use a set of bidirectional LSTM networks as the first neural network to encode multi-source input data.
  • the historical trajectory of the object pT-m:T is directly fed into the LSTM to obtain the hidden state at time T (denoted as ) as a motion history feature.
  • Contextual information is processed according to its specific form.
  • semantic attributes such as face orientation and headlight status are closely related to object intent and future motion, and reflect intrinsic properties of objects that cannot be obtained from motion history.
  • Local maps provide road structures to standardize trajectory predictions.
  • the sequence of directions ie face, body and vehicle heading
  • the sequence of lamp states are respectively directly input into independent bidirectional LSTMs.
  • the embodiment of the present disclosure uses the local map once within the observation time T to reduce redundancy.
  • the original map is first rasterized, and then the rasterized map is input into the CNN model to extract time-series position information and time-series attitude information of the map.
  • all encoded vectors are concatenated as fused features embedded at time T, as in Equation (1):
  • represents the transform function of the entire encoder.
  • Embodiments of the present disclosure model intent prediction as a classification problem. Among them, the model predicts the posterior probability distribution over a finite set of intents based on the fused features e T of a given object.
  • the embodiment of the present disclosure uses a Multilayer perceptron (MLP), and connects the softmax layer as an intent classifier.
  • MLP Multilayer perceptron
  • the embodiment of the present disclosure minimizes the loss of cross-entropy, as shown in formula (2):
  • Embodiments of the present disclosure treat trajectory prediction as a sequence generation task and employ an LSTM decoder to predict object motion at each future time step.
  • the features embedded in eT are fed into the decoder at the beginning.
  • embodiments of the present disclosure determine intent embedding features by passing the output of the intent classifier through another fully connected layer And the intent embedding feature is used as the auxiliary input of the trajectory decoder, which provides a good condition for trajectory prediction.
  • the embodiments of the present disclosure minimize the Gaussian-like loss function during the training process:
  • the following description takes the object as a pedestrian as an example:
  • Table 1 shows the accuracy of body orientation and face orientation acquired at different acquisition distances. It can be seen from Table 1 that the pedestrian's position, body orientation and face orientation are used to represent the dynamic situation of the pedestrian, while the local map area is used to represent the static surrounding environment.
  • the position, body orientation, and face orientation that is, the time-series position information and time-series posture information of pedestrians may be regarded as dynamic features, and the local map area may be regarded as static features.
  • the accuracy of the face direction (Face direction) and the body direction (Body direction) is related to the distance from the pedestrian to the ego vehicle.
  • Embodiments of the present disclosure use the embedding function ⁇ to express this relationship:
  • Wd represents the input-to-output transformation parameter in the second neural network
  • Pedestrians follow basic traffic rules, which are related to their corresponding local road structures.
  • the local map area is the basic static environment for the future trajectory prediction of pedestrians.
  • FIG. 6 is a structural diagram of an implementation framework of a trajectory prediction method according to an embodiment of the present disclosure. As shown in FIG. 6 , first, the time-series position information and time-series attitude information of the pedestrian 61 are extracted from the images 601 to 60n, for example, the facial orientation body orientation and the location of pedestrian 61 and local map areas based on body orientation and location
  • first neural networks 62, 63 and 64 for example, bidirectional LSTM network
  • time-series features used to indicate body orientation ie time-series position information and time-series pose information
  • face-orientation time-series features and the time-series features of the temporal changes of the position of the sample object and then input the time-series features into another second neural network 65 (for example, a bidirectional LSTM network) to obtain the first adjusted time-series features.
  • second neural network 65 for example, a bidirectional LSTM network
  • the coding map 602 is expanded into a one-dimensional feature vector, the one-dimensional feature vector is encoded, and another bidirectional LSTM network, namely the first neural network 66, is input to obtain the time series feature corresponding to the feature vector; then, the The time sequence feature is used as an auxiliary feature of the time sequence feature corresponding to the time sequence position information and time sequence attitude information of the pedestrian 61, and these features are spliced to obtain a fusion feature; then the fusion feature is decoded through the decoded neural network 67 to obtain the predicted pedestrian The future trajectory, that is, the dotted line 69 ; the solid line 70 is the true future trajectory of the pedestrian 61 . It can be seen that the prediction result of the network model adopted in the embodiment of the present disclosure is very accurate.
  • Embodiments of the present disclosure employ mask encoding for local map regions, resulting in an encoded map 602, where each codeword is populated with a specific integer associated with its semantic road structure class.
  • each codeword is populated with a specific integer associated with its semantic road structure class.
  • the local map regions are then uniformly discretized into grids, where each grid is represented by a structure-specific number of major semantic road structure classes. For example, “crosswalk” and “sidewalk” are represented as number “1”, “dangerous place” is represented as “-1”, and others are represented as number "0", that is, a grid 603 for dividing dangerous or safe areas is obtained.
  • the coded dynamic features ie, the temporal position information and temporal pose information of pedestrians
  • the coded static features ie, local map regions
  • the preset data set of the historical data provided by the embodiments of the present disclosure is a large-scale and information-based trajectory data set, so as to facilitate the task of pedestrian trajectory prediction in automatic driving.
  • there are multiple evaluation criteria in the data set such as the average number of failed predictions of future trajectories whose length is greater than the preset threshold, the success rate of future trajectories under error thresholds corresponding to different distances, or the end position of the future trajectory and the end point of the true value trajectory.
  • the error between the positions is used to evaluate the accuracy and robustness of the prediction model; thus, even in very complex scenarios, the neural network can still accurately predict the future trajectories of pedestrians.
  • FIG. 7 is a schematic structural diagram of the trajectory prediction device according to an embodiment of the present disclosure.
  • the device 700 includes:
  • the intention determination module 701 is configured to determine the motion intention of the object according to the time series position information and the time series attitude information of the object; wherein, the time series position information is the position information of the object at different time points within a preset time period, so The time sequence attitude information is the attitude information of the object at different time points within a preset duration; the attitude information at the different time points includes the orientation information of the object at the different time points;
  • the future trajectory determination module 702 is configured to determine the future trajectory of the object according to the time-series position information, the time-series posture information, and the motion intention.
  • the intent determination module 701 includes: a map interception sub-module, configured to obtain environmental information of the environment where the object is located according to the time-series position information and the time-series attitude information; a feature fusion sub-module, configured as The environmental information, the time-series position information and the time-series attitude information are fused to obtain a fusion feature; an intention prediction sub-module is configured to determine the motion intention of the object according to the fusion feature; the future trajectory determination module 702 , comprising: a trajectory prediction sub-module, configured to determine the future trajectory of the object according to the fusion feature and the motion intention.
  • the object includes at least one of a human object and a non-human object
  • the posture information at different time points includes: the position of the human object
  • the orientation information at the different time points, the part includes at least one of the following: limbs, face; in the case that the object includes the non-human body object, the non-human body object includes at least one of the following: a vehicle , animal, and movable device;
  • the posture information at different time points includes: orientation information and driving instruction information of the non-human object at the different time points.
  • the device further includes: a historical moment determination module configured to determine at least two historical moments whose duration from the current moment is less than or equal to a specific duration; a feature information acquisition module configured to acquire the object in at least two Time-series position information and time-series attitude information at historical moments.
  • the map interception sub-module includes: a map interception unit configured to determine the environment information according to the position information and orientation information of the object at any historical moment; wherein the environment information includes at least the following At least one of: road information, pedestrian information, or traffic light information.
  • the map intercepting unit is further configured to: center on the position information and according to the orientation information, delimit a local map area of the environment where the object is located in the world map; The elements in the area are encoded to obtain the environmental information.
  • the feature fusion sub-module includes: a unit for determining time-series position information and time-series attitude information, configured to predict the time-series in a future period according to the time-series location information and time-series attitude information through a first neural network location information and time sequence attitude information; a feature splicing unit, configured to splicing the time sequence location information, time sequence attitude information and the environment information in the future period in a preset manner to obtain the fusion feature.
  • the intent prediction sub-module includes: a confidence level determination unit, configured to determine, through the second neural network, the confidence level that the fusion feature is each intent category in the intent category library; the intent prediction unit, configured as The intent category with the highest confidence is used to determine the motion intent of the object.
  • the trajectory prediction sub-module includes: an iterative step unit, configured to determine an iterative step according to the length of the future period; and a feature iteration unit, configured to use the iterative step according to the The first neural network iterates the motion intention and the fusion feature to obtain the coordinates of the object under each iteration step; the future trajectory determination unit is configured to coordinates to determine the future trajectory.
  • the device further includes a first training module configured to train the first neural network
  • the first training module includes: a prediction sub-module for predicting time-series position information and time-series attitude information, configured to input the time-series location information and time-series attitude information of the object into the first neural network to be trained, and predict that the object is in the Time-series position information and time-series attitude information in a future period; a prediction feature fusion sub-module, configured to fuse the time-series location information and time-series attitude information in the future period with the environmental information of the environment where the object is located to obtain a fusion prediction feature a predicting future trajectory submodule, configured to predict the future trajectory of the object in the future period at least according to the fusion prediction feature; a first prediction loss determination submodule, configured to be based on the true value trajectory of the object, Determine the first prediction loss of the first neural network to be trained with respect to the future trajectory; the first neural network parameter adjustment sub-module is configured to The parameters are adjusted to obtain the first neural network.
  • a prediction sub-module for predicting time-serie
  • the device further includes a second training module configured to train a second neural network
  • the second training module includes: a category confidence determination sub-module, configured to input the fusion feature into the second neural network to be trained, and predict the motion intent of the object as the confidence level of each intent category in the intent category library; 2.
  • a prediction loss determination sub-module configured to determine a second prediction loss of the confidence of the second neural network to be trained with respect to each intention category according to the true intention of the object; a second neural network parameter adjustment sub-module The module is configured to adjust the network parameters of the second neural network to be trained according to the second prediction loss to obtain the second neural network.
  • an embodiment of the present disclosure further provides a computer program product, wherein the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the trajectory prediction method provided by the embodiment of the present disclosure can be implemented.
  • an embodiment of the present disclosure further provides a computer storage medium, where computer-executable instructions are stored thereon, and when the computer-executable instructions are executed by a processor, the trajectory prediction method provided by the foregoing embodiments is implemented.
  • Embodiments of the present disclosure further provide a computer program, where the computer program includes computer-readable codes.
  • the computer-readable codes are executed in an electronic device, the processor of the electronic device executes the above-mentioned implementation.
  • the trajectory prediction method provided by the example.
  • FIG. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
  • the device 800 includes: a processor 801 , at least one communication bus, a communication Interface 802 , at least one external communication interface and memory 803 .
  • the communication interface 802 is configured to realize the connection communication between these components.
  • the communication interface 802 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface.
  • the processor 801 is configured to execute an image processing program in the memory, so as to implement the trajectory prediction method provided by the above embodiment.
  • the above-mentioned memory may be a volatile memory (volatile memory), such as a random access memory (Random Access Memory, RAM); or a non-volatile memory (non-volatile memory), such as a read-only memory (Read Only Memory) -Only Memory, ROM), flash memory (flash memory), hard disk (Hard Disk Drive, HDD) or solid-state drive (Solid-State Drive, SSD); or a combination of the above types of memory, and provide instructions and data.
  • volatile memory such as a random access memory (Random Access Memory, RAM)
  • non-volatile memory such as a read-only memory (Read Only Memory) -Only Memory, ROM), flash memory (flash memory), hard disk (Hard Disk Drive, HDD) or solid-state drive (Solid-State Drive, SSD); or a combination of the above types of memory, and provide instructions and data.
  • the above processor can be an application specific integrated circuit (ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processor Device, DSPD), a programmable logic device (Programmable Logic Device) , PLD), at least one of field programmable gate array (Field Programmable Gate Array, FPGA), central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor.
  • ASIC application specific integrated circuit
  • DSP digital signal processor
  • DSPD digital signal processing device
  • DSPD Digital Signal Processor Device
  • PLD programmable logic device
  • FPGA Field Programmable Gate Array
  • CPU Central Processing Unit
  • controller microcontroller
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.
  • the unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit; it may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present disclosure may be all integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above integration
  • the unit can be implemented either in the form of hardware or in the form of hardware plus software functional units.
  • the above-mentioned integrated units of the present disclosure are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present disclosure essentially or the parts that make contributions to the prior art can be embodied in the form of a software product, and the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is caused to execute all or part of the methods described in the various embodiments of the present disclosure.
  • the aforementioned storage medium includes various media that can store program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
  • Embodiments of the present disclosure provide a trajectory prediction method, device, device, storage medium, and program, wherein the motion intention of the object is determined according to the time-series position information and time-series attitude information of the object; wherein the time-series location information is all The position information of the object at different time points within the preset time period, and the time sequence attitude information is the attitude information of the object at different time points within the preset time period; wherein, the attitude information of the different time points includes the position information of the object.
  • the orientation information of the multiple parts at the different time points; the future trajectory of the object is determined according to the time-series position information, the time-series posture information, and the motion intention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Human Computer Interaction (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Traffic Control Systems (AREA)
  • Image Analysis (AREA)

Abstract

一种轨迹预测方法,该方法包括:根据对象的时序位置信息和时序姿态信息,确定对象的运动意图;其中,时序位置信息为对象在预设时长内不同时间点的位置信息,时序姿态信息为对象在预设时长内不同时间点的姿态信息;不同时间点的姿态信息包括对象的多个部位在不同时间点的朝向信息;根据时序位置信息、时序姿态信息以及运动意图,确定对象的未来轨迹。通过将时序位置信息、时序姿态信息和运动意图相结合,且考虑到对象的朝向信息,能够有效提高预测对象的未来轨迹的准确率。还公开了一种轨迹预测装置、一种计算机存储介质、一种计算机设备以及一种计算机程序。

Description

轨迹预测方法、装置、设备、存储介质及程序
相关申请的交叉引用
本专利申请要求2020年7月31日提交的中国专利申请号为202010763409.4、申请人为商汤集团有限公司和本田技研工业株式会社,申请名称为“轨迹预测方法、装置、设备及存储介质”的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本公开实施例涉及智能驾驶技术领域,涉及但不限于一种轨迹预测方法、装置、设备、存储介质及程序。
背景技术
预测行人或车辆的运动轨迹的过程中,主要考虑行人或车辆轨迹的历史运动的内在关联,如利用行人或车辆的历史轨迹位置信息来做未来时刻的轨迹预测。
发明内容
本公开实施例提供一种轨迹预测方法、装置、设备、存储介质及程序。
本公开实施例提供一种轨迹预测方法,所述方法由电子设备执行,所述方法包括:
根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图;其中,所述时序位置信息为所述对象在预设时长内不同时间点的位置信息,所述时序姿态信息为所述对象在预设时长内不同时间点的姿态信息;所述不同时间点的姿态信息包括所述对象的多个部位在所述不同时间点的朝向信息;
根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹。
通过考虑对象的更加丰富的输入信息,能够更加准确的确定出对象的运动意图;然后,基于估计的运动意图、时序位置信息和时序姿态信息作为输入,来预测对象的未来轨迹,而且在预测的过程中使用有关对象的朝向信息;如此,通过将时序位置信息、时序姿态信息和运动意图相结合,且考虑到对象的朝向信息,能够有效提高预测对象的未来轨迹的准确率。
本公开实施例提供一种轨迹预测装置,所述装置包括:
意图确定模块,配置为根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图;其中,所述时序位置信息为所述对象在预设时长内不同时间点的位置信息,所述时序姿态信息为所述对象在预设时长内不同时间点的姿态信息;所述不同时间点的姿态信息包括所述对象的多个部位在所述不同时间点的朝向信息;
未来轨迹确定模块,配置为根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹。
本公开实施例提供一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,该计算机可执行指令被执行后,能够实现上述所述的轨迹预测方法。
本公开实施例提供一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时可实现上述所述的轨迹预测方法。
本公开实施例还提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于实现上述所述的轨迹预测方法。
本公开实施例提供一种轨迹预测方法、装置、设备、存储介质及程序,使用对象的时序位置信息和时序姿态信息作为输入,来估计对象的运动意图,如此,通过考虑对象的更加丰富的输入信息,能够更加准确的确定出对象的运动意图;然后,基于估计的运 动意图、时序位置信息和时序姿态信息作为输入,来预测对象的未来轨迹,而且在预测的过程中使用有关对象的朝向信息;如此,通过将时序位置信息、时序姿态信息和运动意图相结合,且考虑到对象的朝向信息,能够有效提高预测对象的未来轨迹的准确率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开实施例。根据下面参考附图对示例性实施例的详细说明,本公开实施例的其它特征及方面将变得清楚。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1为本公开实施例轨迹预测方法的实现流程示意图;
图2为可以应用本公开实施例的轨迹预测方法的一种***架构示意图;
图3A为本公开实施例轨迹预测方法的另一实现流程示意图;
图3B为本公开实施例轨迹预测方法的另一实现流程示意图;
图4A为本公开实施例数据集中的对象分布以及每种对象类型的意图分布示意图;
图4B为本公开实施例数据集中的对象分布以及每种对象类型的意图另一分布示意图;
图4C为本公开实施例数据集中的对象分布以及每种对象类型的意图再一分布示意图;
图4D为本公开实施例数据集中的对象分布以及每种对象类型的意图又一分布示意图;
图5为本公开实施例提供的轨迹预测***的框架示意图;
图6为本公开实施例轨迹预测方法的实现框架结构图;
图7为本公开实施例轨迹预测装置结构组成示意图;
图8为本公开实施例计算机设备的组成结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对发明的具体技术方案做进一步详细描述。以下实施例用于说明本公开,但不用来限制本公开的范围。
本实施例提出一种轨迹预测方法应用于计算机设备,所述计算机设备可包括对象或非对象,该方法所实现的功能可以通过计算机设备中的处理器调用程序代码来实现,当然程序代码可以保存在计算机存储介质中,可见,该计算机设备至少包括处理器和存储介质。
图1为本公开实施例轨迹预测方法的实现流程示意图,如图1所示,结合如图1所示方法进行说明:
步骤S101,根据对象的时序位置信息和时序姿态信息,确定对象的运动意图。
在本公开的一些实施例中,所述时序位置信息为所述对象在预设时长内不同时间点的位置信息,所述时序姿态信息为所述对象在预设时长内不同时间点的姿态信息。其中,对象为交通环境中的可运动的对象,包括人体对象,比如,行人或骑自行车的人等。还包括非人体对象,所述非人体对象包括但不限于以下至少之一:各种各样功能的车辆(如卡车、汽车、摩托车、自行车等)、各种轮数的车辆(如四轮车辆、两轮车辆等)和任意可移动设备,比如,机器人、飞行器、导盲器、智能玩具、玩具汽车等。如果对象包括人体对象,不同时间点的姿态信息包括所述人体对象的一个或多个部位在所述不同时间点的朝向信息。通过考虑在预设时长内不同时间点,对象的一个或多个不同部位的朝向信息和位置信息,来预估对象的运动意图,能够提供预测的运动意图的准确度。
步骤S102,根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹。
在本公开的一些实施例中,运动意图为对象在未来时段内的运动倾向,比如,对象为行人,运动意图为在未来时段内是否打算过红绿灯,或是否打算直行等。将时序位置信息、所述时序姿态信息以及所述运动意图相结合,作为一个整体,输入到神经网络中,来预测对象的未来轨迹。比如,将时序位置信息和时序姿态信息按照预设方式拼接在一起,作为融合特征,共同参考该融合特征和运动意图,来预测对象的未来轨迹。
在本公开实施例中,使用对象的时序位置信息和时序姿态信息(作为学习模型的输入来估计行人的意图(比如,是否打算过马路等),这样,通过考虑运动对象的更加丰富的时序位置信息和时序姿态信息,能够更加准确的确定出运动对象的运动意图;然后,基于估计的对象意图和学习模型的输出来预测对象的未来轨迹,而且在估算对象的意图时使用有关对象的多个部分中每个部分的方向的时间序列信息;如此,通过将位置和姿势的时间序列信息和运动意图相结合,预测运动对象的未来轨迹,从而能够有效提高对于未来轨迹预测的准确率。
图2示出可以应用本公开实施例的轨迹预测方法的一种***架构示意图;如图2所示,该***架构中包括:获取终端201、网络202和轨迹预测终端203。为实现支撑一个示例性应用,当获取终端201和轨迹预测终端203通过网络202建立通信连接,获取终端201通过网络202向轨迹预测终端203上报对象的时序位置信息和时序姿态信息。轨迹预测终端203响应于对象的时序位置信息和时序姿态信息,首先,根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图;然后,根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹。同时轨迹预测终端203将对象的未来轨迹上传至网络202,并通过网络202发送给获取终端201。
作为示例,获取终端201可以包括图像采集设备,轨迹预测终端203可以包括具有视觉信息处理能力的视觉处理设备或远程服务器。网络202可以采用有线或无线连接方式。其中,当轨迹预测终端203为视觉处理设备时,获取终端201可以通过有线连接的方式与视觉处理设备通信连接,例如通过总线进行数据通信;当轨迹预测终端203为远程服务器时,获取终端201可以通过无线网络与远程服务器进行数据交互。
或者,在一些场景中,当获取终端201可以是带有视频采集模组的视觉处理设备,可以是带有摄像头的主机。这时,本公开实施例的轨迹预测方法可以由获取终端201执行,上述***架构可以不包含网络202和轨迹预测终端203。
在本公开的一些实施例中,将地图信息融入到位置信息和姿态信息中,来预测运动意图,能够提高预测的准确度,步骤S101可以通过以下步骤实现,如图3A所示,结合图3A进行以下说明:
步骤S11,根据所述时序位置信息和所述时序姿态信息,获取所述对象所处环境的环境信息。
在本公开的一些实施例中,所述环境信息至少包括下列中的至少一个:道路信息、行人信息或交通灯信息。通过参考对象的时序位置信息和时序姿态信息中的朝向信息,对世界地图进行截取,以得到对象所在环境的局部地图区域,从而得到该对象的局部地图信息,将该局部地图信息确定为所述环境信息。对象在历史时刻的时序位置信息和时序姿态信息,可以通过以下过程得到:首先,确定距离当前时刻的时长小于等于预设时长的至少两个历史时刻;然后,获取所述对象在至少两个历史时刻的时序位置信息和时序姿态信息。可以理解为是,获取的距离当前时刻的时长小于预设时长的多个历史时刻的时序位置信息和时序姿态信息。这样,通过获取不同历史时刻下的时序位置信息和时序姿态信息,作为预测未来轨迹的输入信息,能够提高预测到的未来轨迹的准确度。
在本公开的一些实施例中,当前时刻为10:05:20,获取距离当前时刻小于5秒内的 对象的时序位置信息和时序姿态信息,即获取10:05:15至10:05:20之间的对象的时序位置信息和时序姿态信息。其中,时序位置信息和时序姿态信息与对象的属性相关。比如,对象为行人或骑自行车的人,时序位置信息和时序姿态信息至少包括:人的时序位置信息、身体朝向和面部朝向;假如在这一历史时段之间每间隔1秒获取一组时序位置信息和时序姿态信息,比如,如果所述时序位置信息和所述时序姿态信息包括对象的身体朝向、面部朝向和所述对象所处的位置,那么确定每一时刻点的对象的身体朝向、面部朝向和所述对象所处的位置。比如,时刻10:05:15至10:05:20,每间隔1秒获取一组时序位置信息和时序姿态信息,即有5个时刻点距离,那么确定5组对象的身体朝向、面部朝向和所述对象所处的位置。
在本公开的一些实施例中,如果对象为车辆等运动设备,时序位置信息和时序姿态信息至少包括:该运动设备的时序位置信息、设备头部朝向和所述运动设备的行驶指示信息。以车辆为例进行说明:时序位置信息和时序姿态信息包括:车辆的时序位置、车头朝向和车辆的行驶指示信息;其中,行驶指示信息包括但不限于以下至少之一:行驶方向、行驶速度和车灯状态(比如,转向灯的状态)等。如此,将获取的这些丰富的时序位置信息和时序姿态信息,作为截取世界地图的依据,得到对象所在环境的环境信息。也就是说,环境信息可以是通过时序位置信息和时序姿态信息中的对象的位置信息和对象的朝向信息,对世界地图进行截取,以确定出该对象当前所在的局部地图中的道路结构、人行道信息和道路中的交通灯信息等;这样,通过获取对象丰富的时序位置信息和时序姿态信息,来预测对象当前所在的道路结构等环境信息,能够提高地图划分的准确度。即使在观测点较少(甚至只有一帧观测数据)时,仍然能够给出合理的预测结果。
步骤S12,将所述环境信息、所述时序位置信息和时序姿态信息进行融合,得到融合特征。
在本公开的一些实施例中,获取到对象的时序位置信息和时序姿态信息之后,对时序位置信息和时序姿态信息中的每一特征进行独立的时间建模。比如,以人体为例进行说明,时序位置信息和时序姿态信息包括:身体朝向、面部朝向和对象所处的位置;分别将身体朝向、面部朝向和对象所处的位置;单独输入三个独立的第一神经网络中,分别得到用于表明身体朝向、面部朝向和对象所处的位置在时间上的变化情况的时序位置信息和时序姿态信息;将时序位置信息和时序姿态信息输入第二神经网络中,得到调整的时序位置信息和调整的时序姿态信息;将多个不同的距离输入第三神经网络(比如,全连接模型)中,得到该距离下身体朝向、面部朝向和对象所处的位置对应的权重;将该权重与调整的时序位置信息和调整的时序姿态信息相乘,得到相乘结果;将相乘结果与对局部地图区域进行编码后得到的环境信息进行拼接,得到融合特征。
在本公开的一些实施例中,由于时序位置信息、时序姿态信息和环境信息是在同一时间点下获取的。比如,都是针对历史时段内的5个时间点,所以将相乘结果与对局部地图区域进行编码后得到的环境信息进行拼接,可以通过以下方式实现:将表征相乘结果的矩阵与表征环境信息的矩阵按照行或列,拼接在一起,组成一个矩阵,即得到融合特征。假设表征相乘结果的矩阵为3行5列的矩阵,表征环境信息的矩阵为6行5列的矩阵,那么两个矩阵按照列拼接在一起,得到9行5列的矩阵,即得到融合特征。
步骤S13,根据融合特征,确定对象的运动意图。
在本公开的一些实施例中,运动意图可以理解为:对象在运动过程中的运动倾向,如果对象包括人体对象,意图分类包括但不限于以下一种或多种:左转、右转、直行、静止、掉头、加速、减速、横穿马路、等红灯以及倒着走等。如果对象包括非人体对象,意图分类包括但不限于以下一种或多种:左转、右转、直行、静止、左换道、右换道、加速、减速、超车、倒车以及等红灯等。
在本公开的一些实施例中,通过采用全连接层网络对融合特征进行解码,得到该融 合特征为预设类别库中每一种类别的概率,将概率最大的类别作为该融合特征最可能的类别,基于这样最可能的类别来预测对象的运动意图,能够提高预测意图的准确度。
对应地,在本公开实施例中,步骤S102可以如下方式实现:
步骤S14,根据所述融合特征和所述运动意图,确定所述对象的未来轨迹。
在本公开的一些实施例中,可以通过融合特征和运动意图,预测对象在未来时段内的未来轨迹;还可以不预测对象的运动意图,仅采用第一神经网络对融合特征进行多次迭代,预测对象在未来时段内的未来轨迹。比如,对第二调整时序位置信息和时序姿态信息进行解码,即可得到预测的对象的未来轨迹;这样通过多种时序位置信息和时序姿态信息进行轨迹预测,即使在观测点较少(甚至只有一帧观测数据),或者在对象突然加速、减速、突然转弯等场景下,依然能够保证未来轨迹预测的准确率。
在本公开实施例中,将地图信息融入到时序位置信息和时序姿态信息中,来预测运动意图,能够提高对于运动意图进行预测的准确度,然后,基于该运动意图预测对象的未来轨迹,能够提高轨迹预测的准确度。
在一些实施例中,为提高预测未来轨迹的输入信息的丰富性,可以通过对象的位置信息和对象的朝向信息对世界地图进行截取,以确定出对象当前环境的局部地图区域,即步骤S11可以通过以下过程实现:
步骤S111,根据对象在历史时刻的位置信息和朝向信息,对世界地图进行截取,以得到对象所在环境的局部地图区域。
在本公开的一些实施例中,时序姿态信息中的朝向信息和位置信息,是成对出现的,即在某一历史时刻确定对象的位置信息,以及在该位置的朝向信息。比如,对象为人体(比如,行人或者骑自行车的人),根据人的位置信息和人的身体朝向,对人所处的当前道路结构进行确定,从而对世界地图进行截取,以确定出行人当前所在的局部地图区域。如果对象为车辆等运动设备,根据车辆的位置信息和车头朝向,对车辆所处的当前道路进行确定,从而对世界地图进行截取,以确定出车辆当前所在的局部地图区域。
在本公开的一些实施例中,因为历史时刻是多个,那么获取到每一历史时刻的时序位置信息和时序姿态信息之后,也会得到多组时序位置信息和时序姿态信息,进而对于每一组时序位置信息和时序姿态信息都可以截取到对应的局部地图区域。对世界地图的截取,可以通过如下方式实现:根据所述多个时序位置信息中的每一位置和对象处于该位置时的朝向,确定所述对象所在环境的局部地图区域,得到多个局部地图区域。对象处于该位置时的朝向,可以理解为,在对象处于这一位置时的多个部位的朝向。这样,参考对象在一个位置时的多个部位的朝向,划定该对象的局部地图区域,能够提高确定的环境信息的准确度,从而提高未来轨迹预测的准确度。
在本公开的一些实施例中,以所述位置信息为中心,按照所述朝向信息,在世界地图中划定所述对象所在环境的局部地图区域。比如,以该位置为中心,沿着朝向方向,划定一个矩形区域,作为对象所在环境的局部地图区域。这样,多个位置和每一位置下的多个朝向信息,可以确定多个局部地图区域。对所述多个局部地图区域进行编码,得到多个编码地图,即环境信息。如此,以所处位置为中心,参考朝向信息,划定局部地图区域,使得划定的局部地图区域中包括的地图信息与对象的相关性较高,即能够提高环境信息的有效性。
步骤S112,对所述局部地图区域中的元素进行编码,得到所述环境信息。
在本公开的一些实施例中,每一元素表示对应区域的地图信息,所述地图信息至少包括下列中的至少一个:道路结构信息、人行道或道路交通灯。比如,将这个局部地图区域的元素编码为掩码,每一码字表示对应区域的地图信息。比如,环境信息为包括1和0的矩阵,其中,1表示人行道,0表示道路危险区域等。最后,将所述多个环境信息和对应的时序位置信息和时序姿态信息进行融合,得到多组融合特征,通过对融合特征 进行分类,预测出对象的运动意图。
在本公开的一些实施例中,第一神经网络的结构不受限定,包括但不限于卷积神经网络、长短期记忆网络(Long Short-Term Memory,LSTM)等,以下为LSTM为例进行介绍,将多个历史时刻的时序位置信息和时序姿态信息(比如,以对象为行人为例,分别将多个身体朝向、多个面部朝向和对象所处的多个位置)输入双向LSTM网络中,分别得到用于表明这些时序位置信息和时序姿态信息在时间上的变化情况的时序位置信息和时序姿态信息;将时序位置信息和时序姿态信息输入另一双向LSTM网络中,得到输出结果;将所述距离输入全连接模型中,得到该距离下身体朝向、面部朝向和对象所处的位置对应的权重;将该权重与调整后的时序位置信息和时序姿态信息相乘,得到多个相乘结果;然后,将多个相乘结果与多个编码地图拼接在一起,形成融合特征;最后,对融合特征进行解码,分类,预测对象的运动意图;或者,采用LSTM网络对融合特征进行多次迭代,通过对每一次迭代得到的坐标进行预测,以得到对象在未来时段内的未来轨迹。如此,通过对世界地图进行截取,得到局部地图区域,并对其中的道路信息进行编码,从而能够将地图信息用于后续的融合特征中,提高用于预测未来轨迹的输入信息的丰富性。
在本公开实施例中,根据对象的位置和朝向,划定对象的局部地图区域;并对该局部地图区域进行掩码编码,得到环境信息,每一码字表示该区域的地图信息。这样,将对象的时序位置信息和时序姿态信息结合编码地图,预测对象的意图,进而预测对象的未来轨迹,能够提高得到的未来轨迹的准确度。
在一些实施例中,对于提取到的对象的时序位置信息和时序姿态信息,各自进行时序建模,以得到每一个时序位置信息和时序姿态信息在时序上的变化情况,然后将每一时序位置信息和时序姿态信息的时序位置信息和时序姿态信息和环境信息进行融合,得到融合特征,即步骤S12,可以通过以下过程实现,如图3B所示,图3B为本公开实施例轨迹预测方法的另一实现流程示意图,结合图3A和图3B所示的步骤进行以下说明:
步骤S201,通过第一神经网络根据所述时序位置信息和时序姿态信息,预测在未来时段内的时序位置信息和时序姿态信息。
在本公开的一些实施例中,将历史时段内的时序位置信息和时序姿态信息,作为第一神经网络的输入,预测出未来时段内的序位置信息和所述时序姿态信息;步骤S201可以通过以下过程实现:
首先,对每一历史时刻的时序位置信息和时序姿态信息(即多个时序位置信息和时序姿态信息),按照时间顺序进行排列;然后,将排列好的多个时序位置信息和时序姿态信息,输入第一神经网络,得到多个时序位置信息和时序姿态信息。其中,第一神经网络可以是双向LSTM网络,第一神经网络的数量与时序位置信息和时序姿态信息包含的种类相匹配。比如,对象为行人,时序位置信息和时序姿态信息包括:对象的身体朝向、面部朝向和所述对象所处的位置;那么,第一神经网络为三个独立的双向LSTM网络。如果对象为车辆,时序位置信息和时序姿态信息包括:对象的车头朝向、车灯状态和对象所处的位置;那么,第一神经网络为三个独立的双向LSTM网络。
在本公开的一些实施例中,将多个时序位置信息和时序姿态信息输入该双向LSTM网络中,得到对应的时序位置信息和时序姿态信息。比如,对象为行人,将行人在不同时刻的身体朝向、面部朝向和所述行人所处的位置分别输入三个独立的双向LSTM网络,得到分别得到不同时刻的身体朝向对应的多个时序位置信息和时序姿态信息(表明身体朝向在时间上的变化情况)、不同时刻的面部朝向对应的多个时序位置信息和时序姿态信息(表明面部朝向在时间上的变化情况)和,不同时刻行人所处的位置对应的多个时序位置信息和时序姿态信息(表明对象所处的位置在时间上的变化情况)。
在本公开的一些实施例中,如果对象为车辆,将车辆在不同时刻的车头朝向、车灯 状态和车辆所处的位置分别输入三个独立的双向LSTM网络,得到分别得到不同时刻的车头朝向对应的多个时序位置信息和时序姿态信息(表明车头朝向在时间上的变化情况)、不同时刻的车灯状态对应的多个时序位置信息和时序姿态信息(表明车灯状态在时间上的变化情况)和,不同时刻车辆所处的位置对应的多个时序位置信息和时序姿态信息(表明车辆所处的位置在时间上的变化情况)。
在本公开的一些实施例中,该第一神经网络为训练好的神经网络,可以采用以下方式训练得到:
首先,将所述对象在历史时刻的时序位置信息和时序姿态信息输入待训练第一神经网络中,预测所述对象在所述未来时段内的时序位置信息和时序姿态信息。
在本公开的一些实施例中,将对象在历史时刻的时序位置信息和时序姿态信息作为第一神经网络的输入,基于每一组时序位置信息和时序姿态信息预测出该对象在未来时段内对应的预测时序位置信息和时序姿态信息,从而得到预测时序位置信息和时序姿态信息。在一些实施例中,这里的对象可以理解为是样本对象。比如,预设的数据集的样本图像中的行人或者动物等。所述预设的数据集中至少包含样本图像中的样本对象的时序位置信息和时序姿态信息。比如,以样本对象为行人为例进行说明,该预设的数据集至少包含样本图像中样本对象的身体朝向、面部朝向或所述样本对象所处的位置。从这样数据集规模较大,且包含更加丰富的时序位置信息和时序姿态信息的数据集中,获取对象在历史时刻的时序位置信息和时序姿态信息,能够提高获取到的样本数据的丰富性。
其次,将所述未来时段内的时序位置信息、时序姿态信息与所述对象所在环境的环境信息进行融合,得到融合预测特征。
在本公开的一些实施例中,将待训练的第一神经网络预测出的时序位置信息和时序姿态信息与环境信息进行融合,得到融合预测特征。
其次,至少根据融合预测特征,预测对象在未来时段内的未来轨迹。
在本公开的一些实施例中,采用该第一神经网络对融合预测特征进行迭代,从而预测对象在未来时段内的未来轨迹。或者是,对融合预测特征,采用训练好的全连接网络进行分类,以预测对象的运动意图,将运动意图和融合预测特征相结合,来预测对象的未来轨迹。
再次,根据对象的真值轨迹,确定待训练第一神经网络关于未来轨迹的第一预测损失。
在本公开的一些实施例中,根据第一神经网络、未来轨迹和对象的真值轨迹,确定第一预测损失。比如,第一预测损失至少包括下列中的至少一个:长度大于预设阈值的未来轨迹的平均失败预测次数、未来轨迹在不同距离对应的误差阈值下的成功率或未来轨迹的终点位置与真值轨迹的终点位置之间的误差。其中,长度大于预设阈值的未来轨迹的平均失败预测次数可以理解为:对于轨迹长度大于预设阈值的未来轨迹(比如,预测未来5s的未来轨迹);对该未来轨迹中的每一时刻点都进行预测,将该时刻的前5秒的历史轨迹作为输入,预测未来5秒的未来轨迹;那么,该运动预测轨迹需要进行多次预测,从而得到多次预测的结果;统计多次预测的结果中失败的次数;然后将该失败的次数除以该未来轨迹的长度,以实现归一化;由于有很多轨迹长度大于预设阈值的未来轨迹,将每一条轨迹中预测失败的次数除以该未来轨迹的长度,得到多个归一化值;最后,对这多个归一化值求平均得到每条轨迹的平均失败预测次数。
预测的未来轨迹在不同距离对应的误差阈值下的成功率,可以理解为,针对不同距离,预先设定不同的误差阈值。比如,距离越大设定的误差阈值越大,如果在某一距离下,得到的未来轨迹的误差小于误差阈值,确定本次预测成功。这样,可以刻画预测的未来轨迹在不同误差阈值下面的表现,从而基于此,提升神经网络的细节效果。
未来轨迹的终点位置与真值轨迹的终点位置之间的误差,可以理解为,未来轨迹的 终点与真值轨迹的终点之间的差值。
最后,根据第一预测损失,对第一神经网络的网络参数进行调整,以训练所述第一神经网络。
在本公开的一些实施例中,可直接采用第一预测损失对网络参数进行调整。比如,采用长度大于预设阈值的预测有的未来轨迹的平均失败预测次数、预测的未来轨迹在不同距离对应的误差阈值下的成功率或未来轨迹的终点位置与真值轨迹的终点位置之间的误差中的至少一个,对网络参数进行调整。在本公开实施例中,通过采用丰富的信息作为训练样本,使得训练得到的第一神经网络性能更优。
上述参考调整过程还可以通过以下方式实现,首先判断所述成功率与所述平均失败预测次数的大小情况,在所述成功率小于所述平均失败预测次数的情况下,确定本次预测的所未来轨迹失败;然后,采用所述平均位置误差、所述平均失败预测次数、所述成功率或所述误差中的至少一个,对所述神经网络的网络参数进行调整。这样通过多个评价标准对训练过程中的预测的未来轨迹进行评价,从而更准确的调整神经网络的网络参数,使得调整后的第一神经网络预测的未来轨迹准确度更高。
步骤S202,将所述未来时段内的时序位置信息、时序姿态信息和所述环境信息,按照预设方式进行拼接,得到所述融合特征。
在本公开的一些实施例中,时序位置信息和时序姿态信息和对应的局部地图,可以理解为是属于一组时序位置信息和时序姿态信息的时序位置信息和时序姿态信息和根据这一组时序位置信息和时序姿态信息中的位置信息和朝向信息截取的局部地图。将多个时序位置信息和时序姿态信息一一对应地与局部地图,按照预设方式进行拼接,得到融合特征;所述预设方式可以是按照将时序位置信息和时序姿态信息输入神经网络的顺序,对时序位置信息和时序姿态信息与对应的局部地图进行拼接。比如,以对象为行人或者非机动车骑行人为例,将这三种时序位置信息和时序姿态信息按照行人的身体朝向、面部朝向和所述对象所处的位置的顺序,依次输入神经网络(比如,LSTM网络)中;那么按照从行人的身体朝向、面部朝向到行人所处的位置的顺序,对时序位置信息和时序姿态信息和对应的局部地图进行拼接,得到融合特征。然后,采用全连接网络对所述融合特征进行解码,预测行人的运动意图,即行人是想要左转、右转、直行、静止或掉头等。
在本公开的一些实施例中,以对象为运动设备,如车辆,时序位置信息和时序姿态信息包括:车头时序位置信息和时序姿态信息、位置时序位置信息和时序姿态信息和车灯状态时序位置信息和时序姿态信息,将这三种时序位置信息和时序姿态信息按照车头时序位置信息和时序姿态信息、位置时序位置信息和时序姿态信息,以及车灯状态时序位置信息和时序姿态信息的顺序,依次输入神经网络(比如,LSTM网络)中;那么按照从车头时序位置信息和时序姿态信息、位置时序位置信息和时序姿态信息到车灯状态时序位置信息和时序姿态信息的顺序,对时序位置信息和时序姿态信息和对应的局部地图进行拼接,得到融合特征。然后,采用全连接网络对所述融合特征进行解码,预测车辆的运动意图,即车辆是想要左转、右转、直行、静止、左换道、右换道、超车或倒车等。
上述步骤S201和步骤S202提供了一种实现“将所述环境信息和所述时序位置信息和所述时序姿态信息进行融合,得到融合特征”的方式,在该方式中,通过按照时序位置信息和时序姿态信息输入神经网络的顺序,将时序位置信息和时序姿态信息与作为环境信息的局部地图进行融合,能够提高划分局部地图区域的准确度。
步骤S203,通过第二神经网络确定所述融合特征为意图类别库中每一意图类别的置信度。
在本公开的一些实施例中,第二神经网络可以是全连接网络,用于对融合特征进行分类。比如,采用全连接网络来预测融合特征为意图类别库中每一意图类别的可能性,即可得到每一意图类别的置信度。在本公开的一些实施例中,以对象为行人为例,对应 的意图类别库中包括:左转、右转、直行、静止或掉头等;采用全连接网络来预测融合特征可能是左转、右转、直行、静止或掉头等中每一意图类别的置信度,比如,每一意图类别的概率。
在本公开的一些实施例中,该第二神经网络为训练好的神经网络,可以采用以下方式训练得到:
首先,将所述融合特征输入待训练第二神经网络,预测所述对象的运动意图为意图类别库中每一意图类别的置信度。
比如,待训练第二神经网络可以是待训练全连接网络,将融合特征输入待训练的第二神经网络,以预测该对象的运动意图为类别库中每一类别的概率。这里,对象可以是样本对象,将样本对象的融合特征输入待训练第二神经网络,以对该样本对象的运动意图进行分类。
其次,根据对象的真值意图,确定第二神经网络关于每一意图类别的置信度的第二预测损失。
这里,第二预测损失可以是分类的交叉熵损失函数。
最后,根据第二预测损失,对待训练第二神经网络的网络参数进行调整,以训练待训练第二神经网络,得到第二神经网络。
比如,采用分类的交叉熵损失函数对待训练第二神经网络的网络参数进行调整,以训练待训练第二神经网络,得到已训练的第二神经网络。
对于整个未来轨迹预测***而言,损失函数为第一预测损失和第二预测损失之和。如此,通过将对象在所述未来时段内的时序位置信息和时序姿态信息进行融合,并将融合特征作为训练第二神经网络的样本,使得训练得到的第二神经网络的分类性能更优。
步骤S204,根据置信度最大的意图类别,确定对象的运动意图。
在本公开的一些实施例中,选择概率最大的类别,将概率最大的类别确定为对象的运动意图。比如,采用全连接网络来预测融合特征可能是左转、右转、直行、静止或掉头等中每一类别的概率分别为:0.1、0.2、0.2、0.1和0.4,那么概率最大的类别为掉头,说明该对象最可能的运动意图为掉头,从而最终确定对象的运动意图为掉头。如此,采用神经网络通过对融合特征进行意图类别的分类,能够准确的预测最有可能的运动意图。
上述步骤S203和步骤S204提供了一种实现“根据所述融合特征,确定所述对象的运动意图”的方式,在该方式中,通过采用全连接网络对融合特征进行分类,从而能够准确的预测对象在未来时刻内的运动意图。
步骤S205,根据未来时段的长度,确定迭代步长。
比如,未来时段的长度为3秒,确定迭代步长为0.3秒。
步骤S206,按照所述迭代步长,采用第一神经网络对运动意图和融合特征进行迭代,得到所述对象在每一迭代步长下的坐标。
在本公开的一些实施例中,首先按照该迭代步长和未来时段的长度,确定出需要迭代的次数,然后采用第一神经网络对运动意图和融合特征进行迭代,得到每一次迭代的坐标。在本公开的一些实施例中,如果未来时段的长度为3秒,确定迭代步长为0.3秒,那么需要迭代的次数为10次,采用第一神经网络对运动意图和融合特征进行逐次迭代,最后得到10个坐标值。
步骤S207,根据对象在每一迭代步长下的坐标,确定未来轨迹。
比如,基于上述例子,进行了10次迭代,得到10个坐标值,那么基于这10个坐标值,即可预估对象的未来轨迹。
在本公开实施例中,将对象的意图预测与轨迹预测结合到一个***中,通过一步步迭代得到每一步长下的坐标,预测出未来轨迹,从而能够提高最终预测的未来轨迹的效率和预测效果。
在其他实施例中,通过第一神经网络对时序位置信息和时序姿态信息进行提取时序位置信息和时序姿态信息之后,还包括以下过程:
首先,采用其他LSTM网络对每一时序位置信息和时序姿态信息进行调整,得到第一调整时序位置信息和时序姿态信息。
在本公开的一些实施例中,可以采用双向LSTM网络或全连接层的模型,用于对时序位置信息和时序姿态信息进行调整;将时序位置信息和时序姿态信息中的每一时序位置信息和时序姿态信息输入双向LSTM网络或全连接层的模型,得到一个权值矩阵,然后,将权值矩阵分为与时序位置信息和时序姿态信息种类相同的部分,将每一部分分别对应的与时序位置信息和时序姿态信息中的每一时序位置信息和时序姿态信息进行相乘,得到多个第一调整时序位置信息和时序姿态信息。比如,以对象为行人为例进行说明,时序位置信息和时序姿态信息包括:对象的身体朝向、面部朝向和所述对象所处的位置;将这将这三个特征一一对应的输入三个独立的双向LSTM网络之后,得到三个特征对应的三种时序位置信息和时序姿态信息;然后,将这三种时序位置信息和时序姿态信息按照对象的身体朝向、面部朝向和所述对象所处的位置的顺序,依次输入第二神经网络中,得到一个权值矩阵;将该权值矩阵分为三个部分,第一部分与不同时刻的时序位置信息和时序姿态信息相乘,第二部分与不同时刻的时序位置信息和时序姿态信息相乘,第三部分与不同时刻对象的时序位置信息和时序姿态信息相乘,得到包含三种特征的第一调整时序位置信息和时序姿态信息。
其次,通过将每一个时序位置信息和时序姿态信息中的位置信息输入第三神经网络,得到权值向量,并且,采用该权值向量对每一第一调整时序位置信息和时序姿态信息进行调整,得到第二调整时序位置信息和时序姿态信息。
在本公开的一些实施例中,采用全连接模型,针对输入的多个距离,输出在该多个位置下每一种时序位置信息和时序姿态信息对应的权值向量。并且将得到的每一种时序位置信息和时序姿态信息对应的权值向量与该种时序位置信息和时序姿态信息对应的第一调整时序位置信息和时序姿态信息相乘,得到第二调整时序位置信息和时序姿态信息,从而得到第二调整时序位置信息和时序姿态信息。
最后,将第二调整时序位置信息和时序姿态信息与环境信息进行拼接,得到该融合特征。
在本公开的一些实施例中,首先,将第二调整时序位置信息和时序姿态信息中的第二调整时序位置信息和时序姿态信息与所述多个编码地图,按照预设方式进行拼接,得到融合特征。比如,以对象为行人为例,将这三种时序位置信息和时序姿态信息按照行人的身体朝向、面部朝向和所述对象所处的位置的顺序,依次输入神经网络(比如,LSTM网络)中;那么得到的第二调整时序位置信息和时序姿态信息也是包含这三种特征,按照从行人的身体朝向、面部朝向、所述行人所处的位置到局部地图的顺序,对第二调整时序位置信息和时序姿态信息和对应的局部地图进行拼接,得到融合特征。然后,采用全连接网络对所述融合特征进行解码,预测行人的运动意图,即行人是想要左转、右转、直行、静止或掉头等。
本公开实施例提供一种轨迹预测方法,在驾驶场景中,车辆、行人或非机动车可能具有复杂的行为,例如突然转向,突然向左或向右转弯或者行走。仅通过车辆、行人或非机动车的历史轨迹不能容易地预测或预期这种复杂的行为。同时,具有感知功能的自主***可以自然地提取更丰富的信息,以做出更多信息决策。
本公开实施例利用对象的朝向来描述对象运动和局部地图区域来描述周围的静态环境。该位置在水平面中表示为点(x,y),而从相应的红绿蓝(Red Green Blue,RGB)图像中提取体方向和面方向,然后投影到水平面上,表示为单位矢量(d x,d y)。局部地图区域从高清地图中获得,包含多个道路信息,比如,人行横道、车道线、交叉点或人 行道等。
本公开实施例使用数据采集车在城市驾驶场景中收集对象轨迹数据。该车配备了摄像头,64线激光雷达、雷达、全球定位***(Global Positioning System,GPS)或惯性测量单元(Inertial measurement unit,IMU)。本公开实施例利用标注的高清地图,通过感知功能,检测,分析和跟踪生成对象的未来轨迹。本公开实施例在10赫兹(HZ)时提供行人的未来轨迹以及原始数据,其中,原始数据包括原始图像,点云点,自车的车辆辆姿势和高清地图。对于对象的时序位置信息和时序姿态信息,本公开实施例使用第一神经网络和第二神经网络(其中,第一神经网络和第二神经网网络可以采用深度神经网络算法的模型来实现)来获得输出。本公开实施例的提供的预设的数据集中包括:行人的面部朝向、身体朝向和行人所处的位置、车灯信息、车头朝向信息等。如此,采用包含这样丰富信息的数据集训练第一神经网络和第二神经网络,使得训练好的第一神经网络和第二神经网络的泛化性更强。
本公开实施例以10Hz的频率收集原始传感器数据,包括正视图RGB图像(800×1762),LiDAR点云以及自车的姿势和运动信息。为了更好地描述道路结构,本公开实施例为鸟瞰浏览高清晰度地图(High Definition Maps,HDMap)提供了道路类别(即车道线,交叉路口,人行横道,人行道等)的语义标注。道路类别表示为多边形或没有重叠区域的线。HDMap被裁剪并与每个数据帧的自车对齐。借助感知功能,通过检测和跟踪可以生成对象的运行轨迹。在本公开的一些实施例中,以具有更合适的密度,将轨迹采样到每帧0.3秒。本公开实施例收集了超过12000分钟的原始数据,并为车辆,行人和骑自行车的人采样了300000多种不同的轨迹。
为了构建对交通场景的全面描述,本公开实施例手动为收集的轨迹中的对象标注语义属性和意图。本公开实施例为每个对象类别使用不同的属性设置,以更好地捕获其功能。在本公开的一些实施例中,对于行人和骑自行车者等易受伤害的道路使用者(Vulnerable Road Users,VRU),本公开实施例会注明年龄段(成人/少年),性别(女性/男性),面部朝向(角度)和身体朝向;对于车辆,本公开实施例标注了转向灯状态(左转/右转/制动)和前进方向。意图可以理解为对象在观察点的特定时间(在本公开实施例的设置中为1s)之后的未来动作。类似于该属性,本公开实施例为车辆,行人和骑自行车者定义了不同的意图空间,如图4A至图4D所示,其中:图4A表示不同的对象,即车辆401、行人402和骑自行车的人403,其中,车辆401的数量为334696占据58%,行人402的数量为178343占据31%,骑自行车的人403的数量为61934占据11%。
图4B表示对车辆进行的意图预测的结果,其中,直行421占据38.9%(即该车辆进行直行的意图为38.9%),左转422占据2%,右转423占据1%,左换道424占据1.6%,右换道425占据2%,左超车426占据0.1%,右超车427占据0.1%,静止428占据54%,其他429占据0.2%。
图4C表示对行人进行的意图预测的结果,其中,直行431占据48.6%,左转432占据16.8%,右转433占据23.6%,静止434占据6.8%,掉头435占据0.4%,其他436占据3.7%。
图4D表示对骑自行车的人进行的意图预测的结果,其中,直行441占据37.5%,左转442占据13.5%,右转443占据17.9%,静止444占据24%,掉头占据0.1%,其他445占据7%。
与大多数轨迹预测数据集相比,本公开实施例的数据集涵盖了更多的对象类别,并提供了丰富的上下文标注,包括道路信息和属性标注。本公开实施例的数据集使用了更广泛的意图定义,并且数据规模较大。
在本公开实施例中,采用统一的框架来共同预测对象的未来轨迹和潜在意图。本公开实施例采用的第一神经网络和第二神经网络中的至少之一,可以包括但不限于基于 LSTM的编码器-解码器架构实现的,并且基于第一神经网网络和第二神经网络中的至少之一能够提高该框架的直接性和通用性。首先,采用编码器从对象的历史运动轨迹以及丰富的上下文信息中提取对象特征,对象特征包括语义对象属性和本地道路结构。然后,利用解码器估计意图分布并回归未来位置。如图5所示,图5为本公开实施例提供的轨迹预测***的框架示意图,结合图5进行以下说明:
首先,获取在历史时刻内采集的多个图像中,行人501的时序位置信息和时序姿态信息,包括:位置信息502、身体朝向503、面部朝向504和当前时刻的道路结构505。
然后,针对每一个时序位置信息和时序姿态信息建立时序模型,即将每一个时序位置信息和时序姿态信息输入第一神经网络(此处第一神经网络可以采用LSTM网络506实现)中,得到对应的时序特征。
比如,将位置信息502输入LSTM网络506得到位置时序特征,将身体朝向503输入LSTM网络506得到身体朝向时序特征,将面部朝向504输入LSTM网络506得到面部朝向时序特征;最后,将道路结构505输入到第二神经网络(此处第二神经网络可以采用CNN网络507实现)中以对道路结构进行编码,得到道路时序位置信息和时序姿态信息。
最后,将道路时序位置信息和时序姿态信息和时序特征进行融合,得到融合特征,将融合特征输入第一神经网络(此处第一神经网络可以采用MLP网络508实现)中,进行意图预测,得到意图预测的结果为横穿马路509。接下来,将意图预测的结果横穿马路509和融合特征相结合输入LSTM网络506中,进行多次迭代,预测行人的运行轨迹,得到预测的未来轨迹510;在图5中,通过对比行人501的历史轨迹511、预测的未来轨迹510和真值轨迹512,可以看出,采用本公开实施例提供的轨迹预测方法得到的预测的未来轨迹510的准确率是非常高的。
在图5中,根据每个数据项的特定形式,使用一组LSTM或CNN网络对对象的运动历史和多模式上下文输入进行编码。编码后的特征拼接为融合特征之后,馈入解码器以共同预测未来的轨迹和潜在意图。
在本公开实施例中,针对每个时间步长t(比如,t的取值可以为大于0小于T),第i个对象的观察结果表示为
Figure PCTCN2021109871-appb-000001
其中,
Figure PCTCN2021109871-appb-000002
是位置信息,
Figure PCTCN2021109871-appb-000003
是上下文信息。给定在离散时间间隔t∈[T-n:T]中的观察,本公开实施例能够实现预测对象在t∈[T+1:T+m]和意图IT的未来位置。其中,T是最后的观察时间(比如,T取值可以为大于0且小于5分钟),n,m分别是观察时长和预测时长(比如,n,m的取值可以为大于0且小于5分钟的实数)。
本公开实施例使用一组双向LSTM网络作为第一神经网络,对多源输入数据进行编码。将对象pT-m:T的历史轨迹直接输入LSTM,以获取时间T处的隐藏状态(表示为
Figure PCTCN2021109871-appb-000004
)作为运动历史特征。上下文信息根据其特定形式进行处理。对于VRU,本公开实施例设置c t=(f t,b t,r t),其中,ft/bt是以二维单位矢量表示的脸部/身体方向,r t是以自车为中心并旋转的局部道路结构图,以使y轴与自车的头部方向对齐。对于车辆,本公开实施例设置c t=(l t,h t,r t),其中l t是三维二进制矢量中的灯状态,h t是车头朝向,r t与VRU设置中的相同。在本公开实施例中,诸如面部朝向和车灯状态之类的语义属性与对象意图和未来运动密切相关,并反映了对象的固有特性,而这些特性是无法从运动历史中获得的。本地地图提供道路结构以规范轨迹预测。在本公开实施例的实现中,类似于运动历史编码的过程, 方向(即面部,身体和车辆前进方向)序列和灯状态序列分别直接输入到独立的双向LSTM中。本公开实施例在观察时间T内使用一次于本地地图,以减少冗余。本公开实施例首先栅格化原始地图,然后将栅格化的地图输入到CNN模型中以提取地图时序位置信息和时序姿态信息。最后,将所有编码的向量连接为在时间T嵌入的融合特征,如公式(1):
e T=φ(p T-m:T,c T-m:T)              公式(1);
其中,φ表示整个编码器的变换函数。
本公开实施例将意图预测建模为一个分类问题。其中,模型根据给定对象的融合特征e T来预测有限意图集上的后验概率分布。本公开实施例使用多层感知器(Multilayer perceptron,MLP),连接softmax层作为意图分类器。在训练过程中,本公开实施例将交叉熵损失降到最低,如公式(2)所示:
Figure PCTCN2021109871-appb-000005
其中,
Figure PCTCN2021109871-appb-000006
是在时间T的真实意图的预测概率(索引表示为k T)。
本公开实施例将轨迹预测视为序列生成任务,并采用LSTM解码器来预测每个未来时间步长上的对象运动。嵌入e T的特征一开始就被馈送到解码器中。特别地,本公开实施例通过将意图分类器的输出通过另一个全连接层来确定意图嵌入特征
Figure PCTCN2021109871-appb-000007
并将意图嵌入特征用作轨迹解码器的辅助输入,从而为轨迹预测提供良好的条件。本公开实施例在训练过程中最小化了高斯样损失函数:
Figure PCTCN2021109871-appb-000008
其中,(x t,y t)是时间t处的地面真相位置,σ ttt是代表轨迹预测的预测高斯分布参数。通过优化全局损失函数L=L Traj+L Int,本公开实施例的神经网络可以多任务方式进行端到端训练。在一些实施例中还可以使用高斯平均作为预测的轨迹位置。
在其他实施例中,以下针对对象为行人为例,进行说明:
表1为在不同的采集距离下采集到的身体朝向和面部朝向的精确度。从表1可以看出,行人所处的位置、身体朝向和面部朝向用于表示行人的动态情况,而局部地图区域用于表示静态周围环境。在本公开实施例中,位置、身体朝向、面部朝向即行人的时序位置信息和时序姿态信息可以看作是动态特征,而局部地图区域可以看作是静态特征。
表1对于行人,在不同距离下的身体朝向和面部朝向的精确度
Figure PCTCN2021109871-appb-000009
如表1所示,面部朝向(Face direction)和身体朝向(Body direction)的精确度与从行人到自车的距离有关。距离越长,特征的精确度越低。因此,在不同距离的不同时序位置信息和时序姿态信息上调整时序位置信息和时序姿态信息的权重。本公开实施例使用嵌入函数φ来表达这种关系:
Figure PCTCN2021109871-appb-000010
其中,
Figure PCTCN2021109871-appb-000011
表示在时间步长t处第i个行人与自车之间的距离,W dis表示第二神经网络中输入到输出的转换参数,
Figure PCTCN2021109871-appb-000012
在第二神经网络中输入不同的距离后,针对位置,面部朝向和身体朝向输出的对应的权值向量。
行人遵循基本的交通规则,这些规则与其相应的当地道路结构有关。局部地图区域是行人的未来轨迹预测的基本静态环境。
每条车道线内的区域被视为行人的“危险空间”。图6为本公开实施例轨迹预测方法的实现框架结构图,如图6所示,首先,从图像601至图像60n中提取行人61的时序位置信息和时序姿态信息,比如,面部朝向
Figure PCTCN2021109871-appb-000013
身体朝向
Figure PCTCN2021109871-appb-000014
和行人61所处的位置
Figure PCTCN2021109871-appb-000015
以及根据身体朝向和所处的位置确定的局部地图区域
Figure PCTCN2021109871-appb-000016
其次,将行人61所处的位置
Figure PCTCN2021109871-appb-000017
身体朝向
Figure PCTCN2021109871-appb-000018
和面部朝向
Figure PCTCN2021109871-appb-000019
单独输入三个独立的第一神经网络62、63和64(比如,双向LSTM网络)中,分别得到用于表明身体朝向的时序特征(即时序位置信息和时序姿态信息)、面部朝的时序特征和样本对象所处的位置在时间上的变化情况的时序特征;再将时序特征输入另一第二神经网络65(比如,双向LSTM网络)中,得到第一调整时序特征。将不同的距离输入全连接模型68中,得到该距离下身体朝向、面部朝向和运动对象所处的位置对应的权重;将该权重与第一调整时序特征相乘,得到第二调整时序特征。
再次,将编码地图602展开为一维特征向量,对该一维特征向量进行编码,输入另一双向LSTM网络,即第一神经网络66,得到该以为特征向量对应的时序特征;然后,将该时序特征作为行人61的时序位置信息和时序姿态信息对应的时序特征的辅助特征,将这些特征进行拼接,得到融合特征;然后通过解码的神经网络67,对融合特征进行解码,得到预测的行人的未来轨迹,即虚线69;实线70为该行人61的真值未来轨迹,由此可见,本公开实施例通过的网络模型的预测结果是非常准确的。
本公开实施例针对局部地图区域采用掩码编码,得到编码地图602,其中每个码字由与其语义道路结构类相关联的特定整数填充。对于在时间步长t的第i个行人,首先,根据该行人所处的位置和身体朝向,确定该行人对应的局部地图区域。然后将局部地图区域均匀地离散化为网格,其中每个网格由主要语义道路结构类的结构特定数量表示。比如,“人行横道”和“人行道”表示为数字“1”,“危险地点”表示为“-1”,其他表示为数字“0”,即得到用于划分危险或安全区域的网格603。
在本公开的一些实施例中,将编码的动态特征(即行人的时序位置信息和时序姿态信息)和编码的静态特征(即局部地图区域)连接起来预测。使用简单的LSTM网络对与行人的未来轨迹进行预测。
本公开实施例提供历史数据的预设数据集是大规模和信息化的轨迹数据集,以促进自动驾驶中的行人轨迹预测任务。同时,该数据集中具有多个评价标准,长度大于预设阈值的未来轨迹的平均失败预测次数、未来轨迹在不同距离对应的误差阈值下的成功率或未来轨迹的终点位置与真值轨迹的终点位置之间的误差,以评估预测模型的准确性和鲁棒性;从而,即使在非常复杂的场景下,使用该神经网络仍然能够较为准确的预测行人的未来轨迹。
本公开实施例提供一种轨迹预测装置,图7为本公开实施例轨迹预测装置结构组成 示意图,如图7所示,所述装置700包括:
意图确定模块701,配置为根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图;其中,所述时序位置信息为所述对象在预设时长内不同时间点的位置信息,所述时序姿态信息为所述对象在预设时长内不同时间点的姿态信息;所述不同时间点的姿态信息包括所述对象在所述不同时间点的朝向信息;
未来轨迹确定模块702,配置为根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹。
在上述装置中,意图确定模块701,包括:地图截取子模块,配置为根据所述时序位置信息和所述时序姿态信息,获取所述对象所处环境的环境信息;特征融合子模块,配置为将所述环境信息、所述时序位置信息和时序姿态信息进行融合,得到融合特征;意图预测子模块,配置为根据所述融合特征,确定所述对象的运动意图;所述未来轨迹确定模块702,包括:轨迹预测子模块,配置为根据所述融合特征和所述运动意图,确定所述对象的未来轨迹。
在上述装置中,所述对象包括人体对象和非人体对象中的至少之一,在所述对象包括所述人体对象的情况下,所述不同时间点的姿态信息包括:所述人体对象的部位的在所述不同时间点的朝向信息,所述部位包括以下至少之一:肢体、面部;在所述对象包括所述非人体对象的情况下,所述非人体对象包括以下至少之一:车辆、动物、可移动设备;所述不同时间点的姿态信息包括:所述非人体对象在所述不同时间点的朝向信息和行驶指示信息。
在上述装置中,所述装置还包括:历史时刻确定模块,配置为确定距离当前时刻的时长小于等于特定时长的至少两个历史时刻;特征信息获取模块,配置为获取所述对象在至少两个历史时刻的时序位置信息和时序姿态信息。
在上述装置中,所述地图截取子模块,包括:地图截取单元,配置为根据所述对象在任一历史时刻的位置信息和朝向信息,确定所述环境信息;其中,所述环境信息至少包括下列中的至少一个:道路信息、行人信息或交通灯信息。
在上述装置中,所述地图截取单元,还配置为:以所述位置信息为中心,按照所述朝向信息,在世界地图中划定所述对象所在环境的局部地图区域;对所述局部地图区域中的元素进行编码,得到所述环境信息。
在上述装置中,所述特征融合子模块,包括:时序位置信息和时序姿态信息确定单元,配置为通过第一神经网络,根据所述时序位置信息和时序姿态信息,预测在未来时段内的时序位置信息和时序姿态信息;特征拼接单元,配置为将所述未来时段内的时序位置信息、时序姿态信息和所述环境信息,按照预设方式进行拼接,得到所述融合特征。
在上述装置中,所述意图预测子模块,包括:置信度确定单元,配置为通过第二神经网络确定所述融合特征为意图类别库中每一意图类别的置信度;意图预测单元,配置为将置信度最大的意图类别,确定所述对象的运动意图。
在上述装置中,所述轨迹预测子模块,包括:迭代步长单元,配置为根据所述未来时段的长度,确定迭代步长;特征迭代单元,配置为按照所述迭代步长,采用所述第一神经网络对所述运动意图和所述融合特征进行迭代,得到所述对象在每一迭代步长下的坐标;未来轨迹确定单元,配置为根据所述对象在每一迭代步长下的坐标,确定所述未来轨迹。
在上述装置中,所述装置还包括第一训练模块,配置为训练第一神经网络;
第一训练模块,包括:预测时序位置信息和时序姿态信息的预测子模块,配置为将所述对象的时序位置信息和时序姿态信息输入待训练第一神经网络中,预测所述对象在所述未来时段内的时序位置信息和时序姿态信息;预测特征融合子模块,配置为将所述未来时段内的时序位置信息、时序姿态信息与所述对象所在环境的环境信息进行融合, 得到融合预测特征;预测未来轨迹子模块,配置为至少根据所述融合预测特征,预测所述对象在所述未来时段内的未来轨迹;第一预测损失确定子模块,配置为根据所述对象的真值轨迹,确定所述待训练第一神经网络关于所述未来轨迹的第一预测损失;第一神经网络参数调整子模块,配置为根据所述第一预测损失,对所述待训练第一神经网络的网络参数进行调整,得到所述第一神经网络。
在上述装置中,所述装置还包括第二训练模块,配置为训练第二神经网络;
第二训练模块,包括:类别置信度确定子模块,配置为将所述融合特征输入待训练第二神经网络,预测所述对象的运动意图为意图类别库中每一意图类别的置信度;第二预测损失确定子模块,配置为根据所述对象的真值意图,确定所述待训练第二神经网络关于所述每一意图类别的置信度的第二预测损失;第二神经网络参数调整子模块,配置为根据所述第二预测损失,对所述待训练第二神经网络的网络参数进行调整,得到所述第二神经网络。
对应地,本公开实施例再提供一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,该计算机可执行指令被执行后,能够实现本公开实施例提供的轨迹预测方法中。
相应的,本公开实施例再提供一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,所述该计算机可执行指令被处理器执行时实现上述实施例提供的轨迹预测方法。
本公开实施例还提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于实现上述实施例提供的轨迹预测方法。
相应的,本公开实施例提供一种计算机设备,图8为本公开实施例计算机设备的组成结构示意图,如图8所示,所述设备800包括:一个处理器801、至少一个通信总线、通信接口802、至少一个外部通信接口和存储器803。其中,通信接口802配置为实现这些组件之间的连接通信。其中,通信接口802可以包括显示屏,外部通信接口可以包括标准的有线接口和无线接口。其中所述处理器801,配置为执行存储器中图像处理程序,以实现上述实施例提供的轨迹预测方法。
在实际应用中,上述存储器可以是易失性存储器(volatile memory),例如随机存取存储器(Random Access Memory,RAM);或者非易失性存储器(non-volatile memory),例如只读存储器(Read-Only Memory,ROM),快闪存储器(flash memory),硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD);或者上述种类的存储器的组合,并向处理器提供指令和数据。
上述处理器可以为专用集成电路(Application Specific Integrated Circuit,ASIC)、数字信号处理器(Digital Signal Processor,DSP)、数字信号处理设备(Digital Signal Processor Device,DSPD)、可编程逻辑器件(Programmable Logic Device,PLD)、现场可编程门阵列(Field Programmable Gate Array,FPGA)、中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器中的至少一种。可以理解地,对于不同的设备,用于实现上述处理器功能的电子器件还可以为其它,本公开实施例不作限定。
以上轨迹预测装置、计算机设备和存储介质实施例的描述,与上述方法实施例的描述是类似的,具有同相应方法实施例相似的技术描述和有益效果,限于篇幅,可参考上述方法实施例的记载。对于本公开轨迹预测装置、计算机设备和存储介质实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本公开的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些 特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本公开的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本公开实施例的实施过程构成任何限定。上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
在本公开所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个***,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本公开各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本公开上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。
工业实用性
本公开实施例提供一种轨迹预测方法、装置、设备、存储介质及程序,其中,根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图;其中,所述时序位置信息为所述对象在预设时长内不同时间点的位置信息,所述时序姿态信息为所述对象在预设时长内不同时间点的姿态信息;其中,所述不同时间点的姿态信息包括所述对象的多个部位在所述不同时间点的朝向信息;根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹。

Claims (15)

  1. 一种轨迹预测方法,所述方法由电子设备执行,所述方法包括:
    根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图;其中,所述时序位置信息为所述对象在预设时长内不同时间点的位置信息,所述时序姿态信息为所述对象在预设时长内不同时间点的姿态信息;所述不同时间点的姿态信息包括所述对象在所述不同时间点的朝向信息;
    根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹。
  2. 根据权利要求1所述的方法,其中,
    所述根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图,包括:
    根据所述时序位置信息和所述时序姿态信息,获取所述对象所处环境的环境信息;
    将所述环境信息、所述时序位置信息和时序姿态信息进行融合,得到融合特征;
    根据所述融合特征,确定所述对象的运动意图;
    所述根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹,包括:
    根据所述融合特征和所述运动意图,确定所述对象的未来轨迹。
  3. 根据权利要求1或2所述的方法,其中,所述对象包括人体对象和非人体对象中的至少之一;
    在所述对象包括所述人体对象的情况下,所述不同时间点的姿态信息包括:所述人体对象的部位的在所述不同时间点的朝向信息,所述部位包括以下至少之一:肢体、面部;
    在所述对象包括所述非人体对象的情况下,所述非人体对象包括以下至少之一:车辆、可移动设备;
    所述不同时间点的姿态信息包括:所述非人体对象在所述不同时间点的朝向信息和行驶指示信息。
  4. 根据权利要求1或2所述的方法,其中,所述根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图之前,所述方法还包括:
    确定距离当前时刻的时长小于等于特定时长的至少两个历史时刻;
    获取所述对象在所述至少两个历史时刻的时序位置信息和时序姿态信息。
  5. 根据权利要求2至4任一所述的方法,其中,所述根据所述时序位置信息和所述时序姿态信息,获取所述对象所处环境的环境信息,包括:
    根据所述对象在任一历史时刻的位置信息和朝向信息,确定所述环境信息;其中,所述环境信息至少包括下列中的至少一个:道路信息、行人信息或交通灯信息。
  6. 根据权利要求5所述的方法,其中,所述根据所述对象在任一历史时刻的位置信息和朝向信息,确定所述环境信息,包括:
    以所述位置信息为中心,按照所述朝向信息,在世界地图中划定所述对象所在环境的局部地图区域;
    对所述局部地图区域中的元素进行编码,得到所述环境信息。
  7. 根据权利要求2、5和6任一所述的方法,其中,所述将所述环境信息、所述时序位置信息和时序姿态信息进行融合,得到融合特征,包括:
    通过第一神经网络,根据所述时序位置信息和时序姿态信息,预测在未来时段内的时序位置信息和时序姿态信息;
    将所述未来时段内的时序位置信息、时序姿态信息和所述环境信息,按照预设方式进行拼接,得到所述融合特征。
  8. 根据权利要求2、5至7任一所述的方法,其中,所述根据所述融合特征,确定所述对象的运动意图,包括:
    通过第二神经网络确定所述融合特征为意图类别库中每一意图类别的置信度;
    将置信度最大的意图类别,确定所述对象的运动意图。
  9. 根据权利要求2、5至8任一所述的方法,其中,所述根据所述融合特征和所述运动意图,确定所述对象的未来轨迹,包括:
    根据所述未来时段的长度,确定迭代步长;
    按照所述迭代步长,采用所述第一神经网络对所述运动意图和所述融合特征进行迭代,得到所述对象在每一迭代步长下的坐标;
    根据所述对象在每一迭代步长下的坐标,确定所述未来轨迹。
  10. 根据权利要求7至9任一所述的方法,其中,所述第一神经网络的训练方法,包括:
    将所述对象的时序位置信息和时序姿态信息输入待训练第一神经网络中,预测所述对象在所述未来时段内的时序位置信息和时序姿态信息;
    将所述未来时段内的时序位置信息、时序姿态信息与所述环境信息进行融合,得到融合预测特征;
    至少根据所述融合预测特征,预测所述对象在所述未来时段内的未来轨迹;
    根据所述对象的真值轨迹,确定所述待训练第一神经网络关于所述未来轨迹的第一预测损失;
    根据所述第一预测损失,对所述待训练第一神经网络的网络参数进行调整,得到所述第一神经网络。
  11. 根据权利要求8至10任一所述的方法,其中,所述第二神经网络的训练方法,包括:
    将所述融合特征输入待训练第二神经网络,预测所述对象的运动意图为意图类别库中每一意图类别的置信度;
    根据所述对象的真值意图,确定所述待训练第二神经网络关于所述每一意图类别的置信度的第二预测损失;
    根据所述第二预测损失,对所述待训练第二神经网络的网络参数进行调整,得到所述第二神经网络。
  12. 一种轨迹预测装置,所述装置包括:
    意图确定模块,配置为根据对象的时序位置信息和时序姿态信息,确定所述对象的运动意图;其中,所述时序位置信息为所述对象在预设时长内不同时间点的位置信息,所述时序姿态信息为所述对象在预设时长内不同时间点的姿态信息;所述不同时间点的姿态信息包括所述对象在所述不同时间点的朝向信息;
    未来轨迹确定模块,配置为根据所述时序位置信息、所述时序姿态信息以及所述运动意图,确定所述对象的未来轨迹。
  13. 一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,该计算机可执行指令被执行后,能够实现权利要求1至11任一项所述的轨迹预测方法。
  14. 一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时可实现权利要求1至11任一项所述的轨迹预测方法。
  15. 一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于实现如权利要求1至11任 一所述的轨迹预测方法。
PCT/CN2021/109871 2020-07-31 2021-07-30 轨迹预测方法、装置、设备、存储介质及程序 WO2022022721A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2022546580A JP7513726B2 (ja) 2020-07-31 2021-07-30 軌道予測方法、装置、機器、記憶媒体およびプログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010763409.4A CN111942407B (zh) 2020-07-31 2020-07-31 轨迹预测方法、装置、设备及存储介质
CN202010763409.4 2020-07-31

Publications (1)

Publication Number Publication Date
WO2022022721A1 true WO2022022721A1 (zh) 2022-02-03

Family

ID=73337954

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109871 WO2022022721A1 (zh) 2020-07-31 2021-07-30 轨迹预测方法、装置、设备、存储介质及程序

Country Status (3)

Country Link
JP (1) JP7513726B2 (zh)
CN (1) CN111942407B (zh)
WO (1) WO2022022721A1 (zh)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463687A (zh) * 2022-04-12 2022-05-10 北京云恒科技研究院有限公司 一种基于大数据的移动轨迹预测方法
CN114516336A (zh) * 2022-02-24 2022-05-20 重庆长安汽车股份有限公司 一种考虑道路约束条件的车辆轨迹预测方法
CN114550297A (zh) * 2022-02-25 2022-05-27 北京拙河科技有限公司 一种行人的意图分析方法及***
CN114997297A (zh) * 2022-05-26 2022-09-02 哈尔滨工业大学 一种基于多级区域划分的目标运动意图推理方法及***
CN115562332A (zh) * 2022-09-01 2023-01-03 北京普利永华科技发展有限公司 一种无人机机载记录数据的高效处理方法及***
CN115564803A (zh) * 2022-12-06 2023-01-03 腾讯科技(深圳)有限公司 一种动画处理方法、装置、设备、存储介质及产品
CN115690160A (zh) * 2022-11-16 2023-02-03 南京航空航天大学 一种低帧率视频行人轨迹预测方法与***
CN116602663A (zh) * 2023-06-02 2023-08-18 深圳市震有智联科技有限公司 一种基于毫米波雷达的智能监测方法及***
CN116723616A (zh) * 2023-08-08 2023-09-08 杭州依森匠能数字科技有限公司 一种灯光亮度控制方法及***
CN116778101A (zh) * 2023-06-26 2023-09-19 北京道仪数慧科技有限公司 基于营运载具的地图生成方法及***
CN117560638A (zh) * 2024-01-10 2024-02-13 山东派蒙机电技术有限公司 应用于移动端通信***的融合通信方法、装置及设备
CN117799641A (zh) * 2024-02-08 2024-04-02 北京科技大学 多车干扰下智能物流车高效节能驾驶优化控制方法及装置

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111942407B (zh) * 2020-07-31 2022-09-23 商汤集团有限公司 轨迹预测方法、装置、设备及存储介质
CN112364997B (zh) * 2020-12-08 2021-06-04 北京三快在线科技有限公司 一种障碍物的轨迹预测方法及装置
CN113033364B (zh) * 2021-03-15 2024-06-14 商汤集团有限公司 轨迹预测方法、行驶控制方法、装置、电子设备及存储介质
CN113029154B (zh) * 2021-04-01 2022-07-12 北京深睿博联科技有限责任公司 一种盲人导航方法及装置
CN113316788A (zh) * 2021-04-20 2021-08-27 深圳市锐明技术股份有限公司 行人运动轨迹的预测方法、装置、电子设备和存储介质
CN112859883B (zh) * 2021-04-25 2021-09-07 北京三快在线科技有限公司 一种无人驾驶设备的控制方法及控制装置
CN113157846A (zh) * 2021-04-27 2021-07-23 商汤集团有限公司 意图及轨迹预测方法、装置、计算设备和存储介质
CN113382304B (zh) * 2021-06-07 2023-07-18 北博(厦门)智能科技有限公司 一种基于人工智能技术的视频拼接方法
CN113658214B (zh) * 2021-08-16 2022-08-09 北京百度网讯科技有限公司 轨迹预测方法、碰撞检测方法、装置、电子设备及介质
CN114312831B (zh) * 2021-12-16 2023-10-03 浙江零跑科技股份有限公司 一种基于空间注意力机制的车辆轨迹预测方法
CN114663982B (zh) * 2022-04-21 2024-06-25 湖南大学 一种基于多特征融合的人手轨迹预测与意图识别方法
CN114872735B (zh) * 2022-07-10 2022-10-04 成都工业职业技术学院 基于神经网络算法的自动驾驶物流车辆决策方法和装置
CN115628736A (zh) * 2022-09-23 2023-01-20 北京智行者科技股份有限公司 行人轨迹的预测方法、设备、移动装置和存储介质
CN115790606B (zh) * 2023-01-09 2023-06-27 深圳鹏行智能研究有限公司 轨迹预测方法、装置、机器人及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1223567A1 (en) * 2001-01-11 2002-07-17 Siemens Aktiengesellschaft A method for inter-vehicle communication of individualized vehicle data
CN1847817A (zh) * 2005-04-15 2006-10-18 ***通信集团公司 一种使用移动通信方式为汽车提供服务的***和方法
US20180126951A1 (en) * 2016-11-07 2018-05-10 Nio Usa, Inc. Method and system for authentication in autonomous vehicles
CN109801508A (zh) * 2019-02-26 2019-05-24 百度在线网络技术(北京)有限公司 路口处障碍物的运动轨迹预测方法及装置
CN110210417A (zh) * 2019-06-05 2019-09-06 深圳前海达闼云端智能科技有限公司 一种行人运动轨迹的预测方法、终端及可读存储介质
CN111095380A (zh) * 2017-09-20 2020-05-01 本田技研工业株式会社 车辆控制装置、车辆控制方法、及程序
CN111942407A (zh) * 2020-07-31 2020-11-17 商汤集团有限公司 轨迹预测方法、装置、设备及存储介质

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10345815B2 (en) 2016-09-14 2019-07-09 Qualcomm Incorporated Motion planning and intention prediction for autonomous driving in highway scenarios via graphical model-based factorization
DE102016217770A1 (de) * 2016-09-16 2018-03-22 Audi Ag Verfahren zum Betrieb eines Kraftfahrzeugs
KR101946940B1 (ko) * 2016-11-09 2019-02-12 엘지전자 주식회사 차량에 구비된 차량 제어 장치 및 차량의 제어방법
CN107423679A (zh) * 2017-05-31 2017-12-01 深圳市鸿逸达科技有限公司 一种行人意图检测方法和***
JP6833630B2 (ja) * 2017-06-22 2021-02-24 株式会社東芝 物体検出装置、物体検出方法およびプログラム
US11367354B2 (en) * 2017-06-22 2022-06-21 Apollo Intelligent Driving Technology (Beijing) Co., Ltd. Traffic prediction based on map images for autonomous driving
US11104334B2 (en) * 2018-05-31 2021-08-31 Tusimple, Inc. System and method for proximate vehicle intention prediction for autonomous vehicles
JP7125286B2 (ja) * 2018-06-22 2022-08-24 本田技研工業株式会社 行動予測装置及び自動運転装置
US11302197B2 (en) 2018-09-17 2022-04-12 Nissan Motor Co., Ltd. Vehicle behavior prediction method and vehicle behavior prediction device
CN110059598B (zh) * 2019-04-08 2021-07-09 南京邮电大学 基于姿态关节点的长时程快慢网络融合的行为识别方法
CN110147743B (zh) * 2019-05-08 2021-08-06 中国石油大学(华东) 一种复杂场景下的实时在线行人分析与计数***及方法
CN110796856B (zh) * 2019-10-16 2022-03-25 腾讯科技(深圳)有限公司 车辆变道意图预测方法及变道意图预测网络的训练方法
CN111401233A (zh) * 2020-03-13 2020-07-10 商汤集团有限公司 轨迹预测方法、装置、电子设备及介质
CN111402632B (zh) * 2020-03-18 2022-06-07 五邑大学 一种交叉口行人运动轨迹的风险预测方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1223567A1 (en) * 2001-01-11 2002-07-17 Siemens Aktiengesellschaft A method for inter-vehicle communication of individualized vehicle data
CN1847817A (zh) * 2005-04-15 2006-10-18 ***通信集团公司 一种使用移动通信方式为汽车提供服务的***和方法
US20180126951A1 (en) * 2016-11-07 2018-05-10 Nio Usa, Inc. Method and system for authentication in autonomous vehicles
CN111095380A (zh) * 2017-09-20 2020-05-01 本田技研工业株式会社 车辆控制装置、车辆控制方法、及程序
CN109801508A (zh) * 2019-02-26 2019-05-24 百度在线网络技术(北京)有限公司 路口处障碍物的运动轨迹预测方法及装置
CN110210417A (zh) * 2019-06-05 2019-09-06 深圳前海达闼云端智能科技有限公司 一种行人运动轨迹的预测方法、终端及可读存储介质
CN111942407A (zh) * 2020-07-31 2020-11-17 商汤集团有限公司 轨迹预测方法、装置、设备及存储介质

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114516336B (zh) * 2022-02-24 2023-09-26 重庆长安汽车股份有限公司 一种考虑道路约束条件的车辆轨迹预测方法
CN114516336A (zh) * 2022-02-24 2022-05-20 重庆长安汽车股份有限公司 一种考虑道路约束条件的车辆轨迹预测方法
CN114550297A (zh) * 2022-02-25 2022-05-27 北京拙河科技有限公司 一种行人的意图分析方法及***
CN114550297B (zh) * 2022-02-25 2022-09-27 北京拙河科技有限公司 一种行人的意图分析方法及***
CN114463687A (zh) * 2022-04-12 2022-05-10 北京云恒科技研究院有限公司 一种基于大数据的移动轨迹预测方法
CN114997297A (zh) * 2022-05-26 2022-09-02 哈尔滨工业大学 一种基于多级区域划分的目标运动意图推理方法及***
CN114997297B (zh) * 2022-05-26 2024-05-03 哈尔滨工业大学 一种基于多级区域划分的目标运动意图推理方法及***
CN115562332A (zh) * 2022-09-01 2023-01-03 北京普利永华科技发展有限公司 一种无人机机载记录数据的高效处理方法及***
CN115562332B (zh) * 2022-09-01 2023-05-16 北京普利永华科技发展有限公司 一种无人机机载记录数据的高效处理方法及***
CN115690160A (zh) * 2022-11-16 2023-02-03 南京航空航天大学 一种低帧率视频行人轨迹预测方法与***
CN115690160B (zh) * 2022-11-16 2023-12-15 南京航空航天大学 一种低帧率视频行人轨迹预测方法与***
CN115564803A (zh) * 2022-12-06 2023-01-03 腾讯科技(深圳)有限公司 一种动画处理方法、装置、设备、存储介质及产品
CN116602663B (zh) * 2023-06-02 2023-12-15 深圳市震有智联科技有限公司 一种基于毫米波雷达的智能监测方法及***
CN116602663A (zh) * 2023-06-02 2023-08-18 深圳市震有智联科技有限公司 一种基于毫米波雷达的智能监测方法及***
CN116778101A (zh) * 2023-06-26 2023-09-19 北京道仪数慧科技有限公司 基于营运载具的地图生成方法及***
CN116778101B (zh) * 2023-06-26 2024-04-09 北京道仪数慧科技有限公司 基于营运载具的地图生成方法及***
CN116723616B (zh) * 2023-08-08 2023-11-07 杭州依森匠能数字科技有限公司 一种灯光亮度控制方法及***
CN116723616A (zh) * 2023-08-08 2023-09-08 杭州依森匠能数字科技有限公司 一种灯光亮度控制方法及***
CN117560638A (zh) * 2024-01-10 2024-02-13 山东派蒙机电技术有限公司 应用于移动端通信***的融合通信方法、装置及设备
CN117560638B (zh) * 2024-01-10 2024-03-22 山东派蒙机电技术有限公司 应用于移动端通信***的融合通信方法、装置及设备
CN117799641A (zh) * 2024-02-08 2024-04-02 北京科技大学 多车干扰下智能物流车高效节能驾驶优化控制方法及装置

Also Published As

Publication number Publication date
CN111942407A (zh) 2020-11-17
JP2023511765A (ja) 2023-03-22
CN111942407B (zh) 2022-09-23
JP7513726B2 (ja) 2024-07-09

Similar Documents

Publication Publication Date Title
WO2022022721A1 (zh) 轨迹预测方法、装置、设备、存储介质及程序
Chen et al. A review of vision-based traffic semantic understanding in ITSs
US11682137B2 (en) Refining depth from an image
US11814039B2 (en) Vehicle operation using a dynamic occupancy grid
Yang et al. Crossing or not? Context-based recognition of pedestrian crossing intention in the urban environment
US20210216077A1 (en) Method, apparatus and computer storage medium for training trajectory planning model
US20230142676A1 (en) Trajectory prediction method and apparatus, device, storage medium and program
JP2022516288A (ja) 階層型機械学習ネットワークアーキテクチャ
CN111316286A (zh) 轨迹预测方法及装置、存储介质、驾驶***与车辆
US20190374151A1 (en) Focus-Based Tagging Of Sensor Data
US11887324B2 (en) Cross-modality active learning for object detection
Rezaei et al. Traffic-Net: 3D traffic monitoring using a single camera
Zhang et al. Gc-net: Gridding and clustering for traffic object detection with roadside lidar
CN115393677A (zh) 使用融合图像的端到端***训练
Ghahremannezhad et al. Object detection in traffic videos: A survey
Azfar et al. Deep learning-based computer vision methods for complex traffic environments perception: A review
CN117372991A (zh) 基于多视角多模态融合的自动驾驶方法及***
US20230105331A1 (en) Methods and systems for semantic scene completion for sparse 3d data
Mehtab Deep neural networks for road scene perception in autonomous vehicles using LiDARs and vision sensors
Ke Real-time video analytics empowered by machine learning and edge computing for smart transportation applications
Nishida et al. Environment Recognition from A Spherical Camera Image Based on DeepLab v3+
Foster Object detection and sensor data processing for off-road autonomous vehicles
Sarkar et al. Real-Time Risk Prediction at Signalized Intersection Using Graph Neural Network [Webinar]
US20240220787A1 (en) Neuromorphic computing system for edge computing
Sirmacek et al. Sequential image processing methods for improving semantic video segmentation algorithms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21848661

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022546580

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28/04/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21848661

Country of ref document: EP

Kind code of ref document: A1