CN117666559A

CN117666559A - Autonomous vehicle transverse and longitudinal decision path planning method, system, equipment and medium

Info

Publication number: CN117666559A
Application number: CN202311468384.5A
Authority: CN
Inventors: 陈雪梅; 徐书缘; 朱宇臻; 肖龙; 薛杨武; 沈晓旭; 赵小萱
Original assignee: Beijing Institute of Technology BIT; Advanced Technology Research Institute of Beijing Institute of Technology
Current assignee: Beijing Institute of Technology BIT; Advanced Technology Research Institute of Beijing Institute of Technology
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-03-08
Anticipated expiration: 2043-11-07
Also published as: CN117666559B

Abstract

The invention discloses a method, a system, equipment and a medium for planning a transverse and longitudinal decision path of an autonomous vehicle, which relate to the technical field of vehicle driving decision, and comprise the following steps: under global path navigation, obtaining a position point of each step based on the sampling offset of the central line of the road; the method comprises the steps of taking the positions and the speeds of an autonomous vehicle and an environmental vehicle as state observables, taking a position point selected under each step as an action quantity to construct a transverse decision model, taking the opening of an accelerator pedal and the opening of a brake pedal as the action quantity to construct a longitudinal decision model, designing a reward function, and training the transverse decision model and the longitudinal decision model; selecting an optimal position point of each step according to the trained transverse decision model, and obtaining a local path track after polynomial fitting of the optimal position point of each step; based on the local path track, the speed control quantity is obtained according to the trained longitudinal decision model, and the decision planning effect under the perceived occlusion is improved.

Description

Autonomous vehicle transverse and longitudinal decision path planning method, system, equipment and medium

Technical Field

The invention relates to the technical field of vehicle driving decision making, in particular to a method, a system, equipment and a medium for planning a transverse and longitudinal decision making path of an autonomous vehicle.

Background

In terms of safety and efficiency, unmanned vehicles have great advantages over manned vehicles, and along with the development of deep learning, learning-based methods, particularly reinforcement learning algorithms, are widely focused in autonomous vehicle decision-making planning research, however, the safety and feasibility of tracks cannot be fully ensured by completely relying on the traditional reinforcement learning algorithms. In addition, many reinforcement learning algorithms do not have high traffic efficiency.

Therefore, part of scholars integrate reinforcement learning into traditional decision planning, and local path points are selected on unstructured roads based on reinforcement learning to serve as guidance of local planning. However, this method does not take into account the perceived occlusion, and is not able to accommodate scenes of perceived uncertainty, such as pedestrian probes, etc., which are likely to cause accidents in a narrow field of view.

And a layered structure is built, the global task is decomposed into a plurality of local subtasks, and the local subtasks are finally realized to reach a destination through realizing each subtask, but the architecture is relatively complex, and the fusion of the traditional path planning and reinforcement learning algorithm is lacking at present.

Disclosure of Invention

In order to solve the problems, the invention provides a method, a system, equipment and a medium for planning a transverse and longitudinal decision path of an autonomous vehicle, which are used for decoupling transverse and longitudinal decision problems, deciding by utilizing a value distributed reinforcement learning algorithm and improving the decision planning effect under perceived occlusion.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for autonomous vehicle longitudinal and transverse decision path planning, comprising:

under global path navigation, obtaining a position point of each step based on the sampling offset of the central line of the road;

the method comprises the steps of taking the positions and the speeds of an autonomous vehicle and an environmental vehicle as state observables, taking a position point selected under each step as an action quantity to construct a transverse decision model, taking the opening of an accelerator pedal and the opening of a brake pedal as action quantities to construct a longitudinal decision model, and designing a reward function interacted with the environment after the autonomous vehicle executes the action to train the transverse decision model and the longitudinal decision model;

selecting an optimal position point of each step according to the trained transverse decision model, and obtaining a local path track after polynomial fitting of the optimal position point of each step;

based on the local path track, obtaining the speed control quantity according to the trained longitudinal decision model.

As an alternative embodiment, the reward function of the longitudinal decision model comprises: a security rewards function, an efficiency transit rewards function, a destination rewards function, and a comfort rewards.

As an alternative embodiment, the safety reward function represents a reward value given in the event of a collision;

the arrival destination bonus function means that a bonus value is given if the destination is reached and if the destination is not reached but no collision occurs;

the efficiency pass reward function is expressed as controlling vehicle speed according to a desired vehicle speed;

comfortable rewardsDenoted as->Where a is the vehicle acceleration.

As an alternative embodiment, the reward function of the lateral decision model comprises: the reward function of the longitudinal decision model, the reference line reward function and the lane change reward function.

Alternatively, the reference line reward function indicates that if on the reference line, a reward value is awarded;

the lane change reward function indicates that if a lane change exists, a reward value is assigned.

As an alternative implementation mode, when the transverse decision model and the longitudinal decision model are trained, a completely parameterized quantile function of value distributed reinforcement learning is adopted, the longitudinal decision model is trained first, then the transverse decision model is trained, then iterative optimization is carried out, and when the iterative optimization is carried out, a strategy in one direction is fixed, and then a strategy in the other direction is learned.

In the transverse decision model and the longitudinal decision model, the positions and the speeds of the autonomous vehicle and the environmental vehicle are taken as state observables, so that an observation space is formed, the observation space comprises the position difference and the speed difference of the autonomous vehicle and the environmental vehicle, and the state observables at different moments are stacked to form the state space.

In a second aspect, the present invention provides an autonomous vehicle longitudinal decision path planning system comprising:

the sampling module is configured to obtain a position point of each step based on the sampling offset of the central line of the road under the global path navigation;

the model training module is configured to take the positions and the speeds of the autonomous vehicle and the environment vehicle as state observers, take the selected position point under each step as an action quantity to construct a transverse decision model, take the opening of an accelerator pedal and the opening of a brake pedal as the action quantity to construct a longitudinal decision model, and design a reward function interacted with the environment after the autonomous vehicle executes the action so as to train the transverse decision model and the longitudinal decision model;

the transverse decision module is configured to select the optimal position point of each step length according to the trained transverse decision model, and obtain a local path track after polynomial fitting of the optimal position point of each step length;

and the longitudinal decision module is configured to obtain the speed control quantity according to the trained longitudinal decision model based on the local path track.

In a third aspect, the invention provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method for planning a transverse and longitudinal decision path of an autonomous vehicle, which provides a decoupled layered framework, decouples a transverse and longitudinal decision problem, wherein the transverse decision problem is the planning of a local path track, firstly, the position point of each step length is obtained based on a sampling method under global path navigation, then the position point of each step length is decided by using a completely parameterized quantile function of value distributed reinforcement learning, the optimal position point of each step length is selected, and finally, the local path track is generated by using polynomial fitting of traditional path planning; the longitudinal decision-making problem is that the speed control quantity is subjected to reinforcement learning based on the local path track, the decision is made by utilizing a completely parameterized quantile function algorithm of value distributed reinforcement learning, a traditional path planning and value distributed reinforcement learning algorithm are fused, the reinforced learning neural network can fit the total rewarding value (namely Q value) under risk distribution, the safety action under perceived occlusion is potentially learned, the risk in an uncertainty environment is learned, and the decision planning effect under perceived occlusion is improved.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a flow chart of an autonomous vehicle transverse and longitudinal decision path planning method provided in embodiment 1 of the present invention;

fig. 2 is a schematic diagram of transverse track sampling provided in embodiment 1 of the present invention;

FIG. 3 is a diagram of a framework of a transversal decision problem provided in embodiment 1 of the present invention;

FIG. 4 is a schematic view of a transversal and longitudinal iterative optimization provided in embodiment 1 of the present invention;

fig. 5 is a longitudinal strategy training convergence graph based on real vehicle data according to embodiment 1 of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

The embodiment provides an autonomous vehicle transverse and longitudinal decision path planning method which is suitable for sensing a shielding environment, is an autonomous vehicle decision planning technology for realizing a decoupling layered structure based on combination of value distributed reinforcement learning and a traditional path planning method, and is used for effectively improving traffic safety and traffic efficiency by considering shielded obstacles potentially.

As shown in fig. 1, the method specifically includes:

and obtaining the speed control quantity according to the trained longitudinal decision model based on the local path track.

In the present embodiment, in the Frenet coordinate system (also referred to as S-L coordinate system) in the urban environment, the offset is sampled based on the road center line; as shown in fig. 2, Δs means a displacement difference of the lateral displacement, and Δl means a displacement difference of the longitudinal displacement.

The Frenet coordinate system is a coordinate system for representing road position with a more intuitive representation than the traditional Cartesian coordinates; in Frenet coordinates, the position of the vehicle on the road is described using the variables S and L, where S represents distance along the road (also referred to as longitudinal displacement) and L represents left and right positions on the road (also referred to as lateral displacement).

Because the vehicle kinematic model cannot directly walk the broken line, the track local starting point and the track local target point cannot be simply connected by a straight line, and therefore the embodiment utilizes a five-degree polynomial to perform track fitting between the track points.

It should be noted that the five-degree polynomial fit is SL path points in the Frenet coordinate system, independent of time information.

In the embodiment, decoupling the autonomous vehicle transverse and longitudinal decision path planning problem into a transverse decision problem and a longitudinal decision problem which are respectively used for planning a local path track and planning a speed;

the transverse decision problem is decomposed into two parts, as shown in fig. 3, one is that the position point of each step is obtained based on a sampling method under the global path navigation, and then the position point of each step is decided by using a completely parameterized quantile function (Fully parameterized Quantile Function, FQF) of value distributed reinforcement learning, so that the optimal position point of each step is selected; and secondly, fitting the optimal position point of each step by using a five-element formula so as to generate a local path track.

The longitudinal decision problem is to learn the speed control based on the local path trajectory.

Therefore, the embodiment respectively decouples the transverse decision problem and the longitudinal decision problem into two Markov decision problems, and makes decisions by using the FQF algorithm.

In the embodiment, the positions and speeds of the autonomous vehicle and the environmental vehicle are taken as state observables, and the selected position point under each step length is taken as action quantity, so that a transverse decision model is constructed; the method comprises the steps of constructing a longitudinal decision model by taking the positions and the speeds of an autonomous vehicle and an environmental vehicle as state observables and taking the opening of an accelerator pedal and the opening of a brake pedal as action quantities;

wherein:

(1) The position and the speed of the autonomous vehicle and the environmental vehicle are taken as state observers, so that an observation space o is formed _t Comprising: position and speed differences of the autonomous vehicle and the ambient vehicle; stacked by state observables at different moments as a state space s _t ；

（1）；

（2）；

Wherein,is the observation space at time t, < >>Is the observation space at time t-1, +.>Is the observation space at time t-2, +.>Is the state space at time t, +.>、/>Representing a first ambient vehicle and a second ambient vehicle, < > or->Represents an autonomous vehicle->Is the position coordinates of the first ambient vehicle, +.>Is the position coordinates of the second ambient vehicle, +.>Is the position coordinates of the autonomous vehicle,/->、/>、/>The speed of the first ambient vehicle, the speed of the second ambient vehicle, the speed of the autonomous vehicle, respectively。

(2) The action amount of the transverse decision model refers to: a plurality of position points are arranged under each step length, and one position point is selected from the plurality of position points;

the action amount of the longitudinal decision model refers to: the opening of the accelerator pedal and the opening of the brake pedal are used for adjusting the speed.

In this embodiment, designing a reward function for interaction with the environment after the autonomous vehicle performs the action specifically includes:

(1) Reward function of longitudinal decision modelComprising the following steps: security reward function->Efficiency pass reward function->Get destination reward function->And comfortable rewards->The method comprises the steps of carrying out a first treatment on the surface of the Specifically, the formula (3) -formula (7):

（3）；

（4）；

（5）；

（6）；

（7）；

wherein,for the expected speed of the vehicle, < > is>The current vehicle speed; a is the vehicle acceleration;

equation (7) is a reward function of the longitudinal decision model；

Equation (3) indicates that if a collision occurs, a prize value is assigned p7;

equation (4) represents an efficiency passing reward, and the vehicle speed is controlled according to the expected vehicle speed so as to enable the vehicle speed to be close to the expected vehicle speed;

equation (5) shows that if the destination is reached, the prize value is given p ₈ If the destination is not reached but no collision occurs, the prize value is assigned p ₉ Otherwise, 0.

Equation (6) represents a comfort reward.

(2) In the transverse decision model, the transverse offset and the transverse displacement variation from the reference line are additionally considered, namely the rewarding function of the transverse decision modelComprising the following steps: security reward function->Efficiency pass reward function->Get destination reward function->And->And a reference line reward function +.>And lane change reward functionThe method comprises the steps of carrying out a first treatment on the surface of the Specifically, the formula (8) -formula (10):

（8）；

（9）；

（10）；

wherein,the current action quantity;

equation (10) represents a reward function of the lateral decision model；

Equation (8) shows that if on the reference line, the prize value is given p ₁₀ Otherwise give p ₁₁ ；

The expression (9) indicates that if a channel change exists, the value is assigned, otherwise, the value is not assigned;

p ₇ -p ₁₂ all represent super parameters, and the specific value is designed as p ₇ Is-1000, p ₈ 500, p ₉ Is-200, p ₁₀ Is-1000, p ₁₁ 500, p ₁₂ Is-200.

In this embodiment, when training the transverse decision model and the longitudinal decision model, training the transverse decision model, and then performing iterative optimization; upon iterative optimization, after fixing a policy in one direction, a policy in another direction is learned, as shown in fig. 4.

Experiment verification

The embodiment provides a mode of training real vehicle data, which is mainly used for verifying the pedestrian probe scene of an intersection, the introduced real vehicle data is real environment data collected under the environment of an urban T-shaped intersection based on an automatic driving platform, the specific data mainly depends on a camera and a laser radar, the laser radar obtains the distance between a pedestrian and the platform, and the camera assists the laser radar to obtain information. According to the laser radar point cloud image and the camera data, pedestrians can be accurately identified, and the relative pose of the pedestrians can be obtained.

Pedestrian probe data are acquired at a T-junction, 100 sets of pedestrian data are acquired in this scenario, however these data volumes are sufficient for pedestrians and insufficient for vehicles. Therefore, the initial position of the vehicle is subjected to data expansion in a data enhancement mode, so that the initial position of each round of vehicle and the real data of pedestrians under each round of vehicle are generated, a longitudinal strategy training convergence curve based on real vehicle data recharging by utilizing multiple sub-training is obtained, and the longitudinal strategy training convergence curve is shown in fig. 5, wherein the gray shaded part of the curve is a confidence interval obtained by multiple sub-training.

Example 2

The embodiment provides an autonomous vehicle transverse and longitudinal decision path planning system, comprising:

It should be noted that the above modules correspond to the steps described in embodiment 1, and the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method described in embodiment 1. For brevity, the description is omitted here.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.

The method in embodiment 1 may be directly embodied as a hardware processor executing or executed with a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. An autonomous vehicle longitudinal and transverse decision path planning method, comprising:

2. The autonomous vehicle longitudinal decision making method of claim 1, wherein the reward function of the longitudinal decision model comprises: a security rewards function, an efficiency transit rewards function, a destination rewards function, and a comfort rewards.

3. A method of autonomous vehicle longitudinal decision making as claimed in claim 2, characterized in that the safety reward function represents a reward value given in the event of a collision;

comfortable rewardsDenoted as->Where a is the vehicle acceleration.

4. The autonomous vehicle longitudinal decision making method of claim 2, wherein the reward function of the lateral decision model comprises: the reward function of the longitudinal decision model, the reference line reward function and the lane change reward function.

5. A method of autonomous vehicle longitudinal decision making as defined in claim 4, wherein the reference line rewarding function represents a rewarding value if on the reference line;

6. The autonomous vehicle horizontal and vertical decision planning method of claim 1, wherein when training the horizontal decision model and the vertical decision model, a completely parameterized quantile function of value-distributed reinforcement learning is adopted, the vertical decision model is trained first, then the horizontal decision model is trained, then iterative optimization is performed, and when iterative optimization is performed, a strategy in one direction is fixed, and then a strategy in the other direction is learned.

7. The method for planning transverse and longitudinal decisions of an autonomous vehicle according to claim 1, wherein in the transverse decision model and the longitudinal decision model, the positions and speeds of the autonomous vehicle and the environmental vehicle are taken as state observables, an observation space is formed by taking the positions and speeds of the autonomous vehicle and the environmental vehicle as state observables, the observation space comprises the position difference and the speed difference of the autonomous vehicle and the environmental vehicle, and the state observables at different moments are stacked to form the state space.

8. An autonomous vehicle longitudinal and transverse decision path planning system, comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-7.