CN115257819A

CN115257819A - Decision-making method for safe driving of large-scale commercial vehicle in urban low-speed environment

Info

Publication number: CN115257819A
Application number: CN202211070514.5A
Authority: CN
Inventors: 李旭; 胡玮明; 胡悦; 胡锦超; 陆红伟; 徐启敏
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2022-11-01

Abstract

The invention discloses a decision-making method for safe driving of a large-scale commercial vehicle in an urban low-speed environment. And secondly, constructing a multi-head attention-based safe driving decision model of the commercial vehicle. The model contains two sub-networks, deep double-Q network and generative confrontation mock-learning. The deep double-Q network learns the safe driving strategies under the edge scenes such as dangerous scenes and conflict scenes in an unsupervised learning mode; an antagonistic emulation learning sub-network is generated that emulates safe driving behavior of a human driver under different driving conditions and driving regimes. And finally, training a safe driving decision model to obtain driving strategies under different driving conditions and driving conditions. The method provided by the invention can simulate the safe driving behavior of a human driver, and considers the influence of factors such as a visual blind area, a sudden obstacle and the like on the driving safety.

Description

Decision-making method for safe driving of large-scale commercial vehicle in urban low-speed environment

Technical Field

The invention relates to a decision-making method for driving commercial vehicles, in particular to a decision-making method for safe driving of large commercial vehicles in an urban low-speed environment, and belongs to the technical field of automobile safety.

Background

In urban traffic environments, road traffic accidents caused by blind areas of driver's sight lines account for the highest percentage, and the main causes of these accidents are large-scale operation vehicles such as heavy trucks, large buses, motor trains and the like. Different from passenger vehicles, large commercial vehicles have the characteristics of large volume, long vehicle body, large wheelbase, high driving position and the like, and a plurality of static and dynamic vision blind areas exist around the vehicle body, such as the front part of the vehicle head, the vicinity of the right front wheel, the lower part of the right rearview mirror and the like. When the commercial vehicle turns, particularly turns to the right, pedestrians and non-motor vehicles in the blind area of the visual field are extremely easy to collide and even roll, and the main area of the serious safety accident is generated. In addition, compared with a closed expressway scene, under the condition of non-mixed urban traffic environment, the types and the number of traffic participants are relatively large, and the situation that the operating vehicles meet obstacles suddenly occurs, so that the danger is higher. Therefore, how to improve the driving safety of commercial vehicles in an open urban traffic environment with interference of multiple traffic targets is a key problem to be solved urgently at present and is also a key point for guaranteeing the urban road traffic safety.

At present, the active development of the automatic driving technology becomes an important means for guaranteeing the running safety of vehicles widely accepted at home and abroad. As a key part of achieving high-quality automatic driving, driving decisions determine the rationality and safety of automatic driving of commercial vehicles. If the driver can be warned of danger in 1.5 seconds before the traffic accident happens, and a reliable and effective safe driving strategy is provided, the frequency of the traffic accident caused by factors such as blind vision areas, suddenly meeting obstacles and the like can be greatly reduced. Therefore, the research on the safe driving decision method of the large-scale commercial vehicle plays an important role in guaranteeing the driving safety of the commercial vehicle.

Many patents and documents have been developed for collision avoidance driving decisions, but they are mainly directed to passenger vehicles. Compared with a passenger vehicle, the commercial vehicle has a larger visual blind area and longer braking distance and braking time. The anti-collision decision method for the passenger vehicle cannot be directly applied to the commercial vehicle. On the other hand, some patents have studied the safe driving decision of commercial vehicles, such as a decision method for safe driving of highly human-like automatic driving commercial vehicles (application No. 202210158758.2), a decision method for lane change of large commercial vehicles based on deep learning (publication No. CN 113954837A), etc., but these decision methods are all oriented to highway scenes.

Different from a highway scene with few traffic participant types, the urban traffic environment has the characteristics of openness, interference of multiple traffic targets, non-mixed movement and the like. Especially, the existence of factors such as a vehicle vision blind area and an obstacle suddenly presents higher challenges for safe driving of commercial vehicles in an urban traffic environment. Therefore, the decision-making method for safe driving of commercial vehicles oriented to the highway scene cannot be directly applied to the open and interfered urban traffic environment.

Generally, for an open urban traffic environment interfered by multiple traffic targets, the existing method is difficult to meet the requirement of an operating vehicle on safe driving decision, and a safe driving decision method capable of providing specific driving suggestions such as driving actions and driving paths is lacked, in particular to a safe driving decision research of a large operating vehicle considering the influence of visual blind areas and sudden obstacles.

Disclosure of Invention

The invention aims to: the invention provides a decision-making method for safe driving of a large commercial vehicle in an urban low-speed environment, aiming at automatically-driven commercial vehicles such as heavy trucks and heavy trucks, and aiming at realizing decision-making for safe driving of the large commercial vehicle in the urban low-speed environment and guaranteeing the driving safety of the vehicle. The method comprehensively considers the influence of factors such as visual blind areas, suddenly meeting obstacles, different driving conditions and the like on the driving safety, can simulate the safe driving behavior of human drivers, provides a more reasonable and safe driving strategy for automatically driving the operating vehicle, and can effectively ensure the driving safety of the automatically driving operating vehicle. Meanwhile, the method does not need to consider complex vehicle dynamics equations and vehicle body parameters, the calculation method is simple and clear, the safe driving strategy of the automatic driving operation vehicle can be output in real time, the cost of the used sensor is low, and the method is convenient for large-scale popularization.

The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a decision-making method for safe driving of large commercial vehicles in an urban low-speed environment. Firstly, the safe driving behaviors of human drivers in the urban traffic environment are collected, and a safe driving behavior data set is constructed and formed. And secondly, constructing a multi-head attention-based decision model for safe driving of the commercial vehicle. The model contains two sub-networks of deep double-Q network and generative confrontation mimic learning. The deep double-Q network learns the safe driving strategies under the edge scenes such as dangerous scenes and conflict scenes in an unsupervised learning mode; an antagonistic emulation learning sub-network is generated to emulate safe driving behavior under different driving conditions and driving regimes. And finally, training a safe driving decision model to obtain driving strategies under different driving conditions and driving conditions, and realizing high-level decision output of safe driving behaviors of the commercial vehicle. The method specifically comprises the following steps:

the method comprises the following steps: collecting safe driving behaviors of human drivers in urban traffic environment

In order to realize a driving decision comparable to that of a human driver, safe driving behaviors under different driving conditions and driving conditions are collected in a mode of actual road testing and driving simulation, and a data set representing the safe driving behaviors of the human driver is constructed. The method specifically comprises the following 4 sub-steps:

substep 1: and constructing a multi-dimensional target information synchronous acquisition system by using a millimeter wave radar, a 128-line laser radar, a vision sensor, a Beidou sensor and an inertial sensor.

And substep 2: under a real urban environment, a plurality of drivers sequentially drive a commercial vehicle carrying a multi-dimensional target information synchronous acquisition system, relevant data of various driving behaviors such as lane change, lane keeping, vehicle following, acceleration and deceleration and the like of the drivers are acquired and processed, and multi-source heterogeneous description data of the driving behaviors are acquired, such as the distances of obstacles in a plurality of different directions measured by a radar or a vision sensor, the positions, the speeds, the accelerations, the yaw angular velocities and the like measured by a Beidou sensor and an inertial sensor, the steering wheel corners measured by a vehicle-mounted sensor and the like.

Substep 3: in order to simulate safe driving behaviors under marginal scenes such as dangerous scenes, conflict scenes and the like, a virtual city scene based on hardware-in-the-loop simulation is built, and the built city traffic scenes comprise the following three types:

(1) During the driving process of the vehicle, a transversely approaching traffic participant (namely, a sudden obstacle) appears in front of the vehicle;

(2) In the steering process of the vehicle, static traffic participants exist in a visual blind area of the vehicle;

(3) During the turning of the vehicle, moving traffic participants are present in the blind vision zone of the vehicle.

In the above traffic scenarios, there are a variety of road network structures (straight roads, curves, and intersections) and a variety of traffic participants (commercial vehicles, passenger vehicles, non-motor vehicles, and pedestrians).

A plurality of drivers drive the operating vehicles in the virtual scene through real controllers (a steering wheel, an accelerator and a brake pedal), and information such as the transverse and longitudinal position, the transverse and longitudinal speed, the transverse and longitudinal acceleration, the relative distance and the relative speed with surrounding traffic participants and the like of the vehicles is collected.

Substep 4: based on data collected by a real city environment and a driving simulation environment, a driving behavior data set for safe driving decision learning is constructed and formed, and the driving behavior data set can be specifically expressed as follows:

wherein X represents a binary group covering state and action, i.e. a constructed data set representing safe driving behavior of human driver,(s) _j ,a _j ) Indicating the "state" at time j-action "pairs, wherein s _j Indicates the state at time j, a _j Indicating the action at time j, i.e. human driver based on state s _j The actions made, n represents the number of "state-action" pairs in the database.

Step two: construction of multi-head attention-based decision model for safe driving of commercial vehicle

In order to realize the safe driving decision of the large-scale commercial vehicle in the urban low-speed environment, the invention comprehensively considers the influence of factors such as a visual blind area, an emergent obstacle, a driving condition and the like on the driving safety, and establishes a commercial vehicle safe driving decision model. In consideration of the fact that the perception capability of deep learning is combined with the decision capability of the deep learning in the deep reinforcement learning mode, the traffic environment is explored in an unsupervised learning mode, and the deep reinforcement learning mode is utilized to learn the safe driving strategy in the boundary scene such as a dangerous scene and a conflict scene. In addition, the invention simulates the safe driving behavior of the human driver under different driving conditions and driving conditions by means of the imitation learning in consideration of the ability of the imitation learning to have the imitation chart. Therefore, the constructed safe driving decision model consists of two parts, which are specifically described as follows:

substep 1: defining basic parameters of a safe driving decision model

Firstly, converting the safe driving decision problem under the urban low-speed environment into a limited Markov decision process. Secondly, basic parameters of a safe driving decision model are defined.

(1) Defining a state space

To describe the motion state of the own vehicle and nearby traffic participants, the present invention constructs a state space using time series data and an occupancy grid map. The specific description is as follows:

S _t ＝[S ₁ (t),S ₂ (t),S ₃ (t)] (2)

in the formula, S _t Representing the state space at time t, S ₁ (t) and S ₂ (t) a state space relating time-series data at time t, S ₃ (t) represents the state space associated with the occupied trellis diagram at time t.

Firstly, the motion state of the self-vehicle is described by using continuous position, speed, acceleration and course angle information:

S ₁ (t)＝[p _x ,p _y ,v _x ,v _y ,a _x ,a _y ,θ _s ] (3)

in the formula, p _x ,p _y Respectively represents the transverse position and the longitudinal position of the bicycle, and the units are meter and v _x ,v _y Respectively, the transverse speed and the longitudinal speed of the self-vehicle, and the unit is meter per second, a _x ,a _y Respectively represents the lateral acceleration and the longitudinal acceleration of the bicycle, and the unit is meter per second squared and theta _s Indicating the heading angle of the vehicle in degrees.

Secondly, describing the motion state of the surrounding traffic participants by using the relative motion state information of the own vehicle and the surrounding traffic participants:

in the formula (I), the compound is shown in the specification,

respectively represent the relative distance, the relative speed and the acceleration of the own vehicle and the ith traffic participant, and the units are respectively meter, meter per second and meter per second squared.

In the existing state space definition method, a fixed coding method is often used, that is, the number of considered surrounding traffic participants is fixed. However, in an actual urban traffic scenario, the number and location of traffic participants around a commercial vehicle are constantly changing, and special consideration needs to be given to side collisions caused by sudden obstacles and blind visual zones. Although the fixed coding method can realize effective state representation, the number of considered traffic participants is limited (the minimum information quantity required for representing a scene is used), and the influence of all the surrounding traffic participants on the driving safety of a commercial vehicle cannot be accurately and comprehensively described.

Finally, for a more visual description of the selfThe invention rasterizes the road area, divides the road area into a plurality of a multiplied by b grid areas, abstracts the road area and the vehicle target into a grid map, namely an 'existing' grid map S for describing the relative position relationship ₃ (t) of (d). Where a denotes the length of the mesh region and b denotes the width of the mesh region.

The "presence" grid map contains four attributes including grid coordinates, whether a vehicle is present, the category of the corresponding vehicle, and the distance to the left and right lane lines. The grid where no traffic participant exists is set as '0', the grid where the traffic participant exists is set as '1', and the position distribution of the grid and the grid where the own vehicle is located is used for describing the relative distance between the two vehicles.

(2) Defining an action space

Defining an action space with lateral and longitudinal driving actions:

A _t ＝[a _left ,a _straight ,a _right ,a _accel ,a _cons ,a _decel ] (5)

in the formula, A _t Represents the motion space at time t, a _left ,a _straight ,a _right Respectively representing left turn, straight run and right turn, a _accel ,a _cons ,a _decel Respectively, acceleration, uniform velocity, and deceleration.

(3) Defining a reward function

R _t ＝r ₁ +r ₂ +r ₃ (6)

In the formula, R _t A reward function representing the time t, r ₁ ,r ₂ ,r ₃ Representing a forward collision reward function, a backward collision reward function, and a lateral collision reward function, respectively, may be obtained by equations (7), (8), and (9).

Wherein TTC represents the time when the host vehicle collides with the front obstacle, and is obtained by dividing the distance between the host vehicle and the front obstacle by the relative velocity, and TTC represents the time when the host vehicle collides with the front obstacle _thr Representing a distance-to-collision time threshold, RTTC representing a backward collision time, RTTC _thr Represents a threshold value of backward collision time, and the unit is second and x _lat Indicating the distance, x, between the vehicle and the traffic participants on both sides _min Represents the minimum lateral safety distance, and the unit is meter and beta ₁ ,β ₂ ,β ₃ And respectively representing the weight coefficients of the forward collision avoidance reward function, the backward collision avoidance reward function and the lateral collision avoidance reward function.

Substep 2: construction of deep dual-Q network-based decision sub-network

In consideration of the fact that a Deep Double-Q Network (DDQN) uses an experience multiplexing pool, the utilization efficiency of data is improved, parameter oscillation or divergence can be avoided, and negative learning effects caused by over-estimation in a Q learning Network can be reduced. Therefore, the safe driving strategy under the marginal scene is learned by utilizing the deep double-Q network.

Unlike the processing of a state space with fixed dimensions, the processing of feature information covering all surrounding traffic participants needs to have stronger feature extraction capability. Considering that the attention mechanism can capture richer characteristic information (the dependency relationship between the own vehicle and the surrounding traffic participants), the invention designs a strategy network based on the multi-head attention mechanism. In addition, considering that the driving decision is only related to the motion states of the own vehicle and the surrounding traffic participants and is not influenced by the sequence of each traffic participant in the state space, the invention utilizes a position coding method (the document: vaswani, ashish, et al. "Attention all you needed." Advances in Neural Information Processing systems.2017.) to construct the arrangement invariance into a decision sub-network.

The attention layer can be expressed as:

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^O (10)

where Multihead (Q, K, V) represents a multi-head attention value, Q represents a query vector, K represents a key vector, and dimensions are d _k V represents a vector of values with dimension d _v ，W ^O Representing a parameter matrix, head, to be learned _h Denotes the h-th head in multi-head attention, and in the present invention, h =2, can be calculated by the following formula:

wherein Attention (Q, K, V) represents the Attention matrix of the output,

representing the parameter matrix to be learned.

And constructing a decision sub-network based on the deep double-Q network, which is described in detail as follows.

First, a state space S _t Respectively connected to encoder 1, encoder 2 and encoder 3. The encoder 1 is composed of two full-connected layers and outputs the motion state code of the vehicle. The encoder 2 has the same structure as the encoder 1, and outputs a relative motion state code. The encoder 3 consists of two convolutional layers and outputs occupy the trellis diagram encoding.

The number of neurons in the full junction layer is 64, and the activation functions are Tanh functions. The convolution kernels of the convolutional layers are all 3 × 3, and the step lengths are all 2.

Secondly, the dependency relationship between the self-vehicle and the surrounding traffic participants is analyzed by utilizing a multi-head attention mechanism, so that the decision sub-network can notice the traffic participants which are suddenly close to the self-vehicle or conflict with the driving path of the self-vehicle, and different input sizes and arrangement invariances are built into the decision sub-network. The outputs of the encoder 1, the encoder 2 and the encoder 3 are all connected with the multi-head attention module to output an attention matrix. Again, the output attention matrix is connected to the decoder 1. The encoder 1 consists of one fully connected layer.

The number of neurons in the full connection layer is 64, and the activation function is a Sigmoid function.

Substep 3: building a decision sub-network based on generation of confrontation mimic learning

Under the complex urban traffic environment with open and multi-traffic target interference, an accurate and comprehensive reward function is difficult to construct, and particularly, the influence of uncertainty (such as a sudden obstacle, a traffic participant in a visual blind area and the like) on driving safety is difficult to quantitatively describe. In order to reduce the influence of uncertainty of traffic environment and driving condition on safe driving decision and improve the effectiveness and reliability of driving decision, the invention utilizes the driving strategy in the sample data which is generated to resist and imitate a learning sub-network, learn a driving behavior data set and generalization thereof, and further imitate safe driving behaviors under different driving conditions and driving conditions. The generation confrontation imitation learning sub-network consists of a generator and an arbiter, and the generator network and the arbiter network are respectively constructed by utilizing a deep neural network. The specific description is as follows:

(1) Construct generators

A generator network is constructed. The input of the generator is the state space and the output is the probability value f = pi (· | s; θ) of each action in the action space, where θ represents a parameter of the generator network. First, the state space is connected with the full connection layer FC in sequence ₁ And FC ₂ Are connected to obtain a characteristic F ₁ . State space in turn with full connection layer FC ₃ And FC ₄ Obtaining a characteristic F ₂ . At the same time, the state space is sequentially connected with the convolution layer C ₁ And a convolution layer C ₂ Are connected to obtain the characteristic F ₃ . Then, the feature F is processed ₁ 、F ₂ And F ₃ Sequentially combined with a merging layer and a full connecting layer FC ₅ And the Softmax activation function to obtain an output f = π (· | s; θ).

Wherein, the full connection layer FC ₁ 、FC ₂ 、FC ₃ 、FC ₄ And FC ₅ The number of neurons in (1) is 64, and the number of convolution layers C ₁ Has a convolution kernel of 3 x 3 and a step size of 2, and a convolution layer C ₂ Has a convolution kernel of 3 × 3 with a step size of 1.

(2) Construction discriminator

And constructing a discriminator network. The input of the discriminator is a state space, and the output is a vector

Dimension 6, where φ represents a parameter of the arbiter network. First, the state space is connected with the full connection layer FC in sequence ₆ And FC ₇ Are connected to obtain the characteristic F ₄ . State space in turn with full connection layer FC ₈ And FC ₉ Obtaining a characteristic F ₅ . At the same time, the state space is sequentially connected with the convolution layer C ₃ And a convolution layer C ₄ Are connected to obtain a characteristic F ₆ . Then, the feature F is processed ₄ 、F ₅ And F ₆ Sequentially combined with a merging layer and a full connection layer FC ₁₀ Connected with Sigmoid activation function to obtain output

Wherein the full connection layer FC ₆ 、FC ₇ 、FC ₈ 、FC ₉ And FC ₁₀ The number of neurons in the layer is 64, and the layer is convolutional ₃ Has a convolution kernel of 3 x 3 and a step size of 2, and a convolution layer C ₄ Has a convolution kernel of 3 × 3 with a step size of 1.

Step three: model for training safe driving decision of commercial vehicle

First, training is based on generating a sub-network of decisions against mock learning. The goal of generating the antagonistic mock learning subnetwork is to learn a generator network such that the arbiter cannot distinguish between driving actions generated by the generator and actions in the driving behavior data set. The method specifically comprises the following substeps:

substep 1: in a driving behavior data set

In (1), initializing the generator network parameter θ ₀ Sum discriminator network parameter omega ₀ ；

And substep 2: l iterative solutions are performed, each iteration comprising sub-steps 2.1 to 2.2, in particular:

substep 2.1: updating the discriminator parameter ω using the gradient equation described by equation (13) _i →ω _i+1 ：

In the formula (I), the compound is shown in the specification,

a gradient function representing a neural network loss function with a parameter ω;

substep 2.2: setting a reward function

Updating generator parameter θ using trust domain policy optimization algorithm _i →θ _i+1 。

Firstly, on the basis of the network training result, continuing training and constructing a decision sub-network based on the DDQN, specifically comprising the following sub-steps:

substep 3: initializing the capacity of an experience multiplexing pool D to be N;

substep 4: the Q value corresponding to the initialization action is a random value;

substep 5: performing M iterative solutions, each iteration comprising substeps 5.1 to 5.2, in particular:

substep 5.1: initialization state s ₀ Initialization of the policy parameter phi ₀ ；

Substep 5.2: performing T iterative solutions, each iteration comprising sub-steps 5.21 to 5.27, in particular:

substep 5.21: randomly selecting a driving action;

substep 5.22: otherwise, select a _t ＝max _a Q ^* (φ(s _t ),a；θ)；

In the formula, Q ^* (. Represents an optimal action cost function, a _t Represents the operation at time t;

substep 5.23: performing action a _t Obtaining the reward value r at time t _t And state s at time t +1 _t+1 ；

Substep 5.24: store samples (phi) in an empirical multiplexing pool D _t ,a _t ,r _t ,φ _t+1 )；

Substep 5.25: randomly draw a small number of samples (phi) from the empirical multiplexing pool D _j ,a _j ,r _j ,φ _j+1 )；

Substep 5.26: the iteration target is calculated using the formula:

in the formula (I), the compound is shown in the specification,

representing the weight of the target network at time t; γ represents a discount factor; argmax (·) denotes a variable that causes the objective function to have a maximum value, y _i Representing an iteration target at time i, and p (s, a) representing a motion distribution;

substep 5.27: using the following formula in (y) _i -Q(φ _j ,a _j ；θ)) ² The gradient is decreased:

in the formula (I), the compound is shown in the specification,

with the parameter θ _i The gradient function of the neural network loss function of (1), epsilon represents the probability of randomly selecting an action under an epsilon-greedy exploration strategy; theta _i Parameter, L, representing the iteration at time i _i (θ _i ) Represents the loss function at time i, Q (s, a; theta.theta. _i ) Representing the action cost function of the target network and a 'representing all possible actions of state s'.

After the training of the safe driving decision model of the commercial vehicle is finished, the state space information acquired by the sensor is input into the safe driving decision model, so that advanced driving decisions such as steering, straight going, acceleration and deceleration and the like can be output in real time, and the running safety of the commercial vehicle in the urban low-speed environment can be effectively guaranteed.

Has the advantages that: compared with a general driving decision method, the method provided by the invention has the characteristics of more effectiveness and reliability, and is specifically embodied as follows:

(1) The method provided by the invention can simulate the safe driving behavior of a human driver, provides a more reasonable and safe driving strategy for the commercial vehicle in the low-speed environment of the city, realizes the safe driving decision of the large commercial vehicle with high human-like level, and can effectively ensure the driving safety of the vehicle.

(2) The method provided by the invention comprehensively considers the influence of factors such as visual blind areas, sudden obstacles, different driving conditions and the like on driving safety, and carries out strategy learning and training in normal driving scenes and marginal scenes, thereby further improving the effectiveness and reliability of driving decisions.

(3) The method provided by the invention introduces a multi-head attention mechanism, considers the dynamic interaction between the self-vehicle and all surrounding traffic participants, and can process the input variable (the number of the surrounding traffic participants is dynamically changed) safe driving decision.

(4) The method provided by the invention does not need to consider complex vehicle dynamics equations and vehicle body parameters, the calculation method is simple and clear, the safe driving decision strategy of a large-scale operation vehicle can be output in real time, and the used sensor has low cost and is convenient for large-scale popularization.

Drawings

FIG. 1 is a technical roadmap for the present invention;

FIG. 2 is a schematic diagram of a policy network structure based on a multi-head attention mechanism designed by the present invention;

FIG. 3 is a schematic diagram of a generator network structure designed by the present invention;

fig. 4 is a schematic diagram of a network structure of the discriminator designed by the invention.

Detailed Description

The technical scheme of the invention is further explained by combining the drawings and the embodiment.

The invention provides a decision-making method for safe driving of a large-scale commercial vehicle with a high humanoid level aiming at an open urban traffic environment interfered by multiple traffic targets. Firstly, the safe driving behaviors of human drivers in the urban traffic environment are collected, and a safe driving behavior data set is constructed and formed. And secondly, constructing a multi-head attention-based decision model for safe driving of the commercial vehicle. The model contains two sub-networks of deep double-Q network and generative confrontation mimic learning. The deep double-Q network learns the safe driving strategy under the edge scene such as a dangerous scene and a conflict scene in an unsupervised learning mode; and generating an antagonistic imitation learning sub-network to imitate safe driving behaviors under different driving conditions and driving conditions. And finally, training a safe driving decision model to obtain driving strategies under different driving conditions and driving conditions, and realizing high-level decision output of safe driving behaviors of the commercial vehicle. The method provided by the invention can simulate the safe driving behavior of human drivers, considers the influence of factors such as visual blind areas and suddenly encountered obstacles on the driving safety, provides a more reasonable and safe driving strategy for large-scale commercial vehicles, and realizes the safe driving decision of the commercial vehicles under the urban traffic environment. The technical route of the invention is shown in figure 1, and the specific steps are as follows:

In order to realize a driving decision comparable to that of a human driver, safe driving behaviors under different driving conditions and driving conditions are collected in a mode of actual road testing and driving simulation, and a data set representing the safe driving behaviors of the human driver is constructed. The method specifically comprises the following 5 sub-steps:

Substep 2: under the real urban environment, a plurality of drivers drive operating vehicles carrying the multi-dimensional target information synchronous acquisition system in sequence.

Substep 3: the method comprises the steps of collecting and processing relevant data of various driving behaviors such as lane change, lane keeping, car following, acceleration and deceleration and the like of a driver, and obtaining multi-source heterogeneous description data of the driving behaviors, such as the distances of obstacles in different directions measured by a radar or a vision sensor, the positions, the speeds, the accelerations, the yaw velocities and the like measured by a Beidou sensor and an inertial sensor, the steering wheel turning angles measured by a vehicle-mounted sensor and the like.

Substep 4: in order to simulate safe driving behaviors under marginal scenes such as dangerous scenes, conflict scenes and the like, a virtual city scene based on hardware-in-the-loop simulation is built, and the built city traffic scenes comprise the following three types:

(2) In the steering process of the vehicle, static traffic participants exist in the vision blind area of the vehicle;

(3) During the turning process of the vehicle, moving traffic participants exist in the vision blind area of the vehicle.

Substep 5: based on data collected by a real city environment and a driving simulation environment, a driving behavior data set for safe driving decision learning is constructed and formed, and the driving behavior data set can be specifically expressed as follows:

wherein X represents a binary group covering state and action, namely a constructed data set representing safe driving behavior of the human driver,(s) _j ,a _j ) Represents a "state-action" pair at time j, where s _j Indicates the state at time j, a _j Indicating the action at time j, i.e. human driver based on state s _j The actions made, n represents the number of "state-action" pairs in the database.

In order to realize the safe driving decision of the large-scale commercial vehicle in the urban low-speed environment, the invention comprehensively considers the influence of factors such as a visual blind area, an emergent obstacle, a driving condition and the like on the driving safety, and establishes a commercial vehicle safe driving decision model. In consideration of the fact that the perception capability of deep learning and the decision-making capability of the deep learning are combined by the deep reinforcement learning, the traffic environment is explored in an unsupervised learning mode, and the deep reinforcement learning is utilized to learn the safe driving strategy under the boundary scene such as a dangerous scene and a conflict scene. In addition, the invention simulates the safe driving behavior of the human driver under different driving conditions and driving conditions by means of the imitation learning in consideration of the ability of the imitation learning to have the imitation chart. Therefore, the constructed safe driving decision model is composed of two parts, which are described in detail as follows:

substep 1: defining basic parameters of a safe driving decision model

Firstly, converting the safe driving decision problem under the low-speed environment of the city into a limited Markov decision process. Secondly, basic parameters of a safe driving decision model are defined.

(1) Defining a state space

In order to describe the motion state of the own vehicle and the nearby traffic participants, the invention constructs a state space by using time series data and an occupancy grid map. The specific description is as follows:

S _t ＝[S ₁ (t),S ₂ (t),S ₃ (t)] (2)

in the formula, S _t Representing the state space at time t, S ₁ (t) and S ₂ (t) a state space relating time t to time series data, S ₃ (t) represents the state space associated with the occupied trellis diagram at time t.

S ₁ (t)＝[p _x ,p _y ,v _x ,v _y ,a _x ,a _y ,θ _s ] (3)

in the formula (I), the compound is shown in the specification,

Finally, in order to describe the relative position relationship between the vehicle and the surrounding traffic participants more vividly and improve the reliability and effectiveness of decision making, the invention grids the road area, divides the road area into a plurality of a multiplied by b grid areas, abstracts the road area and the vehicle target into a grid map, namely an 'existing' grid map S for describing the relative position relationship ₃ (t) of (d). Where a denotes the length of the mesh region and b denotes the width of the mesh region.

(2) Defining an action space

Defining an action space with lateral and longitudinal driving actions:

A _t ＝[a _left ,a _straight ,a _right ,a _accel ,a _cons ,a _decel ] (5)

(3) Defining a reward function

R _t ＝r ₁ +r ₂ +r ₃ (6)

In the formula, R _t A reward function representing time t, r ₁ ,r ₂ ,r ₃ Respectively indicating forward collision avoidanceThe reward function, the backward collision prevention reward function, and the lateral collision prevention reward function can be obtained by equations (7), (8), and (9).

Wherein TTC represents the time when the host vehicle collides with the front obstacle, and is obtained by dividing the distance between the host vehicle and the front obstacle by the relative velocity, and TTC represents the time when the host vehicle collides with the front obstacle _thr Representing a distance-to-collision time threshold, RTTC representing a backward collision time, RTTC _thr Represents a threshold value of backward collision time, and the unit is second and x _lat Indicating the distance, x, from the vehicle to the two side traffic participants _min Represents the minimum lateral safety distance, and the unit is meter and beta ₁ ,β ₂ ,β ₃ And respectively representing the weight coefficients of the forward collision prevention reward function, the backward collision prevention reward function and the lateral collision prevention reward function.

And substep 2: constructing DDQN-based decision sub-networks

The attention layer may be expressed as:

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^O (10)

wherein Attention (Q, K, V) represents the Attention matrix of the output,

representing the parameter matrix to be learned.

A decision sub-network based on a deep double-Q network is constructed, as shown in fig. 2, and described in detail below.

First, a state space S _t Respectively connected to encoder 1, encoder 2 and encoder 3. The encoder 1 is composed of two full-connected layers and outputs the motion state code of the vehicle. The encoder 2 has the same structure as the encoder 1, and outputs a relative motion state code. The encoder 3 consists of two convolutional layers and outputs a trellis diagram encoding.

Substep 3: building a decision sub-network based on generative confrontation mock learning

Under the complex urban traffic environment with open and multi-traffic target interference, it is difficult to construct an accurate and comprehensive reward function, and especially difficult to quantitatively describe the influence of various uncertainties (such as a sudden obstacle, traffic participants in a visual blind area, and the like) on driving safety. In order to reduce the influence of uncertainty of traffic environment and driving condition on safe driving decision and improve the effectiveness and reliability of driving decision, the invention utilizes the driving strategy in the sample data which is generated to resist and imitate a learning sub-network, learn a driving behavior data set and generalization thereof, and further imitate safe driving behaviors under different driving conditions and driving conditions. The generation confrontation imitation learning sub-network consists of a generator and an arbiter, and the generator network and the arbiter network are respectively constructed by utilizing a deep neural network. The specific description is as follows:

(1) Construct generators

A generator network as shown in figure 3 is constructed. The input of the generator is the state space and the output is the probability value f = pi (· | s; θ) of each action in the action space, where θ represents a parameter of the generator network. First, the state space is connected with the full connection layer FC in sequence ₁ And FC ₂ Are connected to obtain the characteristic F ₁ . State space in turn with full connection layer FC ₃ And FC ₄ Obtaining a characteristic F ₂ . At the same time, the state space is sequentially connected with the convolution layer C ₁ And a convolution layer C ₂ Are connected to obtain the characteristic F ₃ . Then, the feature F ₁ 、F ₂ And F ₃ Sequentially combined with a merging layer and a full connection layer FC ₅ And the Softmax activation function to obtain an output f = pi (· | s; theta).

(2) Constructing discriminators

A network of discriminators as shown in figure 4 is constructed. The input of the discriminator is the state space and the output is the vector

Dimension 6, where φ represents a parameter of the arbiter network. First, the state space is connected with the full connection layer FC in sequence ₆ And FC ₇ Are connected to obtain a characteristic F ₄ . State space sequentially connected with full connection layer FC ₈ And FC ₉ Obtaining a characteristic F ₅ . At the same time, the state space is sequentially connected with the convolution layer C ₃ And a convolution layer C ₄ Are connected to obtain a characteristic F ₆ . Then, the feature F is processed ₄ 、F ₅ And F ₆ Sequentially combined with a merging layer and a full connection layer FC ₁₀ Connected with Sigmoid activation function to obtain output

Wherein, the full connection layer FC ₆ 、FC ₇ 、FC ₈ 、FC ₉ And FC ₁₀ The number of neurons in the layer is 64, and the layer is convolutional ₃ With a convolution kernel of 3 x 3 and a step size of 2, convolution layer C ₄ Has a convolution kernel of 3 × 3 with a step size of 1.

Step three: decision model for training safe driving of commercial vehicle

First, training is based on generating a sub-network of decisions against mock learning. The goal of generating the confrontational mimic learning subnetwork is to learn a generator network so that the arbiter cannot distinguish between the driving actions generated by the generator and the actions in the driving behavior data set. The method specifically comprises the following substeps:

substep 1: in a driving behavior data set

In (1), initializing the generator network parameter θ ₀ Sum arbiter network parameter ω ₀ ；

Substep 2: l iterative solutions are performed, each iteration comprising sub-steps 2.1 to 2.2, in particular:

In the formula (I), the compound is shown in the specification,

substep 2.2: setting a reward function

substep 4: initializing a Q value corresponding to the action as a random value;

substep 5.1: initialization state s ₀ Initialization strategy parameter phi ₀ ；

substep 5.21: randomly selecting a driving action;

substep 5.22: otherwise, select a _t ＝max _a Q ^* (φ(s _t ),a；θ)；

In the formula, Q ^* (. Represents an optimal action cost function, a _t An operation at time t;

Substep 5.24: store samples (phi) in the empirical multiplexing pool D _t ,a _t ,r _t ,φ _t+1 )；

Substep 5.26: the iteration target is calculated using the following equation:

in the formula (I), the compound is shown in the specification,

representing the weight of the target network at the time t; gamma represents a discount factor; argmax (·) denotes a variable that maximizes the objective function, y _i Represents the iteration target at time i, and p (s, a) represents the motion distribution;

in the formula (I), the compound is shown in the specification,

with the expression parameter theta _i The gradient function of the neural network loss function of (1), epsilon represents the probability of randomly selecting an action under an epsilon-greedy exploration strategy; theta _i Parameter, L, representing the iteration at time i _i (θ _i ) Represents the loss function at time i, Q (s, a; theta _i ) Representing the action cost function of the target network and a 'representing all possible actions of state s'.

Claims

1. A decision-making method for safe driving of a large-scale commercial vehicle in an urban low-speed environment comprises the steps of firstly, collecting safe driving behaviors of human drivers in an urban traffic environment, and constructing and forming a safe driving behavior data set; secondly, constructing a decision model for safe driving of the commercial vehicle based on multi-head attention; the model comprises two sub-networks of deep double-Q network and generation confrontation simulation learning; the deep double-Q network learns the safe driving strategy under the edge scene such as a dangerous scene and a conflict scene in an unsupervised learning mode; generating an antagonistic imitation learning sub-network to imitate the safe driving behavior of a human driver under different driving conditions and driving conditions; finally, training a safe driving decision model to obtain driving strategies under different driving conditions and driving working conditions, and realizing high-level decision output of safe driving behaviors of the commercial vehicle; the method is characterized in that:

Collecting safe driving behaviors under different driving conditions and driving conditions in a mode of actual road testing and driving simulation, and further constructing a data set representing the safe driving behaviors of human drivers; the method specifically comprises the following 4 sub-steps:

substep 1: constructing a multi-dimensional target information synchronous acquisition system by using a millimeter wave radar, a 128-line laser radar, a vision sensor, a Beidou sensor and an inertial sensor;

substep 2: under a real urban environment, a plurality of drivers sequentially drive operating vehicles carrying a multi-dimensional target information synchronous acquisition system, acquire and process relevant data of various driving behaviors of the drivers such as lane change, lane keeping, vehicle following and acceleration and deceleration, and acquire multi-source heterogeneous description data of the driving behaviors, wherein the data comprises a plurality of obstacle distances in different directions measured by a radar or a vision sensor, positions, speeds, accelerations and yaw angular velocities measured by a Beidou sensor and an inertial sensor, and steering wheel corners measured by a vehicle-mounted sensor;

substep 3: in order to simulate safe driving behaviors in edge scenes such as dangerous scenes, conflict scenes and the like, a virtual city scene based on hardware-in-the-loop simulation is built; the constructed urban traffic scene comprises the following three categories:

(1) In the driving process of the vehicle, transversely approaching traffic participants appear in front of the vehicle, namely, the traffic participants meet obstacles suddenly;

(3) In the steering process of the vehicle, moving traffic participants exist in the visual blind area of the vehicle;

in the traffic scene, various road network structures exist, including straight roads, curved roads and crossroads, and various types of traffic participants, including commercial vehicles, passenger vehicles, non-motor vehicles and pedestrians;

a plurality of drivers drive a commercial vehicle in a virtual scene through a real controller with a steering wheel, an accelerator and a brake pedal, and acquire the transverse and longitudinal position, the transverse and longitudinal speed, the transverse and longitudinal acceleration, the relative distance with surrounding traffic participants and the relative speed information of the own vehicle;

substep 4: based on data collected by a real city environment and a driving simulation environment, a driving behavior data set for safe driving decision learning is constructed and formed, and the data set is specifically represented as follows:

wherein X represents a binary group covering state and action, namely a constructed data set representing safe driving behavior of the human driver,(s) _j ,a _j ) Represents a "state-action" pair at time j, where s _j Indicates the state at time j, a _j Indicating the action at time j, i.e. human driver based on state s _j The actions taken, n represents the number of "state-action" pairs in the database;

step two: construction of multi-head attention-based safe driving decision model for commercial vehicle

Learning a safe driving strategy under the edge scenes of a dangerous scene and a conflict scene by utilizing deep reinforcement learning; in addition, considering the ability of imitating the sample of the imitation learning, the safe driving behavior of the human driver under different driving conditions and driving conditions is simulated by utilizing the imitation learning; therefore, the constructed safe driving decision model consists of two parts, which are specifically described as follows:

substep 1: defining basic parameters of a safe driving decision model

Firstly, converting a safe driving decision problem in an urban low-speed environment into a limited Markov decision process; secondly, defining basic parameters of a safe driving decision model;

(1) Defining a state space

In order to describe the motion state of the own vehicle and the nearby traffic participants, a state space is constructed by using the time sequence data and the occupancy grid map; the specific description is as follows:

S _t ＝[S ₁ (t),S ₂ (t),S ₃ (t)] (2)

in the formula, S _t Representing the state space at time t, S ₁ (t) a state space of the own vehicle related to the time-series data at time t, S ₂ (t) represents tState space of surrounding traffic participants, S, temporally correlated to time series data ₃ (t) represents the state space associated with the occupied raster map at time t;

S ₁ (t)＝[p _x ,p _y ,v _x ,v _y ,a _x ,a _y ,θ _s ] (3)

in the formula, p _x ,p _y Respectively represents the transverse position and the longitudinal position of the bicycle, and the units are meter and v _x ,v _y Respectively, the transverse speed and the longitudinal speed of the self-vehicle, and the unit is meter per second, a _x ,a _y Respectively represents the lateral acceleration and the longitudinal acceleration of the bicycle, and the unit is meter per second squared and theta _s Representing the course angle of the self vehicle, and the unit is degree;

in the formula (I), the compound is shown in the specification,

a _i respectively representing the relative distance, the relative speed and the acceleration of the vehicle and the ith traffic participant, wherein the units are meter, meter per second and meter per second of square;

finally, the road area is rasterized, divided into a plurality of grid areas of p × q, and the road area and the vehicle object are abstracted into a grid map, namely an 'existing' grid map S for describing the relative position relationship ₃ (t); wherein p represents the length of the mesh region and q represents the width of the mesh region;

the "existing" grid map contains four attributes including grid coordinates, whether a vehicle exists, the category of the corresponding vehicle, and the distance to the left and right lane lines; the grid where no traffic participant exists is set as '0', the grid where the traffic participant exists is set as '1', and the positions of the grid and the grid where the own vehicle is located are distributed and used for describing the relative distance between the two vehicles;

(2) Defining an action space

Defining an action space with lateral and longitudinal driving actions:

A _t ＝[a _left ,a _straight ,a _right ,a _accel ,a _cons ,a _decel ] (5)

in the formula, A _t Represents the motion space at time t, a _left ,a _straight ,a _right Respectively representing left turn, straight run and right turn, a _accel ,a _cons ,a _decel Respectively representing acceleration, uniform speed and deceleration;

(3) Defining a reward function

R _t ＝r ₁ +r ₂ +r ₃ (6)

In the formula, R _t A reward function representing time t, r ₁ ,r ₂ ,r ₃ Respectively representing a forward collision avoidance reward function, a backward collision avoidance reward function and a lateral collision avoidance reward function, which can be obtained through an equation (7), an equation (8) and an equation (9);

wherein TTC represents the time when the host vehicle collides with the front obstacle, and is obtained by dividing the distance between the host vehicle and the front obstacle by the relative velocity, and TTC represents the time when the host vehicle collides with the front obstacle _thr Indicating a distance collisionTime threshold, RTTC denotes the time of the backward collision, RTTC _thr Represents a threshold value of backward collision time, and the unit is second and x _lat Indicating the distance, x, from the vehicle to the two side traffic participants _min Represents the minimum lateral safety distance, and the unit is meter and beta ₁ ,β ₂ ,β ₃ Respectively representing the weight coefficients of a forward collision prevention reward function, a backward collision prevention reward function and a lateral collision prevention reward function;

and substep 2: construction of a deep dual-Q network-based decision sub-network

A safe driving strategy under an edge scene is learned by utilizing a deep double-Q network;

designing a strategy network based on a multi-head attention mechanism; in addition, the permutation invariance is constructed into a decision sub-network by using a position coding method;

the attention layer can be expressed as:

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^O (10)

where MultiHead (Q, K, V) represents a multi-head attention value, Q represents a query vector, K represents a key vector, and dimensions are d _k V represents a vector of values with dimension d _v ，W ^O Representing a parameter matrix, head, to be learned _h Denotes the h-th head in multi-head attention, in the present invention, h =2, calculated by the following formula:

wherein Attention (Q, K, V) represents the Attention matrix of the output,

a parameter matrix representing the need for learning;

constructing a decision sub-network, which is described in detail as follows;

first, a state space S _t Respectively connected with the encoder 1, the encoder 2 and the encoder 3; the encoder 1 consists of two full-connection layers and outputs the motion state code of the vehicle; the structure of the encoder 2 is the same as that of the encoder 1, and relative motion state encoding is output; the encoder 3 is composed of two convolution layers and outputs the occupying grid pattern code;

wherein the number of neurons of the full-junction layer is 64, and the activation functions are Tanh functions; convolution kernels of the convolution layers are all 3 multiplied by 3, and step lengths are all 2;

secondly, analyzing the dependency relationship between the self-vehicle and the surrounding traffic participants by using a multi-head attention mechanism, so that the decision sub-network can notice the traffic participants which are suddenly close to the self-vehicle or conflict with the driving path of the self-vehicle, and different input sizes and arrangement invariances are constructed into the decision sub-network; the outputs of the encoder 1, the encoder 2 and the encoder 3 are all connected with the multi-head attention module to output an attention matrix; thirdly, connecting the output attention matrix with the decoder 1; the encoder 1 consists of a fully connected layer;

the number of neurons in the full connection layer is 64, and the activation function is a Sigmoid function;

The method comprises the steps that a confrontation simulation learning sub-network is generated, a driving behavior data set and a driving strategy in generalized sample data of the driving behavior data set are learned, and then safe driving behaviors of human drivers under different driving conditions and driving working conditions are simulated; the generation confrontation imitation learning sub-network consists of a generator and a discriminator, and a generator network and a discriminator network are respectively constructed by utilizing a deep neural network; the specific description is as follows:

(1) Construct generators

Constructing a generator network as shown in FIG. 3; the input of the generator is state space, and the output is probability value f = pi (· | s; theta) of each action in the action space, wherein theta represents a parameter of the generator network; first, the state space is connected with the full connection layer FC in sequence ₁ And FC ₂ Are connected to obtain a characteristic F ₁ (ii) a State spaceIn turn with full connection layer FC ₃ And FC ₄ Obtaining a characteristic F ₂ (ii) a At the same time, the state space is sequentially linked with the convolution layer C ₁ And a convolution layer C ₂ Are connected to obtain a characteristic F ₃ (ii) a Then, the feature F ₁ 、F ₂ And F ₃ Sequentially combined with a merging layer and a full connecting layer FC ₅ Connecting the Softmax activation function to obtain an output f = pi (· | s; theta);

wherein, the full connection layer FC ₁ 、FC ₂ 、FC ₃ 、FC ₄ And FC ₅ The number of neurons in the layer is 64, and the layer is convolutional ₁ Has a convolution kernel of 3 x 3 and a step size of 2, and a convolution layer C ₂ The convolution kernel of (a) is 3 x 3, the step length is 1;

(2) Construction discriminator

Constructing a discriminator network; the input of the discriminator is the state space and the output is the vector

The dimension is 6, wherein φ represents a parameter of the arbiter network; first, the state space is connected with the full connection layer FC in sequence ₆ And FC ₇ Are connected to obtain a characteristic F ₄ (ii) a State space in turn with full connection layer FC ₈ And FC ₉ Obtaining a characteristic F ₅ (ii) a At the same time, the state space is sequentially connected with the convolution layer C ₃ And a convolution layer C ₄ Are connected to obtain the characteristic F ₆ (ii) a Then, the feature F ₄ 、F ₅ And F ₆ Sequentially combined with a merging layer and a full connecting layer FC ₁₀ Connected with Sigmoid activation function to obtain output

Wherein the full connection layer FC ₆ 、FC ₇ 、FC ₈ 、FC ₉ And FC ₁₀ The number of neurons in the layer is 64, and the layer is convolutional ₃ Has a convolution kernel of 3 x 3 and a step size of 2, and a convolution layer C ₄ The convolution kernel of (a) is 3 x 3, the step length is 1;

step three: decision model for training safe driving of commercial vehicle

First, training is based on generating a decision sub-network for countering mock learning; the goal of generating the confrontational mimic learning subnetwork is to learn a generator network so that the arbiter cannot distinguish between driving actions generated by the generator and actions in the driving behavior data set; the method specifically comprises the following substeps:

substep 1: in-driving behavior data set

Substep 2: performing L iterative solutions, each iteration comprising substeps 2.1 to 2.2, in particular:

substep 2.1: update of the discriminator parameter ω using the gradient formula described by equation (13) _i →ω _i+1 ：

In the formula + _ω A gradient function representing a neural network loss function with a parameter ω;

substep 2.2: setting a reward function

Updating generator parameter θ using trust domain policy optimization algorithm _i →θ _i+1 ；

and substep 4: initializing a Q value corresponding to the action as a random value;

substep 5.21: randomly selecting a driving action;

substep 5.22: otherwise, select a _t ＝max _a Q ^* (φ(s _t ),a；θ)；

Substep 5.26: the iteration target is calculated using the following equation:

in the formula (I), the compound is shown in the specification,

representing the weight of the target network at time t; gamma represents a discount factor; arg max (·) denotes a variable that maximizes the objective function, y _i Represents the iteration target at time i, and p (s, a) represents the motion distribution;

in the formula (I), the compound is shown in the specification,

with the parameter θ _i The gradient function of the neural network loss function of (1), epsilon represents the probability of randomly selecting an action under an epsilon-greedy exploration strategy; theta _i Parameter representing the iteration at time i, L _i (θ _i ) Represents the loss function at time i, Q (s, a; theta _i ) An action cost function representing the target network, a 'represents all possible actions of the state s';

after the training of the safe driving decision model of the commercial vehicle is completed, the state space information acquired by the sensor is input into the safe driving decision model, so that high-level driving decisions such as steering, straight going, acceleration and deceleration and the like can be output in real time, and the running safety of the commercial vehicle in the urban low-speed environment can be effectively guaranteed.