CN113561995A

CN113561995A - Automatic driving decision method based on multi-dimensional reward architecture deep Q learning

Info

Publication number: CN113561995A
Application number: CN202110956262.5A
Authority: CN
Inventors: 崔建勋; 张瞫; 刘昕
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-10-29
Anticipated expiration: 2041-08-19
Also published as: CN113561995B

Abstract

An automatic driving decision method based on multi-dimensional reward architecture deep Q learning belongs to the technical field of automatic driving. The problem that the multi-dimensional performance cannot be optimized simultaneously in the existing driving decision method is solved. The method comprises the steps of collecting environmental information of an automatic driving vehicle in real time by adopting a vision sensor and a LIDAR sensor; acquiring image information and/or point cloud information; inputting image information and/or point cloud information into a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort; summing the reward valuations of the driving decisions in three dimensions to obtain a total driving strategy reward valuation; and analyzing the driving strategy reward evaluation value by adopting an epsilon-greedy algorithm to obtain an optimal decision action. The method is suitable for generating the control strategy of the automatic driving vehicle.

Description

Automatic driving decision method based on multi-dimensional reward architecture deep Q learning

Technical Field

The invention belongs to the technical field of automatic driving.

Background

The automatic driving decision is a crucial link in the whole technical link of automatic driving, takes state perception as input and takes traffic decision under a specific traffic scene as output to serve subsequent automatic driving vehicle motion planning and vehicle control, and the intelligent level of the automatic driving decision directly determines the level and quality of the automatic driving automation degree.

Traditionally, there are two general implementations of automated driving decision-making, rule-based and learning-based methods. The rule-based method is usually used for enumerating various possible traffic driving states artificially, then providing corresponding driving decisions in the states, storing the driving decisions as a rule set, and triggering response decision behaviors when an automatic driving vehicle encounters a certain driving state in the rule set.

The greatest benefit of this approach is safety, controllability, and everything is within the scope of human design and understanding, but the biggest problem is that it is impractical to enumerate all the traffic conditions that an autonomous vehicle may encounter, and for some states that are not defined in a rule set, the autonomous vehicle will not know how to make a decision, i.e. it is not a generalizable problem.

The learning-based method can just overcome the problem that the rule-based method is difficult to generalize, and can train a decision model through state and action samples of some scenes, so that better decision action can be generated when unknown scenes are met. In the learning-based method, a decision mode based on reinforcement learning is used to compare the weighted and highlighted potential. Reinforcement learning allows autonomous vehicles to interact with the environment continuously, and the decision level of the autonomous vehicle is continuously improved under the condition of autonomous exploration. However, the automatic driving decision is a complex decision behavior, and the decision target dimensions considered by the automatic driving decision are many, such as safety, comfort, high efficiency, economy, and the like, whereas the conventional automatic driving decision method based on reinforcement learning generally adopts an accumulated benefit estimation function to estimate the comprehensive benefits of multiple dimensions. However, due to the simultaneous consideration of the combined revenue of multiple dimensions, it is not possible to guarantee that each dimension achieves the best revenue.

Disclosure of Invention

The invention aims to solve the problem that multi-dimensional performance cannot be optimized simultaneously in the existing driving decision method, and provides an automatic driving decision method based on multi-dimensional reward architecture deep Q learning.

The invention relates to an automatic driving decision method based on multidimensional reward architecture deep Q learning, which comprises the following steps:

the method comprises the following steps of firstly, acquiring environmental information of an automatic driving vehicle in real time by adopting a vision sensor and a LIDAR sensor; acquiring image information and/or point cloud information;

inputting image information and/or point cloud information into a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions in three dimensions of safety, efficiency and comfort;

step three, summing the reward valuations of the driving decisions under three dimensions to obtain a total driving strategy reward valuation;

and step four, analyzing the driving strategy reward evaluation value by adopting an epsilon-greedy algorithm to obtain an optimal decision action.

Further, in the second step of the present invention, a specific method for obtaining an incentive valuation of a driving decision in three dimensions of safety, efficiency and comfort is as follows:

when the acquired data is only image information:

regularization processing is carried out on the image information, and data after regularization processing are sequentially input to a convolution layer and a full connection layer of a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;

when the acquired data is only point cloud data:

sequentially inputting the point cloud data into a circulating neural network (LSTM) and a full connection layer to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;

when the received data contains both point cloud data and image information:

the point cloud data is input into a cyclic neural network, image information is input into a convolutional layer after being regularized, the output of the cyclic neural network and the output of the convolutional layer are input into a full-connection layer after being spliced, and reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving are obtained.

Further, in the present invention, the driving strategy in the second step is: accelerating, decelerating, turning to the right, turning to the left and not doing five strategic actions.

Further, in the present invention, the deep Q-value network of the multi-reward architecture includes three reward functions, which are: a security reward function, an efficiency reward function, and a comfort reward function.

Further, in the present invention, the security reward function is:

wherein r is_sThe value of a preset safety reward constant value is positive; r_s(s, a) is the security reward function value obtained with action a under the current environment s.

Further, in the present invention, the efficiency reward function is:

wherein r is_oA constant value is awarded for the preset efficiency, and the value of the constant value is positive; r_o(s, a) is the efficiency reward function value obtained with action a in the current context s.

Further, in the present invention, the comfort reward function is:

wherein r is_lA preset comfort reward constant value is positive; r_l(s, a) is comfort obtained with action a under current environment sA value of a sexual reward function.

Further, in the present invention, the deep Q-value network of the multi-reward framework in the step one is trained, and the specific method of training is as follows:

s1, selecting an input environmental state sample S, inputting the sample S into a depth Q value network of a multi-reward architecture to be trained, and obtaining reward evaluation values of driving decisions in three dimensions of comprehensiveness, efficiency and comfort;

step S2, calculating a sum Q of reward valuations of driving decisions in three dimensions_RAMObtaining Q of any one of actions a_RAMObtaining a parameter set to be trained of the deep network;

step S3, determining a loss objective function according to a parameter set to be trained of the deep network, minimizing the objective function, and obtaining a driving strategy of the next step;

and S4, acquiring the driving strategy in the step S3, updating the environmental state sample S of the arranged S1, and returning to execute the step S1 until the minimized value of the objective function converges.

Further, in the present invention, the loss objective function in step S3 is:

wherein, theta_kK belongs to s, o, l is a parameter set to be trained by the depth network under the security dimension, the efficiency dimension or the comfort dimension; s, a, s ', a' respectively represent the environmental state at the current moment, the action taken by the current environmental state, the environmental state at the next moment and the action taken at the environmental state at the next moment; q_k(s，a；θ_k) K ∈ s, o, l indicates that in the environment state s, with action a, the parameter set is θ_kIn the case of (1), the future is expected to be the cumulative expected total revenue that can be obtained over the reward dimension k.

Further, in the present invention, in step three, the reward evaluation values for driving decisions in three dimensions are summed as:

wherein Q is_RAM(s, a; theta) is the sum of the reward estimates for driving decisions in three dimensions, theta is the set of parameters to be trained for the deep network, theta_kA set of parameters to be trained for a deep network in a security dimension, an efficiency dimension, or a comfort dimension.

The method mainly comprises the step of training an accumulated profit valuation function independently from multiple dimensions by a deep Q valuation network (MRA-DQN), namely, a multi-reward framework is adopted to evaluate profits of each dimension independently and not in a coupling mode, so that the performance of an automatic driving decision model based on reinforcement learning is further improved. Meanwhile, the invention designs respective valuation network structures aiming at the gains of three dimensions of safety, overtaking and lane changing respectively. The three dimensions share a bottom deep network, and then each has its own unique deep network. The network of three dimensions estimates the reward accumulated income of the self dimension respectively and then the reward accumulated income is collected to form the total income Q_MRAFinally, the adopted decision-making behavior is decided, and the optimal strategy is obtained through multi-dimensional consideration.

Drawings

FIG. 1 is a deep Q network architecture diagram of the reward architecture of the present invention;

FIG. 2 is a Q-depth network architecture with only camera sensor acquisition image data as input;

FIG. 3 is a Q-depth network architecture with only LIDAR sensors collecting 3D point cloud data as input;

fig. 4 is a Q depth network architecture with two kinds of perceptual data, image and 3D point cloud as inputs.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The first embodiment is as follows: the following describes an embodiment with reference to fig. 1 to 4, where the method for automatic driving decision based on multidimensional reward framework deep Q learning according to the embodiment includes:

The invention provides an automatic driving decision method based on multi-reward architecture deep Q learning, which mainly comprises a deep Q valuation network (MRA-DQN) as shown in figure 1. The network receives sensory information inputs from the autopilot sensors, including images of the vision sensors and point clouds of the LIDAR sensors, and the corresponding outputs are the cumulative reward revenue values for 5 autopilot decisions (acceleration, deceleration, turn right, turn left, do nothing). And determining the finally executed action based on the epsilon-greedy strategy according to the size of the accumulated profit value.

Further, in the second embodiment, in the second step, a specific method for obtaining the reward evaluation value of the driving decision in the three dimensions of safety, efficiency and comfort is as follows:

when the acquired data is only image information:

when the acquired data is only point cloud data:

when the received data contains both point cloud data and image information:

The Q network described in this embodiment is used to estimate the future cumulative benefit of a state-action pair, and is the key to automated driving behavior decision-making. The invention adopts a deep neural network to approximate the Q function. In order to accommodate the two types of autopilot sensor inputs of camera and LIDAR, three types of Q-deep neural networks are designed in the present invention, as shown in fig. 2, 3 and 4, respectively.

The input module is the state awareness information necessary to provide automated driving decisions. A forward looking camera, a 360LIDAR, and a combination of the two may be employed herein. The camera provides image data and the LIDAR provides 3D point cloud data.

(1) And (3) a Q-depth network architecture taking only image data collected by a camera sensor as input (figure 2). The input data image size is 80x80 pixels, and is first regularized by a Normalization layer and then passed through a series of convolutional layers as shown in FIG. 2, which are shared by the Q-value functions of three dimensions. After passing through the shared convolution layer, the data enters respective full connection layers, and then 5 discrete actions (acceleration, deceleration, turning to the right, turning to the left) corresponding to respective dimensions are outputNo action) is performed. The Q values of the three dimensions are respectively expressed as Q_s，Q_o，Q_lSafety, efficiency (driving closer to the desired speed, higher efficiency) and comfort (less number of vehicle changes, higher comfort) are associated. Reward function (R) per dimension_s，R_o，R_l)。

(2) Q-depth network architecture (fig. 3) with only LIDAR sensor acquisition of 3D point cloud data as input. The input data is point cloud data collected by the LIDAR, and because the point cloud data collected by the LIDAR is time sequence data, a recurrent neural network LSTM (recurrent neural network) is adopted in the underlying network shared by different dimensionality Q value networks, and the number of internal neurons is 512. Similarly, the input point cloud data enters the own network of each dimension after passing through the shared LSTM network, and then the accumulated profit estimate values of each dimension corresponding to 5 discrete actions are output.

(3) And a Q depth network architecture taking two perception data of an image and a 3D point cloud as input (figure 4). The input data are image data acquired by a camera and point cloud data acquired by a LIDAR, the image data are processed by a convolution network, the 3D point cloud data are processed by a recurrent neural network (LSTM), finally, a splicing layer (splice) is arranged to splice the feature extraction results of the image data and the point cloud data, then the image data and the 3D point cloud data enter the network of each dimension, and finally, the accumulated income estimation values of the dimensions corresponding to 5 discrete actions are output.

Further, in the present embodiment, the driving strategy in the second step is: accelerating, decelerating, turning to the right, turning to the left and not doing five strategic actions.

Further, in this embodiment, the deep Q-value network of the multi-reward framework includes three reward functions, which are: a security reward function, an efficiency reward function, and a comfort reward function.

Further, in the present embodiment, the security reward function is:

Further, in the present embodiment, the efficiency reward function is:

Further, in the present embodiment, the comfort reward function is:

wherein r is_lA preset comfort reward constant value is positive; r_l(s, a) is the comfort reward function value obtained with action a in the current environment s.

The input module is the state awareness information necessary to provide automated driving decisions. The present invention may employ a forward looking camera, a 360LIDAR, and a combination of the two. The camera provides image data and the LIDAR provides 3D point cloud data.

Further, in this embodiment, the deep Q-value network of the multi-reward framework in the step one is trained, and the specific method of training is as follows:

step S2, calculating a sum Q of reward valuations of driving decisions in three dimensions_RAMObtaining Q of any one of actions a_RAMValue, obtainTaking a parameter set to be trained of the deep network;

Further, in the present embodiment, the loss objective function in step S3 is:

Further, in the third step, the reward evaluation values for driving decisions in three dimensions are summed as:

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. An automatic driving decision method based on multi-dimensional reward architecture deep Q learning is characterized by comprising the following steps:

2. The automatic driving decision method based on the multidimensional reward architecture deep Q learning as claimed in claim 1, wherein in the second step, the specific method for obtaining the reward evaluation value of the driving decision in three dimensions of safety, efficiency and comfort is as follows:

when the acquired data is only image information:

when the acquired data is only point cloud data:

sequentially inputting the point cloud data into a circulating neural network and a full connection layer of a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;

when the received data contains both point cloud data and image information:

the point cloud data is input into a recurrent neural network of a depth Q value network of a multi-reward architecture, image information is input into a convolution layer of the depth Q value network of the multi-reward architecture after being regularized, output of the recurrent neural network and output of the convolution layer are spliced and then input into a full connection layer, and reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving are obtained.

3. The method as claimed in claim 1 or 2, wherein the driving strategy in step two is: accelerating, decelerating, turning to the right, turning to the left and not doing five strategic actions.

4. The method as claimed in claim 3, wherein the deep Q-value network of the multi-reward framework comprises three reward functions, which are: a security reward function, an efficiency reward function, and a comfort reward function.

5. The method according to claim 4, wherein the safety reward function is:

6. The method of claim 4, wherein the efficiency reward function is:

7. The method of claim 4, wherein the comfort reward function is:

8. The automatic driving decision method based on the multidimensional reward architecture deep Q learning as claimed in claim 1, wherein the deep Q value network of the multidimensional reward architecture in the first step is trained, and the specific method for training is as follows:

step S2, calculating a sum Q of reward valuations of driving decisions in three dimensions_RAMTo obtain any one ofQ of action a_RAMObtaining a parameter set to be trained of the deep network;

and step S4, executing the driving strategy described in step S3, updating the environmental state sample S after arrangement of S1, and returning to execute step S1 until the minimized value of the objective function converges.

9. The method for automatic driving decision based on multidimensional reward architecture deep Q learning as claimed in claim 8, wherein the loss objective function in step S3 is:

wherein, L (theta)_k) Is the value of the objective function, theta_kK belongs to s, o, l is a parameter set to be trained by the depth network under the security dimension, the efficiency dimension or the comfort dimension; s, a, s ', a' respectively represent the environmental state at the current moment, the action taken by the current environmental state, the environmental state at the next moment and the action taken at the environmental state at the next moment; q_k(s, a; θ k), k ∈ s, o, l indicates that in the environmental state s, with action a, the parameter set is θ_kIn the case of (1), the future is expected to be the cumulative expected total revenue that can be obtained over the reward dimension k.

10. The method according to claim 9, wherein in step three, the reward evaluation values for the driving decision in three dimensions are summed as: