CN112861269A

CN112861269A - Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Info

Publication number: CN112861269A
Application number: CN202110267799.0A
Authority: CN
Inventors: 黄鹤; 吴润晨; 张峰; 王博文; 于海涛; 汤德江; 张炳力
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-05-28
Anticipated expiration: 2041-03-11
Also published as: CN112861269B

Abstract

The invention discloses an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which comprises the following steps: 1, defining a state parameter set s and a control parameter set a for driving the automobile; 2, initializing deep reinforcement learning parameters and constructing a deep neural network; 3 defining a depth reinforcement learning reward function and a priority extraction rule; 4 training a deep neural network and obtaining an optimal network model; 5 obtaining the state parameter s of the automobile at the moment t_tAnd inputting the optimal network model to obtain an output a_tAnd is executed by the automobile. The invention completes the longitudinal multi-state driving of the automobile by combining the priority extraction algorithm and the control method of deep reinforcement learning, thereby ensuring that the automobile can driveThe safety is higher during driving, and the occurrence of traffic accidents is reduced.

Description

Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Technical Field

The invention relates to the technical field of intelligent automobile longitudinal multi-state control, in particular to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction.

Background

With the rapid development of urban economy and the continuous improvement of the living standard of people, the quantity of motor vehicles kept in cities is greatly increased, automobiles become indispensable tools for transportation when people go out, and a series of safety problems are brought while rapidness and convenience are brought. Due to the limited technical capability of drivers or other uncontrollable external factors and other reasons, traffic problems such as collision of two or more vehicles often occur on roads, so that great difficulty is caused for smooth roads while life and property safety loss is brought. With the continuous development of automobile related technologies, an adaptive cruise system, an emergency braking system and the like are introduced by a plurality of automobile enterprises. The self-adaptive cruise system obtains front road data by using sensors such as a radar and the like, keeps a certain distance from a front vehicle and maintains a certain speed according to a corresponding algorithm, but is usually started at a higher speed, such as more than 25km/h, and needs a driver to manually control when the speed is lower than the speed; the emergency braking system is a technology which can actively brake to avoid accidents under the conditions that an automobile runs in a non-adaptive cruise state and meets the front emergency, such as sudden stop of the automobile or sudden pedestrian, but has related reasons of misjudgment of a sensor, environmental errors and the like, and cannot be applied to various running environments, so that dangerous accidents are caused.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the automobile longitudinal multi-state control method based on the deep reinforcement learning preferential extraction, so that the automobile longitudinal multi-state driving is completed by combining the priority extraction algorithm and the deep reinforcement learning control method, the safety of the automobile in the driving process is higher, and the occurrence of traffic accidents is reduced.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which is characterized by comprising the following steps of:

step 1: establishing a vehicle dynamic model and a vehicle running environment model;

step 2: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;

and step 3: defining a set of state information s ═ s for a vehicle₀,s₁,…s_t,…,s_n}，s₀Indicating initial state information of the vehicle, s_tIndicating that the vehicle is in state s_t-1I.e. the control action a is executed at time t-1_t-1The state reached thereafter, and has s_t＝{Ax_t,e_t,Ve_tIn which Ax is_tRepresenting the longitudinal acceleration of the vehicle at time t, e_tRepresenting the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time t_tThe difference value between the self vehicle speed and the front vehicle speed is represented;

defining a control parameter set a ═ { a) of a vehicle₀,a₁,…,a_t,…,a_n}，a₀Initial control parameter information representing a vehicle, a_tIndicating that the vehicle is in state s_tI.e. the action performed by the vehicle at time t, and has a_t＝{T_t,B_tIn which T is_tRepresenting the throttle opening at time t of the vehicle, B_tThe master cylinder pressure of the vehicle at the time t is represented, wherein t is 1,2, …, c and c represents the total training time;

and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;

and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;

the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time t_tThe hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:

Q_e＝Relu(Relu(s_t×w₁+b₁)×w₂+b₂) (1)

in the formula (1), w₁、b₁Weight and bias value for the hidden layer, w₂、b₂Is the weight and offset value, Q, of the output layer_eThe output value of the output layer is the current Q value of all actions obtained by the deep neural network;

step 6: defining a reward function for deep reinforcement learning:

in the formulae (2) and (3), r_hThe bonus value r is the bonus value in the high-speed state of the vehicle_lThe method comprises the following steps that (1) the reward value is in a low-speed state of a vehicle, dis is the relative distance between the vehicle and a front vehicle, Vf is the speed of the front vehicle, x represents the lower limit of the relative distance, y represents the upper limit of the relative distance, mid represents the switching threshold value of a reward function relative to the relative distance, lim represents the switching threshold value of the reward function relative to the difference value between the speed of the vehicle and the speed of the front vehicle, z represents the switching threshold value of the reward function relative to the speed of the front vehicle, and u represents the lower limit of the;

and 7: defining an experience pool priority extraction rule;

for the current Q value Q stored in the experience pool_eAnd a target Q value Q_tMake a difference, anAccording to the SumTree algorithm, the difference values are used for carrying out priority sequencing on all parameter forms stored in an experience pool, the sequenced parameter forms are obtained, and the front bs parameter forms are extracted from the parameter forms;

the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (4):

in the formula (4), p_kThe priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;

and 8: defining a greedy strategy;

generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting Q_eThe action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;

and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment;

state s at time t_tObtaining all action value functions through the deep neural network, and selecting action a by utilizing a greedy strategy_tThen executed by the vehicle;

state s of the vehicle at time t_tLower execution action a_tObtaining the state parameter s at the moment of t +1_t+1And a prize value r at time t_tEach parameter is expressed in a parameter form s_t,a_t,r_r,s_t+1Storing the data into an experience pool D;

step 10: constructing a target neural network with the same structure as the deep neural network;

obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 moment_t+1Inputting the target neural network, and having:

Q_ne＝Relu(Relu(s_t+1×w₁′+b₁′)×w₂′+b₂′) (5)

in the formula (5), Q_neThe output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a₁′、w₂' weights for the hidden and output layers of the target neural network, respectively, b₁′、b₂' bias of the hidden layer and the output layer of the target neural network, respectively;

step 11: establishing a target Q value Q_t；

The probability distribution pi (a | s) of the action a performed in the state s is defined by equation (6):

π(a|s)＝P(a_t＝a|s_t＝s) (6)

in the formula (6), p represents a conditional probability;

obtaining a State cost function v using equation (7)_π(s)：

v_π(s)＝E_π(r_t+γr_t+1+γ²r_t+2+…|s_t＝s) (7)

In the formula (7), gamma is a reward attenuation factor, E_πIndicating a desire;

obtaining the execution of action a at time t by equation (8)_tProbability of going to the next state s

Obtaining an action cost function q by using the formula (9)_π(s,a)：

In the formula (9), the reaction mixture is,

representing the reward value, v, of the vehicle after performing action a in state s_π(s ') represents a state cost function for the vehicle at state s';

obtaining a target Q value Q by the formula (10)_t：

Q_t＝r_t+γmax(Q_ne) (10)

Step 12: the loss function loss is constructed using equation (11):

loss＝ISW×(Q_t-Q_e)² (11)

carrying out a gradient descent method on the loss function loss so as to update the deep neural network parameter w₁、w₂、b₁、b₂；

Updating the parameter w of the target neural network with an update frequency rt₁′、w₂′、b₁′、b₂', and update values are taken from the deep neural network;

step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;

step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, so that corresponding actions are executed on the vehicle to complete longitudinal multi-state control.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the traditional automobile longitudinal control method, the control method has better control smoothness under different working conditions and better control stability under the limit working condition, and is suitable for the multi-state control of high speed, medium speed and low speed of the automobile;

2. the deep reinforcement learning of the invention utilizes the trained deep neural network, and corresponding actions can be executed only by inputting the state information of the automobile, so that the invention has more simplicity and rapidity compared with the complex traditional automobile control and has relatively better control effect;

3. compared with common reinforcement learning, the deep reinforcement learning of the invention processes the input state parameter information by using the neural network without a large amount of table storage data, thereby greatly saving the memory space, and the neural network training has higher efficiency and better convergence compared with the common iteration method;

4. the invention adopts the data priority extraction method, can perform priority arrangement on the data in the experience pool compared with the harsh performance of the traditional automobile multi-state control method switching, integrates the parameter information of the automobile in various states, greatly shortens the training time, enables the multi-state control of the automobile to be uniform, does not need to perform complicated control method switching, and has better control effect.

Detailed Description

In this embodiment, an automobile longitudinal multi-state control method based on deep reinforcement learning and preferential extraction can decide the throttle opening and the master cylinder pressure of an automobile at a corresponding moment according to real-time state parameters of the automobile, so as to complete multi-state control of automobile following running, adaptive cruise, emergency braking in a medium speed state and start-stop in a low speed state of the automobile in a high speed state, specifically according to the following steps:

step 1: establishing a vehicle dynamic model and a vehicle running environment model by utilizing carsim software;

step 2: acquiring automobile driving data in a real driving scene and taking the automobile driving data as initialization data, wherein the automobile driving data is initial state information of a vehicle and initial control parameter information of the vehicle;

and step 3: defining a set of state information s ═ s for a vehicle₀,s₁,…s_t,…,s_n}，s₀Indicating initial state information of the vehicle, s_tIndicating that the vehicle is in state s_t-1I.e. the control action a is executed at time t-1_t-1The state reached thereafter, and has s_t＝{Ax_t,e_t,Ve_tIn which Ax is_tRepresents the longitudinal acceleration of the vehicle at time t, in m/s²，e_tShowing the speed of the vehicle and the relative distance between the two vehicles before the time tDifference of separation, Ve_tThe difference value between the self vehicle speed and the front vehicle speed is represented;

defining a control parameter set a ═ { a) of a vehicle₀,a₁,…,a_t,…,a_n}，a₀Initial control parameter information representing a vehicle, a_tIndicating that the vehicle is in state s_tI.e. the action performed by the vehicle at time t, and has a_t＝{T_t,B_tIn which T is_tRepresenting the throttle opening at time t of the vehicle, B_tThe unit of the master cylinder pressure of the vehicle at the time t is Mpa, t is 1,2, …, c and c represents the total training time;

the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time t_tThe hidden layer comprises n neurons, state information from the input layer is calculated by using an activation function Relu and is transmitted to the output layer, and the output layer comprises k neurons and is used for outputting an action value function;

for the hidden layer, there are:

l＝Relu((s_t×w₁)+b₁) (1)

in the formula (1), w₁、b₁Weights and bias values for the hidden layer;

for the output layer, there are:

out＝Relu((l×w₂)+b₂) (2)

in the formula (2), w₂、b₂Is the weight and offset value of the output layer;

the formula (1) and the formula (2) are combined to obtain:

Q_e＝Relu(Relu(s_t×w₁+b₁)×w₂+b₂) (3)

in the formula (3), Q_eThe output value of the output layer is the current Q value of all actions obtained through the deep neural network;

step 6: defining a deep reinforcement learning reward function, wherein the design of the reward function is an important component of a deep reinforcement learning algorithm, the updating and convergence of the neural network weight and bias depend on the quality of the design of the reward function, and the reward function is defined as follows:

in the formulae (4) and (5), r_hThe bonus value r is the bonus value in the high-speed state of the vehicle_lThe reward value is the reward value under the low-speed state of the vehicle, the condition of the reward value and the condition of the reward value is that whether the vehicle speed reaches 25km/h or not, if the vehicle speed reaches or exceeds 25km/h, the high-speed control of the vehicle is carried out, the corresponding follow-up running and adaptive cruise are completed, if the vehicle speed is lower than 25km/h, the medium-low speed control of the vehicle is carried out, the corresponding emergency braking and start-stop operation are completed, dis is the relative distance between the vehicle and the front vehicle, the unit is m, Vf is the vehicle speed of the front vehicle, the unit is km/h, x is the lower limit of the relative distance, the unit is m, y is the upper limit of the relative distance, the unit is m, mid is the switching threshold value of the reward function relative distance, the unit is m, lim is the switching threshold value of the reward function relative distance between the vehicle speed and the front vehicle, the unit is, the unit is km/h, u represents the lower limit of the speed of the front vehicle, and the unit is km/h;

and 7: defining an experience pool priority extraction rule;

under the normal condition, the vehicle rarely meets the state meeting the large reward value in the environment, the reward values of other states are very small, the vehicle is not worth learning and has small action on the parameters of the iterative neural network, the learning time is greatly increased in the environment with a small number of large reward values, and the effect is not good;

by using the experience pool priority extraction method, the small amount of the state samples which are worth learning can be valued;

the specific method is that when the current and target state parameters are stored in the experience pool, the current Q value Q stored in the experience pool is_eAnd a target Q value Q_tMaking a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;

the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (6):

in the formula (6), p_kThe priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;

the ineffective training can be effectively avoided by using the experience pool priority extraction method, the training time is greatly shortened, and the training effect is better;

and 8: defining a greedy strategy;

and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment, and processing data correlation and non-static distribution problems by deep reinforcement learning through the aid of experience pool playback;

state s at time t_tObtaining all action value functions through a deep neural network, and selecting action a by utilizing a greedy strategy_tThen executed by the vehicle;

Q_ne＝Relu(Relu(s_t+1×w₁′+b₁′)×w₂′+b₂′) (7)

in the formula (7), Q_neThe output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a₁′、w₂' weights for the hidden and output layers of the target neural network, respectively, b₁′、b₂' bias of the hidden layer and the output layer of the target neural network, respectively;

step 11: establishing a target Q value Q_t；

The action of the vehicle in a certain state is uncertain, and a relevant conditional probability is needed to select the determined action, wherein the conditional probability is defined as follows:

π(a|s)＝P(a_t＝a|s_t＝s) (8)

in equation (8), pi (a | s) represents a probability distribution of an action a performed by the vehicle in a state s, and p represents a conditional probability;

obtaining a state cost function v using equation (9)_π(s)：

v_π(s)＝E_π(r_t+γr_t+1+γ²r_t+2+…|s_t＝s) (9)

In the formula (9), E_πRepresenting expectation, gamma represents a reward attenuation factor, taking a value between 0 and 1; when gamma is 0, v_π(s)＝E_π(r_t|s_tS), at this time, the state valueThe value function is only determined by the reward value of the current state and is irrelevant to the subsequent state; when gamma is 1, v_π(s)＝E_π(r_t+r_t+1+r_t+2+…|s_tS), at which point the state cost function is determined by the prize values of all current and subsequent states; when the value of gamma tends to 0, the current reward is more emphasized, and when the value of gamma tends to 1, the subsequent reward is more considered;

obtaining the execution of action a at time t by equation (10)_tProbability of going to the next state s

Obtaining an action cost function q by using equation (11)_π(s,a)：

In the formula (11), the reaction mixture is,

obtaining a target Q value Q by the formula (12)_t：

Q_t＝r_t+γmax(Q_ne) (12)

Step 12: the loss function loss is constructed using equation (13):

loss＝ISW×(Q_t-Q_e)² (13)

the gradient descent method is carried out on the loss function loss, so as to update the parameter w of the deep neural network₁、w₂、b₁、b₂；

Update the mesh with an update frequency rtParameters w of the neural network₁′、w₂′、b₁′、b₂', and the update value is taken from the deep neural network;

step 14: the real-time state parameter information of the vehicle is input into the trained deep neural network and output action is obtained, so that corresponding action is performed on the vehicle to complete longitudinal high-speed, medium-speed and low-speed multi-state control.

Claims

1. A longitudinal multi-state control method of an automobile based on deep reinforcement learning preferential extraction is characterized by comprising the following steps:

defining a control parameter set a ═ { a) of a vehicle₀,a₁,…,a_t,…,a_n}，a₀Initial control parameter information representing a vehicle, a_tIndicating that the vehicle is in state s_tI.e. vehicle at time tThe action performed and has a_t＝{T_t,B_tIn which T is_tRepresenting the throttle opening at time t of the vehicle, B_tThe master cylinder pressure of the vehicle at the time t is represented, wherein t is 1,2, …, c and c represents the total training time;

Q_e＝Relu(Relu(s_t×w₁+b₁)×w₂+b₂) (1)

step 6: defining a reward function for deep reinforcement learning:

in the formulae (2) and (3), r_hThe bonus value r is the bonus value in the high-speed state of the vehicle_lIs the low speed of the vehicleIn the state, the reward value dis is the relative distance between the vehicle and the front vehicle, Vf is the speed of the front vehicle, x is the lower limit of the relative distance, y is the upper limit of the relative distance, mid is the switching threshold of the reward function relative to the relative distance, lim is the switching threshold of the reward function relative to the difference between the vehicle speed of the vehicle and the vehicle speed of the front vehicle, z is the switching threshold of the reward function relative to the vehicle speed of the front vehicle, and u is the lower limit of the vehicle speed of the front vehicle;

and 7: defining an experience pool priority extraction rule;

for the current Q value Q stored in the experience pool_eAnd a target Q value Q_tMaking a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;

and 8: defining a greedy strategy;

state s of the vehicle at time t_tLower execution action a_tGet the state of t +1State parameter s_t+1And a prize value r at time t_tEach parameter is expressed in a parameter form s_t,a_t,r_r,s_t+1Storing the data into an experience pool D;

Q_ne＝Relu(Relu(s_t+1×w₁′+b₁′)×w₂′+b₂′) (5)

step 11: establishing a target Q value Q_t；

π(a|s)＝P(a_t＝a|s_t＝s) (6)

in the formula (6), p represents a conditional probability;

obtaining a State cost function v using equation (7)_π(s)：

v_π(s)＝E_π(r_t+γr_t+1+γ²r_t+2+…|s_t＝s) (7)

Obtaining an action cost function q by using the formula (9)_π(s,a)：

In the formula (9), the reaction mixture is,

obtaining a target Q value Q by the formula (10)_t：

Q_t＝r_t+γmax(Q_ne) (10)

Step 12: the loss function loss is constructed using equation (11):

loss＝ISW×(Q_t-Q_e)² (11)