CN112861269A - Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction - Google Patents

Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction Download PDF

Info

Publication number
CN112861269A
CN112861269A CN202110267799.0A CN202110267799A CN112861269A CN 112861269 A CN112861269 A CN 112861269A CN 202110267799 A CN202110267799 A CN 202110267799A CN 112861269 A CN112861269 A CN 112861269A
Authority
CN
China
Prior art keywords
vehicle
state
value
neural network
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110267799.0A
Other languages
Chinese (zh)
Other versions
CN112861269B (en
Inventor
黄鹤
吴润晨
张峰
王博文
于海涛
汤德江
张炳力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202110267799.0A priority Critical patent/CN112861269B/en
Publication of CN112861269A publication Critical patent/CN112861269A/en
Application granted granted Critical
Publication of CN112861269B publication Critical patent/CN112861269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/15Vehicle, aircraft or watercraft design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/17Mechanical parametric or variational design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/14Force analysis or force optimisation, e.g. static or dynamic forces
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Geometry (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Automation & Control Theory (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Artificial Intelligence (AREA)
  • Traffic Control Systems (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which comprises the following steps: 1, defining a state parameter set s and a control parameter set a for driving the automobile; 2, initializing deep reinforcement learning parameters and constructing a deep neural network; 3 defining a depth reinforcement learning reward function and a priority extraction rule; 4 training a deep neural network and obtaining an optimal network model; 5 obtaining the state parameter s of the automobile at the moment ttAnd inputting the optimal network model to obtain an output atAnd is executed by the automobile. The invention completes the longitudinal multi-state driving of the automobile by combining the priority extraction algorithm and the control method of deep reinforcement learning, thereby ensuring that the automobile can driveThe safety is higher during driving, and the occurrence of traffic accidents is reduced.

Description

Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction
Technical Field
The invention relates to the technical field of intelligent automobile longitudinal multi-state control, in particular to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction.
Background
With the rapid development of urban economy and the continuous improvement of the living standard of people, the quantity of motor vehicles kept in cities is greatly increased, automobiles become indispensable tools for transportation when people go out, and a series of safety problems are brought while rapidness and convenience are brought. Due to the limited technical capability of drivers or other uncontrollable external factors and other reasons, traffic problems such as collision of two or more vehicles often occur on roads, so that great difficulty is caused for smooth roads while life and property safety loss is brought. With the continuous development of automobile related technologies, an adaptive cruise system, an emergency braking system and the like are introduced by a plurality of automobile enterprises. The self-adaptive cruise system obtains front road data by using sensors such as a radar and the like, keeps a certain distance from a front vehicle and maintains a certain speed according to a corresponding algorithm, but is usually started at a higher speed, such as more than 25km/h, and needs a driver to manually control when the speed is lower than the speed; the emergency braking system is a technology which can actively brake to avoid accidents under the conditions that an automobile runs in a non-adaptive cruise state and meets the front emergency, such as sudden stop of the automobile or sudden pedestrian, but has related reasons of misjudgment of a sensor, environmental errors and the like, and cannot be applied to various running environments, so that dangerous accidents are caused.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the automobile longitudinal multi-state control method based on the deep reinforcement learning preferential extraction, so that the automobile longitudinal multi-state driving is completed by combining the priority extraction algorithm and the deep reinforcement learning control method, the safety of the automobile in the driving process is higher, and the occurrence of traffic accidents is reduced.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to an automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction, which is characterized by comprising the following steps of:
step 1: establishing a vehicle dynamic model and a vehicle running environment model;
step 2: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;
and step 3: defining a set of state information s ═ s for a vehicle0,s1,…st,…,sn},s0Indicating initial state information of the vehicle, stIndicating that the vehicle is in state st-1I.e. the control action a is executed at time t-1t-1The state reached thereafter, and has st={Axt,et,VetIn which Ax istRepresenting the longitudinal acceleration of the vehicle at time t, etRepresenting the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time ttThe difference value between the self vehicle speed and the front vehicle speed is represented;
defining a control parameter set a ═ { a) of a vehicle0,a1,…,at,…,an},a0Initial control parameter information representing a vehicle, atIndicating that the vehicle is in state stI.e. the action performed by the vehicle at time t, and has at={Tt,BtIn which T istRepresenting the throttle opening at time t of the vehicle, BtThe master cylinder pressure of the vehicle at the time t is represented, wherein t is 1,2, …, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time ttThe hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:
Qe=Relu(Relu(st×w1+b1)×w2+b2) (1)
in the formula (1), w1、b1Weight and bias value for the hidden layer, w2、b2Is the weight and offset value, Q, of the output layereThe output value of the output layer is the current Q value of all actions obtained by the deep neural network;
step 6: defining a reward function for deep reinforcement learning:
Figure BDA0002972670170000021
Figure BDA0002972670170000022
in the formulae (2) and (3), rhThe bonus value r is the bonus value in the high-speed state of the vehiclelThe method comprises the following steps that (1) the reward value is in a low-speed state of a vehicle, dis is the relative distance between the vehicle and a front vehicle, Vf is the speed of the front vehicle, x represents the lower limit of the relative distance, y represents the upper limit of the relative distance, mid represents the switching threshold value of a reward function relative to the relative distance, lim represents the switching threshold value of the reward function relative to the difference value between the speed of the vehicle and the speed of the front vehicle, z represents the switching threshold value of the reward function relative to the speed of the front vehicle, and u represents the lower limit of the;
and 7: defining an experience pool priority extraction rule;
for the current Q value Q stored in the experience pooleAnd a target Q value QtMake a difference, anAccording to the SumTree algorithm, the difference values are used for carrying out priority sequencing on all parameter forms stored in an experience pool, the sequenced parameter forms are obtained, and the front bs parameter forms are extracted from the parameter forms;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (4):
Figure BDA0002972670170000031
in the formula (4), pkThe priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
and 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting QeThe action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment;
state s at time ttObtaining all action value functions through the deep neural network, and selecting action a by utilizing a greedy strategytThen executed by the vehicle;
state s of the vehicle at time ttLower execution action atObtaining the state parameter s at the moment of t +1t+1And a prize value r at time ttEach parameter is expressed in a parameter form st,at,rr,st+1Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 momentt+1Inputting the target neural network, and having:
Qne=Relu(Relu(st+1×w1′+b1′)×w2′+b2′) (5)
in the formula (5), QneThe output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a1′、w2' weights for the hidden and output layers of the target neural network, respectively, b1′、b2' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Qt
The probability distribution pi (a | s) of the action a performed in the state s is defined by equation (6):
π(a|s)=P(at=a|st=s) (6)
in the formula (6), p represents a conditional probability;
obtaining a State cost function v using equation (7)π(s):
vπ(s)=Eπ(rt+γrt+12rt+2+…|st=s) (7)
In the formula (7), gamma is a reward attenuation factor, EπIndicating a desire;
obtaining the execution of action a at time t by equation (8)tProbability of going to the next state s
Figure BDA0002972670170000041
Figure BDA0002972670170000042
Obtaining an action cost function q by using the formula (9)π(s,a):
Figure BDA0002972670170000043
In the formula (9), the reaction mixture is,
Figure BDA0002972670170000044
representing the reward value, v, of the vehicle after performing action a in state sπ(s ') represents a state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (10)t
Qt=rt+γmax(Qne) (10)
Step 12: the loss function loss is constructed using equation (11):
loss=ISW×(Qt-Qe)2 (11)
carrying out a gradient descent method on the loss function loss so as to update the deep neural network parameter w1、w2、b1、b2
Updating the parameter w of the target neural network with an update frequency rt1′、w2′、b1′、b2', and update values are taken from the deep neural network;
step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;
step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, so that corresponding actions are executed on the vehicle to complete longitudinal multi-state control.
Compared with the prior art, the invention has the beneficial effects that:
1. compared with the traditional automobile longitudinal control method, the control method has better control smoothness under different working conditions and better control stability under the limit working condition, and is suitable for the multi-state control of high speed, medium speed and low speed of the automobile;
2. the deep reinforcement learning of the invention utilizes the trained deep neural network, and corresponding actions can be executed only by inputting the state information of the automobile, so that the invention has more simplicity and rapidity compared with the complex traditional automobile control and has relatively better control effect;
3. compared with common reinforcement learning, the deep reinforcement learning of the invention processes the input state parameter information by using the neural network without a large amount of table storage data, thereby greatly saving the memory space, and the neural network training has higher efficiency and better convergence compared with the common iteration method;
4. the invention adopts the data priority extraction method, can perform priority arrangement on the data in the experience pool compared with the harsh performance of the traditional automobile multi-state control method switching, integrates the parameter information of the automobile in various states, greatly shortens the training time, enables the multi-state control of the automobile to be uniform, does not need to perform complicated control method switching, and has better control effect.
Detailed Description
In this embodiment, an automobile longitudinal multi-state control method based on deep reinforcement learning and preferential extraction can decide the throttle opening and the master cylinder pressure of an automobile at a corresponding moment according to real-time state parameters of the automobile, so as to complete multi-state control of automobile following running, adaptive cruise, emergency braking in a medium speed state and start-stop in a low speed state of the automobile in a high speed state, specifically according to the following steps:
step 1: establishing a vehicle dynamic model and a vehicle running environment model by utilizing carsim software;
step 2: acquiring automobile driving data in a real driving scene and taking the automobile driving data as initialization data, wherein the automobile driving data is initial state information of a vehicle and initial control parameter information of the vehicle;
and step 3: defining a set of state information s ═ s for a vehicle0,s1,…st,…,sn},s0Indicating initial state information of the vehicle, stIndicating that the vehicle is in state st-1I.e. the control action a is executed at time t-1t-1The state reached thereafter, and has st={Axt,et,VetIn which Ax istRepresents the longitudinal acceleration of the vehicle at time t, in m/s2,etShowing the speed of the vehicle and the relative distance between the two vehicles before the time tDifference of separation, VetThe difference value between the self vehicle speed and the front vehicle speed is represented;
defining a control parameter set a ═ { a) of a vehicle0,a1,…,at,…,an},a0Initial control parameter information representing a vehicle, atIndicating that the vehicle is in state stI.e. the action performed by the vehicle at time t, and has at={Tt,BtIn which T istRepresenting the throttle opening at time t of the vehicle, BtThe unit of the master cylinder pressure of the vehicle at the time t is Mpa, t is 1,2, …, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time ttThe hidden layer comprises n neurons, state information from the input layer is calculated by using an activation function Relu and is transmitted to the output layer, and the output layer comprises k neurons and is used for outputting an action value function;
for the hidden layer, there are:
l=Relu((st×w1)+b1) (1)
in the formula (1), w1、b1Weights and bias values for the hidden layer;
for the output layer, there are:
out=Relu((l×w2)+b2) (2)
in the formula (2), w2、b2Is the weight and offset value of the output layer;
the formula (1) and the formula (2) are combined to obtain:
Qe=Relu(Relu(st×w1+b1)×w2+b2) (3)
in the formula (3), QeThe output value of the output layer is the current Q value of all actions obtained through the deep neural network;
step 6: defining a deep reinforcement learning reward function, wherein the design of the reward function is an important component of a deep reinforcement learning algorithm, the updating and convergence of the neural network weight and bias depend on the quality of the design of the reward function, and the reward function is defined as follows:
Figure BDA0002972670170000061
Figure BDA0002972670170000062
in the formulae (4) and (5), rhThe bonus value r is the bonus value in the high-speed state of the vehiclelThe reward value is the reward value under the low-speed state of the vehicle, the condition of the reward value and the condition of the reward value is that whether the vehicle speed reaches 25km/h or not, if the vehicle speed reaches or exceeds 25km/h, the high-speed control of the vehicle is carried out, the corresponding follow-up running and adaptive cruise are completed, if the vehicle speed is lower than 25km/h, the medium-low speed control of the vehicle is carried out, the corresponding emergency braking and start-stop operation are completed, dis is the relative distance between the vehicle and the front vehicle, the unit is m, Vf is the vehicle speed of the front vehicle, the unit is km/h, x is the lower limit of the relative distance, the unit is m, y is the upper limit of the relative distance, the unit is m, mid is the switching threshold value of the reward function relative distance, the unit is m, lim is the switching threshold value of the reward function relative distance between the vehicle speed and the front vehicle, the unit is, the unit is km/h, u represents the lower limit of the speed of the front vehicle, and the unit is km/h;
and 7: defining an experience pool priority extraction rule;
under the normal condition, the vehicle rarely meets the state meeting the large reward value in the environment, the reward values of other states are very small, the vehicle is not worth learning and has small action on the parameters of the iterative neural network, the learning time is greatly increased in the environment with a small number of large reward values, and the effect is not good;
by using the experience pool priority extraction method, the small amount of the state samples which are worth learning can be valued;
the specific method is that when the current and target state parameters are stored in the experience pool, the current Q value Q stored in the experience pool iseAnd a target Q value QtMaking a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (6):
Figure BDA0002972670170000071
in the formula (6), pkThe priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
the ineffective training can be effectively avoided by using the experience pool priority extraction method, the training time is greatly shortened, and the training effect is better;
and 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting QeThe action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment, and processing data correlation and non-static distribution problems by deep reinforcement learning through the aid of experience pool playback;
state s at time ttObtaining all action value functions through a deep neural network, and selecting action a by utilizing a greedy strategytThen executed by the vehicle;
state s of the vehicle at time ttLower execution action atObtaining the state parameter s at the moment of t +1t+1And a prize value r at time ttEach parameter is expressed in a parameter form st,at,rr,st+1Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 momentt+1Inputting the target neural network, and having:
Qne=Relu(Relu(st+1×w1′+b1′)×w2′+b2′) (7)
in the formula (7), QneThe output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a1′、w2' weights for the hidden and output layers of the target neural network, respectively, b1′、b2' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Qt
The action of the vehicle in a certain state is uncertain, and a relevant conditional probability is needed to select the determined action, wherein the conditional probability is defined as follows:
π(a|s)=P(at=a|st=s) (8)
in equation (8), pi (a | s) represents a probability distribution of an action a performed by the vehicle in a state s, and p represents a conditional probability;
obtaining a state cost function v using equation (9)π(s):
vπ(s)=Eπ(rt+γrt+12rt+2+…|st=s) (9)
In the formula (9), EπRepresenting expectation, gamma represents a reward attenuation factor, taking a value between 0 and 1; when gamma is 0, vπ(s)=Eπ(rt|stS), at this time, the state valueThe value function is only determined by the reward value of the current state and is irrelevant to the subsequent state; when gamma is 1, vπ(s)=Eπ(rt+rt+1+rt+2+…|stS), at which point the state cost function is determined by the prize values of all current and subsequent states; when the value of gamma tends to 0, the current reward is more emphasized, and when the value of gamma tends to 1, the subsequent reward is more considered;
obtaining the execution of action a at time t by equation (10)tProbability of going to the next state s
Figure BDA0002972670170000081
Figure BDA0002972670170000082
Obtaining an action cost function q by using equation (11)π(s,a):
Figure BDA0002972670170000083
In the formula (11), the reaction mixture is,
Figure BDA0002972670170000084
representing the reward value, v, of the vehicle after performing action a in state sπ(s ') represents a state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (12)t
Qt=rt+γmax(Qne) (12)
Step 12: the loss function loss is constructed using equation (13):
loss=ISW×(Qt-Qe)2 (13)
the gradient descent method is carried out on the loss function loss, so as to update the parameter w of the deep neural network1、w2、b1、b2
Update the mesh with an update frequency rtParameters w of the neural network1′、w2′、b1′、b2', and the update value is taken from the deep neural network;
step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;
step 14: the real-time state parameter information of the vehicle is input into the trained deep neural network and output action is obtained, so that corresponding action is performed on the vehicle to complete longitudinal high-speed, medium-speed and low-speed multi-state control.

Claims (1)

1. A longitudinal multi-state control method of an automobile based on deep reinforcement learning preferential extraction is characterized by comprising the following steps:
step 1: establishing a vehicle dynamic model and a vehicle running environment model;
step 2: acquiring automobile running data in a real driving scene as initialization data, wherein the automobile running data is initial state information of a vehicle and initial control parameter information of the vehicle;
and step 3: defining a set of state information s ═ s for a vehicle0,s1,…st,…,sn},s0Indicating initial state information of the vehicle, stIndicating that the vehicle is in state st-1I.e. the control action a is executed at time t-1t-1The state reached thereafter, and has st={Axt,et,VetIn which Ax istRepresenting the longitudinal acceleration of the vehicle at time t, etRepresenting the difference, Ve, between the speed of the vehicle and the relative distance between the two vehicles before time ttThe difference value between the self vehicle speed and the front vehicle speed is represented;
defining a control parameter set a ═ { a) of a vehicle0,a1,…,at,…,an},a0Initial control parameter information representing a vehicle, atIndicating that the vehicle is in state stI.e. vehicle at time tThe action performed and has at={Tt,BtIn which T istRepresenting the throttle opening at time t of the vehicle, BtThe master cylinder pressure of the vehicle at the time t is represented, wherein t is 1,2, …, c and c represents the total training time;
and 4, step 4: initializing parameters including time t, greedy probability epsilon-greedy, experience pool size ms, target network updating frequency rt, number bs of preferentially extracted data and reward attenuation factor gamma;
and 5: constructing a deep neural network, and randomly initializing parameters of the neural network: weight w, offset b;
the deep neural network comprises an input layer, a hidden layer and an output layer; wherein the input layer comprises m neurons for inputting the state s of the vehicle at the time ttThe hidden layer comprises n neurons, calculates state information from the input layer by using an activation function Relu and transmits the state information to the output layer, the output layer comprises k neurons, is used for outputting an action value function, and comprises:
Qe=Relu(Relu(st×w1+b1)×w2+b2) (1)
in the formula (1), w1、b1Weight and bias value for the hidden layer, w2、b2Is the weight and offset value, Q, of the output layereThe output value of the output layer is the current Q value of all actions obtained by the deep neural network;
step 6: defining a reward function for deep reinforcement learning:
Figure FDA0002972670160000011
Figure FDA0002972670160000012
in the formulae (2) and (3), rhThe bonus value r is the bonus value in the high-speed state of the vehiclelIs the low speed of the vehicleIn the state, the reward value dis is the relative distance between the vehicle and the front vehicle, Vf is the speed of the front vehicle, x is the lower limit of the relative distance, y is the upper limit of the relative distance, mid is the switching threshold of the reward function relative to the relative distance, lim is the switching threshold of the reward function relative to the difference between the vehicle speed of the vehicle and the vehicle speed of the front vehicle, z is the switching threshold of the reward function relative to the vehicle speed of the front vehicle, and u is the lower limit of the vehicle speed of the front vehicle;
and 7: defining an experience pool priority extraction rule;
for the current Q value Q stored in the experience pooleAnd a target Q value QtMaking a difference, and carrying out priority sequencing on all parameter forms stored in the experience pool by using the difference value according to a SumTree algorithm to obtain a sequenced parameter form and extracting a front bs parameter form from the sequenced parameter form;
the weight ISW in the form of the extracted pre-bs parameter is obtained using equation (4):
Figure FDA0002972670160000021
in the formula (4), pkThe priority value is in the form of any kth parameter, min (p) is the minimum value of the priority in the form of the extracted parameters of the previous bs, and beta is a weight increase coefficient, and the value of the weight increase coefficient gradually converges from 0 to 1 along with the increase of the extraction times;
and 8: defining a greedy strategy;
generating a random number eta between 0 and 1, judging whether eta is less than or equal to epsilon-greedy, if yes, selecting QeThe action corresponding to the medium and maximum Q value is a vehicle execution action, otherwise, one action is randomly selected as the vehicle execution action;
and step 9: creating an experience pool D for storing the state, action and reward information of the vehicle at each moment;
state s at time ttObtaining all action value functions through the deep neural network, and selecting action a by utilizing a greedy strategytThen executed by the vehicle;
state s of the vehicle at time ttLower execution action atGet the state of t +1State parameter st+1And a prize value r at time ttEach parameter is expressed in a parameter form st,at,rr,st+1Storing the data into an experience pool D;
step 10: constructing a target neural network with the same structure as the deep neural network;
obtaining a bs strip parameter form from an experience pool D by using a preferential extraction rule, and obtaining a state s at a t +1 momentt+1Inputting the target neural network, and having:
Qne=Relu(Relu(st+1×w1′+b1′)×w2′+b2′) (5)
in the formula (5), QneThe output value of the output layer of the target neural network is the Q value of all actions obtained by the target neural network; w is a1′、w2' weights for the hidden and output layers of the target neural network, respectively, b1′、b2' bias of the hidden layer and the output layer of the target neural network, respectively;
step 11: establishing a target Q value Qt
The probability distribution pi (a | s) of the action a performed in the state s is defined by equation (6):
π(a|s)=P(at=a|st=s) (6)
in the formula (6), p represents a conditional probability;
obtaining a State cost function v using equation (7)π(s):
vπ(s)=Eπ(rt+γrt+12rt+2+…|st=s) (7)
In the formula (7), gamma is a reward attenuation factor, EπIndicating a desire;
obtaining the execution of action a at time t by equation (8)tProbability of going to the next state s
Figure FDA0002972670160000031
Figure FDA0002972670160000032
Obtaining an action cost function q by using the formula (9)π(s,a):
Figure FDA0002972670160000033
In the formula (9), the reaction mixture is,
Figure FDA0002972670160000034
representing the reward value, v, of the vehicle after performing action a in state sπ(s ') represents a state cost function for the vehicle at state s';
obtaining a target Q value Q by the formula (10)t
Qt=rt+γmax(Qne) (10)
Step 12: the loss function loss is constructed using equation (11):
loss=ISW×(Qt-Qe)2 (11)
carrying out a gradient descent method on the loss function loss so as to update the deep neural network parameter w1、w2、b1、b2
Updating the parameter w of the target neural network with an update frequency rt1′、w2′、b1′、b2', and update values are taken from the deep neural network;
step 13: assigning t +1 to t, judging whether t is less than or equal to c, if so, returning to the step 9 to continue training, otherwise, judging whether the loss value gradually decreases and tends to converge, if so, indicating that a trained deep neural network is obtained, otherwise, making t equal to c +1, increasing the network iteration times, and returning to the step 9 to execute;
step 14: and inputting the real-time state parameter information of the vehicle into the trained deep neural network to obtain an output action, so that corresponding actions are executed on the vehicle to complete longitudinal multi-state control.
CN202110267799.0A 2021-03-11 2021-03-11 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction Active CN112861269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110267799.0A CN112861269B (en) 2021-03-11 2021-03-11 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110267799.0A CN112861269B (en) 2021-03-11 2021-03-11 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Publications (2)

Publication Number Publication Date
CN112861269A true CN112861269A (en) 2021-05-28
CN112861269B CN112861269B (en) 2022-08-30

Family

ID=75994127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110267799.0A Active CN112861269B (en) 2021-03-11 2021-03-11 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction

Country Status (1)

Country Link
CN (1) CN112861269B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113715842A (en) * 2021-08-24 2021-11-30 华中科技大学 High-speed moving vehicle control method based on simulation learning and reinforcement learning
CN113734170A (en) * 2021-08-19 2021-12-03 崔建勋 Automatic driving lane change decision-making method based on deep Q learning
CN114527642A (en) * 2022-03-03 2022-05-24 东北大学 AGV automatic PID parameter adjusting method based on deep reinforcement learning
CN115303290A (en) * 2022-10-09 2022-11-08 北京理工大学 System key level switching method and system of vehicle hybrid key level system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190220744A1 (en) * 2018-01-17 2019-07-18 Hengshuai Yao Method of generating training data for training a neural network, method of training a neural network and using neural network for autonomous operations
CN110450771A (en) * 2019-08-29 2019-11-15 合肥工业大学 A kind of intelligent automobile stability control method based on deeply study
CN110716550A (en) * 2019-11-06 2020-01-21 南京理工大学 Gear shifting strategy dynamic optimization method based on deep reinforcement learning
CN110716562A (en) * 2019-09-25 2020-01-21 南京航空航天大学 Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning
US20200033868A1 (en) * 2018-07-27 2020-01-30 GM Global Technology Operations LLC Systems, methods and controllers for an autonomous vehicle that implement autonomous driver agents and driving policy learners for generating and improving policies based on collective driving experiences of the autonomous driver agents
CN110850720A (en) * 2019-11-26 2020-02-28 国网山东省电力公司电力科学研究院 DQN algorithm-based area automatic power generation dynamic control method
CN110969848A (en) * 2019-11-26 2020-04-07 武汉理工大学 Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes
US20200265305A1 (en) * 2017-10-27 2020-08-20 Deepmind Technologies Limited Reinforcement learning using distributed prioritized replay
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN111985614A (en) * 2020-07-23 2020-11-24 中国科学院计算技术研究所 Method, system and medium for constructing automatic driving decision system
CN112162555A (en) * 2020-09-23 2021-01-01 燕山大学 Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet
CN112406867A (en) * 2020-11-19 2021-02-26 清华大学 Emergency vehicle hybrid lane change decision method based on reinforcement learning and avoidance strategy

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200265305A1 (en) * 2017-10-27 2020-08-20 Deepmind Technologies Limited Reinforcement learning using distributed prioritized replay
US20190220744A1 (en) * 2018-01-17 2019-07-18 Hengshuai Yao Method of generating training data for training a neural network, method of training a neural network and using neural network for autonomous operations
US20200033868A1 (en) * 2018-07-27 2020-01-30 GM Global Technology Operations LLC Systems, methods and controllers for an autonomous vehicle that implement autonomous driver agents and driving policy learners for generating and improving policies based on collective driving experiences of the autonomous driver agents
CN110450771A (en) * 2019-08-29 2019-11-15 合肥工业大学 A kind of intelligent automobile stability control method based on deeply study
CN110716562A (en) * 2019-09-25 2020-01-21 南京航空航天大学 Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning
CN110716550A (en) * 2019-11-06 2020-01-21 南京理工大学 Gear shifting strategy dynamic optimization method based on deep reinforcement learning
CN110850720A (en) * 2019-11-26 2020-02-28 国网山东省电力公司电力科学研究院 DQN algorithm-based area automatic power generation dynamic control method
CN110969848A (en) * 2019-11-26 2020-04-07 武汉理工大学 Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN111985614A (en) * 2020-07-23 2020-11-24 中国科学院计算技术研究所 Method, system and medium for constructing automatic driving decision system
CN112162555A (en) * 2020-09-23 2021-01-01 燕山大学 Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet
CN112406867A (en) * 2020-11-19 2021-02-26 清华大学 Emergency vehicle hybrid lane change decision method based on reinforcement learning and avoidance strategy

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DEY, K.C.;LI YAN: "A review of communication, driver characteristics, and controls aspects of cooperative adaptive cruise control", 《IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS》 *
王文飒,梁军,陈龙,陈小波,朱宁,华国栋: "基于深度强化学习的协同式自适应巡航控制", 《交通信息与安全》 *
黄鹤,郭伟锋,梅炜炜,张润,程进,张炳力: "基于深度强化学习的自动泊车控制策略", 《2020中国汽车工程学会年会论文集》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113734170A (en) * 2021-08-19 2021-12-03 崔建勋 Automatic driving lane change decision-making method based on deep Q learning
CN113734170B (en) * 2021-08-19 2023-10-24 崔建勋 Automatic driving lane change decision method based on deep Q learning
CN113715842A (en) * 2021-08-24 2021-11-30 华中科技大学 High-speed moving vehicle control method based on simulation learning and reinforcement learning
CN114527642A (en) * 2022-03-03 2022-05-24 东北大学 AGV automatic PID parameter adjusting method based on deep reinforcement learning
CN114527642B (en) * 2022-03-03 2024-04-02 东北大学 Method for automatically adjusting PID parameters by AGV based on deep reinforcement learning
CN115303290A (en) * 2022-10-09 2022-11-08 北京理工大学 System key level switching method and system of vehicle hybrid key level system
CN115303290B (en) * 2022-10-09 2022-12-06 北京理工大学 System key level switching method and system of vehicle hybrid key level system

Also Published As

Publication number Publication date
CN112861269B (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN112861269B (en) Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction
CN111898211B (en) Intelligent vehicle speed decision method based on deep reinforcement learning and simulation method thereof
CN107229973B (en) Method and device for generating strategy network model for automatic vehicle driving
CN109213148B (en) Vehicle low-speed following decision method based on deep reinforcement learning
CN111845741B (en) Automatic driving decision control method and system based on hierarchical reinforcement learning
CN112249032B (en) Automatic driving decision method, system, equipment and computer storage medium
EP3725627B1 (en) Method for generating vehicle control command, and vehicle controller and storage medium
CN106740846A (en) A kind of electric automobile self-adapting cruise control method of double mode switching
CN107168303A (en) A kind of automatic Pilot method and device of automobile
US12005922B2 (en) Toward simulation of driver behavior in driving automation
CN113954837B (en) Deep learning-based lane change decision-making method for large-scale commercial vehicle
CN113276884B (en) Intelligent vehicle interactive decision passing method and system with variable game mode
CN112668779A (en) Preceding vehicle motion state prediction method based on self-adaptive Gaussian process
Yu et al. Autonomous overtaking decision making of driverless bus based on deep Q-learning method
CN112201070B (en) Deep learning-based automatic driving expressway bottleneck section behavior decision method
JP7415471B2 (en) Driving evaluation device, driving evaluation system, in-vehicle device, external evaluation device, and driving evaluation program
CN113160585A (en) Traffic light timing optimization method, system and storage medium
CN113722835A (en) Modeling method for anthropomorphic random lane change driving behavior
CN113954855B (en) Self-adaptive matching method for automobile driving mode
CN108569268A (en) Vehicle collision avoidance parameter calibration method and device, vehicle control device, storage medium
CN112542061B (en) Lane borrowing and overtaking control method, device and system based on Internet of vehicles and storage medium
CN113033902A (en) Automatic driving track-changing planning method based on improved deep learning
WO2023004698A1 (en) Method for intelligent driving decision-making, vehicle movement control method, apparatus, and vehicle
CN114148349B (en) Vehicle personalized following control method based on generation of countermeasure imitation study
CN116306800A (en) Intelligent driving decision learning method based on reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant