CN113104050B - Unmanned end-to-end decision method based on deep reinforcement learning - Google Patents
Unmanned end-to-end decision method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN113104050B CN113104050B CN202110372793.XA CN202110372793A CN113104050B CN 113104050 B CN113104050 B CN 113104050B CN 202110372793 A CN202110372793 A CN 202110372793A CN 113104050 B CN113104050 B CN 113104050B
- Authority
- CN
- China
- Prior art keywords
- network
- eval
- target
- critic1
- reinforcement learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 63
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000009471 action Effects 0.000 claims abstract description 69
- 238000012549 training Methods 0.000 claims abstract description 55
- 230000007613 environmental effect Effects 0.000 claims abstract description 17
- 230000006870 function Effects 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 12
- 230000002452 interceptive effect Effects 0.000 claims description 9
- 230000003993 interaction Effects 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 5
- 230000005540 biological transmission Effects 0.000 claims description 2
- 238000012360 testing method Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000002904 solvent Substances 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
Images
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W60/00—Drive control systems specially adapted for autonomous road vehicles
- B60W60/001—Planning or execution of driving tasks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W2520/00—Input parameters relating to overall vehicle dynamics
- B60W2520/10—Longitudinal speed
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W2540/00—Input parameters relating to occupants
- B60W2540/10—Accelerator pedal position
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W2540/00—Input parameters relating to occupants
- B60W2540/12—Brake pedal position
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W2540/00—Input parameters relating to occupants
- B60W2540/18—Steering angle
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Transportation (AREA)
- Medical Informatics (AREA)
- Mechanical Engineering (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Automation & Control Theory (AREA)
- Image Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an unmanned end-to-end decision method based on deep reinforcement learning, which comprises the following steps: 1) acquiring a front road characteristic code of the unmanned vehicle, 2) inputting a front road environment state and a vehicle self state into a trained deep reinforcement learning structure as a current time environment state to output the action of the unmanned vehicle; the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; the method requires less environmental data, can effectively reduce the cost, and simultaneously achieves the purpose of improving the exploration efficiency of the intelligent agent by constructing a deep reinforcement learning network with high learning efficiency and high training speed.
Description
Technical Field
The invention relates to the technical field of unmanned driving, in particular to an unmanned driving end-to-end decision method based on deep reinforcement learning.
Background
The unmanned technology is developed rapidly in recent years, integrates environmental perception, decision planning and control, and enables a vehicle to run safely on a road without a driver by means of an artificial intelligence technology. The sensing module fuses sensor information such as a camera and a laser radar and senses the surrounding environment of the vehicle in real time. And the decision module outputs the optimal decision plan according to the perception information and the vehicle state information. And the control module controls the vehicle to run on the road at the specified speed along with the planned track according to the decision information. The decision-making module is used as a central pivot for connecting perception and control and is the key point of the research of the unmanned technology.
There are currently three directions of decision-making studies: 1) a rule-based decision method, 2) a simulated learning-based decision method, and 3) a reinforcement learning-based decision method. The rule-based method cannot cover all possible scenes and is difficult to adapt to complex environments; the method based on the imitation learning is difficult to make optimal decisions in the face of complex and changeable urban traffic environments. The learning mode of deep reinforcement learning is closer to human thinking, the advantages of deep learning and reinforcement learning are combined, and the performance and generalization are improved.
The decision-making method based on deep reinforcement learning obtains a certain result in the field of unmanned driving, but a large amount of random exploration is carried out during algorithm training, so that an intelligent agent can easily obtain too many low-return experiences in the early training period, and the algorithm learning efficiency is low and the training time is long. Therefore, how to solve these problems is of crucial significance to vehicle decision making.
Disclosure of Invention
The invention aims to provide an unmanned end-to-end decision method based on deep reinforcement learning, which solves the technical problem.
Therefore, the technical scheme of the invention is as follows:
an unmanned end-to-end decision method based on deep reinforcement learning comprises the following steps:
1. an unmanned end-to-end decision method based on deep reinforcement learning is characterized by comprising the steps of constructing and training a deep reinforcement learning network; wherein,
1) constructing a deep reinforcement learning network:
the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current timetAnd outputs a continuous spatial motion actThe Target Actor network is used for outputting actions for training the Eval Critic1 network and the Eval Critic2 networka'; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 networkj'; the Eval Q network is used for receiving the environment state s at the current momenttAnd outputs the action value QDAnd selecting the discrete space action a with the highest valueDtThe Target Q network is used for outputting action values for training the Eval Q network
2) Training a deep reinforcement learning network, which comprises the following specific steps:
s1, initializing each network parameter of the deep reinforcement learning network and interacting with the interactive environment; the Eval Actor network receives the current time environment state stAnd outputs a continuous spatial motion actThe Eval Q network receives the current time environment state stAnd outputs a discrete space motion aDtContinuous spatial motion aCtAnd discrete spatial motion aDtWeighted fusion resulting in an execution action at(ii) a Performing action atAnd the next time environmental state st+1Inputting the result into a reward and punishment function to obtain an execution action atIs a reward and punishment value rt;
S2, repeating the step S1 until the training of the deep reinforcement learning network is completed, and continuously including the historical experience information at each moment including St、act、aDt、rt、st+1Stored as a set of samples in an experience playback pool;
s3, when the number of the acquired samples in the experience playback pool meets the calling requirement, calling N groups of samples from the experience playback pool, and training an Eval Actor network through an Actor loss function J:
wherein, thetaμIs an Eval Actor network parameter; q is to an Eval Critic1 netThe environmental state s of the envelope input sample iiAnd performing action aiThe action value of the later output;is Critic1 network parameter; μ denotes Eval Actor;
s4, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples are processed by a first loss function L1Training the Eval Critic1 network and the Eval Critic2 network in synchronization:
wherein Q is the environment state s of the sample i input into the Eval Critic1 network or the Eval Critic2 networkiAnd performing action aiThe action value of the later output;is an Eval Critic1 network parameter or an Eval Critic2 network parameter, yiIn order to estimate the motion value for the first time,rito execute aiReward and punishment value of, gammaCIs a first buckling rate, Qj 'Inputting an environment state s 'and a Target action a' at the next moment into a Target Critic1 network or a Target Critic2 network and outputting an action value; the Target action a 'is an action which is output after the environment state s' at the next moment is input to the Target Actor network;is a Target Crtic 1 network parameter or a Target Crtic 2 network parameter; in the process, synchronously training an Eval Critic1 network and an Eval Critic2 network by taking the minimum action values output by the Target Critic1 network and the Target Critic2 network;
s5, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples pass through a second loss function L2To trainEval Q network:
wherein Q isDTo input the environmental State s of sample i into the Eval Q networkiThen outputting the action value; theta is an Eval Q network parameter; y isi' is a second motion estimation value,rito perform action aiReward and punishment value of, gammaDThe second folding rate is set as the second folding rate,the maximum Target action value is output after the environment state s' at the next moment is input to the Target Q network,representing a Target Q network parameter; in the process, the Target Q network selects the maximum action value through s' to train the Eval Q network;
s6, after the Eval Actor completes updating, the Target Actor is updated in a soft updating mode by adopting the following formula:
θμ'←τθμ+(1-τ)θμ',
where τ is a soft update parameter, θμ'Is a Target Actor network parameter, θμIs an Actor network parameter;
after Eval Critic1 and Eval Critic2 complete updating, Target Critic1 and Target Critic2 are updated:
wherein tau is a soft update parameter,is a Target critical 1 network parameter or a Target critical 2 network parameter,is an Eval Critic1 network parameter or an Eval Critic2 network parameter;
after the Eval Q completes updating, the Target Q network is updated in a hard updating mode by adopting the following formula:
s7, repeating the steps S3-S6 until the Actor loss function J and the first loss function L1A second loss function L2The loss values of (1) are all shown to be converged, and the deep reinforcement learning network training is completed.
Further, the current time environmental status stThe method comprises the steps of (1) including a front road environment state and a vehicle self state; the front road environment state is a front road characteristic code; the vehicle state includes the vehicle running speed, the steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal.
Further, the method for acquiring the environmental state of the front road comprises the following steps:
1) acquiring a front road picture in real time through an RGB (red, green and blue) camera arranged in front of a vehicle;
2) inputting the front road picture into a pre-trained network, and acquiring a characteristic information code as a front road environment state; wherein the pre-training network is: and sequentially inputting a plurality of front road pictures into an end-to-end decision picture data set, and performing end-to-end imitation learning decision training on the unmanned vehicle so as to extract the front road characteristic codes of the unmanned vehicle.
Further, the vehicle running speed, the steering wheel steering angle, the degree of opening of the accelerator pedal, and the degree of opening of the brake pedal in the state of the vehicle itself are obtained by four sensors provided at the vehicle transmission, the steering wheel, the accelerator pedal, and the brake pedal, respectively.
Further, in step S1, the weighted fusion formula is: a ist=α×aCt+(1-α)×aDt(ii) a Wherein, atTo perform an action value, aDtAs discrete spatial motion values, aCtAlpha is the ratio of continuous motion.
Further, in step S1, the reward and punishment function is:
rt=v×[1-ωt 2-(|ωt|-|ωt-1|)2]-12×(lol+lor)-rc
where v is the forward speed of the vehicle, ωt、ωt-1The steering wheel turning angles l at the current moment and the last moment respectivelyol、lorRespectively the intrusion ratio, r, of the vehicle on both sides of the roadcIs a penalty in the event of a collision.
Further, in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval criticic 1 network, the Eval criticic 2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.
Compared with the prior art, the unmanned end-to-end decision method based on the deep reinforcement learning can meet the early-stage data requirement only by enabling the RGB camera to sense the environment in front of the vehicle, and acquiring the real-time vehicle running speed, steering angle of a steering wheel, opening degree of an accelerator pedal and opening degree of a brake pedal of the unmanned vehicle only through a conventional sensor, so that the cost is effectively reduced, meanwhile, the problem that too many low-return experiences are easily obtained in the early stage of training due to the fact that a large amount of exploration is needed for the reinforcement learning algorithm, the learning efficiency of the algorithm is low is solved, the deep reinforcement learning network with high learning efficiency and high training speed is constructed, and the purpose of improving the exploration efficiency of the intelligent body is achieved.
Drawings
FIG. 1 is a flow chart of an end-to-end decision method for unmanned driving based on deep reinforcement learning according to the present invention;
FIG. 2 is a schematic structural diagram of a deep reinforcement learning network in the unmanned end-to-end decision method based on deep reinforcement learning according to the present invention;
FIG. 3(a) is a screenshot of state 1 in a simulated automobile completing a turn in a simulated environment test scenario in an embodiment of the invention;
FIG. 3(b) is a screenshot of state 2 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;
FIG. 3(c) is a screenshot of state 3 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;
FIG. 3(d) is a screenshot of state 4 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;
fig. 3(e) is a screenshot of state 5 in the process that the simulated automobile completes a turn in the simulated environment test scenario in the embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the following figures and specific examples, which are not intended to limit the invention in any way.
As shown in fig. 1, the depth reinforcement learning-based unmanned end-to-end decision method includes the following steps:
s1, acquiring a road characteristic code in front of the unmanned vehicle; in particular, the amount of the solvent to be used,
s101, acquiring a front road picture in real time through an RGB (red, green and blue) camera arranged in front of a vehicle;
s102, inputting a front road picture into a pre-trained network, and acquiring a characteristic information code as a front road environment state;
wherein the pre-training network is: sequentially inputting a plurality of front road pictures into an end-to-end decision picture data set, and performing end-to-end imitation learning decision training on the unmanned vehicle to extract a front road characteristic code of the unmanned vehicle;
s2, inputting the front road environment state and the vehicle self state as the current time environment state into the trained deep reinforcement learning structure to output the action of the unmanned vehicle; specifically, the vehicle state comprises the vehicle running speed, the steering wheel steering angle, the opening degree of an accelerator pedal and the opening degree of a brake pedal; the four real-time measurement data are respectively obtained by sensors arranged at a steering wheel, an accelerator pedal and a brake pedal of the vehicle; the actions of the unmanned vehicle comprise a steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal;
the specific implementation steps for constructing and training the deep reinforcement learning network are as follows:
1) as shown in fig. 2, a deep reinforcement learning network is constructed:
the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current timetAnd outputs a continuous spatial motion actThe Target Actor network is used for outputting an action a' for training an Eval criticic 1 network and an Eval criticic 2 network; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 networkj'; the Eval Q network is used for receiving the environment state s at the current momenttAnd outputs the discrete space motion value a with the highest motion valueDtThe Target Q network is used for outputting action values for training the Eval Q network
2) Training a deep reinforcement learning network, which comprises the following specific steps:
s1, initializing each network parameter of the deep reinforcement learning network and interacting with the interactive environment; the Eval Actor network receives the current time environment state stAnd outputs a continuous spatial motion actThe Eval Q network receives the current time environment state stAnd outputs a discrete space motion aDtContinuous spatial motion value aCtAnd a discrete spatial motion value aDtWeighted fusion to obtain the value a of the executed actiont(ii) a Execution action value atAnd the next time environmental state st+1Inputting the result into a reward and punishment function to obtain an execution action atIs a reward and punishment value rt;
S2, repeating the step S1 until the training of the deep reinforcement learning network is completed, and continuously including the historical experience information at each moment including St、act、aDt、rt、st+1Stored as a set of samples in an experience playback pool;
s3, when the number of the acquired samples in the experience playback pool meets the calling requirement, calling N groups of samples from the experience playback pool, and training an Eval Actor network through an Actor loss function J:
wherein, thetaμIs an Eval Actor network parameter; q is the environment state s of the input i sample to the Eval Critic1 networkiAnd executing the action value aiThe action value of the later output;is Critic1 network parameter; μ denotes Eval Actor;
s4, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples are processed by a first loss function L1Training the Eval Critic1 network and the Eval Critic2 network in synchronization:
wherein Q is the environment state s of the sample i input into the Eval Critic1 network or the Eval Critic2 networkiAnd executing the action value aiThe action value of the later output;is an Eval Critic1 network parameter or an Eval Critic2 network parameter, yiIn order to estimate the motion value for the first time,rito execute aiReward and punishment value of, gammaCIs a first buckling rate, Qj 'Inputting an environment state s 'and a Target action a' at the next moment into a Target Critic1 network or a Target Critic2 network and outputting an action value; the Target action a 'is an action to be output after the environmental state s' at the next time is input to the Target Actor network,is a Target Crtic 1 network parameter or a Target Crtic 2 network parameter; in the process, synchronously training an Eval Critic1 network and an Eval Critic2 network by taking the minimum action values output by the Target Critic1 network and the Target Critic2 network;
s5, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples pass through a second loss function L2To train the Eval Q network:
wherein Q isDTo input the environmental State s of sample i into the Eval Q networkiThen outputting the action value; theta is an Eval Q network parameter; y isi' is a second motion estimation value,rito perform action aiReward and punishment value of, gammaDIs a second foldThe buckling rate of the rubber belt is increased,the maximum Target action value is output after the environment state s' at the next moment is input to the Target Q network,representing a Target Q network parameter; in the process, the Target Q network selects the maximum action value through s' to train the Eval Q network;
s6, after the Eval Actor completes updating, the Target Actor is updated in a soft updating mode by adopting the following formula:
θμ'←τθμ+(1-τ)θμ',
where τ is a soft update parameter, θμ'Is a Target Actor network parameter, θμIs an Actor network parameter;
after Eval Critic1 and Eval Critic2 complete updating, Target Critic1 and Target Critic2 are updated:
wherein tau is a soft update parameter,is a Target critical 1 network parameter or a Target critical 2 network parameter,is an Eval Critic1 network parameter or an Eval Critic2 network parameter;
after the Eval Q completes updating, the Target Q network is updated in a hard updating mode by adopting the following formula:
s7, repeating the steps S3-S6 until the Actor loss function J and the first loss function L1A second loss function L2The loss values of the network are all shown as convergence, and the deep reinforcement learning network training is completed;
in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval Critic1 network, the Eval Critic2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.
In order to prove the unmanned end-to-end decision method based on deep reinforcement learning, a simulation test method is further adopted to verify the accuracy of the method. Specifically, using the cara open source simulator, an official township map was selected for training and testing.
Fig. 3(a) to 3(e) show screenshots of five consecutive states of the simulated automobile in the process of completing turning in the simulated environment test scene. It can be seen from the figure that the simulated automobile can smoothly turn by adopting the unmanned end-to-end decision method. Among them, some test results are shown in table 1 below.
Table 1:
steering wheel corner | Throttle opening degree | Brake opening degree |
0.08717041 | 0.00000000 | 0.35341561 |
0.15307951 | 0.00000000 | 0.52670780 |
0.09271482 | 0.00000000 | 0.71335390 |
0.09728579 | 0.00000000 | 0.80667695 |
0.13101204 | 0.00000000 | 0.29040372 |
0.17254703 | 0.18484006 | 0.00000000 |
0.28552827 | 0.45002510 | 0.00000000 |
0.35098195 | 0.68417078 | 0.00000000 |
0.32560289 | 0.60003130 | 0.00000000 |
0.27327946 | 0.51601205 | 0.00000000 |
0.49871063 | 0.52256383 | 0.00000000 |
0.55013824 | 0.55902611 | 0.00000000 |
0.29818016 | 0.57373144 | 0.00000000 |
0.43982193 | 0.53180445 | 0.00000000 |
0.27434003 | 0.58508871 | 0.00000000 |
0.41293526 | 0.60961419 | 0.00000000 |
0.53498042 | 0.77192284 | 0.00000000 |
0.54406005 | 0.83376914 | 0.00000000 |
0.36400995 | 0.93965306 | 0.00000000 |
0.41749707 | 0.99218750 | 0.00000000 |
0.70428425 | 0.99804688 | 0.00000000 |
0.50889337 | 0.99951172 | 0.00000000 |
0.50394905 | 0.99987793 | 0.00000000 |
0.34516448 | 0.99969482 | 0.00000000 |
0.40782961 | 0.99992371 | 0.00000000 |
0.48604059 | 0.99999809 | 0.00000000 |
0.34041631 | 0.99999523 | 0.00000000 |
-0.28926253 | 0.99999881 | 0.00000000 |
-0.38473549 | 1.00000000 | 0.00000000 |
-0.27320272 | 1.00000000 | 0.00000000 |
-0.14927354 | 1.00000000 | 0.00000000 |
-0.08594432 | 1.00000000 | 0.00000000 |
-0.03263356 | 1.00000000 | 0.00000000 |
Table 1 shows a series of action values output by the vehicle during the curve turning process during the simulation test, including values of steering wheel angle, accelerator and brake, and it is obvious from the numerical results that the values are smooth in change and small in oscillation, and can realize smooth turning and meet the driving requirements of the unmanned vehicle in actual road driving.
Claims (7)
1. An unmanned end-to-end decision method based on deep reinforcement learning is characterized by comprising the steps of constructing and training a deep reinforcement learning network; wherein,
1) constructing a deep reinforcement learning network:
the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current timetAnd outputs a continuous spatial motion actThe Target Actor network is used for outputting an action a' for training an Eval criticic 1 network and an Eval criticic 2 network; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 networkj'; the Eval Q network is used for receiving the environment state s at the current momenttAnd outputs the action value QDAnd selecting the discrete space action a with the highest valueDtThe Target Q network is used for outputting action values for training the Eval Q network
2) Training a deep reinforcement learning network, which comprises the following specific steps:
s1, initializing each network parameter of the deep reinforcement learning network and interacting with the interactive environment; the Eval Actor network receives the current time environment state stAnd outputs a continuous spatial motion actThe Eval Q network receives the current time environment state stAnd outputs a discrete space motion aDtContinuous spatial motion aCtAnd discrete spatial motion aDtWeighted fusion resulting in an execution action at(ii) a Performing action atAnd the next time environmental state st+1Inputting the result into a reward and punishment function to obtain an execution action atIs a reward and punishment value rt;
S2, repeating the step S1 until the depthThe reinforcement learning network training is completed, and the historical experience information at each moment is continuously included by st、act、aDt、rt、st+1Stored as a set of samples in an experience playback pool;
s3, when the number of the acquired samples in the experience playback pool meets the calling requirement, calling N groups of samples from the experience playback pool, and training an Eval Actor network through an Actor loss function J:
wherein, thetaμIs an Eval Actor network parameter; q is the environmental state s of the input sample i to the Eval Critic1 networkiAnd performing action aiThe action value of the later output;is Critic1 network parameter; μ denotes Eval Actor;
s4, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples are processed by a first loss function L1Training the Eval Critic1 network and the Eval Critic2 network in synchronization:
wherein Q is the environment state s of the sample i input into the Eval Critic1 network or the Eval Critic2 networkiAnd performing action aiThe action value of the later output;is an Eval Critic1 network parameter or an Eval Critic2 network parameter, yiIn order to estimate the motion value for the first time,rito execute aiGamma C is the first discount rate, Qj' is an action value output after an environment state s ' and a Target action a ' at the next moment are input into a Target Critic1 network or a Target Critic2 network; the Target action a 'is an action which is output after the environment state s' at the next moment is input to the Target Actor network;is a Target Crtic 1 network parameter or a Target Crtic 2 network parameter; in the process, synchronously training an Eval Critic1 network and an Eval Critic2 network by taking the minimum action values output by the Target Critic1 network and the Target Critic2 network;
s5, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples pass through a second loss function L2To train the Eval Q network:
wherein Q isDTo input the environmental State s of sample i into the Eval Q networkiThen outputting the action value; theta is an Eval Q network parameter; y isi' is a second motion estimation value,rito perform action aiReward and punishment value of, gammaDThe second folding rate is set as the second folding rate,the maximum Target action value is output after the environment state s' at the next moment is input to the Target Q network,representing a Target Q network parameter; in the process, the Target Q network selects the maximum action value through s' to train the Eval Q network;
s6, after the Eval Actor completes updating, the Target Actor is updated in a soft updating mode by adopting the following formula:
θμ'←τθμ+(1-τ)θμ',
where τ is a soft update parameter, θμ' is a Target Actor network parameter, θμIs an Actor network parameter;
after the update of the Eval Critic1 and the Eval Critic2 is completed, the Target Critic1 and the Target Critic2 are updated:
wherein tau is a soft update parameter,is a Target critical 1 network parameter or a Target critical 2 network parameter,is an Eval Critic1 network parameter or an Eval Critic2 network parameter;
after the Eval Q completes updating, the Target Q network is updated in a hard updating mode by adopting the following formula:
s7, repeating the steps S3-S6 until the Actor loss function J and the first loss function L1A second loss function L2The loss values of (1) are all shown to be converged, and the deep reinforcement learning network training is completed.
2. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, wherein the current time ringAmbient state stThe method comprises the steps of (1) including a front road environment state and a vehicle self state; the front road environment state is a front road characteristic code; the vehicle state includes the vehicle running speed, the steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal.
3. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 2, characterized in that the method for acquiring the environmental state of the road ahead is as follows:
1) acquiring a front road picture in real time through an RGB (red, green and blue) camera arranged in front of a vehicle;
2) inputting the front road picture into a pre-trained network, and acquiring a characteristic information code as a front road environment state; wherein the pre-training network is: and sequentially inputting a plurality of front road pictures into an end-to-end decision picture data set, and performing end-to-end imitation learning decision training on the unmanned vehicle so as to extract the front road characteristic codes of the unmanned vehicle.
4. The unmanned end-to-end decision method based on deep reinforcement learning of claim 2, characterized in that the vehicle running speed, steering wheel steering angle, opening degree of accelerator pedal and opening degree of brake pedal in the vehicle's own state are obtained by four sensors disposed at the vehicle transmission, steering wheel, accelerator pedal and brake pedal, respectively.
5. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, wherein in step S1, the weighted fusion formula is: a ist=α×aCt+(1-α)×aDt(ii) a Wherein, atTo perform an action value, aDtAs discrete spatial motion values, aCtAlpha is the ratio of continuous motion.
6. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, characterized in that at stepIn S1, the reward and penalty function is: r ist=v×[1-ωt 2-(|ωt|-|ωt-1|)2]-12×(lol+lor)-rc
Where v is the forward speed of the vehicle, ωt、ωt-1The steering wheel turning angles l at the current moment and the last moment respectivelyol、lorRespectively the intrusion ratio, r, of the vehicle on both sides of the roadcIs a penalty in the event of a collision.
7. The unmanned end-to-end decision method based on deep reinforcement learning of claim 1, wherein in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval Critic1 network, the Eval Critic2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110372793.XA CN113104050B (en) | 2021-04-07 | 2021-04-07 | Unmanned end-to-end decision method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110372793.XA CN113104050B (en) | 2021-04-07 | 2021-04-07 | Unmanned end-to-end decision method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113104050A CN113104050A (en) | 2021-07-13 |
CN113104050B true CN113104050B (en) | 2022-04-12 |
Family
ID=76714310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110372793.XA Active CN113104050B (en) | 2021-04-07 | 2021-04-07 | Unmanned end-to-end decision method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113104050B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113715842B (en) * | 2021-08-24 | 2023-02-03 | 华中科技大学 | High-speed moving vehicle control method based on imitation learning and reinforcement learning |
CN114179835B (en) * | 2021-12-30 | 2024-01-05 | 清华大学苏州汽车研究院(吴江) | Automatic driving vehicle decision training method based on reinforcement learning in real scene |
CN114397817A (en) * | 2021-12-31 | 2022-04-26 | 上海商汤科技开发有限公司 | Network training method, robot control method, network training device, robot control device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109213148A (en) * | 2018-08-03 | 2019-01-15 | 东南大学 | It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding |
CN111605565A (en) * | 2020-05-08 | 2020-09-01 | 昆山小眼探索信息科技有限公司 | Automatic driving behavior decision method based on deep reinforcement learning |
CN111885137A (en) * | 2020-07-15 | 2020-11-03 | 国网河南省电力公司信息通信公司 | Edge container resource allocation method based on deep reinforcement learning |
WO2021038781A1 (en) * | 2019-08-29 | 2021-03-04 | 日本電気株式会社 | Learning device, learning method, and learning program |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2016297852C1 (en) * | 2015-07-24 | 2019-12-05 | Deepmind Technologies Limited | Continuous control with deep reinforcement learning |
WO2020062911A1 (en) * | 2018-09-26 | 2020-04-02 | Huawei Technologies Co., Ltd. | Actor ensemble for continuous control |
US10940863B2 (en) * | 2018-11-01 | 2021-03-09 | GM Global Technology Operations LLC | Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle |
-
2021
- 2021-04-07 CN CN202110372793.XA patent/CN113104050B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109213148A (en) * | 2018-08-03 | 2019-01-15 | 东南大学 | It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding |
WO2021038781A1 (en) * | 2019-08-29 | 2021-03-04 | 日本電気株式会社 | Learning device, learning method, and learning program |
CN111605565A (en) * | 2020-05-08 | 2020-09-01 | 昆山小眼探索信息科技有限公司 | Automatic driving behavior decision method based on deep reinforcement learning |
CN111885137A (en) * | 2020-07-15 | 2020-11-03 | 国网河南省电力公司信息通信公司 | Edge container resource allocation method based on deep reinforcement learning |
Non-Patent Citations (1)
Title |
---|
基于深度递归强化学习的无人自主驾驶策略研究;李志航;《工业控制计算机》;20200425(第04期);第49-57页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113104050A (en) | 2021-07-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113104050B (en) | Unmanned end-to-end decision method based on deep reinforcement learning | |
CN110745136B (en) | Driving self-adaptive control method | |
CN111061277B (en) | Unmanned vehicle global path planning method and device | |
Codevilla et al. | End-to-end driving via conditional imitation learning | |
CN110136481B (en) | Parking strategy based on deep reinforcement learning | |
CN112232490B (en) | Visual-based depth simulation reinforcement learning driving strategy training method | |
Min et al. | Deep Q learning based high level driving policy determination | |
CN112433525A (en) | Mobile robot navigation method based on simulation learning and deep reinforcement learning | |
Bai et al. | Deep reinforcement learning based high-level driving behavior decision-making model in heterogeneous traffic | |
Ronecker et al. | Deep Q-network based decision making for autonomous driving | |
CN109726804A (en) | A kind of intelligent vehicle driving behavior based on driving prediction field and BP neural network personalizes decision-making technique | |
CN114153213A (en) | Deep reinforcement learning intelligent vehicle behavior decision method based on path planning | |
Yuan et al. | Multi-reward architecture based reinforcement learning for highway driving policies | |
CN114282433A (en) | Automatic driving training method and system based on combination of simulation learning and reinforcement learning | |
CN113255054A (en) | Reinforcement learning automatic driving method based on heterogeneous fusion characteristics | |
Capasso et al. | End-to-end intersection handling using multi-agent deep reinforcement learning | |
CN113276883A (en) | Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device | |
Capasso et al. | Intelligent roundabout insertion using deep reinforcement learning | |
CN113657433A (en) | Multi-mode prediction method for vehicle track | |
Shen et al. | Inverse reinforcement learning with hybrid-weight trust-region optimization and curriculum learning for autonomous maneuvering | |
Maramotti et al. | Tackling real-world autonomous driving using deep reinforcement learning | |
CN116639124A (en) | Automatic driving vehicle lane changing method based on double-layer deep reinforcement learning | |
Arbabi et al. | Planning for autonomous driving via interaction-aware probabilistic action policies | |
Jaladi et al. | End-to-end training and testing gamification framework to learn human highway driving | |
Gutiérrez-Moreno et al. | Hybrid decision making for autonomous driving in complex urban scenarios |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |