CN113104050B - Unmanned end-to-end decision method based on deep reinforcement learning - Google Patents

Unmanned end-to-end decision method based on deep reinforcement learning Download PDF

Info

Publication number
CN113104050B
CN113104050B CN202110372793.XA CN202110372793A CN113104050B CN 113104050 B CN113104050 B CN 113104050B CN 202110372793 A CN202110372793 A CN 202110372793A CN 113104050 B CN113104050 B CN 113104050B
Authority
CN
China
Prior art keywords
network
eval
target
critic1
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110372793.XA
Other languages
Chinese (zh)
Other versions
CN113104050A (en
Inventor
杨璐
王一权
任凤雷
刘佳琦
王龙志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN202110372793.XA priority Critical patent/CN113104050B/en
Publication of CN113104050A publication Critical patent/CN113104050A/en
Application granted granted Critical
Publication of CN113104050B publication Critical patent/CN113104050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2520/00Input parameters relating to overall vehicle dynamics
    • B60W2520/10Longitudinal speed
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/10Accelerator pedal position
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/12Brake pedal position
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/18Steering angle

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Transportation (AREA)
  • Medical Informatics (AREA)
  • Mechanical Engineering (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Automation & Control Theory (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an unmanned end-to-end decision method based on deep reinforcement learning, which comprises the following steps: 1) acquiring a front road characteristic code of the unmanned vehicle, 2) inputting a front road environment state and a vehicle self state into a trained deep reinforcement learning structure as a current time environment state to output the action of the unmanned vehicle; the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; the method requires less environmental data, can effectively reduce the cost, and simultaneously achieves the purpose of improving the exploration efficiency of the intelligent agent by constructing a deep reinforcement learning network with high learning efficiency and high training speed.

Description

Unmanned end-to-end decision method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of unmanned driving, in particular to an unmanned driving end-to-end decision method based on deep reinforcement learning.
Background
The unmanned technology is developed rapidly in recent years, integrates environmental perception, decision planning and control, and enables a vehicle to run safely on a road without a driver by means of an artificial intelligence technology. The sensing module fuses sensor information such as a camera and a laser radar and senses the surrounding environment of the vehicle in real time. And the decision module outputs the optimal decision plan according to the perception information and the vehicle state information. And the control module controls the vehicle to run on the road at the specified speed along with the planned track according to the decision information. The decision-making module is used as a central pivot for connecting perception and control and is the key point of the research of the unmanned technology.
There are currently three directions of decision-making studies: 1) a rule-based decision method, 2) a simulated learning-based decision method, and 3) a reinforcement learning-based decision method. The rule-based method cannot cover all possible scenes and is difficult to adapt to complex environments; the method based on the imitation learning is difficult to make optimal decisions in the face of complex and changeable urban traffic environments. The learning mode of deep reinforcement learning is closer to human thinking, the advantages of deep learning and reinforcement learning are combined, and the performance and generalization are improved.
The decision-making method based on deep reinforcement learning obtains a certain result in the field of unmanned driving, but a large amount of random exploration is carried out during algorithm training, so that an intelligent agent can easily obtain too many low-return experiences in the early training period, and the algorithm learning efficiency is low and the training time is long. Therefore, how to solve these problems is of crucial significance to vehicle decision making.
Disclosure of Invention
The invention aims to provide an unmanned end-to-end decision method based on deep reinforcement learning, which solves the technical problem.
Therefore, the technical scheme of the invention is as follows:
an unmanned end-to-end decision method based on deep reinforcement learning comprises the following steps:
1. an unmanned end-to-end decision method based on deep reinforcement learning is characterized by comprising the steps of constructing and training a deep reinforcement learning network; wherein,
1) constructing a deep reinforcement learning network:
the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current timetAnd outputs a continuous spatial motion actThe Target Actor network is used for outputting actions for training the Eval Critic1 network and the Eval Critic2 networka'; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 networkj'; the Eval Q network is used for receiving the environment state s at the current momenttAnd outputs the action value QDAnd selecting the discrete space action a with the highest valueDtThe Target Q network is used for outputting action values for training the Eval Q network
Figure BDA0003010004840000021
2) Training a deep reinforcement learning network, which comprises the following specific steps:
s1, initializing each network parameter of the deep reinforcement learning network and interacting with the interactive environment; the Eval Actor network receives the current time environment state stAnd outputs a continuous spatial motion actThe Eval Q network receives the current time environment state stAnd outputs a discrete space motion aDtContinuous spatial motion aCtAnd discrete spatial motion aDtWeighted fusion resulting in an execution action at(ii) a Performing action atAnd the next time environmental state st+1Inputting the result into a reward and punishment function to obtain an execution action atIs a reward and punishment value rt
S2, repeating the step S1 until the training of the deep reinforcement learning network is completed, and continuously including the historical experience information at each moment including St、act、aDt、rt、st+1Stored as a set of samples in an experience playback pool;
s3, when the number of the acquired samples in the experience playback pool meets the calling requirement, calling N groups of samples from the experience playback pool, and training an Eval Actor network through an Actor loss function J:
Figure BDA0003010004840000031
wherein, thetaμIs an Eval Actor network parameter; q is to an Eval Critic1 netThe environmental state s of the envelope input sample iiAnd performing action aiThe action value of the later output;
Figure BDA0003010004840000039
is Critic1 network parameter; μ denotes Eval Actor;
s4, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples are processed by a first loss function L1Training the Eval Critic1 network and the Eval Critic2 network in synchronization:
Figure BDA0003010004840000032
wherein Q is the environment state s of the sample i input into the Eval Critic1 network or the Eval Critic2 networkiAnd performing action aiThe action value of the later output;
Figure BDA0003010004840000038
is an Eval Critic1 network parameter or an Eval Critic2 network parameter, yiIn order to estimate the motion value for the first time,
Figure BDA0003010004840000033
rito execute aiReward and punishment value of, gammaCIs a first buckling rate, Qj 'Inputting an environment state s 'and a Target action a' at the next moment into a Target Critic1 network or a Target Critic2 network and outputting an action value; the Target action a 'is an action which is output after the environment state s' at the next moment is input to the Target Actor network;
Figure BDA0003010004840000034
is a Target Crtic 1 network parameter or a Target Crtic 2 network parameter; in the process, synchronously training an Eval Critic1 network and an Eval Critic2 network by taking the minimum action values output by the Target Critic1 network and the Target Critic2 network;
s5, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples pass through a second loss function L2To trainEval Q network:
Figure BDA0003010004840000035
wherein Q isDTo input the environmental State s of sample i into the Eval Q networkiThen outputting the action value; theta is an Eval Q network parameter; y isi' is a second motion estimation value,
Figure BDA0003010004840000036
rito perform action aiReward and punishment value of, gammaDThe second folding rate is set as the second folding rate,
Figure BDA0003010004840000037
the maximum Target action value is output after the environment state s' at the next moment is input to the Target Q network,
Figure BDA0003010004840000041
representing a Target Q network parameter; in the process, the Target Q network selects the maximum action value through s' to train the Eval Q network;
s6, after the Eval Actor completes updating, the Target Actor is updated in a soft updating mode by adopting the following formula:
θμ'←τθμ+(1-τ)θμ'
where τ is a soft update parameter, θμ'Is a Target Actor network parameter, θμIs an Actor network parameter;
after Eval Critic1 and Eval Critic2 complete updating, Target Critic1 and Target Critic2 are updated:
Figure BDA0003010004840000042
wherein tau is a soft update parameter,
Figure BDA0003010004840000043
is a Target critical 1 network parameter or a Target critical 2 network parameter,
Figure BDA0003010004840000044
is an Eval Critic1 network parameter or an Eval Critic2 network parameter;
after the Eval Q completes updating, the Target Q network is updated in a hard updating mode by adopting the following formula:
Figure BDA0003010004840000045
in the formula,
Figure BDA0003010004840000046
is a Target Q network parameter, theta is an Eval Q network parameter;
s7, repeating the steps S3-S6 until the Actor loss function J and the first loss function L1A second loss function L2The loss values of (1) are all shown to be converged, and the deep reinforcement learning network training is completed.
Further, the current time environmental status stThe method comprises the steps of (1) including a front road environment state and a vehicle self state; the front road environment state is a front road characteristic code; the vehicle state includes the vehicle running speed, the steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal.
Further, the method for acquiring the environmental state of the front road comprises the following steps:
1) acquiring a front road picture in real time through an RGB (red, green and blue) camera arranged in front of a vehicle;
2) inputting the front road picture into a pre-trained network, and acquiring a characteristic information code as a front road environment state; wherein the pre-training network is: and sequentially inputting a plurality of front road pictures into an end-to-end decision picture data set, and performing end-to-end imitation learning decision training on the unmanned vehicle so as to extract the front road characteristic codes of the unmanned vehicle.
Further, the vehicle running speed, the steering wheel steering angle, the degree of opening of the accelerator pedal, and the degree of opening of the brake pedal in the state of the vehicle itself are obtained by four sensors provided at the vehicle transmission, the steering wheel, the accelerator pedal, and the brake pedal, respectively.
Further, in step S1, the weighted fusion formula is: a ist=α×aCt+(1-α)×aDt(ii) a Wherein, atTo perform an action value, aDtAs discrete spatial motion values, aCtAlpha is the ratio of continuous motion.
Further, in step S1, the reward and punishment function is:
rt=v×[1-ωt 2-(|ωt|-|ωt-1|)2]-12×(lol+lor)-rc
where v is the forward speed of the vehicle, ωt、ωt-1The steering wheel turning angles l at the current moment and the last moment respectivelyol、lorRespectively the intrusion ratio, r, of the vehicle on both sides of the roadcIs a penalty in the event of a collision.
Further, in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval criticic 1 network, the Eval criticic 2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.
Compared with the prior art, the unmanned end-to-end decision method based on the deep reinforcement learning can meet the early-stage data requirement only by enabling the RGB camera to sense the environment in front of the vehicle, and acquiring the real-time vehicle running speed, steering angle of a steering wheel, opening degree of an accelerator pedal and opening degree of a brake pedal of the unmanned vehicle only through a conventional sensor, so that the cost is effectively reduced, meanwhile, the problem that too many low-return experiences are easily obtained in the early stage of training due to the fact that a large amount of exploration is needed for the reinforcement learning algorithm, the learning efficiency of the algorithm is low is solved, the deep reinforcement learning network with high learning efficiency and high training speed is constructed, and the purpose of improving the exploration efficiency of the intelligent body is achieved.
Drawings
FIG. 1 is a flow chart of an end-to-end decision method for unmanned driving based on deep reinforcement learning according to the present invention;
FIG. 2 is a schematic structural diagram of a deep reinforcement learning network in the unmanned end-to-end decision method based on deep reinforcement learning according to the present invention;
FIG. 3(a) is a screenshot of state 1 in a simulated automobile completing a turn in a simulated environment test scenario in an embodiment of the invention;
FIG. 3(b) is a screenshot of state 2 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;
FIG. 3(c) is a screenshot of state 3 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;
FIG. 3(d) is a screenshot of state 4 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;
fig. 3(e) is a screenshot of state 5 in the process that the simulated automobile completes a turn in the simulated environment test scenario in the embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the following figures and specific examples, which are not intended to limit the invention in any way.
As shown in fig. 1, the depth reinforcement learning-based unmanned end-to-end decision method includes the following steps:
s1, acquiring a road characteristic code in front of the unmanned vehicle; in particular, the amount of the solvent to be used,
s101, acquiring a front road picture in real time through an RGB (red, green and blue) camera arranged in front of a vehicle;
s102, inputting a front road picture into a pre-trained network, and acquiring a characteristic information code as a front road environment state;
wherein the pre-training network is: sequentially inputting a plurality of front road pictures into an end-to-end decision picture data set, and performing end-to-end imitation learning decision training on the unmanned vehicle to extract a front road characteristic code of the unmanned vehicle;
s2, inputting the front road environment state and the vehicle self state as the current time environment state into the trained deep reinforcement learning structure to output the action of the unmanned vehicle; specifically, the vehicle state comprises the vehicle running speed, the steering wheel steering angle, the opening degree of an accelerator pedal and the opening degree of a brake pedal; the four real-time measurement data are respectively obtained by sensors arranged at a steering wheel, an accelerator pedal and a brake pedal of the vehicle; the actions of the unmanned vehicle comprise a steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal;
the specific implementation steps for constructing and training the deep reinforcement learning network are as follows:
1) as shown in fig. 2, a deep reinforcement learning network is constructed:
the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current timetAnd outputs a continuous spatial motion actThe Target Actor network is used for outputting an action a' for training an Eval criticic 1 network and an Eval criticic 2 network; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 networkj'; the Eval Q network is used for receiving the environment state s at the current momenttAnd outputs the discrete space motion value a with the highest motion valueDtThe Target Q network is used for outputting action values for training the Eval Q network
Figure BDA0003010004840000071
2) Training a deep reinforcement learning network, which comprises the following specific steps:
s1, initializing each network parameter of the deep reinforcement learning network and interacting with the interactive environment; the Eval Actor network receives the current time environment state stAnd outputs a continuous spatial motion actThe Eval Q network receives the current time environment state stAnd outputs a discrete space motion aDtContinuous spatial motion value aCtAnd a discrete spatial motion value aDtWeighted fusion to obtain the value a of the executed actiont(ii) a Execution action value atAnd the next time environmental state st+1Inputting the result into a reward and punishment function to obtain an execution action atIs a reward and punishment value rt
S2, repeating the step S1 until the training of the deep reinforcement learning network is completed, and continuously including the historical experience information at each moment including St、act、aDt、rt、st+1Stored as a set of samples in an experience playback pool;
s3, when the number of the acquired samples in the experience playback pool meets the calling requirement, calling N groups of samples from the experience playback pool, and training an Eval Actor network through an Actor loss function J:
Figure BDA0003010004840000081
wherein, thetaμIs an Eval Actor network parameter; q is the environment state s of the input i sample to the Eval Critic1 networkiAnd executing the action value aiThe action value of the later output;
Figure BDA0003010004840000082
is Critic1 network parameter; μ denotes Eval Actor;
s4, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples are processed by a first loss function L1Training the Eval Critic1 network and the Eval Critic2 network in synchronization:
Figure BDA0003010004840000083
wherein Q is the environment state s of the sample i input into the Eval Critic1 network or the Eval Critic2 networkiAnd executing the action value aiThe action value of the later output;
Figure BDA0003010004840000084
is an Eval Critic1 network parameter or an Eval Critic2 network parameter, yiIn order to estimate the motion value for the first time,
Figure BDA0003010004840000085
rito execute aiReward and punishment value of, gammaCIs a first buckling rate, Qj 'Inputting an environment state s 'and a Target action a' at the next moment into a Target Critic1 network or a Target Critic2 network and outputting an action value; the Target action a 'is an action to be output after the environmental state s' at the next time is input to the Target Actor network,
Figure BDA0003010004840000086
is a Target Crtic 1 network parameter or a Target Crtic 2 network parameter; in the process, synchronously training an Eval Critic1 network and an Eval Critic2 network by taking the minimum action values output by the Target Critic1 network and the Target Critic2 network;
s5, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples pass through a second loss function L2To train the Eval Q network:
Figure BDA0003010004840000087
wherein Q isDTo input the environmental State s of sample i into the Eval Q networkiThen outputting the action value; theta is an Eval Q network parameter; y isi' is a second motion estimation value,
Figure BDA0003010004840000088
rito perform action aiReward and punishment value of, gammaDIs a second foldThe buckling rate of the rubber belt is increased,
Figure BDA0003010004840000089
the maximum Target action value is output after the environment state s' at the next moment is input to the Target Q network,
Figure BDA00030100048400000810
representing a Target Q network parameter; in the process, the Target Q network selects the maximum action value through s' to train the Eval Q network;
s6, after the Eval Actor completes updating, the Target Actor is updated in a soft updating mode by adopting the following formula:
θμ'←τθμ+(1-τ)θμ'
where τ is a soft update parameter, θμ'Is a Target Actor network parameter, θμIs an Actor network parameter;
after Eval Critic1 and Eval Critic2 complete updating, Target Critic1 and Target Critic2 are updated:
Figure BDA0003010004840000091
wherein tau is a soft update parameter,
Figure BDA0003010004840000092
is a Target critical 1 network parameter or a Target critical 2 network parameter,
Figure BDA0003010004840000093
is an Eval Critic1 network parameter or an Eval Critic2 network parameter;
after the Eval Q completes updating, the Target Q network is updated in a hard updating mode by adopting the following formula:
Figure BDA0003010004840000094
in the formula,
Figure BDA0003010004840000095
is a Target Q network parameter, theta is an Eval Q network parameter;
s7, repeating the steps S3-S6 until the Actor loss function J and the first loss function L1A second loss function L2The loss values of the network are all shown as convergence, and the deep reinforcement learning network training is completed;
in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval Critic1 network, the Eval Critic2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.
In order to prove the unmanned end-to-end decision method based on deep reinforcement learning, a simulation test method is further adopted to verify the accuracy of the method. Specifically, using the cara open source simulator, an official township map was selected for training and testing.
Fig. 3(a) to 3(e) show screenshots of five consecutive states of the simulated automobile in the process of completing turning in the simulated environment test scene. It can be seen from the figure that the simulated automobile can smoothly turn by adopting the unmanned end-to-end decision method. Among them, some test results are shown in table 1 below.
Table 1:
steering wheel corner Throttle opening degree Brake opening degree
0.08717041 0.00000000 0.35341561
0.15307951 0.00000000 0.52670780
0.09271482 0.00000000 0.71335390
0.09728579 0.00000000 0.80667695
0.13101204 0.00000000 0.29040372
0.17254703 0.18484006 0.00000000
0.28552827 0.45002510 0.00000000
0.35098195 0.68417078 0.00000000
0.32560289 0.60003130 0.00000000
0.27327946 0.51601205 0.00000000
0.49871063 0.52256383 0.00000000
0.55013824 0.55902611 0.00000000
0.29818016 0.57373144 0.00000000
0.43982193 0.53180445 0.00000000
0.27434003 0.58508871 0.00000000
0.41293526 0.60961419 0.00000000
0.53498042 0.77192284 0.00000000
0.54406005 0.83376914 0.00000000
0.36400995 0.93965306 0.00000000
0.41749707 0.99218750 0.00000000
0.70428425 0.99804688 0.00000000
0.50889337 0.99951172 0.00000000
0.50394905 0.99987793 0.00000000
0.34516448 0.99969482 0.00000000
0.40782961 0.99992371 0.00000000
0.48604059 0.99999809 0.00000000
0.34041631 0.99999523 0.00000000
-0.28926253 0.99999881 0.00000000
-0.38473549 1.00000000 0.00000000
-0.27320272 1.00000000 0.00000000
-0.14927354 1.00000000 0.00000000
-0.08594432 1.00000000 0.00000000
-0.03263356 1.00000000 0.00000000
Table 1 shows a series of action values output by the vehicle during the curve turning process during the simulation test, including values of steering wheel angle, accelerator and brake, and it is obvious from the numerical results that the values are smooth in change and small in oscillation, and can realize smooth turning and meet the driving requirements of the unmanned vehicle in actual road driving.

Claims (7)

1. An unmanned end-to-end decision method based on deep reinforcement learning is characterized by comprising the steps of constructing and training a deep reinforcement learning network; wherein,
1) constructing a deep reinforcement learning network:
the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current timetAnd outputs a continuous spatial motion actThe Target Actor network is used for outputting an action a' for training an Eval criticic 1 network and an Eval criticic 2 network; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 networkj'; the Eval Q network is used for receiving the environment state s at the current momenttAnd outputs the action value QDAnd selecting the discrete space action a with the highest valueDtThe Target Q network is used for outputting action values for training the Eval Q network
Figure FDA0003010004830000011
2) Training a deep reinforcement learning network, which comprises the following specific steps:
s1, initializing each network parameter of the deep reinforcement learning network and interacting with the interactive environment; the Eval Actor network receives the current time environment state stAnd outputs a continuous spatial motion actThe Eval Q network receives the current time environment state stAnd outputs a discrete space motion aDtContinuous spatial motion aCtAnd discrete spatial motion aDtWeighted fusion resulting in an execution action at(ii) a Performing action atAnd the next time environmental state st+1Inputting the result into a reward and punishment function to obtain an execution action atIs a reward and punishment value rt
S2, repeating the step S1 until the depthThe reinforcement learning network training is completed, and the historical experience information at each moment is continuously included by st、act、aDt、rt、st+1Stored as a set of samples in an experience playback pool;
s3, when the number of the acquired samples in the experience playback pool meets the calling requirement, calling N groups of samples from the experience playback pool, and training an Eval Actor network through an Actor loss function J:
Figure FDA0003010004830000012
wherein, thetaμIs an Eval Actor network parameter; q is the environmental state s of the input sample i to the Eval Critic1 networkiAnd performing action aiThe action value of the later output;
Figure FDA0003010004830000021
is Critic1 network parameter; μ denotes Eval Actor;
s4, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples are processed by a first loss function L1Training the Eval Critic1 network and the Eval Critic2 network in synchronization:
Figure FDA0003010004830000022
wherein Q is the environment state s of the sample i input into the Eval Critic1 network or the Eval Critic2 networkiAnd performing action aiThe action value of the later output;
Figure FDA0003010004830000023
is an Eval Critic1 network parameter or an Eval Critic2 network parameter, yiIn order to estimate the motion value for the first time,
Figure FDA0003010004830000024
rito execute aiGamma C is the first discount rate, Qj' is an action value output after an environment state s ' and a Target action a ' at the next moment are input into a Target Critic1 network or a Target Critic2 network; the Target action a 'is an action which is output after the environment state s' at the next moment is input to the Target Actor network;
Figure FDA0003010004830000025
is a Target Crtic 1 network parameter or a Target Crtic 2 network parameter; in the process, synchronously training an Eval Critic1 network and an Eval Critic2 network by taking the minimum action values output by the Target Critic1 network and the Target Critic2 network;
s5, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples pass through a second loss function L2To train the Eval Q network:
Figure FDA0003010004830000026
wherein Q isDTo input the environmental State s of sample i into the Eval Q networkiThen outputting the action value; theta is an Eval Q network parameter; y isi' is a second motion estimation value,
Figure FDA0003010004830000027
rito perform action aiReward and punishment value of, gammaDThe second folding rate is set as the second folding rate,
Figure FDA0003010004830000028
the maximum Target action value is output after the environment state s' at the next moment is input to the Target Q network,
Figure FDA0003010004830000029
representing a Target Q network parameter; in the process, the Target Q network selects the maximum action value through s' to train the Eval Q network;
s6, after the Eval Actor completes updating, the Target Actor is updated in a soft updating mode by adopting the following formula:
θμ'←τθμ+(1-τ)θμ'
where τ is a soft update parameter, θμ' is a Target Actor network parameter, θμIs an Actor network parameter;
after the update of the Eval Critic1 and the Eval Critic2 is completed, the Target Critic1 and the Target Critic2 are updated:
Figure FDA0003010004830000031
wherein tau is a soft update parameter,
Figure FDA0003010004830000032
is a Target critical 1 network parameter or a Target critical 2 network parameter,
Figure FDA0003010004830000033
is an Eval Critic1 network parameter or an Eval Critic2 network parameter;
after the Eval Q completes updating, the Target Q network is updated in a hard updating mode by adopting the following formula:
Figure FDA0003010004830000034
in the formula,
Figure FDA0003010004830000035
is a Target Q network parameter, theta is an Eval Q network parameter;
s7, repeating the steps S3-S6 until the Actor loss function J and the first loss function L1A second loss function L2The loss values of (1) are all shown to be converged, and the deep reinforcement learning network training is completed.
2. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, wherein the current time ringAmbient state stThe method comprises the steps of (1) including a front road environment state and a vehicle self state; the front road environment state is a front road characteristic code; the vehicle state includes the vehicle running speed, the steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal.
3. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 2, characterized in that the method for acquiring the environmental state of the road ahead is as follows:
1) acquiring a front road picture in real time through an RGB (red, green and blue) camera arranged in front of a vehicle;
2) inputting the front road picture into a pre-trained network, and acquiring a characteristic information code as a front road environment state; wherein the pre-training network is: and sequentially inputting a plurality of front road pictures into an end-to-end decision picture data set, and performing end-to-end imitation learning decision training on the unmanned vehicle so as to extract the front road characteristic codes of the unmanned vehicle.
4. The unmanned end-to-end decision method based on deep reinforcement learning of claim 2, characterized in that the vehicle running speed, steering wheel steering angle, opening degree of accelerator pedal and opening degree of brake pedal in the vehicle's own state are obtained by four sensors disposed at the vehicle transmission, steering wheel, accelerator pedal and brake pedal, respectively.
5. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, wherein in step S1, the weighted fusion formula is: a ist=α×aCt+(1-α)×aDt(ii) a Wherein, atTo perform an action value, aDtAs discrete spatial motion values, aCtAlpha is the ratio of continuous motion.
6. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, characterized in that at stepIn S1, the reward and penalty function is: r ist=v×[1-ωt 2-(|ωt|-|ωt-1|)2]-12×(lol+lor)-rc
Where v is the forward speed of the vehicle, ωt、ωt-1The steering wheel turning angles l at the current moment and the last moment respectivelyol、lorRespectively the intrusion ratio, r, of the vehicle on both sides of the roadcIs a penalty in the event of a collision.
7. The unmanned end-to-end decision method based on deep reinforcement learning of claim 1, wherein in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval Critic1 network, the Eval Critic2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.
CN202110372793.XA 2021-04-07 2021-04-07 Unmanned end-to-end decision method based on deep reinforcement learning Active CN113104050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110372793.XA CN113104050B (en) 2021-04-07 2021-04-07 Unmanned end-to-end decision method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110372793.XA CN113104050B (en) 2021-04-07 2021-04-07 Unmanned end-to-end decision method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN113104050A CN113104050A (en) 2021-07-13
CN113104050B true CN113104050B (en) 2022-04-12

Family

ID=76714310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110372793.XA Active CN113104050B (en) 2021-04-07 2021-04-07 Unmanned end-to-end decision method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN113104050B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113715842B (en) * 2021-08-24 2023-02-03 华中科技大学 High-speed moving vehicle control method based on imitation learning and reinforcement learning
CN114179835B (en) * 2021-12-30 2024-01-05 清华大学苏州汽车研究院(吴江) Automatic driving vehicle decision training method based on reinforcement learning in real scene
CN114397817A (en) * 2021-12-31 2022-04-26 上海商汤科技开发有限公司 Network training method, robot control method, network training device, robot control device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213148A (en) * 2018-08-03 2019-01-15 东南大学 It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN111885137A (en) * 2020-07-15 2020-11-03 国网河南省电力公司信息通信公司 Edge container resource allocation method based on deep reinforcement learning
WO2021038781A1 (en) * 2019-08-29 2021-03-04 日本電気株式会社 Learning device, learning method, and learning program

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2016297852C1 (en) * 2015-07-24 2019-12-05 Deepmind Technologies Limited Continuous control with deep reinforcement learning
WO2020062911A1 (en) * 2018-09-26 2020-04-02 Huawei Technologies Co., Ltd. Actor ensemble for continuous control
US10940863B2 (en) * 2018-11-01 2021-03-09 GM Global Technology Operations LLC Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213148A (en) * 2018-08-03 2019-01-15 东南大学 It is a kind of based on deeply study vehicle low speed with decision-making technique of speeding
WO2021038781A1 (en) * 2019-08-29 2021-03-04 日本電気株式会社 Learning device, learning method, and learning program
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN111885137A (en) * 2020-07-15 2020-11-03 国网河南省电力公司信息通信公司 Edge container resource allocation method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度递归强化学习的无人自主驾驶策略研究;李志航;《工业控制计算机》;20200425(第04期);第49-57页 *

Also Published As

Publication number Publication date
CN113104050A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
CN113104050B (en) Unmanned end-to-end decision method based on deep reinforcement learning
CN110745136B (en) Driving self-adaptive control method
CN111061277B (en) Unmanned vehicle global path planning method and device
Codevilla et al. End-to-end driving via conditional imitation learning
CN110136481B (en) Parking strategy based on deep reinforcement learning
CN112232490B (en) Visual-based depth simulation reinforcement learning driving strategy training method
Min et al. Deep Q learning based high level driving policy determination
CN112433525A (en) Mobile robot navigation method based on simulation learning and deep reinforcement learning
Bai et al. Deep reinforcement learning based high-level driving behavior decision-making model in heterogeneous traffic
Ronecker et al. Deep Q-network based decision making for autonomous driving
CN109726804A (en) A kind of intelligent vehicle driving behavior based on driving prediction field and BP neural network personalizes decision-making technique
CN114153213A (en) Deep reinforcement learning intelligent vehicle behavior decision method based on path planning
Yuan et al. Multi-reward architecture based reinforcement learning for highway driving policies
CN114282433A (en) Automatic driving training method and system based on combination of simulation learning and reinforcement learning
CN113255054A (en) Reinforcement learning automatic driving method based on heterogeneous fusion characteristics
Capasso et al. End-to-end intersection handling using multi-agent deep reinforcement learning
CN113276883A (en) Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device
Capasso et al. Intelligent roundabout insertion using deep reinforcement learning
CN113657433A (en) Multi-mode prediction method for vehicle track
Shen et al. Inverse reinforcement learning with hybrid-weight trust-region optimization and curriculum learning for autonomous maneuvering
Maramotti et al. Tackling real-world autonomous driving using deep reinforcement learning
CN116639124A (en) Automatic driving vehicle lane changing method based on double-layer deep reinforcement learning
Arbabi et al. Planning for autonomous driving via interaction-aware probabilistic action policies
Jaladi et al. End-to-end training and testing gamification framework to learn human highway driving
Gutiérrez-Moreno et al. Hybrid decision making for autonomous driving in complex urban scenarios

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant