CN113104050B

CN113104050B - Unmanned end-to-end decision method based on deep reinforcement learning

Info

Publication number: CN113104050B
Application number: CN202110372793.XA
Authority: CN
Inventors: 杨璐; 王一权; 任凤雷; 刘佳琦; 王龙志
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2022-04-12
Anticipated expiration: 2041-04-07
Also published as: CN113104050A

Abstract

The invention discloses an unmanned end-to-end decision method based on deep reinforcement learning, which comprises the following steps: 1) acquiring a front road characteristic code of the unmanned vehicle, 2) inputting a front road environment state and a vehicle self state into a trained deep reinforcement learning structure as a current time environment state to output the action of the unmanned vehicle; the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; the method requires less environmental data, can effectively reduce the cost, and simultaneously achieves the purpose of improving the exploration efficiency of the intelligent agent by constructing a deep reinforcement learning network with high learning efficiency and high training speed.

Description

Unmanned end-to-end decision method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of unmanned driving, in particular to an unmanned driving end-to-end decision method based on deep reinforcement learning.

Background

The unmanned technology is developed rapidly in recent years, integrates environmental perception, decision planning and control, and enables a vehicle to run safely on a road without a driver by means of an artificial intelligence technology. The sensing module fuses sensor information such as a camera and a laser radar and senses the surrounding environment of the vehicle in real time. And the decision module outputs the optimal decision plan according to the perception information and the vehicle state information. And the control module controls the vehicle to run on the road at the specified speed along with the planned track according to the decision information. The decision-making module is used as a central pivot for connecting perception and control and is the key point of the research of the unmanned technology.

There are currently three directions of decision-making studies: 1) a rule-based decision method, 2) a simulated learning-based decision method, and 3) a reinforcement learning-based decision method. The rule-based method cannot cover all possible scenes and is difficult to adapt to complex environments; the method based on the imitation learning is difficult to make optimal decisions in the face of complex and changeable urban traffic environments. The learning mode of deep reinforcement learning is closer to human thinking, the advantages of deep learning and reinforcement learning are combined, and the performance and generalization are improved.

The decision-making method based on deep reinforcement learning obtains a certain result in the field of unmanned driving, but a large amount of random exploration is carried out during algorithm training, so that an intelligent agent can easily obtain too many low-return experiences in the early training period, and the algorithm learning efficiency is low and the training time is long. Therefore, how to solve these problems is of crucial significance to vehicle decision making.

Disclosure of Invention

The invention aims to provide an unmanned end-to-end decision method based on deep reinforcement learning, which solves the technical problem.

Therefore, the technical scheme of the invention is as follows:

an unmanned end-to-end decision method based on deep reinforcement learning comprises the following steps:

1. an unmanned end-to-end decision method based on deep reinforcement learning is characterized by comprising the steps of constructing and training a deep reinforcement learning network; wherein,

1) constructing a deep reinforcement learning network:

the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current time_tAnd outputs a continuous spatial motion a_ctThe Target Actor network is used for outputting actions for training the Eval Critic1 network and the Eval Critic2 networka'; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 network_j'; the Eval Q network is used for receiving the environment state s at the current moment_tAnd outputs the action value Q_DAnd selecting the discrete space action a with the highest value_DtThe Target Q network is used for outputting action values for training the Eval Q network

2) Training a deep reinforcement learning network, which comprises the following specific steps:

s1, initializing each network parameter of the deep reinforcement learning network and interacting with the interactive environment; the Eval Actor network receives the current time environment state s_tAnd outputs a continuous spatial motion a_ctThe Eval Q network receives the current time environment state s_tAnd outputs a discrete space motion a_DtContinuous spatial motion a_CtAnd discrete spatial motion a_DtWeighted fusion resulting in an execution action a_t(ii) a Performing action a_tAnd the next time environmental state s_t+1Inputting the result into a reward and punishment function to obtain an execution action a_tIs a reward and punishment value r_t；

S2, repeating the step S1 until the training of the deep reinforcement learning network is completed, and continuously including the historical experience information at each moment including S_t、a_ct、a_Dt、r_t、s_t+1Stored as a set of samples in an experience playback pool;

s3, when the number of the acquired samples in the experience playback pool meets the calling requirement, calling N groups of samples from the experience playback pool, and training an Eval Actor network through an Actor loss function J:

wherein, theta^μIs an Eval Actor network parameter; q is to an Eval Critic1 netThe environmental state s of the envelope input sample i_iAnd performing action a_iThe action value of the later output;

is Critic1 network parameter; μ denotes Eval Actor;

s4, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples are processed by a first loss function L₁Training the Eval Critic1 network and the Eval Critic2 network in synchronization:

wherein Q is the environment state s of the sample i input into the Eval Critic1 network or the Eval Critic2 network_iAnd performing action a_iThe action value of the later output;

is an Eval Critic1 network parameter or an Eval Critic2 network parameter, yⁱIn order to estimate the motion value for the first time,

r_ito execute a_iReward and punishment value of, gamma_CIs a first buckling rate, Q_j ^'Inputting an environment state s 'and a Target action a' at the next moment into a Target Critic1 network or a Target Critic2 network and outputting an action value; the Target action a 'is an action which is output after the environment state s' at the next moment is input to the Target Actor network;

is a Target Crtic 1 network parameter or a Target Crtic 2 network parameter; in the process, synchronously training an Eval Critic1 network and an Eval Critic2 network by taking the minimum action values output by the Target Critic1 network and the Target Critic2 network;

s5, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples pass through a second loss function L₂To trainEval Q network:

wherein Q is_DTo input the environmental State s of sample i into the Eval Q network_iThen outputting the action value; theta is an Eval Q network parameter; y isⁱ' is a second motion estimation value,

r_ito perform action a_iReward and punishment value of, gamma_DThe second folding rate is set as the second folding rate,

the maximum Target action value is output after the environment state s' at the next moment is input to the Target Q network,

representing a Target Q network parameter; in the process, the Target Q network selects the maximum action value through s' to train the Eval Q network;

s6, after the Eval Actor completes updating, the Target Actor is updated in a soft updating mode by adopting the following formula:

θ^μ'←τθ^μ+(1-τ)θ^μ'，

where τ is a soft update parameter, θ^μ'Is a Target Actor network parameter, θ^μIs an Actor network parameter;

after Eval Critic1 and Eval Critic2 complete updating, Target Critic1 and Target Critic2 are updated:

wherein tau is a soft update parameter,

is a Target critical 1 network parameter or a Target critical 2 network parameter,

is an Eval Critic1 network parameter or an Eval Critic2 network parameter;

after the Eval Q completes updating, the Target Q network is updated in a hard updating mode by adopting the following formula:

in the formula,

is a Target Q network parameter, theta is an Eval Q network parameter;

s7, repeating the steps S3-S6 until the Actor loss function J and the first loss function L₁A second loss function L₂The loss values of (1) are all shown to be converged, and the deep reinforcement learning network training is completed.

Further, the current time environmental status s_tThe method comprises the steps of (1) including a front road environment state and a vehicle self state; the front road environment state is a front road characteristic code; the vehicle state includes the vehicle running speed, the steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal.

Further, the method for acquiring the environmental state of the front road comprises the following steps:

1) acquiring a front road picture in real time through an RGB (red, green and blue) camera arranged in front of a vehicle;

2) inputting the front road picture into a pre-trained network, and acquiring a characteristic information code as a front road environment state; wherein the pre-training network is: and sequentially inputting a plurality of front road pictures into an end-to-end decision picture data set, and performing end-to-end imitation learning decision training on the unmanned vehicle so as to extract the front road characteristic codes of the unmanned vehicle.

Further, the vehicle running speed, the steering wheel steering angle, the degree of opening of the accelerator pedal, and the degree of opening of the brake pedal in the state of the vehicle itself are obtained by four sensors provided at the vehicle transmission, the steering wheel, the accelerator pedal, and the brake pedal, respectively.

Further, in step S1, the weighted fusion formula is: a is_t＝α×a_Ct+(1-α)×a_Dt(ii) a Wherein, a_tTo perform an action value, a_DtAs discrete spatial motion values, a_CtAlpha is the ratio of continuous motion.

Further, in step S1, the reward and punishment function is:

r_t＝v×[1-ω_t ²-(|ω_t|-|ω_t-1|)²]-12×(l_ol+l_or)-r_c

where v is the forward speed of the vehicle, ω_t、ω_t-1The steering wheel turning angles l at the current moment and the last moment respectively_ol、l_orRespectively the intrusion ratio, r, of the vehicle on both sides of the road_cIs a penalty in the event of a collision.

Further, in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval criticic 1 network, the Eval criticic 2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.

Compared with the prior art, the unmanned end-to-end decision method based on the deep reinforcement learning can meet the early-stage data requirement only by enabling the RGB camera to sense the environment in front of the vehicle, and acquiring the real-time vehicle running speed, steering angle of a steering wheel, opening degree of an accelerator pedal and opening degree of a brake pedal of the unmanned vehicle only through a conventional sensor, so that the cost is effectively reduced, meanwhile, the problem that too many low-return experiences are easily obtained in the early stage of training due to the fact that a large amount of exploration is needed for the reinforcement learning algorithm, the learning efficiency of the algorithm is low is solved, the deep reinforcement learning network with high learning efficiency and high training speed is constructed, and the purpose of improving the exploration efficiency of the intelligent body is achieved.

Drawings

FIG. 1 is a flow chart of an end-to-end decision method for unmanned driving based on deep reinforcement learning according to the present invention;

FIG. 2 is a schematic structural diagram of a deep reinforcement learning network in the unmanned end-to-end decision method based on deep reinforcement learning according to the present invention;

FIG. 3(a) is a screenshot of state 1 in a simulated automobile completing a turn in a simulated environment test scenario in an embodiment of the invention;

FIG. 3(b) is a screenshot of state 2 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;

FIG. 3(c) is a screenshot of state 3 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;

FIG. 3(d) is a screenshot of state 4 in the simulated automobile completing a turn in the simulated environmental test scenario in an embodiment of the invention;

fig. 3(e) is a screenshot of state 5 in the process that the simulated automobile completes a turn in the simulated environment test scenario in the embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the following figures and specific examples, which are not intended to limit the invention in any way.

As shown in fig. 1, the depth reinforcement learning-based unmanned end-to-end decision method includes the following steps:

s1, acquiring a road characteristic code in front of the unmanned vehicle; in particular, the amount of the solvent to be used,

s101, acquiring a front road picture in real time through an RGB (red, green and blue) camera arranged in front of a vehicle;

s102, inputting a front road picture into a pre-trained network, and acquiring a characteristic information code as a front road environment state;

wherein the pre-training network is: sequentially inputting a plurality of front road pictures into an end-to-end decision picture data set, and performing end-to-end imitation learning decision training on the unmanned vehicle to extract a front road characteristic code of the unmanned vehicle;

s2, inputting the front road environment state and the vehicle self state as the current time environment state into the trained deep reinforcement learning structure to output the action of the unmanned vehicle; specifically, the vehicle state comprises the vehicle running speed, the steering wheel steering angle, the opening degree of an accelerator pedal and the opening degree of a brake pedal; the four real-time measurement data are respectively obtained by sensors arranged at a steering wheel, an accelerator pedal and a brake pedal of the vehicle; the actions of the unmanned vehicle comprise a steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal;

the specific implementation steps for constructing and training the deep reinforcement learning network are as follows:

1) as shown in fig. 2, a deep reinforcement learning network is constructed:

the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current time_tAnd outputs a continuous spatial motion a_ctThe Target Actor network is used for outputting an action a' for training an Eval criticic 1 network and an Eval criticic 2 network; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 network_j'; the Eval Q network is used for receiving the environment state s at the current moment_tAnd outputs the discrete space motion value a with the highest motion value_DtThe Target Q network is used for outputting action values for training the Eval Q network

s1, initializing each network parameter of the deep reinforcement learning network and interacting with the interactive environment; the Eval Actor network receives the current time environment state s_tAnd outputs a continuous spatial motion a_ctThe Eval Q network receives the current time environment state s_tAnd outputs a discrete space motion a_DtContinuous spatial motion value a_CtAnd a discrete spatial motion value a_DtWeighted fusion to obtain the value a of the executed action_t(ii) a Execution action value a_tAnd the next time environmental state s_t+1Inputting the result into a reward and punishment function to obtain an execution action a_tIs a reward and punishment value r_t；

wherein, theta^μIs an Eval Actor network parameter; q is the environment state s of the input i sample to the Eval Critic1 network_iAnd executing the action value a_iThe action value of the later output;

is Critic1 network parameter; μ denotes Eval Actor;

wherein Q is the environment state s of the sample i input into the Eval Critic1 network or the Eval Critic2 network_iAnd executing the action value a_iThe action value of the later output;

r_ito execute a_iReward and punishment value of, gamma_CIs a first buckling rate, Q_j ^'Inputting an environment state s 'and a Target action a' at the next moment into a Target Critic1 network or a Target Critic2 network and outputting an action value; the Target action a 'is an action to be output after the environmental state s' at the next time is input to the Target Actor network,

s5, while step S3 is performed, N groups of samples are called from the experience playback pool, and the samples pass through a second loss function L₂To train the Eval Q network:

r_ito perform action a_iReward and punishment value of, gamma_DIs a second foldThe buckling rate of the rubber belt is increased,

θ^μ'←τθ^μ+(1-τ)θ^μ'，

wherein tau is a soft update parameter,

is an Eval Critic1 network parameter or an Eval Critic2 network parameter;

in the formula,

is a Target Q network parameter, theta is an Eval Q network parameter;

s7, repeating the steps S3-S6 until the Actor loss function J and the first loss function L₁A second loss function L₂The loss values of the network are all shown as convergence, and the deep reinforcement learning network training is completed;

in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval Critic1 network, the Eval Critic2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.

In order to prove the unmanned end-to-end decision method based on deep reinforcement learning, a simulation test method is further adopted to verify the accuracy of the method. Specifically, using the cara open source simulator, an official township map was selected for training and testing.

Fig. 3(a) to 3(e) show screenshots of five consecutive states of the simulated automobile in the process of completing turning in the simulated environment test scene. It can be seen from the figure that the simulated automobile can smoothly turn by adopting the unmanned end-to-end decision method. Among them, some test results are shown in table 1 below.

Table 1:

steering wheel corner	Throttle opening degree	Brake opening degree
			0.08717041	0.00000000	0.35341561
0.15307951	0.00000000	0.52670780
			0.09271482	0.00000000	0.71335390
0.09728579	0.00000000	0.80667695
			0.13101204	0.00000000	0.29040372
0.17254703	0.18484006	0.00000000
			0.28552827	0.45002510	0.00000000
0.35098195	0.68417078	0.00000000
			0.32560289	0.60003130	0.00000000
0.27327946	0.51601205	0.00000000
			0.49871063	0.52256383	0.00000000
0.55013824	0.55902611	0.00000000
			0.29818016	0.57373144	0.00000000
0.43982193	0.53180445	0.00000000
			0.27434003	0.58508871	0.00000000
0.41293526	0.60961419	0.00000000
			0.53498042	0.77192284	0.00000000
0.54406005	0.83376914	0.00000000
			0.36400995	0.93965306	0.00000000
0.41749707	0.99218750	0.00000000
			0.70428425	0.99804688	0.00000000
0.50889337	0.99951172	0.00000000
			0.50394905	0.99987793	0.00000000
0.34516448	0.99969482	0.00000000
			0.40782961	0.99992371	0.00000000
0.48604059	0.99999809	0.00000000
			0.34041631	0.99999523	0.00000000
-0.28926253	0.99999881	0.00000000
			-0.38473549	1.00000000	0.00000000
-0.27320272	1.00000000	0.00000000
			-0.14927354	1.00000000	0.00000000
-0.08594432	1.00000000	0.00000000
			-0.03263356	1.00000000	0.00000000

Table 1 shows a series of action values output by the vehicle during the curve turning process during the simulation test, including values of steering wheel angle, accelerator and brake, and it is obvious from the numerical results that the values are smooth in change and small in oscillation, and can realize smooth turning and meet the driving requirements of the unmanned vehicle in actual road driving.

Claims

1) constructing a deep reinforcement learning network:

the deep reinforcement learning network comprises an Actor network consisting of an Eval Actor network and a Target Actor network, a Critic1 network consisting of an Eval Critic1 network and a Target Critic1 network, a Critic2 network consisting of an Eval Critic2 network and a Target Critic2 network, and a Q network group consisting of an Eval Q network and a Target Q network; wherein, the Eval Actor network is used for receiving the environment state s at the current time_tAnd outputs a continuous spatial motion a_ctThe Target Actor network is used for outputting an action a' for training an Eval criticic 1 network and an Eval criticic 2 network; the Eval Critic1 network and the Eval Critic2 network are used for outputting an action value Q for training the Eval Actor network, and the Target Critic1 network and the Target Critic2 network are used for outputting an action value Q for training the Eval Critic1 network and the Eval Critic2 network_j'; the Eval Q network is used for receiving the environment state s at the current moment_tAnd outputs the action value Q_DAnd selecting the discrete space action a with the highest value_DtThe Target Q network is used for outputting action values for training the Eval Q network

S2, repeating the step S1 until the depthThe reinforcement learning network training is completed, and the historical experience information at each moment is continuously included by s_t、a_ct、a_Dt、r_t、s_t+1Stored as a set of samples in an experience playback pool;

wherein, theta^μIs an Eval Actor network parameter; q is the environmental state s of the input sample i to the Eval Critic1 network_iAnd performing action a_iThe action value of the later output;

is Critic1 network parameter; μ denotes Eval Actor;

r_ito execute a_iGamma C is the first discount rate, Q_j' is an action value output after an environment state s ' and a Target action a ' at the next moment are input into a Target Critic1 network or a Target Critic2 network; the Target action a 'is an action which is output after the environment state s' at the next moment is input to the Target Actor network;

θ^μ'←τθ^μ+(1-τ)θ^μ'，

where τ is a soft update parameter, θ^μ' is a Target Actor network parameter, θ^μIs an Actor network parameter;

after the update of the Eval Critic1 and the Eval Critic2 is completed, the Target Critic1 and the Target Critic2 are updated:

wherein tau is a soft update parameter,

is an Eval Critic1 network parameter or an Eval Critic2 network parameter;

in the formula,

is a Target Q network parameter, theta is an Eval Q network parameter;

2. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, wherein the current time ringAmbient state s_tThe method comprises the steps of (1) including a front road environment state and a vehicle self state; the front road environment state is a front road characteristic code; the vehicle state includes the vehicle running speed, the steering angle of a steering wheel, the opening degree of an accelerator pedal and the opening degree of a brake pedal.

3. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 2, characterized in that the method for acquiring the environmental state of the road ahead is as follows:

4. The unmanned end-to-end decision method based on deep reinforcement learning of claim 2, characterized in that the vehicle running speed, steering wheel steering angle, opening degree of accelerator pedal and opening degree of brake pedal in the vehicle's own state are obtained by four sensors disposed at the vehicle transmission, steering wheel, accelerator pedal and brake pedal, respectively.

5. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, wherein in step S1, the weighted fusion formula is: a is_t＝α×a_Ct+(1-α)×a_Dt(ii) a Wherein, a_tTo perform an action value, a_DtAs discrete spatial motion values, a_CtAlpha is the ratio of continuous motion.

6. The deep reinforcement learning-based unmanned end-to-end decision method according to claim 1, characterized in that at stepIn S1, the reward and penalty function is: r is_t＝v×[1-ω_t ²-(|ω_t|-|ω_t-1|)²]-12×(l_ol+l_or)-r_c

7. The unmanned end-to-end decision method based on deep reinforcement learning of claim 1, wherein in the training process of the deep reinforcement learning network, the training frequencies of the Eval Actor network, the Eval Critic1 network, the Eval Critic2 network and the Eval Q are as follows: the deep reinforcement learning network performs training once when interacting with the interaction environment; the update frequency of the Target Actor network and the Target Q network is as follows: the deep reinforcement learning network and the interactive environment are updated once every two times; the update frequency of the Target critical 1 network and the Target critical 2 network is as follows: the deep reinforcement learning network and the interactive environment are updated once every eight times of interaction.