CN113276883A

CN113276883A - Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device

Info

Publication number: CN113276883A
Application number: CN202110464610.7A
Authority: CN
Inventors: 俞扬; 詹德川; 周志华; 史正昕; 罗凡明; 袁雷; 秦熔均
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-08-20
Anticipated expiration: 2041-04-28
Also published as: CN113276883B

Abstract

The invention discloses an unmanned vehicle driving strategy planning method based on a dynamic generation environment and an implementation device thereof, and (1) an unmanned driving environment is constructed in a simulator. (2) And initializing the reinforcement learning parameters and the network strategy model. (3) Interacting with the environment, collecting the current state of the unmanned vehicle, performing action sampling by a strategy network, executing the action in a simulator and entering a new state. (4) The accumulated reward of the vehicle in a section of the generating environment and whether the task is successfully completed are collected, and a new section of the generating environment is established. (5) And (3) for the driving strategy of the vehicle, after a certain reinforcement learning sample is collected by repeating the operation in the step (3), performing strong strategy iteration. (6) And (3) continuously generating the environment to which the vehicle faces by the mode in (4), extracting the environment parameters of failed roads and low-jackpot roads according to success and jackpot collection, and repeatedly training the road sections. (7) The training of the steps is continued until the strategy is converged.

Description

Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device

Technical Field

The invention relates to an unmanned vehicle driving strategy planning method based on a dynamic generation environment and an implementation device, and belongs to the technical field of unmanned vehicles.

Background

The unmanned vehicle is one of the popular fields of research in recent years, because unmanned driving can improve people's mode of going out, promote traffic safety, has very big application prospect, can effectively promote the development of automobile industry with the help of current artificial intelligence technique, more can reduce a lot of manpower and materials, reaches high-efficient resource utilization efficiency. With the gradual maturity and application of deep learning technology in recent years, deep reinforcement learning has succeeded to a certain extent in the fields of games, recommendation systems, unmanned driving and the like, reinforcement learning is applied to the field of unmanned driving, and efficient unmanned driving strategy training can be achieved to a certain extent through proper reward function design.

Aiming at the characteristic that the cost of unmanned trial and error is too high, an algorithm based on reinforcement learning generally establishes a driving scene through a simulator for sampling training and migrates strategies to a real environment. At present mainstream based on passing through of simulator environment training vehicle selects a sealed garden, carries out 1 in the simulator to this sealed garden: 1, such methods are more limited, including 1: 1, fine manual operation can be performed on the road surface, the road edge and the curve in the modeling, and the modeling needs to consume a plurality of repeated operations; and since the environment 1: 1, modeling is not easy to modify the road width, the style and the like; once the model of the closed park is established, the unmanned vehicle strategy trained in the simulator is usually only suitable for the selected closed park, the generalization is poor, when the unmanned vehicle strategy obtained by previous training faces similar scenes such as road width change, turning amplitude change and the like, the unmanned vehicle strategy obtained by previous training cannot obtain expected effects, and if the vehicle strategy is required to be suitable for a new park, the simulator scene, the training strategy model and other operations need to be re-established by using the method.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects that in the prior art, the environment scene is single and poor in diversity when the simulator is used for training, the driving strategy obtained by training in the traditional simulator is poor in generalization and the like, the invention provides the unmanned vehicle driving strategy planning method and the implementation device based on the dynamic generation environment, and the efficient training from the simulator to the real environment is realized by dynamically adjusting the environment parameters in the simulator.

The technical scheme is as follows: a method for planning a driving strategy of an unmanned vehicle based on a dynamic generation environment realizes the efficient training from a simulator to a real environment by dynamically adjusting environmental parameters in the simulator, and comprises the following steps:

(1) the unmanned environment was constructed using a parameter-based approach in the UE4 simulator.

(2) And initializing a reinforcement learning parameter and a network strategy model, starting to train the unmanned vehicle, and training the vehicle to safely and efficiently reach a specified target point through a designed reward function.

(3) Interacting with the environment, collecting the current state of the unmanned vehicle in the dynamically generated environment, performing action sampling by a strategy network, executing the action obtained by sampling in a simulator and entering a new state.

(4) The method comprises the steps that a primary task of the unmanned vehicle in the dynamic generation environment is started by entering a road section, and is finished when the unmanned vehicle collides or reaches a target point of the road section, after the primary task is finished, the accumulated reward of the unmanned vehicle in the generation environment of the section and whether the unmanned vehicle successfully completes the task are collected, and a new generation environment is established by using a new set of parameters.

(5) And (3) for the driving strategy of the unmanned vehicle, after a certain reinforcement learning sample is collected by repeating the operation in the step (3), strong strategy iteration is carried out.

(6) And (3) continuously generating the environment to which the unmanned vehicle faces by the mode in (4), extracting the environment parameters of the failed road and the low-jackpot road based on whether the unmanned vehicle succeeds or not and collecting the jackpots, and repeatedly training the sections for multiple times, so that the unmanned vehicle can obtain more targeted training.

(7) And (5) continuing the training in the steps (3) to (6) until the strategy is converged to obtain a trained strategy model.

In the step (1), the environmental parameters include road parameters for modeling a section of road, a second-order Bezier curve is used as a road main line, the road main line is modeled into a section of road model for driving of unmanned vehicles by using a mesh grid method, the shape, the corner size and the road width of the road are determined by controlling the environmental parameters, and the environmental parameters can also control dynamic and static obstacles, particularly determine the placement position of the static obstacles and the advancing mode of the dynamic obstacles, so that the static obstacles and the dynamic pedestrians in a real scene can be simulated, and rich training environments are realized together.

Splicing roads for the unmanned vehicle to run: the road for the unmanned vehicle to run is combined with special road scenes, and the special road scenes are characterized in that the special road scenes can be effectively connected with the road generated by parameters to jointly form a continuous dynamic generation environment for the unmanned vehicle to train. The special road scenes comprise challenging roads such as narrow roads, crossroads and T-shaped intersections, the training samples can be richer in the combined mode, and compared with the traditional method, the method can better cope with complex environments.

The simulator is used for simulating a four-tuple form (O, A, P, R) of a Markov decision process approximately, providing observation information consistent with the unmanned vehicle in a real scene, and providing instructions of an accelerator, steering and the like consistent with the real unmanned vehicle; the sampling is performed in a markov process provided by a simulator, and the unmanned vehicle is trained. O, A, P, R refer to vehicle state information, vehicle motion, state transition probability, and rewards, respectively. The states comprise radar information received by the unmanned vehicle, the current speed, the throttle size and the turning angle state of the unmanned vehicle, and information used for describing the forward road aiming point. The vehicle action comprises an accelerator instruction and a turning angle instruction. The reward is a combination of a plurality of reward functions, and comprises progressive positive reward advancing to a target point, negative reward colliding with an obstacle, and positive reward for completing a task.

The method comprises the steps of using a neural network as a strategy model for driving of the unmanned vehicle, outputting actions to be executed by using the strategy according to observation information collected in a simulator to control the unmanned vehicle to drive in the simulator, giving rewards to the environment of the simulator at the same time, collecting samples in the process, and optimizing a current strategy by using a near-end strategy optimization algorithm PPO until the strategy converges.

The radar information refers to the distance information between obstacles around the vehicle body and the vehicle body represented by radar rays around the vehicle body; the information of the pre-aiming point of the road in front of the vehicle is the road point information which is planned in front when the vehicle runs, can be used for informing the future traveling direction of the vehicle, and plays a great role when facing crossroads, T-shaped intersections or static barriers covering the shape of the road.

The trained strategy model can be migrated to a real scene, a park in which the unmanned vehicle runs is selected under the real scene, a starting point and a terminal point of the vehicle are selected, and obstacles blocking the vehicle can be arranged in the scene. And calling a driving strategy trained in the simulator to control the vehicle, so as to realize automatic driving of the vehicle.

An apparatus for implementing a dynamically generated environment based unmanned vehicle driving strategy planning method, comprising:

(1) radar mounted on unmanned vehicle body: for obtaining obstacle distance information in the vicinity of the unmanned vehicle.

(2) CAN equipment: the system is used for transmitting the current state of the vehicle to the trained strategy model and sending the control information given by the strategy model to the vehicle chassis.

(3) Differential GPS: the method is used for acquiring the current longitude and latitude position of the vehicle.

(4) A memory: for storing the trained strategy model.

(5) A processor: and the CAN equipment is used for calculating the control information through the strategy model after receiving the observation information and sending the control information through the CAN equipment.

Compared with the prior art, the invention has the following advantages:

1) compared with the traditional simulator-based method, the unmanned vehicle driving strategy obtained based on dynamic generation environment training can obtain a more stable effect when facing a new scene, and the vehicle strategy has the capabilities of avoiding obstacles and avoiding pedestrians by introducing dynamic static obstacles.

2) Dynamically modeling a road for a road main line using a second order bezier curve, compared to performing 1: the modeling of 1 has the characteristics of less manual repeated operation, easiness in using environmental parameters to control the road and the like, and has the characteristics of continuous connection, smoothness and the like in a multi-section generated road. Connecting components such as crossroads and T-shaped intersections are added at the connecting positions, and the dynamic generation environment in the simulator is closer to the real road through the processing, so that the difference between the real environment and the environment of the simulator is reduced.

3) When a road is modeled through environmental parameters, based on the characteristics of reinforcement learning, after an unmanned vehicle completes the training of a road section, the accumulated reward corresponding to the road section and whether the accumulated reward is successful or not are collected, the road formed by modeling of the environmental parameters is combined with the expression of the vehicle on the road section, repeated training can be performed for a plurality of times aiming at the scene with poor vehicle expression, and the robustness of the strategy is improved.

Drawings

FIG. 1 is a schematic diagram of a method for modeling a section of road based on a second-order Bezier curve according to the present invention;

FIG. 2 is a schematic diagram of obstacle avoidance in a dynamically generated environment according to the present invention;

FIG. 3 is a schematic illustration of training an unmanned vehicle driving strategy in a dynamically generated environment;

FIG. 4 is a schematic diagram of a hardware facility in a real scene according to the present invention;

FIG. 5 is a flow chart of policy usage in the migration to a real scenario of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

An unmanned vehicle driving strategy planning method based on a dynamic generation environment comprises the following steps:

the method comprises the following steps:

constructing a training environment in a simulator by using a code, and firstly using road parameters to construct a corresponding road section, as shown in fig. 1(a), in order to keep P in a new section of a curve₀And P 'in the last segment'₂At the same point, P in the new curve segment for keeping the curve smooth₀P₁The direction is P 'in the upper section of the curve'₁P′₂The direction, P in the new road section can be determined according to the distances d1, d2 and the angle theta₁、P₂Two points, now from P₀、P₁、P₂The new curve can be obtained by three points by means of a second-order bezier curve function b (t):

B(t)＝(1-t)²P₀+2t(1-t)P₁+t²P₂,t∈[0,1]

by sampling t, a set of intermediate points on the curve is obtained, which can be used in subsequent steps to build a road model from the curve and as pre-pointing point information representing a future driving plan of the vehicle.

Step two:

after a set of second-order bezier curve road intermediate points are obtained by using the method, the intermediate points are expanded into the road by using a mesh grid component (as shown in fig. 1 (b)). In the expansion process, differences are sequentially obtained for the group of intermediate points, and the advancing directions of the intermediate points are obtained. Then, direction vectors perpendicular to the traveling direction of the intermediate points and on the horizontal plane are calculated, and for each intermediate point, left and right boundary points of the road, which determine the shape of the road, are obtained by moving the direction vector in the positive and negative directions by a distance of half the width of the road. And these boundary points constitute vertices in the mesh, and these boundary points are connected in order to constitute triangular faces in the mesh (as shown in fig. 1 (b)). In the UE4 simulator, after the road plane is drawn by using the above method, a collision body is added to make it a drivable road surface, and walls are generated on both sides of the road surface by a similar method, and the road surface and the walls together form the main body of a section of road.

Step three:

after a road body is obtained, the difficulty of the road is improved by adding dynamic obstacles and static obstacles, and the generation of the obstacles is also controlled by parameters, specifically: (1) the placing position of the static barrier is controlled by the position of the middle point of the Bezier curve and the distance from the middle point. (2) The moving of the dynamic barrier is controlled by a rule, the dynamic barrier can perform a return motion at a certain speed between a starting point and a target point, the dynamic barrier is provided with radar rays, the radar rays can stop or accelerate when a vehicle is seen, so that pedestrians in actual conditions are simulated, in the process, the starting point position, the target point position and the moving speed of the moving dynamic barrier are determined by parameters, and the generation environment after the barrier is added is shown in fig. 2.

The method for generating the road based on the second-order Bezier curve can smoothly connect the roads in the two sections of generating environments, so that the unmanned vehicle can continuously run in a plurality of sections of generating environment barriers.

Step four:

the road scene forming the unmanned dynamic generation environment is not only a road formed by a second-order bezier curve, as shown in the left half part of fig. 3, but also various roads such as crossroads, t-junctions, straight roads and the like can be established by using a mesh grid method when the road is generated, so that the unmanned vehicle can carry out richer and more diverse training. In addition, based on the characteristics of reinforcement learning, the cumulative reward of each road section and whether the driving task is successfully completed or not can be recorded when the vehicle starts to run on each road section until the vehicle collides or completes the road section, and the records respectively correspond to two parameter sets, namely a failure set for storing failed road parameters and a low cumulative reward set, and a batch of road parameter sets which are successful and have the lowest cumulative reward are maintained in a maximum pile mode. When a new parameter generation environment is selected, one set of parameters is extracted from the two sets with a set probability. The method can be used for carrying out targeted training aiming at scenes that the vehicle is weak at present.

Step five:

a method of training an unmanned vehicle in a dynamically generated environment may be seen in the right half of fig. 3. The invention uses a near-end strategy optimization algorithm PPO (prompt Policy optimization) to optimize the driving strategy of the unmanned vehicle, and in the network structure, a combination of a full connection layer and a relu activation function is used, and an lstm layer is added to combine with information of a time domain, so that the vehicle can better perform an obstacle avoidance task.

Step six:

the unmanned vehicle interacts with the environment in the generated environment provided in step three and four, which may be defined in the form of a quadruple (O, a, P, R) of a Markov Decision Process (MDP), and observation information O is collected once at 100ms intervals in the UE4 simulator, the observation information including radar information simulated in the simulator, vehicle state information such as the current speed of the vehicle, and pre-aiming point information. The method for acquiring and using the preview point information comprises the following steps: and C, calculating a group of forward aiming points by means of the intermediate points of the Bezier curve obtained in the step I and the current position of the vehicle, and using the relative position information of the aiming points relative to the vehicle body as aiming point information, wherein the aiming point information can represent the direction in which the vehicle needs to walk next to guide the vehicle to better run, and particularly in the scene of the crossroads, the aiming point information can guide the vehicle to go forward to which crossroad of the crossroads.

And transmitting the observation information O to a neural network, sampling the action by the policy network to obtain an action A to be executed, including an accelerator and a steering command of the unmanned vehicle, and transmitting the action to a simulator for execution to obtain a reward R.

Step eight:

and repeatedly executing the step six, collecting observation, action and reward information in the interaction process as samples, updating the strategy when the samples are enough, wherein the optimization target used in the near-end strategy optimization algorithm is as follows:

where theta is the parameter to be optimized in the algorithm,

as a dominant function at time t, i.e. Q(s)_t,a_t)-V(s_t)，Q(s_t,a_t) Representing vehicle slave state s by a function called state-action value_tStarting, performing action a_tAnd then use the jackpot accrued by the current strategy. V(s)_t) Called state value function, representing the vehicle's slave state s_tCumulative rewards from departure using current strategy, where Q(s)_t,a_t) Reward discount and win from the current state to the end of a track, V(s)_t) Can be calculated by critic network, therefore

The expression of (a) is:

where γ is the attenuation coefficient, and γ is 0.99 in this experiment. In measuring current strategy pi_θ(a_t|s_t) Old policy

When the cutting idea in ppo2 is used, namely

Instead of the KL penalty in ppo1, the value of (c) is clipped between 1- ε and 1+ ε, where ε is the hyper-parameter, determining the clipping range, preventing the parameter from updating too fast.

In this mode, samples are continuously and repeatedly acquired and network parameters are updated, and training is ended when the maximum number of iterations in the training setting is reached.

Step nine:

after the driverless vehicle driving strategy trained in the dynamically generated environment is obtained, the strategy can be migrated to a real environment.

The steps for migrating the unmanned vehicle strategy in real environment are shown in fig. 5:

step 01: and selecting a loaded and trained strategy model.

Step 02: and (4) with the help of map information, after a starting point and a target point are selected, planning a path by using an A-star algorithm, simulating the middle point of the Bezier curve mentioned in the step one by using a planned path point, and calculating the information of the pre-aiming point by using the method in the step six.

Step 03: as shown in fig. 5, vehicle state information such as radar data and vehicle speed and the preview point information are combined and transmitted to the policy model to obtain an action command.

Step 04: after the unmanned vehicle executes the action command in Step03 for a while, next state information is acquired.

Step 05: and repeatedly executing Step04-Step05 until the unmanned vehicle finishes the driving task.

As shown in fig. 4, the apparatus for implementing the unmanned vehicle driving strategy planning method based on the dynamically generated environment includes:

(3) Differential GPS: the system is used for acquiring the current longitude and latitude position of the vehicle, forming observation information together with information acquired by the radar and the acquired vehicle body state information and sending the observation information to the processor.

(4) A memory: for storing the trained strategy model.

The method has the advantages that the road is parameterized, and the dynamic and static obstacles which obstruct the vehicle from moving forward in the road scene are introduced by the parameters, so that the vehicle has obstacle avoidance capability, the generation environment which is jointly formed by the road and the obstacles is determined by the parameters, and the vehicle can be trained for many times aiming at the generation environment parameter combination with poor vehicle performance, so that the vehicle can be trained more specifically. And the mode of the multi-section road combination connection can add special road sections such as crossroads, narrow roads, T-shaped intersections and the like in the road combination process and keep the roads coherent, so that vehicles can learn more special scenes expected to be added, and the splicing mode can better include new road scenes, thereby further improving the richness of training scenes. The vehicle strategy model trained in the generated environment has strong generalization, and can be directly applied to a closed park environment established by using the UE4, and strategies are further migrated to vehicles in a real environment.

Claims

1. A method for planning a driving strategy of an unmanned vehicle based on a dynamic generation environment is characterized in that a method for training a simulator to a real environment is realized by dynamically adjusting environment parameters in the simulator, and comprises the following steps:

(1) constructing an unmanned environment in a simulator using a parameter-based approach;

(2) initializing a reinforcement learning parameter and a network strategy model, starting to train the unmanned vehicle, and training the vehicle to reach a specified target point through a designed reward function;

(3) interacting with the environment, collecting the current state of the unmanned vehicle in the dynamically generated environment, performing action sampling by a strategy network, executing the action obtained by sampling in a simulator and entering a new state;

(4) starting a primary task of the unmanned vehicle in the dynamic generation environment from a road section, and finishing collision or reaching a target point of the road section, and establishing a section of new generation environment after finishing the primary task;

(5) for the driving strategy of the unmanned vehicle, after the operation in the step (3) is repeated and a preset reinforcement learning sample is collected, strong strategy iteration is carried out;

(6) for the environment faced by the unmanned vehicle, training for a plurality of times for the set parameters;

(7) and (5) continuing the training of the steps until the strategy is converged to obtain a trained strategy model.

2. The unmanned aerial vehicle driving strategy planning method based on dynamic generation environment of claim 1, wherein in the step (4), a task of the unmanned aerial vehicle in the dynamic generation environment is started by entering a road section, and is finished by colliding with or reaching a target point of the road section, after finishing the task, the accumulated reward of the unmanned aerial vehicle in the generation environment of the section and whether the unmanned aerial vehicle successfully completes the task are collected, and a new generation environment is established by using a new set of parameters.

3. The unmanned aerial vehicle driving strategy planning method based on dynamic environment generation as claimed in claim 2, wherein in the step (6), the environment is continuously generated from the mode in (4) for the environment faced by the unmanned aerial vehicle, and based on success or failure and accumulated reward collection, the environment parameters of the failed road and the low accumulated reward road are extracted, and the unmanned aerial vehicle is trained repeatedly for a plurality of times on the road sections, so that the unmanned aerial vehicle obtains more targeted training.

4. The unmanned vehicle driving strategy planning method based on the dynamic generation environment of claim 1, wherein in the step (1), the environment parameters include road parameters for modeling a section of road, a second-order bezier curve is used as a road main line, the road main line is modeled into a section of road model for the unmanned vehicle to drive by using a mesh grid method, the shape, the corner size and the road width of the road are determined by controlling the environment parameters, and the environment parameters can also control dynamic and static obstacles, in particular determine the placement position of the static obstacles and the driving mode of the dynamic obstacles, so as to simulate the static obstacles and the dynamic pedestrians in the real scene.

5. The unmanned vehicle driving strategy planning method based on dynamic generation environment of claim 4, wherein roads available for the unmanned vehicle to run are spliced: combining the road for the unmanned vehicle to run with a special road scene to jointly form a continuous dynamic generation environment for the unmanned vehicle to train; the special road scene comprises an intersection, a T-shaped intersection and a narrow road.

6. The dynamically environment-based unmanned aerial vehicle driving strategy planning method of claim 1, wherein the simulator is configured to approximate a four-tuple form (in terms of (i) a Markov decision processO,A,P,R) Providing observation information consistent with the unmanned vehicle in a real scene, and providing instructions of an accelerator and steering consistent with the real unmanned vehicle; sampling in a Markov process provided by a simulator, and training the unmanned vehicle; o, A, P, R refer to vehicle status information, vehicle actions, acquisitions, awards, respectively; the states include radar information received by the unmanned vehicle, current speed, throttle size and turning angle states of the unmanned vehicle, andthe system is used for describing the information of the pre-aiming point of the road ahead; the vehicle action comprises an accelerator instruction and a turning angle instruction; the reward is a combination of a plurality of reward functions, and comprises progressive positive reward advancing to a target point, negative reward colliding with an obstacle, and positive reward for completing a task.

7. The unmanned vehicle driving strategy planning method based on dynamic generation environment as claimed in claim 1, wherein a neural network is used as a strategy model for the unmanned vehicle driving, the strategy is used to output the action to be executed according to the observation information collected in the simulator to control the unmanned vehicle to drive in the simulator, and the simulator environment gives a reward, samples are collected in the process, and the current strategy is optimized by using a near-end strategy optimization algorithm PPO until the strategy converges.

8. The unmanned vehicle driving strategy planning method based on dynamic generation environment of claim 6, wherein the radar information refers to distance information between obstacles around the vehicle body and the vehicle body, which are represented by radar rays around the vehicle body; the information of the predicted point of the road in front of the vehicle is the forward planned waypoint information when the vehicle is running, and the information can be used for informing the future traveling direction of the vehicle.

9. The unmanned vehicle driving strategy planning method based on the dynamic generation environment of claim 1, wherein the trained strategy model is used for migrating to a real scene, selecting a park to be driven by the unmanned vehicle under the real scene, selecting a starting point and an end point of the vehicle, and arranging an obstacle obstructing the vehicle in the scene; and calling a driving strategy trained in the simulator to control the vehicle, so as to realize automatic driving of the vehicle.

10. An apparatus for implementing a dynamically generated environment based unmanned vehicle driving strategy planning method, comprising:

(1) radar mounted on unmanned vehicle body: the system comprises a display unit, a control unit and a display unit, wherein the display unit is used for displaying the distance information of the obstacle near the unmanned vehicle;

(2) CAN equipment: the system is used for transmitting the current state of the vehicle to the trained strategy model and sending the control information given by the strategy model to the vehicle chassis;

(3) differential GPS: the system is used for acquiring the current longitude and latitude position of the vehicle;

(4) a memory: the strategy model is used for storing the trained strategy model;