CN109143852B

CN109143852B - Intelligent driving vehicle environment self-adaptive importing method under urban environment

Info

Publication number: CN109143852B
Application number: CN201810780413.4A
Authority: CN
Inventors: 陈雪梅; 刘哥盟; 杜明明
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-07-17
Filing date: 2018-07-17
Publication date: 2020-09-18
Anticipated expiration: 2038-07-17
Also published as: CN109143852A

Abstract

The invention discloses a self-adaptive importing method of an intelligent driving vehicle environment in an urban environment, which comprises the steps of extracting an initial state vector; calculating action variables according to a greedy strategy, executing an import action and simultaneously updating an import scene, selecting an import gap and an import action according to uniform probability if the action variables take random actions, comparing maximum action value functions of all candidate gaps if an intelligent method is adopted, selecting a gap and an action corresponding to a maximum value from the maximum value functions, and returning to a target import gap and an intelligent import action; sensing a state vector at the next moment; calculating an award value according to the environment feedback information; storing the initial state vector, the action variable, the state vector at the next moment and the reward value into a sample set, and evaluating and improving according to an LSQ method after enough samples are obtained; and repeating the steps until the confluence is successful. The sample set and the learning time of the invention are lower than those of the Q learning algorithm, and the success rate is high.

Description

Intelligent driving vehicle environment self-adaptive importing method under urban environment

Technical Field

The method relates to an environment self-adaptive import method which comprehensively considers target gap selection and expected import opportunity under a complex city environment.

Background

The unmanned vehicle has great potential in solving the problems of traffic safety, road congestion management and the like as a future traffic development trend. As the 'brain' of the unmanned vehicle, the decision-making system embodies the intelligent level thereof, improves the generalization and adaptability of the decision-making system in a complex urban environment, and is very important for developing the unmanned vehicle capable of actually driving on the road. However, the traditional rule learning-based unmanned vehicle can only adapt to a single driving environment, cannot face a complex and changeable real scene, and the decision making may not meet the requirements of robustness and flexibility. The expressway converging requirements under the urban environment make safe and effective decisions under multiple constraint conditions such as short time and limited space, and higher requirements are provided for a decision making system of an unmanned vehicle.

In the aspect of the research of the convergence behavior strategy, Yang provides a longitudinal control algorithm, guides the unmanned vehicle to converge into a main line and provides a speed strategy according to the distance from a target gap. Liu et al utilizes an improved game theory framework to model the influx behavior of freeway ramps. Ran et al are concerned with how incoming vehicles travel on freeway ramps to desired incoming locations and quantify the interaction between each other by modeling the acceleration and deceleration of the vehicles. The above studies have all focused on telling the highway environment, high density urban environments are rarely involved, and most studies consider tactical decisions of lane changing, and less studies describe the continuity process of lane changing.

In the aspect of application of reinforcement learning in driving behavior decision, Abbel and Wunda consider interaction between a vehicle and the surrounding environment, reverse reinforcement learning is utilized, vehicle operation is learned, influence of the environment on behavior operation is reflected by a reward function, and functional relationship mapping between environment influence factors and vehicle motion is established. The national defense science Xuxin research team provides the obstacle avoidance and navigation problems of the intelligent vehicle under the continuous state space of the expressway based on the approximate strategy iteration KLSPI algorithm. Shalev-Shwartz discusses a secure reinforcement learning method that divides the policy network into two parts and learns the safety and comfort of driving separately, but model validity is only verified in a simple simulation environment.

The method only considers a few indexes in consideration of the convergence, and cannot simulate the automatic convergence driving experience of artificial driving.

Disclosure of Invention

1. The invention aims to provide a novel method.

The method comprehensively considers evaluation indexes such as safety, comfort, timeliness and the like, establishes the existing weighted comprehensive reward value model, simultaneously sets an action space containing two-dimensional variables, namely a longitudinal speed decision variable and a transverse speed decision variable, decouples the transverse and longitudinal motions of the unmanned vehicle, realizes the continuous control of the importing process, and improves the adaptability of the unmanned vehicle to the dynamic environment in the importing process.

2. The technical scheme adopted by the invention is disclosed.

The invention provides an environment self-adaptive importing method of an intelligent driving vehicle in an urban environment, which is characterized by comprising the following steps of:

extracting an initial state vector;

calculating action variables according to a greedy strategy, executing the import action and simultaneously updating the import scene, if the action variables take random action, selecting the import interval and the import action according to uniform probability,

if intelligent selection action is adopted, the candidate gaps comprise a front vehicle, a rear vehicle and an incoming vehicle, the maximum action value functions of all the candidate gaps are compared, the maximum function is selected, the gap and the action corresponding to the maximum value are selected, and the target incoming gap and the intelligent incoming action are returned;

sensing a state vector at the next moment;

calculating an award value according to the environment feedback information;

storing the initial state vector, the action variable, the state vector at the next moment and the reward value into a sample set, and evaluating and improving the strategy according to an LSQ method after enough samples are obtained;

and repeating the steps until the merging is successful.

Further, the state space is described as a seven-dimensional vector space, wherein the front three dimensions are position coordinates and speed information of the imported vehicle, and the rear four dimensions are longitudinal position coordinates and speed information of the front vehicle and the rear vehicle of the target lane in the simulation process.

Furthermore, the initial state space adopts a basis function which is included in the collision time, the headway, the relative distance and the relative speed of the two vehicles and the motion state.

Further, the action variables include a longitudinal speed decision variable and a transverse speed decision variable.

Furthermore, the acceleration of the longitudinal speed decision variable in the action variables is discrete into five action values of rapid deceleration, uniform speed, acceleration and rapid acceleration, and the transverse speed decision variable is two action values, so that the action space of the longitudinal speed decision variable and the transverse speed decision variable is 10 actions.

Furthermore, the reward function for calculating the reward value is a linear weighting function of the security reward value, the reward value for success or failure of the task, the remittance efficiency reward value, the speed limit reward value and the comfort reward value.

Furthermore, the speed limit, i.e. the safety degree reward function, is specifically as follows:

when the collision is or is easy to collide, a large negative reward (punishment) is given, when the safety condition is met, the reward value is 0, and therefore the weight of the safety reward value is a large negative value;

dx₁₀,dx₀₂the relative distances between the converged vehicle and the front vehicle and the rear vehicle of the target lane are respectively, wherein dis is a safe threshold value of the relative distance between the converged vehicle and the front vehicle of the target lane and between the converged vehicle and the rear vehicle of the target lane.

Further, the task success reward function:

dis₁to a safe distance threshold, dx₁₀,dx₀₂The relative distances between the vehicle to be merged and the front vehicle and the rear vehicle of the target lane are respectively, when the unmanned vehicle is successfully merged, a larger positive reward is given, and the weight is a larger positive value.

Further, the import efficiency reward value function is:

step represents the current period, and when the unmanned vehicle successfully sinks within the preset value, a positive reward is given, otherwise, a negative reward is given, so that the weight is positive.

Furthermore, the speed limit reward function is as follows:

v_limitindicating the road speed limit. When the unmanned vehicle is in the speed limit range, the speed limit reward value is 0; if speeding, a negative prize value is given, so the weight is positive.

Further, the comfort reward function:

the comfort degree in the driving process comprises characteristic indexes of acceleration and impact degree in the longitudinal and transverse directions, the impact degree refers to the change rate of the acceleration along with time, the comfort reward value considers the change of the longitudinal acceleration and is normalized as follows:

where | △ a | represents the longitudinal acceleration motion difference of two cycles, a_maxRepresents the maximum acceleration, a_minRepresents the maximum deceleration, when the acceleration difference is 0, the reward value is zero; in other cases, the acceleration changes constantly, the driving comfort decreases, and a negative reward is given, so the weight is a negative value.

3. The technical effect produced by the invention.

(1) Compared with the Q learning algorithm, the environment self-adaptive importing method (LSPI algorithm) of the intelligent driving vehicle in the urban environment has the advantages that the environment generalization and popularization capability of Q learning is restricted due to the discrete state space, the samples cannot be fully learned, information loss is generated, the state space is considered to be more finely dispersed, the calculated amount is exponentially increased along with the increase of the state space dimension, the requirement on the storage space is larger, the algorithm convergence time and the required sample set are far longer than that of the LSPI algorithm, and the learning time is longer than that of the LSPI algorithm.

(2) According to the LSPI algorithm and the Q learning algorithm, verification shows that the import success rate based on the LSPI algorithm is gradually improved along with the increase of training times, and finally reaches 86% success rate, which shows that the import strategy method can independently learn the import strategy. The learning success rate of Q fluctuates around 25%, the import success rate is low, and the applicability of the algorithm is not high.

Drawings

FIG. 1 is a graph comparing success rate of Q learning and LSPI algorithm.

Fig. 2 shows the comparison result between the import strategy gap selection and the real data.

Fig. 3 is a graph comparing data of the 2745 imported vehicle and simulation experiments.

Fig. 4 is a graph comparing data of the imported vehicle 63 with simulation experiment data.

FIG. 5 is a flow chart of multi-target candidate gap selection.

Fig. 6 is a flowchart of the import strategy training based on the LSPI algorithm.

Detailed Description

Examples

The invention considers the target gap selection and the expected influx opportunity and provides the target gap selection and the expected influx opportunity based on a least square strategy iterative algorithm.

The method regards a front vehicle, a rear vehicle and an afflux vehicle of a candidate gap as a unit afflux system to carry out reinforcement learning modeling. And in the strategy optimization process, comparing the maximum action value functions of all candidate gaps, and selecting the strategy corresponding to the maximum value as an output strategy. In the unit system reinforcement learning modeling process, evaluation indexes such as safety, comfort and timeliness are comprehensively considered, an existing weighted comprehensive reward value model is established, meanwhile, an action space is set to comprise two-dimensional variables, namely a longitudinal speed decision variable and a transverse speed decision variable, transverse and longitudinal movement of the unmanned vehicle is decoupled, and continuity control of the remittance process is achieved.

Modeling based on LSPI algorithm import strategy:

(1) state space

The unit system state space of the LSPI algorithm is described as a seven-dimensional vector space (x)₀y₀v₀x₁v₁x₂v₂) Wherein (x)₀y₀v₀) To incorporate the position coordinates and velocity information of the vehicle, (x)₁v₁x₂v₂) And longitudinal position coordinates and speed information of the front vehicle and the rear vehicle of the target lane in the simulation process are represented.

(2) Basis function establishment

The basis functions, which are also referred to as features in some cases, are generally selected based on empirical knowledge, and commonly used basis functions include gaussian radial basis functions, polynomial basis functions, and the like. The invention relates to a Time To Collision (TTC) and a headway (gt) of two vehicles in a unit system_i) Relative distance (dx)_i) And relative velocity (dv)_i) And a part of self-state information (y) of the unmanned vehicle₀,v₀) The basis functions are included and comprise 14 dimensions, as shown in table 1.

TABLE 1 basis function establishment

(3) Movement space

In order to simplify the action space of the model and ensure the comfort requirement, the invention disperses the longitudinal acceleration into five action values of rapid deceleration, uniform speed, acceleration and rapid acceleration which respectively correspond to (-4, -2, 0, 2 and 4). Thus, the motion space contains 10 motions, as shown in table 2.

TABLE 2 action space settings

(4) Reward function

In the process of converging the unmanned vehicle, the condition which is firstly met is safety. Secondly, the import process requires that the lane change operation be completed within limited time and space constraints, and therefore, the efficiency of the import process is also one of the evaluation indexes. The method refers to the merging behavior of urban expressway ramps in the real world, aims to keep traffic rules, and considers the comfort in the driving process, so the method brings the speed limit and the comfort into evaluation indexes. In view of the above, the present invention establishes a linearly weighted overall prize value model, as shown in equation 1,

R_safety(s, a) represents a security reward value, μ₁For security reward value weight, R_task(s, a) a prize value, μ, indicating success or failure of the task₂Rewarding value weight for task success, R_time(s, a) represents the remittance efficiency reward value, μ₃Rewarding value weight for remittance efficiency, R_rule(s, a) represents a rate-limiting prize value, μ₄Rewarding value weight for remittance efficiency, R_comfort(s, a) comfort reward value, μ₅Value weights are awarded for the importation efficiency.

Dx in the formula presented below₁₀,dx₀₂The relative distances of the merging vehicle to the front vehicle and the rear vehicle of the target lane are respectively.

1) Security reward function

During driving, safety is the most important evaluation index, when collision or collision is easy, a large negative reward (punishment) is given, when the safety condition is met, the reward value is 0, and therefore the weight mu of the safety reward value₁A large negative value.

Wherein dis is a relative distance safety threshold value between the converged vehicle and the vehicle in front of the target lane and between the converged vehicle and the vehicle behind the target lane

1) Task success reward function

The task success reward value is a reward value which is fed back when the remittance task is completed safely and efficiently. The unit importing system of the present invention includes:

dis₁is a safe distance threshold. When the unmanned vehicle successfully enters, a larger positive reward is given, and the weight mu₂A large positive value.

2) Sink efficiency reward function

The converged driving behavior is required to be within certain space constraint, and the lane change task is efficiently completed. Therefore, the invention designs the remittance efficiency reward value according to the timeliness of completion of the remittance task:

step denotes the current period. When the unmanned vehicle successfully remits within 6.5 seconds, a positive reward is given, and vice versa, a negative reward is given, so the weight mu₃Positive values.

3) Speed limit reward function

In the driving process, the traffic laws and regulations are required to be complied with, and the speed limit reward value is introduced to regulate the speed of the unmanned vehicle within a reasonable range.

v_limitIndicating the road speed limit. When the unmanned vehicle is in the speed limit range, the speed limit reward value is 0; if overspeed, a negative prize value is given. Thus, the weight μ₄Positive values.

4) Comfort reward function

The comfort degree during driving comprises characteristic indexes of acceleration and impact degree in the longitudinal direction and the transverse direction, and the impact degree refers to the change rate of the acceleration along with time. In the research of the convergent process, only a simple two-degree-of-freedom kinematic model is considered, so the comfort reward value mainly considers the longitudinal acceleration change and is normalized as follows:

where | △ a | represents the longitudinal acceleration motion difference of two cycles, a_maxRepresents the maximum acceleration, a_minIndicating the maximum deceleration. When the acceleration difference is 0, the reward value is zero; under other conditions, the acceleration changes continuously, the driving comfort is reduced, and negative rewards are given to the driverThe weight mu₅Is negative.

Example 2

The specific process of the optimization training of the environment self-adaptive importing strategy method based on the LSPI algorithm is as follows:

(1) initialization strategy pi₀And sample set D₀；

(2) Operating a Vissim + Prescan combined traffic simulation platform;

(3) obtaining the information of the imported environment from the simulation environment, and extracting the state vector s_t；

(4) Computing an action variable a according to a greedy policy_t(ii) a If random action is taken, selecting an influx gap and an influx action according to uniform probability; and if the intelligent agent selection action is adopted, comparing the maximum function of all the candidate gaps, selecting the gap and the action corresponding to the maximum value, and returning to the target remittance gap and the intelligent agent remittance action. The simulation platform executes the import action a_tAnd updates the import scene.

(5) The unmanned vehicle senses the state vector s of the next moment through the sensor_t+1；

(6) Calculating a reward value R based on the environmental feedback information_t；

(7) Will(s)_t，a_t，s_t+1，R_t) The sample set is stored and if sufficient, the policy is evaluated and refined according to the LSQ method.

(8) Repeating the steps 3-7 until the remittance is successful;

(9) and repeating the steps 2-8 until the strategy converges or the maximum iteration number is reached.

Experimental verification

Simulation comparison experiment of Q learning and LSPI algorithm:

1) a comparison experiment is designed in the unit import system strategy learning stage, and the following table shows the number of samples required by the unmanned vehicle unit system strategy convergence and the convergence time. Compared with the LSPI algorithm and the Q learning algorithm, in the process of optimization training, the number of samples required by the LSPI algorithm is lower than that of the Q learning algorithm, and the iteration times on the strategy learning problem are relatively small. Specific comparisons are shown in the following table:

TABLE 3 comparison of simulation results for LSPI and Q learning algorithms

The termination condition of the Q learning algorithm is that the sum of squares of the difference values of the Q value table between two iterations is less than 1, and the termination condition of the LSPI algorithm is that the sum of squares of the difference values of the parameter vectors omega of the two iterations is less than 0.01. The data in the table are mean values obtained by multiple experiments, and analysis shows that the number of samples required by convergence of the Q learning algorithm is far larger than that of LSPI algorithms, and the learning time is also longer than that of the LSPI algorithms. The discrete state space restricts the generalization and popularization ability of Q learning to the environment, and the samples cannot be fully learned, so that the information is lost. Considering the state space is more finely dispersed, the calculation amount will grow exponentially as the dimension of the state space increases, and the requirement on the storage space will be larger, so the convergence time and the required sample set of the algorithm are far larger than those of the LSPI algorithm.

2) Designing a typical 3-interval import scene on a combined simulation platform, carrying out strategy optimization training, and continuously exploring and optimizing an import strategy by an algorithm agent through interactive feedback with a simulation environment.

As shown in fig. 1, the maximum number of iterations of each set of experiments is 5000, wherein each iteration is 500, the policy is evaluated once, the success rate of the import behavior is recorded, the import success rate based on the LSPI algorithm is gradually increased with the increase of the training times, and finally reaches 86% success rate, which indicates that the import strategy method can autonomously learn the import strategy. The learning success rate of Q fluctuates around 25%, the import success rate is low, and the applicability of the algorithm is not high.

(2) And verifying the true imported data.

As shown in fig. 2, 30 sets of real data in the ngsim (next generation simulation) US101 data set are selected for verification of the import policy, the selection of the gaps after training is slightly conservative under the same condition, the original gap is selected under 73% of the conditions, and the proportion of the gap before selection is lower than that of the real import data.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An environment self-adaptive importing method of an intelligent driving vehicle in an urban environment is characterized by comprising the following steps:

extracting an initial state vector;

sensing a state vector at the next moment;

calculating an award value according to the environment feedback information;

storing the initial state vector, the action variable, the state vector at the next moment and the reward value into a sample set, and after enough samples are obtained, converging into strategy modeling according to an LSPI algorithm for evaluation and improvement;

repeatedly executing the steps until the merging is successful;

the LSPI algorithm is imported into strategy modeling:

(1) state space

The unit system state space of the LSPI algorithm is described as a seven-dimensional vector space (x)₀,y₀,v₀,x₁,v₁,x₂,v₂) Wherein (x)₀,y₀,v₀) To incorporate the position coordinates and velocity information of the vehicle, (x)₁,v₁,x₂,v₂) Representing longitudinal position coordinates and speed information of a front vehicle and a rear vehicle of the target lane in the simulation process;

(2) basis function establishment

The collision time TTC and the head time interval gt of two vehicles in the unit system_iRelative distance dx_iAnd relative velocity dv_iAnd a part of self-state information (y) of the unmanned vehicle₀,v₀) Including a basis function, the basis function comprising 14 dimensions;

(3) movement space

Dispersing the longitudinal acceleration into five action values of rapid deceleration, uniform speed, acceleration and rapid acceleration which respectively correspond to (-4, -2, 0, 2 and 4);

(4) reward function

The speed limit and the comfort are brought into the evaluation index, a linear weighted comprehensive reward value model is established, as shown in a formula (1),

2. The adaptive importing method for the environment of the intelligent driving vehicle in the urban environment according to claim 1, wherein: the state space is described as a seven-dimensional vector space, wherein the front three dimensions are position coordinates and speed information of the imported vehicle, and the rear four dimensions are longitudinal position coordinates and speed information of the front vehicle and the rear vehicle of the target lane in the simulation process.

3. The adaptive importing method for the environment of the intelligent driving vehicle in the urban environment according to claim 1, wherein: the initial state space adopts a basis function which adopts the collision time, the headway, the relative distance, the relative speed and the motion state of two vehicles to be included in the basis function.

4. The adaptive importing method for the environment of the intelligent driving vehicle in the urban environment according to claim 1, wherein: the action variables comprise a longitudinal speed decision variable and a transverse speed decision variable.

5. The adaptive importing method for the environment of the intelligent driving vehicle in the urban environment according to claim 4, wherein: the acceleration of the longitudinal speed decision variable in the action variables is dispersed into five action values of rapid deceleration, uniform speed, acceleration and rapid acceleration, and the transverse speed decision variable is two action values, so that the action space of the longitudinal speed decision variable and the transverse speed decision variable is 10 actions.

6. The adaptive importing method for the environment of the intelligent driving vehicle in the urban environment according to claim 1, wherein the security reward value is specifically:

when collision or collision is easy, a large negative reward is given, and when the safety condition is met, the reward value is 0, so that the weight of the safety reward value is a large negative value;

dx₁₀,dx₀₂the relative distances, min (dx), of the oncoming vehicle to the leading and trailing vehicles of the target lane, respectively₁₀,dx₀₂) Is the minimum value of the relative distance between the merging vehicle and the front vehicle and the rear vehicle of the target lane; where dis is a relative distance safety threshold between the merging vehicle and the vehicle in front of the target lane and the vehicle behind the target lane, and min dis is a relative distance safety threshold between the merging vehicle and the vehicle in front of the target lane and the vehicle behind the target laneA minimum value.

7. The adaptive importing method for intelligent driving vehicle environment in urban environment according to claim 1, wherein the reward value for success or failure of task is:

dis₁to a safe distance threshold, y₀As vehicle ordinate, dx₁₀,dx₀₂The relative distances between the vehicle to be merged and the front vehicle and the rear vehicle of the target lane are respectively, when the unmanned vehicle is successfully merged, a larger positive reward is given, and the weight is a larger positive value.

8. The adaptive import method for the environment of the intelligent driving vehicle under the urban environment according to claim 1, wherein the import efficiency reward value is as follows:

9. The adaptive importing method for the environment of the intelligent driving vehicle in the urban environment according to claim 1, wherein the speed limit reward value is as follows:

v_limitindicating a road speed limit; when the unmanned vehicle is within the speed limit range, the speed limit reward value is 0, if the unmanned vehicle is overspeed, a negative reward value is given, therefore, the weight is positive, v₀Representing a vehicle speed;

the comfort reward value is as follows:

where | Δ a | represents the longitudinal acceleration motion difference of two cycles, a_maxRepresents the maximum acceleration, a_minRepresents the maximum deceleration, when the acceleration difference is 0, the reward value is zero; in other cases, the acceleration changes constantly, the driving comfort decreases, and a negative reward is given, so the weight is a negative value.