CN110568760A

CN110568760A - Parameterized learning decision control system and method suitable for lane changing and lane keeping

Info

Publication number: CN110568760A
Application number: CN201910952119.1A
Authority: CN
Inventors: 高炳钊; 张羽翔; 吕吉东; 陈虹
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2019-12-13
Anticipated expiration: 2039-10-08
Also published as: CN110568760B

Abstract

The invention belongs to the technical field of design of advanced auxiliary driving and unmanned systems of automobiles, and particularly relates to a parameterized learning decision control system and method suitable for lane changing and lane keeping behaviors. The invention designs a parameterized learning control system suitable for lane changing and lane keeping behaviors based on a parameterized decision frame, which comprises a learning decision method designed based on a reinforcement learning algorithm under the scene of lane changing and lane keeping of a vehicle and a trajectory planning controller which can be suitable for straight roads and curved roads after corresponding parameterization under the scene of the scene.

Description

parameterized learning decision control system and method suitable for lane changing and lane keeping

Technical Field

The invention belongs to the technical field of design of advanced auxiliary driving and unmanned systems of automobiles, and particularly relates to a parameterized learning decision control system and method suitable for lane changing and lane keeping behaviors.

Background

With the continuous development of intelligent driving assistance technology and unmanned driving technology, different forms of motion control systems are continuously proposed and applied. For example, in the motion trajectory planning and control problem, in order to make the system more functional and adaptive to various scenarios, under the control framework of vehicle hierarchy, the integrated underlying motion controller needs to perform various driving tasks and scenarios, such as lane change, lane keeping, etc. Meanwhile, each execution subsystem, such as driving, braking and steering system, has the capability of coordination control and can realize stable switching among different tasks. The parameterized decision framework proposed in the prior art can meet the above requirements, namely a trajectory planning control method based on a parametric decision framework, which is based on a model prediction control method and integrates trajectory planning and motion control in various scenes. The trajectory planning and control method has advantages and development potential because it is in a simple form and can be adapted to various driving tasks and conditions. Under the track planning control framework, human-like driving decisions are described as a plurality of decision parameters closely related to track characteristics at a decision control layer. Furthermore, the solution of different decision parameters needs to be adapted to the changeable driving conditions and continuously adapted to the behavior and feedback behavior of the real human driver in the real driving scene, and the continuous learning effect is difficult to achieve by using the model-based control method. Therefore, for the design of the decision layer control algorithm, a reinforcement learning algorithm which has advantages in sequence control and continuous learning in the learning algorithm can be used. For urban conditions, or highway conditions, lane change and lane keeping are the most common. The decision parameter feature relationship is simple and consistent.

disclosure of Invention

the invention provides a parameterized learning decision control system and a parameterized learning decision control method suitable for lane changing and lane keeping behaviors, which comprise a learning decision method designed based on an reinforcement learning algorithm and a track planning controller which can be suitable for straight roads and curved roads after corresponding parameterization in the scene.

The technical scheme of the invention is described as follows by combining the attached drawings:

A parameterization learning decision control system suitable for lane changing and lane keeping is characterized by comprising a sensing signal collection and data storage module A, a learning decision parameter module B, a trajectory planning and motion control module C and an execution tracking module D;

The perception signal collecting and data storing module A is used for obtaining the running state information of the current vehicle and the vehicle in the surrounding environment, processing the signals, and collecting data for the learning training of the subsequent decision parameters;

The learning decision parameter module B is used for learning the collected decision data, when the data quantity collected by the system reaches a certain threshold value or is updated to a certain degree, the system can continuously learn, and a proper decision parameter value is learned based on a reinforcement learning method;

the track planning and motion control module C is used for real-time track planning and motion control of vehicle planning, and determines the form of a controller and roll-optimizes tracks by using the specific decision parameter values output by the learning decision parameter module B and the current driving road type judged by the sensing signal collection and data storage module A based on a model prediction control method;

The execution tracking module D is used for tracking and controlling the control quantity output by the algorithm, and is realized by adopting a PID (proportion integration differentiation) controller to ensure the control precision;

the perception signal collecting and data storing module A is connected with a learning decision parameter module B, a track planning and motion control module C and an execution tracking module D; the learning decision parameter module B is connected with the track planning and motion control module C; and the track planning and motion control module C is connected with the execution tracking module D.

A method for a parameterized learning decision control system for lane change and lane keeping, the method comprising the steps of:

The method comprises the following steps of firstly, obtaining state information of a vehicle and an environmental vehicle required by a vehicle control algorithm through a perception signal collection and data storage module A, and comprising the following steps: the lane, the speed and the acceleration of the surrounding vehicle are obtained by means of a vehicle-mounted camera and a radar environment sensing element in a vehicle-mounted intelligent sensing module, and the relative distance of the vehicle relative to the vehicle by taking the lane as a reference is obtained, and the driving intention of the environment vehicle, namely the lane keeping or lane changing, and the lane and the speed of the vehicle are obtained through the deviation of the vehicle from the center line of the lane or the steering lamp information, and the information is stored in the module;

Step two, learning appropriate decision parameter values, namely specific values of lateral deviation, behavior time and acceleration and deceleration behaviors of the behavior terminal through a learning decision parameter module B, and dispersing two continuous variables of the behavior time and the acceleration and deceleration behaviors in a value range space to obtain a discrete action space; performing state design and return design by using a least square strategy iterative reinforcement learning method based on a kernel function, and learning by using a reinforcement learning algorithm when the data volume collected by the system reaches a certain threshold;

Performing on-line optimization solution to perform track planning and motion control through a track planning and motion control module C according to decision parameter values output by a learning decision parameter module B, using a state space equation containing a vehicle dynamics equation and a six-dimensional state vector, and establishing a constraint equation with terminal state constraint, so that the process of action execution can be matched with different road types; decision parameters corresponding to the lane changing and lane keeping behavior scenes are unified and determined, and are lateral deviation of a behavior terminal, behavior time and acceleration and deceleration behaviors, and are respectively corresponding to terminal lateral deviation equality constraints in a model prediction controller, a prediction time domain and an acceleration reference item in an objective function; for two different road conditions of a straight road and a curve, two different terminal state equality constraint conditions are correspondingly converted, namely, the terminal lateral deviation, the course angle, the lateral speed and the yaw speed of the vehicle are constrained under the straight road condition, and only the terminal lateral displacement and the course angle of the vehicle are constrained under the curve condition;

and fourthly, tracking control is carried out on the control quantity output by the algorithm through executing the tracking module D, and the control precision is ensured by adopting a PID controller.

the specific method of the first step is as follows:

Obtaining state information of a vehicle and an environmental vehicle required by a vehicle control algorithm in a perception signal collection and data storage module A, wherein the state information comprises the following steps: the state information of surrounding vehicles is obtained by means of a vehicle-mounted camera and a radar environment sensing element in a vehicle-mounted intelligent sensing module, different positions of the surrounding vehicles are labeled, and target vehicles at corresponding positions are screened; if the corresponding position has the target vehicle, the activation sign signal P of the corresponding position_N1, otherwise P_N0 is _; when the activation flag signal P at position N_NWhen _flagis 1, the corresponding lane L of the vehicle_NVelocity v_NAcceleration a_NAnd a relative distance d with respect to the host vehicle with respect to the lane thereof_NAnd obtaining the driving intention I of the environmental vehicle through the deviation of the environmental vehicle from the central line of the lane or the information of the steering lamp_NAnd a lane L of the host vehicle_hvelocity v_hIs recorded; wherein, for the driving intention I_NIs calculated by

Wherein, I_NWhen the value of (1) is-1, 0,1, respectively indicating the intention of changing lane to the right, keeping lane and changing lane to the left of the environmental vehicle; flag _ light is a turn signal, the values of which are-1, 0, and 1 respectively indicate that the vehicle in the environment has the right turn signal, does not have the right turn signal and the left turn signal is on; delta d is the lateral distance of the current environment vehicle relative to the lane where the current environment vehicle is located and perpendicular to the lane line direction; d_lanethe distance between two adjacent lanes is defined as the distance between two adjacent lanes; this information is finally stored in the module.

the specific method of the second step is as follows:

The learning decision parameter module B learns proper decision parameter values based on a least square strategy iterative reinforcement learning method of a kernel function; modeling a driving decision process suitable for lane changing and lane keeping behaviors into a Markov decision process, wherein the Markov decision process comprises state design, action design and return design; according to the designed Markov decision process model and the recorded data, when the data volume collected by the system reaches a certain threshold value, learning by using a least square strategy iterative reinforcement learning method based on a kernel function;

2.1) establishing a Markov decision process model;

Firstly, designing a state;

for the relative positions of the environmental vehicle and the host vehicle, and for the number of the positions of the environmental vehicle, in order to fully express the traffic flow state in the environment, the state of the vehicle at the position N is considered, respectively, as the current lane L_NVelocity v_NAcceleration a_NAnd a relative distance d with respect to the host vehicle with respect to the lane thereof_NAnd obtaining the driving intention I of the environmental vehicle through the lateral deviation delta d of the environmental vehicle relative to the central line of the lane or the turn signal information Flag _ light_NWherein subscript N represents the corresponding vehicle at location N; the state vector also includes the state of the vehicle, the lane L of the vehicle_hVelocity v_h(ii) a The numerical values of the state quantities are read, calculated and stored in the sensing signal collection and data storage module A; thus, the state vector s can be represented as

When no environment vehicle exists at the corresponding position, the corresponding state vector value is set to be 0;

Secondly, designing the action;

In the framework of the problem, the decision parameters corresponding to the lane change and lane keeping behavior scenes are unified and determined, and are laterally deviated T for the behavior terminal_yTime of action t_facceleration and deceleration behavior a_tar(ii) a The decision parameters can be directly applied to a trajectory planning and motion control controller in a trajectory planning and motion control module C, and respectively correspond to terminal lateral deviation equality constraint in a model prediction controller, a prediction time domain and an acceleration reference item in an objective function; thus, the motion vector a can be represented as

a＝(T_y，t_f，a_tar)^T， (3)

Wherein the behavior terminal is laterally offset by T_y∈{-d_lane，0，d_lane}；d_laneThe distance between two adjacent lanes is respectively and correspondingly changed to the left, the lane is kept and the lane is changed to the right; in the motion space, the action time t_facceleration and deceleration behavior a_tarThe two continuous variables are dispersed in a value range space to obtain a discrete action space; thus the action time t_fCan be expressed as

Acceleration and deceleration behavior a_tarE { -1.5, -0.5,0,0.5,1.5}, these parameterized decisions are used to describe human driving behavior;

Thirdly, return design;

In the design of the return function, safety factors r are respectively considered_sFast factor r_rRide comfort factor r_crespectively expressed as:

r_r＝β₁a_tar (6)

r_r＝r_r-0.5 f t_f＝4， (7)

r_c＝-β₁|a_tar| (8)

r_c＝r_c-0.5 f t_f＝2， (9)

wherein d is_Nis the relative distance of the vehicle at position N relative to the host vehicle on the basis of the lane thereof, d_cIs the collision distance, TH ═ d_N/v_htime interval of head, TH_expIs the desired headway, L_NIs the vehicle lane at position N, L_hIs the vehicle lane, beta₁，β₂Is the weight coefficient, t_fAs time of action, a_tarAcceleration and deceleration behaviors; thus, the total return can be calculated as follows

r＝r_s+r_r+r_c+r_a， (10)

wherein r is_aa return returned after the trajectory planning is carried out for the trajectory planning and motion control module (C);

2.2) least square strategy iterative algorithm based on kernel function: in a continuous state space, a function approximation method is used for representing a state-action value function; solving a weight vector of a state-action value function in reinforcement learning by using a least square strategy iterative algorithm based on a kernel function; firstly, obtaining a kernel dictionary through a thinning process; the feature vector is designed according to the state vector s and the motion vector a in the state pair m (s, a), and can be expressed as phi (m) s^T，a^T]^TSelecting a radial basis function as the kernel function can be expressed as:

wherein the content of the first and second substances,<·,·>Represents the inner product of two vectors, phi (m)_i),φ(m_j) The state vectors are normalized in different ranges and are distinguished from action vectors and state vectors; sample set is denoted as M ═ M₁,m₂,...,m_pThe feature vector set is phi ═ phi (m)₁),φ(m₂),...,φ(m_p) }; screening based on the feature vector set, and if the linear correlation between the current feature vector and the feature vector in the dictionary is greater than a threshold value, adding the feature vector into a kernel dictionary to approximate the state value function;

The screening process is described as: assuming that after traversing q samples, kernel dictionary D_t-1T-1 (t is more than 1 and less than or equal to p) feature vectors; for the q +1 th sample, when judging whether the sample should be added into the kernel dictionary, calculating:

wherein λ ═ λ₁,λ₂,...,λ_t-1]As a weight vector of the formula(12) The solution of (a) is:

Wherein λ ═ λ₁,λ₂,...,λ_t-1]as a weight vector, [ W ]_t-1]_i,j＝κ(m_i,m_j) Is a t-1 × t-1 dimensional matrix, w_(q+1)(q+1)＝κ(m_q+1,m_q+1) For the current feature vector m_q+1inner product value with itself, w_t-1(m_q+1)＝[κ(m₁,m_q+1),κ(m₂,m_q+1),...,κ(m_t-1,m_q+1)]^TFor the existing feature vector and the current feature vector s in the dictionary_tThe inner product t-1 dimensional column vector; if xi is more than mu, the feature vector is added into the kernel dictionary, otherwise, the feature vector is not added; until all samples are tested;

After obtaining the kernel dictionary, linearly approximating a state-action value function by using a characteristic vector in the kernel dictionary; the state-action value function is represented as:

Wherein the content of the first and second substances,is in a state m_iis estimated by the state-action value function of (a ═ a₁,α₂,...,α_t) Is a weight vector; phi (m)_j) Is a state pair m_jThe feature vector of (2); for the ii th sample pair m_iiAnd ii +1 sample pair m_ii+1The incremental iterative update equation is:

wherein, w_t(m_ii)＝[κ(m₁,m_ii),κ(m₂,m_ii),...,κ(m_t,m_ii)]^T，w_t(m_ii+1)＝[κ(m₁,m_ii+1),κ(m₂,m_ii+1),...,κ(m_t,m_ii+1)]^TAre respectively formed by m_ii,m_ii+1Calculating with the feature vector in the dictionary; a. the_ii-1,A_iiis a matrix of dimensions t x t, b_ii-1,b_iiThe t-dimensional column vectors respectively correspond to the values of the matrix A and the vector b in two times of iterative updating before and after; alpha is alpha_iiEstimating a linear approximation weight vector of the state-action value function after iterative computation for the ii samples;

Estimation based on state-action value functionto improve the policy, the updated policy can be expressed as:

iteration is continued until all sample states and actions in the data set are the same as the actions obtained by the current strategy, and the algorithm convergence is finished;

The specific calculation process is as follows:

Step (1): obtain data set M ═ M₁,m₂,...,m_p}, kernel function k, and initialize empty kernel dictionary D₀A threshold μ;

step (2): calculating equation (13) for cycle i ═ 1: p; if xi is more than mu, adding the current feature vector into the dictionary; otherwise, i is i + 1;

And (3): and obtaining a kernel dictionary, and performing strategy iteration. Initializing a zero matrix A, a zero vector b and a zero weight vector alpha;

And (4): calculating equation (15) for a number of cycles i ═ 1: p; until the data set policy is consistent with the current network policy;

And (5): and outputting the weight vector alpha.

The concrete method of the third step is as follows:

3.1) nonlinear trajectory planning and establishment of motion equation: the bicycle vehicle dynamics model can be expressed as:

Wherein M is the vehicle mass; v. of_xIs the longitudinal vehicle speed; v. of_yIs the vehicle lateral velocity; w is a_ris the vehicle yaw rate; f_yf,F_yrRespectively applying the lateral force of the front wheels and the lateral force of the rear wheels of the vehicle; i is_zIs the moment of inertia of the vehicle along the z-axis; l_f,l_rThe distance between the front shaft and the rear shaft is; the control of tracking the longitudinal speed and the steering motion of the vehicle in the tracking module D is executed, so that the control method ensureswhile here the control quantity is simplified to the front wheel turning angle delta_fAnd a rate of change of the longitudinal velocity number a; tire side force F_yf,F_yrCan be expressed as:

Wherein, delta_fis a front wheel corner; c_r,C_ffront and rear wheel cornering stiffness, respectively; meanwhile, according to the motion relationship of the vehicle, there are Is the heading angle of the vehicle; considering the motion equation of the motion of the vehicle under the global coordinate system, the nonlinear vehicle motion space equation is established as

Wherein the state variable isThe controlled variable is u ═ a, δ_f]；F_yf,F_yrCan be driven by(18) calculating to obtain; x and Y are the positions of the vehicles in the global coordinate system;

3.2) establishing an optimized trajectory planner: firstly, the terminal state equality constraint condition is related to different road types; for a task, the completion of the task can be guaranteed only when a certain terminal state condition is met in a prediction time domain terminal; for lane keeping and lane changing tasks in a straight road environment, the task is completed under the condition that the yaw speed is at the terminal moment, the lateral speed returns to 0, the course angle is consistent with the center line of the current lane, and the position is on the center line of the lane of the current lane; in a curve environment, the equality constraint that the lateral speed returns to 0 for the yaw rate can be relaxed; thus, the terminal equation in a straight-path environment is constrained to

Wherein, w_r(t_f),v_y(t_f),Y(t_f) Respectively predicting the time yaw velocity, the lateral velocity, the course angle and the transverse displacement of the time domain terminal; y is_l,fLaterally displaced for a desired termination; lane keeping time y_l,f0; when changing lanes, y_l,f＝d_lane，d_laneIs the lateral distance between adjacent lanes; the terminal equality constraint in a curved road environment is

Wherein the content of the first and second substances,The course angle is the central line of the target lane at the point which is vertically closest to the current position of the vehicle; p (t)_f) Predicting the vehicle position at the time of the time domain terminal; p_laneThe position of the central line of the target lane at the closest point in vertical distance with the current position of the vehicle is taken as the position; at the same time, the control quantity should satisfy the inequalityConstraining

wherein, subscripts min, max represent the minimum and maximum values of the corresponding variables, respectively;

front wheel turning angle delta of each control quantity in prediction time domain is considered by target function_fAnd the amount of change Δ δ in the rate of change of the longitudinal speed number_fAnd Δ a and the rate of change of the longitudinal velocity number a and the desired acceleration/deceleration behavior a_tarThe integral performance index of the deviation, the objective function of the controller is expressed as:

whereinIs a weight coefficient;

Thus an optimization problem can be established as

Wherein, P (t)_f)∈R_ac，P(t_f)∈R_cdPredicting the positions of the vehicles at the time of the time domain terminal in a straight road and a curved road;

3.3) the trajectory planning and motion control module executes the driving decision return calculation: changing the transfer function in reinforcement learning into an actual track planning and motion control module (C), wherein the return r returned after the track planning is carried out by the track planning and motion control module (C)_aThe calculation equation is

The invention has the beneficial effects that:

1. The invention designs a parameterized learning control system suitable for lane changing and lane keeping behaviors, and uses consistent driving decision and track planning forms in different driving tasks and environments;

2. The invention uses a learning decision method designed based on a reinforcement learning algorithm, and the decision simultaneously comprises three variables of behavior terminal lateral deviation, behavior time and acceleration and deceleration behaviors.

3. the invention uses a model predictive control method to carry out track planning and motion control on the online optimization solution of decision parameter values, and different terminal state constraints are suitable for different driving tasks and road conditions.

Drawings

FIG. 1 is a schematic view of position numbers of a host vehicle and an environmental vehicle;

FIG. 2 is a block diagram of the system architecture of the present invention;

FIG. 3 is a general flow diagram of the system of the present invention;

FIG. 4 is a lane change diagram of the host vehicle (H) and the environmental vehicles (N1-N8) under scene 1;

FIG. 5 is a lane change diagram of the host vehicle (H) and the environmental vehicles (N1-N8) under scene 2;

Detailed Description

Because the driving behavior characteristics of a driver in a real driving environment are unknown in the system design stage, an accurate model is difficult to establish, and the system needs to improve the overall performance of the system through continuous learning. In order to improve the adaptability of the system to different driving behavior characteristics of different drivers and further ensure the safety of the system under the condition of obtaining better driving performance, the invention designs a parameterized learning control system suitable for lane changing and lane keeping behaviors based on a parameterized decision framework, which comprises a learning decision method designed based on an enhanced learning algorithm under the scene of lane changing and lane keeping of a vehicle and a trajectory planning controller which is suitable for straight roads and curved roads after corresponding parameterization under the scene.

A parameterized learning decision control system suitable for lane change and lane keeping behaviors comprises a plurality of sub-modules, and a structural block diagram of the system is shown in figure 2, and mainly comprises: the perception signal collection and data storage module A, the learning decision parameter module B, the trajectory planning and motion control module C and the execution tracking module D jointly form a parameterized decision frame, and the parameterized learning decision control system is suitable for lane changing and lane keeping behaviors. The sensing signal collection and data storage module A is used for obtaining the running state information of the current vehicle and the surrounding environment vehicle and carrying out signal processing, and comprises: the lane, the speed and the acceleration of the surrounding vehicle are obtained by means of a vehicle-mounted camera and a radar environment sensing element in a vehicle-mounted intelligent sensing module, the relative distance of the vehicle relative to the vehicle by taking the lane as a reference is obtained, the driving intention (keeping or changing lane) of the environment vehicle and the lane and the speed of the vehicle are obtained through the deviation of the vehicle from the lane central line or the steering lamp information, and data are collected for learning and training of subsequent decision parameters. And the learning decision parameter module B is used for learning a proper decision parameter value by a reinforcement learning method. For urban conditions, or highway conditions, lane change and lane keeping behaviors are most common. The decision parameter feature relationship is simple, and the consistency is realized, namely the specific numerical values of lateral deviation of the behavior terminal, behavior time and acceleration and deceleration behaviors. And dispersing the two continuous variables of the behavior time and the acceleration and deceleration behavior in a value range space to obtain a discrete action space. Further performing state design and report design. When the data volume collected by the system reaches a certain threshold value, learning is carried out by using a least square strategy iterative reinforcement learning algorithm based on a kernel function. And the trajectory planning and motion control module C performs online optimization solution according to the decision parameter values output by the learning decision parameter module B, and is used for real-time trajectory planning and motion control of vehicle planning. The sensing signal collection and data storage module A judges the type of the current driving road; and rolling to optimize the track based on a model prediction control method. A nonlinear state space equation with a six-dimensional state vector is established, and a constraint equation with terminal state constraint is established, so that the process of action execution can be matched with different road types. The specific decision parameter value output by the learning decision parameter module B determines the form of a controller, and two different terminal state equality constraint conditions are correspondingly converted for two different road conditions of a straight road and a curved road, namely the terminal lateral deviation, the course angle, the lateral speed and the yaw speed of the vehicle are constrained under the straight road condition, and only the terminal lateral displacement and the course angle of the vehicle are constrained under the curved road condition. Behavior terminal lateral deviation, behavior time and acceleration and deceleration behaviors respectively correspond to terminal lateral deviation equality constraint in a model prediction controller, and an acceleration reference item in a time domain and an objective function is predicted; and the execution tracking module D is used for tracking and controlling the control quantity output by the algorithm, and is realized by adopting a PID (proportion integration differentiation) controller, so that the control precision is ensured.

On this basis, fig. 3 shows an overall technical scheme flowchart of the present invention, and the specific implementation process is as follows:

As shown in fig. 3, the learning process of the entire system is present in a human driver driving or virtual simulation environment. When a human driver drives, only the perception signal collection and data storage module A and the learning decision parameter module B work. When the virtual simulation environment is used for learning or the learning effect is verified, the modules A-D work simultaneously. The sensing signal collection and data storage module A obtains the lane, speed and acceleration of the surrounding vehicle and the relative distance of the vehicle relative to the vehicle by taking the lane as a reference by means of the vehicle-mounted camera and the radar environment sensing element in the vehicle-mounted intelligent sensing module, obtains the driving intention (keeping or changing lane) of the environment vehicle and the lane and speed of the vehicle by means of the deviation of the vehicle from the lane central line or the steering lamp information, and stores the information in the module. The sample value in the learning decision parameter block B reaches the threshold (10)³) Or after the data updating amount is more than 20%, learning the decision parameters according to a designed least square strategy iteration reinforcement learning algorithm based on the kernel function, and updating; otherwise, human driving continues to be collected or a random strategy is used to search the action space in the simulation environment. And the track planning and motion control module C carries out on-line optimization solution according to the decision parameter values output by the learning decision parameter module B to carry out track planning and motion control. Obtaining a controlled amount of front wheel steering angle delta_fAnd a rate of change a in the longitudinal velocity number, the final output acting on the executive tracking module D. The vehicle execution control module D adopts a feedback proportional-integral-derivative PID controller to realize the tracking execution of the decision quantity because the control precision of the vehicle actuator on the control quantity needs to be ensured.

A parameterized learning decision control method suitable for lane changing and lane keeping behaviors comprises the following steps:

The method comprises the following steps of firstly, obtaining state information of a vehicle and an environmental vehicle required by a vehicle control algorithm through a perception signal collection and data storage module A, and comprising the following steps: the method comprises the following steps of obtaining the lane, speed and acceleration of a surrounding vehicle and the relative distance of the surrounding vehicle relative to the vehicle by using a vehicle-mounted camera and a radar environment sensing element in a vehicle-mounted intelligent sensing module and taking the lane as a reference, obtaining the driving intention (keeping or changing lane) of the surrounding vehicle and the lane and speed of the vehicle through the deviation of the driving intention and the lane center line of the vehicle or the turn signal information, and storing the information in the module, wherein the specific method comprises the following steps:

Obtaining state information of a vehicle and an environmental vehicle required by a vehicle control algorithm in a perception signal collection and data storage module A, wherein the state information comprises the following steps: and surrounding vehicle state information is obtained by means of a vehicle-mounted camera and a radar environment sensing element in the vehicle-mounted intelligent sensing module. As shown in fig. 1, different positions are respectively labeled as shown in the figure, and target vehicles at the corresponding positions are screened. If the corresponding position has the target vehicle, the activation sign signal P of the corresponding position_N1, otherwise P_Nand 0 is used for flag. When the activation flag signal P at position N_Nwhen _flagis 1, the corresponding lane L of the vehicle_Nvelocity v_NAcceleration a_NAnd a relative distance d with respect to the host vehicle with respect to the lane thereof_NAnd obtaining the driving intention I of the environmental vehicle through the deviation of the environmental vehicle from the central line of the lane or the information of the steering lamp_NAnd a lane L of the host vehicle_hVelocity v_hIs recorded. Wherein for the driving intention I_Nis calculated by

Wherein, I_Nthe values of (1), (0), (1) respectively indicate the intention of the environmental vehicle to change lane to the right, keep lane and change lane to the left, the Flag _ light is a turn signal, the values of (1), (0) and (1) respectively indicate the turn signals of the environmental vehicle to the right, no and leftthe lamp is turned on, delta d is the lateral distance of the current environment vehicle relative to the direction perpendicular to the lane line of the lane where the vehicle is located, d_lanethis information is finally stored in the module for the distance between two adjacent lanes.

Step two, learning appropriate decision parameter values, namely specific values of lateral deviation, behavior time and acceleration and deceleration behaviors of the behavior terminal through a learning decision parameter module B, and dispersing two continuous variables of the behavior time and the acceleration and deceleration behaviors in a value range space to obtain a discrete action space; the least square strategy iteration reinforcement learning method based on the kernel function is used for state design and return design, when the data volume collected by the system reaches a certain threshold value, the reinforcement learning algorithm is used for learning, and the specific method is as follows:

the learning decision parameter module B learns proper decision parameter values based on a least square strategy iterative reinforcement learning method of a kernel function. The driving decision process suitable for lane changing and lane keeping behaviors is modeled into a Markov decision process, and the Markov decision process comprises state design, action design and return design. According to the designed Markov decision process model and the recorded data, when the data volume collected by the system reaches a certain threshold value, a least square strategy iteration reinforcement learning method based on a kernel function is used for learning.

2.1) establishing a Markov decision process model;

State design, according to the number of the relative position between the environment vehicle and the position of the environment vehicle in fig. 1, in order to completely express the traffic flow state in the environment, the state of the vehicle at the position N is considered, and is respectively the current lane L_NVelocity v_Nacceleration a_NAnd a relative distance d with respect to the host vehicle with respect to the lane thereof_NAnd obtaining the driving intention I of the environmental vehicle through the lateral deviation delta d of the environmental vehicle relative to the central line of the lane or the turn signal information Flag _ light_NWhere the subscript N represents the corresponding vehicle at location N. The state vector also includes the state of the vehicle, the lane L of the vehicle_hVelocity v_h. The values of these state quantities are read in the sensing signal collection and data storage module ACalculated and stored. Thus, the state vector s can be represented as

When there is no ambient vehicle in the corresponding position, the corresponding state vector value is set to 0.

Secondly, action design, in the frame of the problem, decision parameters corresponding to lane changing and lane keeping behavior scenes are unified and determined, and the decision parameters are laterally deviated T for a behavior terminal_yTime of action t_fAcceleration and deceleration behavior a_tar. The decision parameters can be directly applied to a trajectory planning and motion control controller in the trajectory planning and motion control module C, and respectively correspond to terminal lateral deviation equality constraint in a model prediction controller, a prediction time domain and an acceleration reference item in an objective function. Thus, the motion vector a can be represented as

a＝(T_y，t_f，a_tar)^T， (3)

Wherein the behavior terminal is laterally offset by T_y∈{-d_lane，0，d_lane}，d_laneThe distance between two adjacent lanes is respectively corresponding to changing lanes leftwards, keeping lanes and changing lanes rightwards. In the motion space, the action time t_fAcceleration and deceleration behavior a_tarthe two continuous variables are dispersed in the value range space to obtain a discrete action space. Thus the action time t_fCan be expressed as

Acceleration and deceleration behavior a_tarE { -1.5, -0.5,0,0.5,1.5 }. These parameterized decisions may be used to describe the driving behavior of humans, as shown in table 1.

TABLE 1 parameterized decision and human decision analogy example

And thirdly, return design. In the design of the return function, safety factors r are respectively considered_sFast factor r_rRide comfort factor r_cRespectively expressed as:

r_r＝β_i a_tar (6)

r_r＝r_r-0.5 f t_f＝4， (7)

r_c＝-β₁|a_tar| (8)

r_c＝r_c-0.5 f t_f＝2， (9)

wherein d is_NIs the relative distance of the vehicle at position N relative to the host vehicle on the basis of the lane thereof, d_cis the collision distance, TH ═ d_N/v_hTime interval of head, TH_expIs the desired headway, L_NIs the vehicle lane at position N, L_hIs the vehicle lane, beta₁,β₂Is the weight coefficient, t_fas time of action, a_tarAcceleration and deceleration behaviors. Thus, the total return can be calculated as follows

r＝r_s+r_r+r_c+r_a， (10)

Here we change the transfer function in reinforcement learning into a practical trajectory planning and motion control module C, so r_aAnd returning the return after the trajectory planning is carried out for the trajectory planning and motion control module C. The specific values will be further explained in the trajectory planning and motion control module C.

2.2) least square strategy iterative algorithm based on kernel function: in a continuous state space, a function approximation method is generally used to characterize a state-action value function; solving a weight vector of a state-action value function in reinforcement learning by using a least square strategy iterative algorithm based on a kernel function; first, a kernel dictionary is obtained through a thinning process. According to the state vector s and the motion vector a in the state pair m ═ s, aThe design feature vector may be expressed as phi (m) ═ s^T,a^T]^Tselecting a radial basis function as the kernel function can be expressed as:

Wherein the content of the first and second substances,<·,·>Represents the inner product of two vectors, phi (m)_i),φ(m_j) And k is a weight vector and is used for normalizing state vectors in different ranges and distinguishing motion vectors from state vectors. The sample set may be represented as M ═ M₁,m₂,...,m_pThe feature vector set is phi ═ phi (m)₁),φ(m₂),...,φ(m_p) }; and screening based on the feature vector set, and if the linear correlation between the current feature vector and the feature vector in the dictionary is greater than a threshold value, adding the feature vector into the kernel dictionary to approximate the state value function.

the screening process can be described as: assuming that after traversing q samples, kernel dictionary D_t-1There are t-1 (t is more than 1 and less than or equal to p) eigenvectors. For the q +1 th sample, when judging whether the sample should be added into the kernel dictionary, calculating:

Wherein λ ═ λ₁,λ₂,...,λ_t-1]as a weight vector, the solution of equation (12) is:

After the kernel dictionary is obtained, the characteristic vector in the kernel dictionary is used for linearly approximating the state-action value function. The state-action value function may be expressed as:

estimation based on state-action value functionThe strategy is improved. The updated policy may be expressed as:

The specific calculation process is as follows:

step (2): the loop i is 1: p, and equation (13) is calculated. If xi is more than mu, adding the current feature vector into the dictionary; otherwise, i is i + 1;

And (4): multiple cycles i ═ 1: p, calculate equation (15). Until the data set policy is consistent with the current network policy;

and (5): and outputting the weight vector alpha.

Performing on-line optimization solution to perform track planning and motion control through a track planning and motion control module (C) according to decision parameter values output by a learning decision parameter module (B), using a state space equation containing a vehicle dynamics equation and a six-dimensional state vector, and establishing a constraint equation with terminal state constraint, so that the process of motion execution can be matched with different road types; decision parameters corresponding to the lane changing and lane keeping behavior scenes are unified and determined, and are lateral deviation of a behavior terminal, behavior time and acceleration and deceleration behaviors, and are respectively corresponding to terminal lateral deviation equality constraints in a model prediction controller, a prediction time domain and an acceleration reference item in an objective function; for two different road conditions of a straight road and a curve, two different terminal state equality constraint conditions are correspondingly converted, namely, the terminal lateral deviation, the course angle, the lateral speed and the yaw speed of the vehicle are constrained under the straight road condition, and only the terminal lateral displacement and the course angle of the vehicle are constrained under the curve condition; the specific method comprises the following steps:

Where M is the vehicle mass, v_xIs the longitudinal vehicle speed, v_yAs the lateral speed of the vehicle, w_rIs the yaw rate of the vehicle, F_yf,F_yrrespectively front wheel and rear wheel lateral forces, I_zIs the moment of inertia of the vehicle along the z-axis,/_f,l_rIs the front-rear axle distance. The longitudinal speed and the steering movement of the vehicle can be tracked and controlled in the tracking module D, so that the condition that the longitudinal speed and the steering movement of the vehicle can be tracked and controlled is ensuredWhile here the control quantity is simplified to the front wheel turning angle delta_fAnd the rate of change a of the longitudinal velocity number. Tire side force F_yf,F_yrCan be expressed as:

Wherein, delta_fIs a corner of the front wheel, C_r,C_fFront and rear wheel cornering stiffness, respectively; meanwhile, according to the motion relationship of the vehicle, there are Is the heading angle of the vehicle. And considering the motion equation of the vehicle under the global coordinate system, the nonlinear vehicle motion space equation is established as

Wherein the state variable isThe controlled variable is u ═ a, δ_f]。F_yf,F_yrCan be calculated from equation (18). X and Y are the positions of the vehicles in the global coordinate system.

3.2) establishing an optimized trajectory planner: first are the terminal state equation constraints, which are related to different road types. The idea is that for a task, the completion of the task can be guaranteed only when a certain terminal state condition is satisfied in the prediction time domain terminal. For lane keeping and lane changing tasks in a straight road environment, the task is completed under the condition that the yaw speed is at the terminal moment, the lateral speed returns to 0, the course angle is consistent with the center line of the current lane, and the position is on the center line of the lane of the current lane; whereas in a curve environment, the equality constraint for yaw rate, lateral rate back to 0, can be relaxed. Thus, the terminal equation in a straight-path environment is constrained to

Wherein, ω is_r(t_f),v_y(t_f),Y(t_f) Respectively predicting the time yaw velocity, the lateral velocity, the course angle, the transverse displacement and the y of the time domain terminal_l,fFor desired terminal lateral displacement, lane keeping y_l,f0; when changing lanes, y_l,f＝d_lane，d_laneIs the lateral distance between adjacent lanes; the terminal equality constraint in a curved road environment is

Wherein the content of the first and second substances,is the heading angle of the center line of the target lane at the nearest point in vertical distance from the current position of the vehicle, P (t)_f) Predicting time domain terminal time vehicle position, P_laneand the position of the central line of the target lane at the closest point in vertical distance with the current position of the vehicle. At the same time, the control quantity should satisfy the inequality constraint

Front wheel turning angle delta of each control quantity in prediction time domain is considered by target function_fAnd the amount of change Δ δ in the rate of change of the longitudinal speed number_fand Δ a and the rate of change of the longitudinal velocity number a and the desired acceleration/deceleration behavior a_tarthe integral-type performance indicator of the deviation, the objective function of the controller can be expressed as:

wherein the content of the first and second substances,Are weight coefficients.

Thus an optimization problem can be established as

Wherein, P (t)_f)∈R_ac，P(t_f)∈R_cdAnd predicting the position of the vehicle at the time of the time domain terminal in a straight road and a curve.

3.3) the trajectory planning and motion control module executes the driving decision return calculation: the transfer function in reinforcement learning is changed into an actual track planning and motion control module C, and the return r returned after the track planning is carried out by the track planning and motion control module C_aThe calculation equation is

Finally, the driving strategy is verified after learning, as shown in a driving scene 1 shown in fig. 4, an environmental vehicle N1 keeps driving on a lane 2, the environmental vehicle firstly drives on the lane 2 and then is switched into the lane 1, and the environmental vehicle keeps driving along a lane 3; and the environmental vehicle is switched into the lane 4 from the lane 3 and then into the lane 5 to finally keep running. Under the scene, the vehicle firstly changes the lane 3 into the lane 5 continuously and then changes into the lane 2 and finally changes into the lane 1.

in driving scenario 2 shown in fig. 5, environmental vehicle N3 changes into lane 1 after lane 2 has remained running for a period of time; the environmental vehicle N4 is changed into the lane 3 from the lane 2 and then is changed into the lane 4; the environmental vehicle N5 keeps running along the lane 3; the environmental vehicle N7 changes into lane 3 after lane 4 has remained running for a period of time; the environmental vehicle N8 keeps running along the lane 4; under the scene, the vehicle continuously changes the lane from the lane 3 to the lane 1 and then keeps running.

therefore, the vehicle can autonomously switch lane keeping and changing operations and carry out active lane changing operation according to the environment, and the system is a parameterized learning decision control system suitable for lane changing and lane keeping behaviors.

Claims

1. a parameterization learning decision control system suitable for lane changing and lane keeping is characterized by comprising a sensing signal collection and data storage module (A), a learning decision parameter module (B), a trajectory planning and motion control module (C) and an execution tracking module (D);

the perception signal collecting and data storing module (A) is used for obtaining the running state information of the current vehicle and the vehicles in the surrounding environment, processing the signals and collecting data for the learning training of the subsequent decision parameters;

The learning decision parameter module (B) is used for learning the collected decision data, when the data quantity collected by the system reaches a certain threshold value or is updated to a certain degree, the system can continuously learn, and a proper decision parameter value is learned based on a reinforcement learning method;

The track planning and motion control module (C) is used for real-time track planning and motion control of vehicle planning, and based on a model prediction control method, the controller form is determined by using the specific decision parameter value output by the learning decision parameter module (B) and the current driving road type judged by the sensing signal collection and data storage module (A), and the track is optimized in a rolling manner;

The execution tracking module (D) is used for tracking and controlling the control quantity output by the algorithm, and is realized by adopting a PID controller to ensure the control precision;

The perception signal collection and data storage module (A) is connected with the learning decision parameter module (B), the trajectory planning and motion control module (C) and the execution tracking module (D); the learning decision parameter module (B) is connected with the trajectory planning and motion control module (C); and the track planning and motion control module (C) is connected with the execution tracking module (D).

2. The method of claim 1, wherein the method comprises the steps of:

the method comprises the following steps of firstly, obtaining state information of a vehicle and an environmental vehicle required by a vehicle control algorithm through a sensing signal collection and data storage module (A), wherein the state information comprises the following steps: the lane, the speed and the acceleration of the surrounding vehicle are obtained by means of a vehicle-mounted camera and a radar environment sensing element in a vehicle-mounted intelligent sensing module, and the relative distance of the vehicle relative to the vehicle by taking the lane as a reference is obtained, and the driving intention of the environment vehicle, namely the lane keeping or lane changing, and the lane and the speed of the vehicle are obtained through the deviation of the vehicle from the center line of the lane or the steering lamp information, and the information is stored in the module;

step two, learning appropriate decision parameter values, namely specific values of lateral deviation, behavior time and acceleration and deceleration behaviors of a behavior terminal through a learning decision parameter module (B), and dispersing two continuous variables of the behavior time and the acceleration and deceleration behaviors in a value range space to obtain a discrete action space; performing state design and return design by using a least square strategy iterative reinforcement learning method based on a kernel function, and learning by using a reinforcement learning algorithm when the data volume collected by the system reaches a certain threshold;

performing on-line optimization solution to perform track planning and motion control through a track planning and motion control module (C) according to decision parameter values output by a learning decision parameter module (B), using a state space equation containing a vehicle dynamics equation and a six-dimensional state vector, and establishing a constraint equation with terminal state constraint, so that the process of motion execution can be matched with different road types; decision parameters corresponding to the lane changing and lane keeping behavior scenes are unified and determined, and are lateral deviation of a behavior terminal, behavior time and acceleration and deceleration behaviors, and are respectively corresponding to terminal lateral deviation equality constraints in a model prediction controller, a prediction time domain and an acceleration reference item in an objective function; for two different road conditions of a straight road and a curve, two different terminal state equality constraint conditions are correspondingly converted, namely, the terminal lateral deviation, the course angle, the lateral speed and the yaw speed of the vehicle are constrained under the straight road condition, and only the terminal lateral displacement and the course angle of the vehicle are constrained under the curve condition;

and fourthly, tracking control is carried out on the control quantity output by the algorithm through the execution tracking module (D), and the control precision is ensured by adopting a PID controller.

3. The method for the parameterized learning decision control system for lane change and lane keeping according to claim 1, wherein the specific method of the first step is as follows:

Obtaining state information of a vehicle and an environmental vehicle required by a vehicle control algorithm in a perception signal collection and data storage module A, wherein the state information comprises the following steps: the state information of surrounding vehicles is obtained by means of a vehicle-mounted camera and a radar environment sensing element in a vehicle-mounted intelligent sensing module, different positions of the surrounding vehicles are labeled, and target vehicles at corresponding positions are screened; if the corresponding position has the target vehicle, the activation sign signal P of the corresponding position_N1, otherwise P_N0 is _; activation flag signal at position NP_Nwhen _flagis 1, the corresponding lane L of the vehicle_NVelocity v_NAcceleration a_NAnd a relative distance d with respect to the host vehicle with respect to the lane thereof_NAnd obtaining the driving intention I of the environmental vehicle through the deviation of the environmental vehicle from the central line of the lane or the information of the steering lamp_NAnd a lane L of the host vehicle_hVelocity v_hIs recorded; wherein, for the driving intention I_NIs calculated by

4. the method of the parameterized learning decision control system for lane change and lane keeping according to claim 1, wherein the specific method of the second step is as follows:

A learning decision parameter module (B) learns appropriate decision parameter values based on a least square strategy iterative reinforcement learning method of a kernel function; modeling a driving decision process suitable for lane changing and lane keeping behaviors into a Markov decision process, wherein the Markov decision process comprises state design, action design and return design; according to the designed Markov decision process model and the recorded data, when the data volume collected by the system reaches a certain threshold value, learning by using a least square strategy iterative reinforcement learning method based on a kernel function;

2.1) establishing a Markov decision process model;

firstly, designing a state;

Numbering of relative positions of environmental vehicles and the host vehicle, and of positions of environmental vehiclesfor a complete representation of the traffic state in the environment, the state of the vehicle at the position N is taken into account, respectively the current lane L_Nvelocity v_NAcceleration a_NAnd a relative distance d with respect to the host vehicle with respect to the lane thereof_NAnd obtaining the driving intention I of the environmental vehicle through the lateral deviation delta d of the environmental vehicle relative to the central line of the lane or the turn signal information Flag _ light_Nwherein subscript N represents the corresponding vehicle at location N; the state vector also includes the state of the vehicle, the lane L of the vehicle_hVelocity v_h(ii) a The numerical values of the state quantities are read, calculated and stored in the sensing signal collection and data storage module A; thus, the state vector s can be represented as

Secondly, designing the action;

In the framework of the problem, the decision parameters corresponding to the lane change and lane keeping behavior scenes are unified and determined, and are laterally deviated T for the behavior terminal_yTime of action t_fAcceleration and deceleration behavior a_tar(ii) a The decision parameters can be directly applied to a trajectory planning and motion control controller in a trajectory planning and motion control module (C) and respectively correspond to terminal lateral deviation equality constraint in a model prediction controller to predict time domain and acceleration reference items in an objective function; thus, the motion vector a can be represented as

a＝(T_y,t_f,a_tar)^T, (3)

Wherein the behavior terminal is laterally offset by T_y∈{-d_lane,0,d_lane}；d_lanethe distance between two adjacent lanes is respectively and correspondingly changed to the left, the lane is kept and the lane is changed to the right; in the motion space, the action time t_fAcceleration and deceleration behavior a_tarThe two continuous variables are dispersed in a value range space to obtain a discrete action space; thus act likeTime t_fCan be expressed as

Thirdly, return design;

r_r＝β₁a_tar (6)

r_r＝r_r-0.5 f t_f＝4， (7)

r_c＝-β₁|a_tar| (8)

r_c＝r_c-0.5 f t_f＝2， (9)

wherein d is_NIs the relative distance of the vehicle at position N relative to the host vehicle on the basis of the lane thereof, d_cIs the collision distance, TH ═ d_N/v_hTime interval of head, TH_expIs the desired headway, L_NIs the vehicle lane at position N, L_hIs the vehicle lane, beta₁,β₂Is the weight coefficient, t_fAs time of action, a_tarAcceleration and deceleration behaviors; thus, the total return can be calculated as follows

r＝r_s+r_r+r_c+r_a， (10)

2.2) kernel function-based optimizationAnd (3) a small second-multiplication strategy iterative algorithm: in a continuous state space, a function approximation method is used for representing a state-action value function; solving a weight vector of a state-action value function in reinforcement learning by using a least square strategy iterative algorithm based on a kernel function; firstly, obtaining a kernel dictionary through a thinning process; the feature vector is designed according to the state vector s and the motion vector a in the state pair m (s, a), and can be expressed as phi (m) s^T,a^T]^TSelecting a radial basis function as the kernel function can be expressed as:

Wherein, w_t(m_ii)＝[κ(m₁,m_ii),κ(m₂,m_ii),...,κ(m_t,m_ii)]^T，w_t(m_ii+1)＝[κ(m₁,m_ii+1),κ(m₂,m_ii+1),...,κ(m_t,m_ii+1)]^TAre respectively formed by m_ii,m_ii+1and in dictionariesCalculating a feature vector; a. the_ii-1,A_iiIs a matrix of dimensions t x t, b_ii-1,b_iiThe t-dimensional column vectors respectively correspond to the values of the matrix A and the vector b in two times of iterative updating before and after; alpha is alpha_iiEstimating a linear approximation weight vector of the state-action value function after iterative computation for the ii samples;

The specific calculation process is as follows:

and (5): and outputting the weight vector alpha.

5. the method of the parameterized learning decision control system for lane change and lane keeping according to claim 1, wherein the specific method of the third step is as follows:

wherein M is the vehicle mass; v. of_xIs the longitudinal vehicle speed; v. of_yIs the vehicle lateral velocity; w is a_ris the vehicle yaw rate; f_yf,F_yrRespectively applying the lateral force of the front wheels and the lateral force of the rear wheels of the vehicle; i is_zIs the moment of inertia of the vehicle along the z-axis; l_f,l_rThe distance between the front shaft and the rear shaft is; the control of the tracking of the longitudinal speed and the steering movement of the vehicle in the tracking module (D) is carried out, thus ensuringWhile here the control quantity is simplified to the front wheel turning angle delta_fAnd a rate of change of the longitudinal velocity number a; tire side force F_yf,F_yrCan be expressed as:

Wherein the state variable isthe controlled variable is u ═[a,δ_f]；F_yf,F_yrCan be calculated from formula (18); x and Y are the positions of the vehicles in the global coordinate system;

Wherein the content of the first and second substances,The course angle is the central line of the target lane at the point which is vertically closest to the current position of the vehicle; p (t)_f) Predicting the vehicle position at the time of the time domain terminal; p_laneto be current with the vehiclethe position is vertically away from the center line of the nearest target lane; at the same time, the control quantity should satisfy the inequality constraint

WhereinIs a weight coefficient;

Thus an optimization problem can be established as