CN114442630B

CN114442630B - Intelligent vehicle planning control method based on reinforcement learning and model prediction

Info

Publication number: CN114442630B
Application number: CN202210088325.4A
Authority: CN
Inventors: 陈剑; 戚子恒; 王通
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2023-12-05
Anticipated expiration: 2042-01-25
Also published as: CN114442630A

Abstract

The invention discloses an intelligent vehicle planning control method based on reinforcement learning and model prediction. Comprising the following steps: acquiring and calculating by using a vehicle-mounted laser radar sensor to obtain road boundary information and barrier information under a vehicle body coordinate system; acquiring and calculating by using a vehicle-mounted GPS sensor to obtain a global reference road point under a vehicle body coordinate system; building a virtual scene where the intelligent vehicle is located; under a virtual scene of the intelligent vehicle, carrying out path planning on the intelligent vehicle by utilizing a path generating module based on road boundary information, barrier information and global reference road points under a vehicle body coordinate system to obtain a planned path of the intelligent vehicle; and tracking the planning path of the intelligent vehicle by using a tracking control module, thereby realizing the planning control of the intelligent vehicle. According to the invention, the network training of the planning part is promoted, the path planning effect of the intelligent vehicle when the intelligent vehicle is positioned at an irregular time is ensured, and the stability and the comfort of the vehicle body movement are promoted.

Description

Intelligent vehicle planning control method based on reinforcement learning and model prediction

Technical Field

The invention belongs to an intelligent vehicle planning control method in the field of intelligent vehicle automatic driving, and particularly relates to an intelligent vehicle planning control method based on reinforcement learning and model prediction in a weak GPS environment.

Background

With the recent development of economy and the improvement of the technical level of the automobile industry, the automobile maintenance amount is continuously increased, and the problems of traffic accidents, traffic jams, exhaust emission, drowsiness of drivers and the like are aggravated. The unmanned automobile has the advantages of energy conservation, environmental protection, comfort, high efficiency and the like, is an important trend of automobile development in the future, and is highly valued in all countries of the world.

Path planning and tracking control are key technologies for autopilot. For the path planning module, the planning effect is seriously dependent on a high-precision map and a high-precision positioning device. Compared with the traditional electronic map with the accuracy of meter level, the high-accuracy map with the centimeter level can more truly show the details of the number, the shape, the width and the like of lanes of the road, and help the intelligent vehicle to plan and decide more accurately. However, the processes of information collection, quality detection, operation and maintenance and the like in the process of high-precision map making make the drawing and maintenance of the map expensive. Meanwhile, because GPS signals are easy to be positioned inaccurately or lost due to weather, high buildings, tunnels and the like, high-precision positioning equipment is often required to be provided with high-cost IMU equipment for auxiliary positioning, and great barriers are brought to popularization and popularization of intelligent vehicles. The difficulty with tracking control modules is how to deal with the non-linear nature of the vehicle system and with the constraint problems in the state variables and manipulated variables while tracking the path. Meanwhile, since errors are also easily introduced when the sensor senses the motion state of the vehicle body, the robustness of the controller under the error interference needs to be ensured.

In recent years, reinforcement learning has been greatly successful in fields such as image recognition, voice recognition, robotics, and the like. Q learning is developed from reinforcement learning. In Q learning, there is a body with states and corresponding actions. At any time, the agent is in some feasible state. In the next time step, the state is converted by performing certain operations. This action is accompanied by rewards or penalties. The goal of the agent is to maximize the rewards benefits. The algorithm can interact with its environment through constant trial and error in an initially unknown environment, which directs the vehicle to take action continuously so that its return from the environment is maximized, and then find a collision-free path to avoid the obstacle.

DDPG (Deep Deterministic Policy Gradient) uses the network structure of the Actor-Critic and adopts the method of an experience playback pool in DQN (Deep Q Network) algorithm to establish a database named experience pool for storing the interaction data of the agent and the environment. During training, the intelligent agent can randomly select training data from the experience pool to train the neural network, so that the correlation of the training data in time is prevented, and the training efficiency and the sample utilization rate are effectively improved.

Model predictive control (Model Predictive Control, MPC) has found wide application in industrial systems as an effective method to conveniently address the problem of multi-variable constraint control. In recent years, MPC has expanded to the problem of moving body tracking control, and can accomplish the preset objective in a suboptimal manner on the basis of satisfying the constraint conditions of the system. In this control scheme, the control sequence is recalculated at each sampling time, minimizing the cost function under the input state constraints. After the first control input of the sequence is applied to the system, the online optimization problem is repeated at the next time step according to the latest system state.

Disclosure of Invention

In order to solve the problem of inaccurate positioning of the intelligent vehicle in the background technology, the invention provides an intelligent vehicle planning control method based on reinforcement learning and model prediction, which improves the existing planning and control algorithm so as to improve the stability and comfort of the intelligent vehicle when the positioning is inaccurate.

The technical scheme adopted by the invention is as follows:

the invention comprises the following steps:

step 1: obtaining an obstacle grid map through a vehicle-mounted laser radar sensor, determining road boundary information and obstacle information around a vehicle body under a laser radar sensor coordinate system based on the obstacle grid map, and then obtaining the road boundary information and the obstacle information under the vehicle body coordinate system after coordinate conversion;

step 2: acquiring global reference road points under a vehicle-mounted GPS sensor coordinate system by using a vehicle-mounted GPS sensor, acquiring vehicle body positioning and motion states by using the vehicle-mounted GPS sensor, and finally carrying out coordinate conversion on the global reference road points based on the vehicle body positioning and the motion states to acquire global reference road points under the vehicle body coordinate system;

step 3: constructing a virtual scene where the intelligent vehicle is located by the barrier grid map and the global reference road points;

step 4: under a virtual scene of the intelligent vehicle, carrying out path planning on the intelligent vehicle by utilizing a path generating module based on road boundary information, barrier information and global reference road points under a vehicle body coordinate system to obtain a planned path of the intelligent vehicle;

step 5: and tracking the planning path of the intelligent vehicle by using a tracking control module, thereby realizing the planning control of the intelligent vehicle.

The path generation module in the step 4 is obtained through training of the following steps:

s1: the training phase of the reinforcement learning agent based on the DDPG is divided into an initial phase, an intermediate phase and a final phase in sequence; the first state space input in the initial stage consists of the distance from the intelligent vehicle to the left and right boundaries of the road and the position of an accurate global reference road point in a vehicle body coordinate system, the second state space input in the middle stage consists of the first state space and the position of the nearest obstacle in front of the intelligent vehicle in the vehicle body coordinate system, and the third state space input in the final stage consists of the distance from the intelligent vehicle to the left and right boundaries of the road, the position of the nearest obstacle in front of the intelligent vehicle in the vehicle body coordinate system and the position of an inaccurate reference road point in the vehicle body coordinate system;

s2: constructing an action space which is the front wheel rotation angle delta of the intelligent vehicle _f ；

S3: and training the reinforcement learning agent based on the DDPG by forming a training set by the action space and different state spaces, setting a reward and punishment value, and supervising the training process to obtain the trained reinforcement learning agent.

The reward and punishment value comprises a reward value R reaching the end point _arrive Punishment value R for intelligent vehicle collision _collision And an intermediate state punishment value R _temp 。

Punishment value R of the intermediate state _temp The method is obtained by calculation through the following steps:

a1: respectively distributing corresponding potential field functions for road boundaries, barriers and global reference road points in each training stage by utilizing a potential field method;

a2: respectively calculating corresponding road boundary potential fields P according to the three potential field functions _R Obstacle potential field P _O And an accurate global reference waypoint potential field P _W And an inaccurate global reference waypoint potential field P _W′ After the corresponding potential fields in the training phase are overlapped, the total potential field P of the current training phase is obtained _U And as a punishment value R of the intermediate state _temp ；

A3: in the training process, according to the total potential field P _U Setting potential field parameters of all potential field functions of each training stage in A1 by using a path planning method based on a potential field method, updating the total potential field of each training stage according to the set potential field function, and taking the updated total potential field as a punishment value R of the intermediate state of each training stage _temp 。

In the tracking control module of the step 5, firstly, a vehicle dynamics model is built according to the intelligent vehicle, and then a prediction equation of the vehicle state is built based on the vehicle dynamics model;

then, according to a prediction equation of the vehicle state, a model prediction control algorithm is utilized to establish a target optimization function and constraint conditions, and then a path tracking controller is established;

and finally, tracking a planned path of the intelligent vehicle by using a path tracking controller, thereby realizing the planning control of the intelligent vehicle.

The target optimization function is as follows:

the constraint conditions of the target optimization function are as follows:

Δu _min ≤Δu(k|t)≤Δu _max

u _min ≤u(k|t)≤u _max

y _min ≤y(k|t)≤y _max

β _min ≤β(k|t)≤β _max

k＝t,…,t+N _p -1

y(t+N _p |t)-r(t+N _p |t)∈Ω

wherein min is _U(t) J represents that in the prediction time domain corresponding to the time t, when the intelligent vehicleWhen the target optimization value is minimum, the operation of collecting the control quantity of the front wheel corner of the vehicle is taken; j represents a target optimized value of the intelligent vehicle, and U (t) represents a control quantity set of the front wheel corner of the vehicle in a prediction time domain corresponding to the moment t;representing the operation of calculating the norm square based on the first weight matrix Q,/>Representing the operation of calculating the norm square based on the second weight matrix R ++>Representing an operation of calculating a norm square based on the third weight matrix P, y (t+i|t) represents a predicted value of the i-th vehicle state yaw angle and lateral position at the time t, r (t+i|t) represents a predicted value of the i-th vehicle state yaw angle and lateral position at the time t, u (t+i|t) represents an i-th control amount at the time t, y (t+n) _p I t) represents the nth at time t _p Predicted values of individual vehicle state yaw angle and lateral position, r (t+N) _p I t) represents the nth at time t _p Expected values of individual vehicle state yaw angle and lateral position, N _p For predicting time domain, Q, R, P are the first, second and third weight coefficients, deltau _max Right limit increment of the front wheel corner of the vehicle; deltau _min The left limit increment of the front wheel corner of the vehicle; deltau (k|t) represents the control increment of the vehicle front wheel rotation angle at the k time at the current t time, u (k|t) is the control amount of the vehicle front wheel rotation angle at the k time at the current t time, u _max The right limit position of the front wheel corner of the vehicle; u (u) _min The left limit position of the front wheel corner of the vehicle; y (k|t) represents the vehicle state yaw angle and lateral position at k time at the current t time, y _min Is the minimum value of the vehicle state yaw angle and lateral position; y is _max Beta (k|t) represents the vehicle centroid slip angle at time k at the current time t, which is the maximum value of the vehicle state yaw angle and the lateral position; beta _min And beta _max Respectively the vehicle mass center side deflection angleMinimum and maximum values, Ω represent terminal constraint fields.

And the terminal constraint domain in the target optimization function is subjected to linearization pretreatment.

The beneficial effects of the invention are as follows:

aiming at the scene of inaccurate positioning of an intelligent vehicle, the invention provides a planning control method, which comprises a path planning method based on DDPG reinforcement learning and a path tracking method based on model prediction control, namely a path generation module and a tracking control module.

In the path planning method, the path generation of the intelligent vehicle in the inaccurate positioning scene is realized based on the DDPG algorithm, and the safety and smoothness of the path are ensured. And the potential field method is used for improving the reward and punishment value of the DDPG, and the training phase is divided into an initial phase, an intermediate phase and a final phase, so that the convergence speed and the training efficiency of the algorithm are improved.

In the tracking control method, a path tracking controller is realized based on a model predictive control algorithm, and terminal cost and terminal constraint are added in a target optimization function, so that the stability and control precision of a control system are improved. And the terminal constraint domain is linearized, so that the real-time performance of the intelligent vehicle control system is ensured.

The planning control algorithm combining the path planning method and the tracking control method can smoothly complete obstacle avoidance in a scene of inaccurate positioning of the intelligent vehicle, safely complete navigation tasks along a designed path, and ensure smooth and stable track.

Drawings

Fig. 1 is a schematic diagram of an offset occurring at a collected reference waypoint.

Fig. 2 is a schematic diagram of misalignment of the vehicle body that offsets the reference waypoints.

Fig. 3 is a schematic diagram of a DDPG network structure.

FIG. 4 is a flow diagram of virtual environment path generation.

Fig. 5 is a kinematic model of a smart car.

FIG. 6 is a schematic diagram of path generation in a virtual environment.

Fig. 7 is a vehicle dynamics model.

FIG. 8 is a plot of a reward function for reinforcement learning training.

Fig. 9 is a flow of implementation of the planning control of the present invention.

FIG. 10 is a motion profile of a smart car when positioning is inaccurate.

FIG. 11 is a graph of centroid side bias angle variation for three methods when positioning is inaccurate.

Fig. 12 shows the lateral acceleration variation in three methods when the positioning step is accurate.

Detailed Description

The invention is further illustrated and described below in connection with specific embodiments. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

As shown in fig. 9, the present invention includes the steps of:

step 1: the intelligent vehicle is provided with a laser radar sensor and a GPS sensor. Obtaining an obstacle grid map through a vehicle-mounted laser radar sensor, determining road boundary information and obstacle information around a vehicle body under a laser radar sensor coordinate system based on the obstacle grid map, and then obtaining the road boundary information and the obstacle information under the vehicle body coordinate system after coordinate conversion; the obstacle information is specifically the position of the nearest obstacle in front of the intelligent vehicle.

Step 2: acquiring global reference road points under a vehicle-mounted GPS sensor coordinate system by using a vehicle-mounted GPS sensor, acquiring vehicle body positioning and motion states by using the vehicle-mounted GPS sensor, and finally carrying out coordinate conversion on the global reference road points based on the vehicle body positioning and the motion states to acquire global reference road points under the vehicle body coordinate system; the signals of the vehicle-mounted GPS sensor are subject to environmental interference to shift, so that the acquired global reference waypoints shift, as shown in fig. 1. The signal of the vehicle-mounted GPS sensor is interfered to further cause inaccurate positioning of the vehicle body, so that the global reference road point under the coordinate system of the vehicle body is offset, as shown in fig. 2.

step 4: as shown in fig. 4, in a virtual scene of the intelligent vehicle, a path generation module is utilized to plan the path of the intelligent vehicle based on road boundary information, barrier information and global reference road points in a vehicle body coordinate system, so as to obtain a planned path of the intelligent vehicle; the kinematic model of the intelligent vehicle is shown in fig. 5, and the generation of the planned path in the virtual environment is shown in fig. 6.

s1: the network structure of the reinforcement learning agent based on the DDPG is shown in figure 3, and the training phase of the reinforcement learning agent based on the DDPG is divided into an initial phase, an intermediate phase and a final phase in turn from simple to difficult according to training scenes; wherein the first state space input in the initial stage is the distance d from the intelligent vehicle to the left and right boundaries of the road _l And d _r And the position d of the accurate global reference road point in the vehicle body coordinate system _wx And d _wy The second state space input in the middle stage consists of the first state space and the position d of the nearest obstacle in front of the intelligent vehicle in the vehicle body coordinate system _ox And d _oy The third state space input in the final stage consists of the distance from the intelligent vehicle to the left and right boundaries of the road, the position of the nearest obstacle in front of the intelligent vehicle in the vehicle body coordinate system and the position d of the inaccurate reference road point in the vehicle body coordinate system _wx′ And d _wy′ Composition; i.e. the third state space s= { d _l ,d _r ,d _ox ,d _oy ,d _wx′ ,d _wy′ }。

S3: training the reinforcement learning intelligent agent based on DDPG by forming a training set by the action space and different state spaces, setting a reward and punishment value and supervising the training process to obtain the trained reinforcement learning intelligent agent;

the prize and punishment value comprises a prize value R reaching the end point _arrive Punishment value R for intelligent vehicle collision _collision And an intermediate state punishment value R _temp ；

Punishment value R of intermediate state _temp The method is obtained by calculation through the following steps:

a2: respectively calculating corresponding road boundary potential fields P according to the three potential field functions _R Obstacle potential field P _O And an accurate global reference waypoint potential field P _W And an inaccurate global reference waypoint potential field P _W′ After the corresponding potential fields in the training phase are overlapped, the total potential field P of the current training phase is obtained _U And as a punishment value R of the intermediate state _temp The method comprises the steps of carrying out a first treatment on the surface of the I.e. the punishment value R of the intermediate state of the final stage _temp ＝P _R +P _O +P _W′ ；

The potential field function of the road boundary is:

wherein P is _R (d _l ,d _r ) Is the boundary potential field of the road, a _R Is the intensity parameter of the potential field, d _s Is the safe distance from the intelligent vehicle to the road boundary.

The potential field function of an obstacle is:

wherein P is _O (d _ox ,d _oy ) Is the potential field of an obstacle, a _o And b _o The intensity parameter and the shape parameter of the obstacle potential function, respectively. X is X _s And Y _s The safety distances between the vehicle and the obstacle are respectively represented, the longitudinal direction is the running direction of the intelligent vehicle, the direction perpendicular to the running direction of the intelligent vehicle is the transverse direction, and the longitudinal direction and the transverse direction are in the horizontal plane, and are defined as:

X _s ＝X ₀ -vT ₀

Y _s ＝Y ₀ +(υsinθ _e +υ _o sinθ _e )T ₀

wherein X is ₀ And Y ₀ Representing the minimum safe distance in the longitudinal and transverse directions, T ₀ Is a safe time interval, v is the speed of the intelligent vehicle, v _o Is the speed of the obstacle, theta _e Is the heading angle deviation between the intelligent vehicle and the obstacle.

The potential field functions of the accurate and inaccurate global reference waypoints are the same, wherein the potential field functions of the global reference waypoints are as follows:

wherein P is _W (d _wy ) Is the accurate global reference waypoint potential field, d _a Refers to the error range of the transverse position of the global reference road point, a _w The potential field strength for the global reference waypoint.

In the tracking control module of step 5, firstly, a vehicle dynamics model is built according to the intelligent vehicle, and then a prediction equation of the vehicle state is built based on the vehicle dynamics model; the vehicle dynamics model is shown in fig. 7.

Then, according to a prediction equation of the vehicle state, a model prediction control algorithm is utilized to establish a target optimization function and constraint conditions with terminal constraint and terminal cost, and then a path tracking controller is established;

and finally, tracking a planned path of the intelligent vehicle by controlling the front wheel corner of the vehicle by using a path tracking controller, so as to realize the planning control of the intelligent vehicle.

The objective optimization function with terminal constraint and terminal cost is as follows:

the constraint conditions of the target optimization function are as follows:

Δu _min ≤Δu(k|t)≤Δu _max

u _min ≤u(k|t)≤u _max

y _min ≤y(k|t)≤y _max

β _min ≤β(k|t)≤β _max

k＝t,…,t+N _p -1

y(t+N _p |t)-r(t+N _p |t)∈Ω

wherein,the cost of the added terminal; y (t+N) _p |t)-r(t+N _p T) εΩ are the terminal constraints that are added. min _U(t) J represents the operation of taking a control quantity set of the front wheel corner of the vehicle when the target optimization value of the intelligent vehicle is minimum in a prediction time domain corresponding to the t moment; j represents a target optimized value of the intelligent vehicle, reflects the requirements of path tracking errors and stable change of control quantity in a certain time domain in the future, and U (t) represents a control quantity set of the front wheel corner of the vehicle in a prediction time domain corresponding to the moment t; />Representing the operation of calculating the norm square based on the first weight matrix Q,representing the operation of calculating the norm square based on the second weight matrix R ++>Representing the calculation of norms based on the third weight matrix PSquare operation +.>Representing the operation of calculating the intelligent vehicle tracking error weight based on the first weight matrix Q at the ith moment at the t moment,/">Representing the operation of calculating the intelligent vehicle control stability weight based on the second weight matrix R at the ith moment at the t moment,/and the control stability weight of the intelligent vehicle>Indicating the N-th at time t _p Operation of calculating the intelligent vehicle tracking error weight based on the third weight matrix P at each moment, +.>Reflecting the requirements for path tracking errors, < >>Reflecting the requirement for smooth variation of the control quantity, y (t+i|t) represents the predicted value of the ith vehicle state yaw angle and lateral position at time t, r (t+i|t) represents the predicted value of the ith vehicle state yaw angle and lateral position at time t, the predicted value of the vehicle state yaw angle and lateral position is obtained by the planned path of the intelligent vehicle, u (t+i|t) represents the ith control quantity at time t, y (t+n) _p I t) represents the nth at time t _p Predicted values of individual vehicle state yaw angle and lateral position, r (t+N) _p I t) represents the nth at time t _p Expected values of individual vehicle state yaw angle and lateral position, N _p For predicting time domain, Q, R, P are the first, second and third weight coefficients, deltau _max Right limit increment of the front wheel corner of the vehicle; deltau _min The left limit increment of the front wheel corner of the vehicle; deltau (k|t) represents the control increment of the vehicle front wheel rotation angle at the k time at the current t time, u (k|t) is the control amount of the vehicle front wheel rotation angle at the k time at the current t time, u _max Right limit for front wheel corner of vehicleA location; u (u) _min The left limit position of the front wheel corner of the vehicle; y (k|t) represents the vehicle state yaw angle and lateral position at k time at the current t time, y _min Is the minimum value of the vehicle state yaw angle and lateral position; y is _max Beta (k|t) represents the vehicle centroid slip angle at time k at the current time t, which is the maximum value of the vehicle state yaw angle and the lateral position; beta _min And beta _max Respectively a minimum value and a maximum value of the vehicle mass center slip angle, and omega represents a terminal constraint domain.

The terminal constraint domain in the target optimization function is subjected to linearization pretreatment, so that the real-time performance of the control system is ensured.

In this embodiment, the training environment is a joint simulation of MATLAB/Simulink and Carsim. The network structure, the state space, the action space and the rewarding function of the reinforcement learning algorithm are designed in MATLAB/Simulink, and the vehicle model with high accuracy and high reality is obtained in Carsim.

After the potential field design is completed, setting the potential field parameters by using a path planning method of a potential field method. And if the planned path does not meet the safety requirement, adjusting the potential field parameters.

When the reinforcement learning training scene is set, the training scene is divided into three stages from simple to difficult. The initial stage only comprises road barriers and accurate reference road points; in the middle stage, adding an obstacle into the initial stage; in the final stage, inaccurate reference road points are added in the middle stage.

As shown in FIG. 8, the result of reinforcement learning training is that the method improves both the network training effect and the convergence rate of the conventional DDPG network.

The controller provided by the invention is tested under the double-shift-line working condition, noise is added into the yaw rate and the transverse speed, and the tracking effect is compared with that of the traditional model predictive control method. The Mean Absolute Error (MAE) of its tracking effect is given in the following table:

table 1: average absolute error of tracking effect (MAE)

As can be seen from Table 1, the tracking control method provided by the invention has improved tracking accuracy compared with the conventional model predictive control method when no error exists, yaw rate noise exists and transverse rate noise exists.

The path planning method and the tracking control method are combined and used for coping with the scene of inaccurate positioning of the vehicle body, and the implementation flow of the scene is shown in fig. 9. Fig. 10 is a comparison of planning control effects in a scene where a designed reference road point is inaccurate and a vehicle body positioning step is accurate, a frame work a is a planning control method proposed by the present invention, a frame work B is a conventional DDPG planning and pure tracking control method, and pf+mpc is a planning and model predictive control tracking method of a potential field method. Fig. 11 (a), (b), and (c) are respectively centroid slip angle value changes of three methods in order, and fig. 12 (a), (b), and (c) are respectively lateral acceleration changes of three methods in order, for representing stability and comfort of the track. Table 2 statistical analysis was performed on the experimental data.

Table 2: experimental results analysis Table of the invention and other methods

As can be seen from fig. 9, 10, 11, 12 and table 2, the planning control method designed by the invention can enable the intelligent vehicle to have a more comfortable and more stable motion state when the positioning is inaccurate.

Claims

1. The intelligent vehicle planning control method based on reinforcement learning and model prediction is characterized by comprising the following steps of:

step 5: tracking a planning path of the intelligent vehicle by utilizing a tracking control module, so as to realize planning control of the intelligent vehicle;

the prize and punishment valueIncluding the prize value R reaching the endpoint _arrive Punishment value R for intelligent vehicle collision _collision And an intermediate state punishment value R _temp ；

2. The intelligent vehicle planning control method based on reinforcement learning and model prediction according to claim 1, wherein the tracking control module in step 5 firstly establishes a vehicle dynamics model according to the intelligent vehicle, and then establishes a prediction equation of the vehicle state based on the vehicle dynamics model;

3. The intelligent vehicle planning control method based on reinforcement learning and model prediction according to claim 2, wherein the objective optimization function is:

the constraint conditions of the target optimization function are as follows:

Δu _min ≤Δu(k|t)≤Δu _max

u _min ≤u(k|t)≤u _max

y _min ≤y(k|t)≤y _max

β _min ≤β(k|t)≤β _max

k＝t，...，t+N _p -1

y(t+N _p |t)-r(t+N _p |t)∈Ω

wherein,representing the operation of taking a control quantity set of the front wheel corner of the vehicle when the target optimization value of the intelligent vehicle is minimum in a prediction time domain corresponding to the t moment; j represents a target optimized value of the intelligent vehicle, and U (t) represents a control quantity set of the front wheel corner of the vehicle in a prediction time domain corresponding to the moment t; />Representing the operation of calculating the norm square based on the first weight matrix Q,representing the operation of calculating the norm square based on the second weight matrix R ++>Representing an operation of calculating a norm square based on the third weight matrix P, y (t+i|t) represents a predicted value of the i-th vehicle state yaw angle and lateral position at the time t, and r (t+i|t) represents the i-th vehicle state yaw angle at the time tAnd the expected value of the lateral position, u (t+i|t) represents the i-th control amount at the time t, y (t+N) _p I t) represents the nth at time t _p Predicted values of individual vehicle state yaw angle and lateral position, r (t+N) _p I t) represents the nth at time t _p Expected values of individual vehicle state yaw angle and lateral position, N _p For predicting time domain, Q, R, P are the first, second and third weight coefficients, deltau _max Right limit increment of the front wheel corner of the vehicle; deltau _min The left limit increment of the front wheel corner of the vehicle; deltau (k|t) represents the control increment of the vehicle front wheel rotation angle at the k time at the current t time, u (k|t) is the control amount of the vehicle front wheel rotation angle at the k time at the current t time, u _max The right limit position of the front wheel corner of the vehicle; u (u) _min The left limit position of the front wheel corner of the vehicle; y (k|t) represents the vehicle state yaw angle and lateral position at k time at the current t time, y _min Is the minimum value of the vehicle state yaw angle and lateral position; y is _max Beta (k|t) represents the vehicle centroid slip angle at time k at the current time t, which is the maximum value of the vehicle state yaw angle and the lateral position; beta _min And beta _max Respectively a minimum value and a maximum value of the vehicle mass center slip angle, and omega represents a terminal constraint domain.

4. The intelligent vehicle planning control method based on reinforcement learning and model prediction according to claim 3, wherein a terminal constraint domain in the objective optimization function is subjected to linearization pretreatment.