CN113625718A

CN113625718A - Method for planning driving path of vehicle

Info

Publication number: CN113625718A
Application number: CN202110927868.6A
Authority: CN
Inventors: 莫建林; 赖哲渊; 张汉驰
Original assignee: SAIC Volkswagen Automotive Co Ltd
Current assignee: SAIC Volkswagen Automotive Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-11-09
Anticipated expiration: 2041-08-12
Also published as: CN113625718B

Abstract

The invention provides a method for planning a driving path of a vehicle, which comprises the following steps: generating an environment state characteristic diagram sequence of the self-vehicle based on the map data and the target tracking result; acquiring vehicle state information of the self vehicle; taking the environment state characteristic diagram sequence of the self vehicle and the self vehicle state information as environment and state data, and inputting the environment and state data into a path planning model; and acquiring a planned track of the self-vehicle output by the path planning model. The method can realize the reinforcement learning and training of the path planning model, so that the method is better applied to the scene of vehicle automatic driving.

Description

Method for planning driving path of vehicle

Technical Field

The present invention relates to the field of automatic driving, and in particular, to a method and an apparatus for planning a driving path of a vehicle, and a computer readable medium.

Background

Some automatic driving path planning methods are based on modularization methods, specifically, an automatic driving system is divided into an environment sensing module, a prediction module, a path planning module and a control module explicitly, the output of the former module is used as the input of the latter module, and after the corresponding result of the environment sensing and prediction is obtained, the path planning module searches for an optimal planning path meeting the set loss function based on optimization methods such as dynamic planning. Some methods adopt simulation learning (emulation learning), which is based on deep learning and learns a large number of data sets driven by experts through a deep neural network model so as to learn an optimized planned path.

The solution of the modularized method has strong interpretability, is not like a black box mode which is basically not ideal in interpretation in deep learning, but has the defects that the modules are influenced by each other, the accuracy of the perception and prediction module seriously influences the design of the subsequent planning module, and the inter-module interference phenomenon is serious.

Although the simulated learning method based on deep learning can directly learn a planned track or a driving behavior end to end, the simulated learning method has the defects that a great amount of labeled data is needed for learning, and meanwhile, the simulated learning method can only be applied to scene modes appearing in learning, and generally, data coping with extreme scenes (such as red light running, illegal driving, vehicle collision and the like) is extremely rare in expert data, so that the learned model cannot be well applied to an automatic driving scene.

Disclosure of Invention

The invention aims to provide a method for planning a driving path of a vehicle, which realizes the reinforcement learning and training of a path planning model and enables the path planning model to be better applied to the scene of vehicle automatic driving.

In order to solve the technical problem, the invention provides a method for planning a driving path of a vehicle, which comprises the following steps: generating an environment state characteristic diagram sequence of the self-vehicle based on the map data and the target tracking result; acquiring vehicle state information of the self vehicle; taking the environment state characteristic diagram sequence of the self vehicle and the self vehicle state information as environment and state data, and inputting the environment and state data into a path planning model; and acquiring a planned track of the self-vehicle output by the path planning model.

In an embodiment of the present invention, the method further includes: and obtaining a model return estimation value output by the path planning model, and evaluating the model based on the return estimation value.

In an embodiment of the present invention, the path planning model includes a trunk neural network, a first feature vectorization module, a first fully connected network, a second feature vectorization module, a third fully connected network, and a fourth fully connected network, which are connected in sequence; the environment state feature diagram sequence of the vehicle is input into the backbone neural network, and the vehicle state information is input into the feature vectorization first module.

In an embodiment of the present invention, the environment and state data are updated through the operation of the controller module and the environment and state data calculation module; wherein the second fully-connected network inputs planned trajectory data of the self-vehicle into the controller module, the controller module controls the self-vehicle to travel, and inputs first data generated by the self-vehicle to travel and second data generated by observing the periphery of the self-vehicle into the environment and state data calculation module, and the environment and state data calculation module updates the environment and state data based on the first data and the second data; the fourth fully-connected network outputs a model-reported estimate.

In an embodiment of the invention, the environment and state data calculation module outputs a vehicle state transition reward value of the path planning model.

In an embodiment of the present invention, the vehicle state information includes a speed, an acceleration, a heading angle and a heading angular velocity.

In an embodiment of the present invention, obtaining the model return estimation value output by the path planning model includes: calculating a vehicle speed reward value of the path planning model; calculating a vehicle position reward value of the path planning model; deriving a vehicle state transition reward value based on the vehicle speed reward value and the vehicle location reward value; and calculating to obtain the model return estimation value based on the vehicle state transition reward value.

In an embodiment of the present invention, calculating the vehicle speed reward value of the path planning model includes: according to the actual speed V of the bicycle_realAnd the desired speed V of the vehicle_expReceive the reward parameter G_speed(ii) a According to the reward parameter G_speedObtaining the vehicle speed reward value r_t，speed. In one embodiment of the present invention, the actual speed V of the vehicle is determined according to the actual speed V of the vehicle_realAnd the desired speed V of the vehicle_expReceive the reward parameter G_speedThe method comprises the following steps:

wherein, | V_real-V_expI represents the pair V _real-V_expAnd taking an absolute value.

In an embodiment of the invention, the reward parameter G is determined according to_speedObtaining the vehicle speed reward value r_t，speedThe method comprises the following steps: when G is_speed> 1 or G_speed＝1，r_t，speed0; when G is_speed＝0，r_t，speed1 is ═ 1; when 0 < G_speed＜1，

In an embodiment of the present invention, the desired speed V of the host vehicle_expThe calculation of (a) includes: when the vehicle meets the red light road condition, and when the distance between the vehicle and the red light stop line is greater than L1, the expected speed of the vehicle is

V_exp＝V_exp，max；

When the distance between the self vehicle and the red light stop line is less than or equal to L1, the desired speed of the self vehicle is according to

Performing linear deceleration;

LD is the distance between the current time and the red light stop line of the bicycle, and the distance between the red light stop line and the red light is L2V_exp，maxIs the maximum desired speed.

In an embodiment of the present invention, the desired speed V of the host vehicle_expThe calculation of (a) includes: when the vehicle meets the road condition of the obstacle, the vehicle and the obstacle are in the same stateThe actual distance P of the obstacle stop line from the obstacle D2 and the distance of the own vehicle from the obstacle stop line D1 satisfy P>D1+ D2, the expected speed of the bicycle

V_exp＝V_exp，max；

When the actual distance P between the self vehicle and the obstacle, the distance D2 between the obstacle stop line and the obstacle and the distance D1 between the self vehicle and the obstacle stop line satisfy P ≦ D1+ D2, the desired speed of the self vehicle is equal to or less than D1+ D2

Performing linear deceleration; wherein, V_exp，maxIs the maximum desired speed.

In an embodiment of the present invention, the desired speed V of the host vehicle_expThe calculation of (a) includes: when the vehicle meets the green light road condition, the expected speed of the vehicle is

V_exp＝V_exp，max；

Wherein, V_exp，maxIs the maximum desired speed.

In an embodiment of the present invention, calculating the vehicle location reward value of the path planning model includes: determining the vehicle position reward value according to the distance S1 between the center point of the vehicle and the center line of the lane;

wherein the content of the first and second substances,

when | S1| > 1 or | S1| > 1, r_t，position＝-1；

When | S1| ═ 0, r_t，position＝0；

When 0 < | S1| < 1,

| S1| represents the absolute value of S1.

In one embodiment of the present invention, deriving a vehicle state transition reward value based on the vehicle speed reward value and the vehicle location reward value comprises: the vehicle state transition reward value

r_t＝r_t，speed+r_t，position

Wherein r is_t，speedRepresenting said vehicle speed reward value, r_t，positionRepresenting the vehicle location reward value.

In an embodiment of the present invention, calculating the model reward evaluation value based on the vehicle state transition reward value includes: the model return estimate

Wherein ρ is an estimation coefficient, T represents a total frame number of the environmental status feature map in the environmental status feature map sequence, the total frame number corresponds to a time point corresponding to the end of the path planning, and T is a positive integer.

In an embodiment of the present invention, the backbone neural network, the feature vectorization first module, the first fully connected network, and the second fully connected network are connected to form an Actor network; the Actor network is connected with the feature vectorization second module, the third fully-connected network and the fourth fully-connected network to form a Critic network; wherein the Actor network outputs a planned trajectory a of the host vehicle_tThe neural network weight parameter needing to be learned by the Actor network is theta^μThe Actor network is expressed as a weight parameter in the form of a_t＝μ(s_t|θ^u)，s_tRepresenting the environment and state data at the current time; the Critic network output model return estimation value Q_tThe neural network weight parameters needing to be learned by the Critic network comprise theta of the first half network^μAnd theta of the latter half network^EThe Critic network is expressed as a weight parameter in the form of Q_t＝Q(s_t，a_t|θ^μ，θ^E) (ii) a The environment state characteristic diagram sequence comprises a multi-frame environment state characteristic diagram.

In an embodiment of the present invention, the weights of the neural networks of the Actor network and the Critic network are setParameter theta^μAnd theta^EThe process of performing reinforcement learning includes: setting the number RB of playback buffers for the reinforcement learning and the number N of sample batches during training, wherein the RB and the N are positive integers; for the weight parameter theta of the neural network ^μAnd theta^EThe Actor network mu(s)_t|θ^μ) The Critic network Q(s)_t，a_t|θ^μ，θ^E) Carrying out initialization; constructing and relating to the Actor network mu(s)_t|θ^μ) And said Critic network Q(s)_t，a_t|θ^μ，θ^E) Is completely identical to the first target network mu'(s)_t|θ^μ′) And a second target network Q'(s)_t，a_t|θ^μ′，θ^E′) (ii) a For the first target network mu'(s)_t|θ^μ′) And a second target network Q'(s)_t，a_t|θ^μ′，θ^E′) Weight parameter theta of^μ′And theta^E′Carrying out initialization; setting the update period value Num of the target network weight parameter_update(ii) a Setting an initial value s of the environment and state data₁And the target network update count value Num_countAn initial value of (d); for the environment and state data with the total frame number of T frames, the environment and state data is s₁Initially, a learning step is performed.

In an embodiment of the present invention, for the environment and status data with the total frame number of T frames, the environment and status data is s₁Initially, the performing learning step includes: output a to the current Actor network_tAdding disturbance to obtain a_t，dAs the motion track indication of the current frame; context and status data s based on the current time_tPerforming a on the environment and state_t，dAnd obtaining environment and state data s after the vehicle state is transferred_t+1And corresponding vehicle state transition excitation value r_t(ii) a Transferring the current vehicle state to corresponding sample vector(s) _t，a_t，d，r_t，s_t+1) Saving in the playback cache; randomly taking N samples(s) from the playback buffer_i，a_i，d，r_i，s_i+1)(i＝1，2，…，N，a_i，d∈a_t，d，r_i∈r_t) And training the Actor network and the Critic network.

In an embodiment of the invention, the initializing comprises randomizing the parameters for the neural network weight parameter θ^μAnd theta^EThe Actor network mu(s)_t|θ^μ) The Critic network Q(s)_t，a_t|θ^μ，θ^E) Initialization is performed.

In an embodiment of the present invention, a is output to the current Actor network_tAdding disturbance to obtain a_t，dThe motion track indication as the current frame comprises the following steps:

a_t，d＝μ(s_t|θ^μ)+σζ_t-βμ(s_t|θ^μ)

wherein zeta is a Gaussian random process, sigma is a first disturbance parameter, and beta is a second disturbance parameter.

In one embodiment of the invention, the environment and state data s is based on the current time_tPerforming a on the environment and state_t，dAnd obtaining environment and state data s after the vehicle state is transferred_t+1The environment and state data calculation module is used for calculating the environment and state data of the environment and the state of the vehicle.

In one embodiment of the present invention, the controller module controls lateral and longitudinal movements of the host vehicle.

In one embodiment of the invention, N samples(s) are randomly drawn from the playback buffer_i，a_i，d，r_i，s_i+1)(i＝1，2，…，N，a_i，d∈a_t，d) Training the Actor network and the Critic network comprises: calculating the model return estimation value Q _tA target value of (d); calculating the model return estimated value Q of the current frame_tAverage residual error with model reported target value; selecting and updating the weight parameter theta of the Critic network according to the sampling result of Bernoulli distribution^μAnd theta^EThe manner of (a); weights to the Actor networkParameter theta^μUpdating is carried out; updating the count value Num for the target network_countUpdating is carried out; comparing the updated count value Num of the target network_countAnd the update period value Num_updateObtaining a judgment result; determining whether to determine the weight parameter theta of the target network according to the judgment result^μ′And theta^E′And (6) updating.

In one embodiment of the present invention, the model return estimate Q is calculated_tThe target values of (a) include:

the model return estimate Q_tTarget value of

y_i＝r_i+γQ′(s_i+1，μ′(s_i+1|θ^μ′)|θ^μ′，θ^E′)；

Where γ is a target value coefficient. Calculating the model return estimated value Q of the current frame_tThe average residuals from the model reward target values include:

mean residual error

Wherein, y_iRepresenting the model reward objective value.

In an embodiment of the present invention, the weight parameter θ of the criticic network is selected and updated according to the sampling result of bernoulli distribution^μAnd theta^EThe method comprises the following steps: for each frame of the environment and state data, sampling a Bernoulli sample once according to Bernoulli distribution to obtain a sampling result;

If the sampling result is 1, according to

Weight parameter theta to the Critic network^EUpdating is carried out, and the weight parameter theta^μKeeping the same;

if the sampling result is 0, according to

Weight parameter theta to the Critic network^μAnd theta^EUpdating is carried out;

wherein the content of the first and second substances,

representing the function L vs. theta^EThe derivation is carried out by the derivation,

representing the function L vs. theta^μAnd (4) derivation, wherein the probability that the Bernoulli sample in the Bernoulli distribution is 1 is taken as k, and k is more than 0 and less than 1.

In an embodiment of the present invention, a weight parameter θ for the Actor network^μThe updating comprises the following steps:

according to

A weight parameter theta to the Actor network^μThe updating is carried out, and the updating is carried out,

wherein J ═ Q (s, a | θ)^μ，θ^E)。

In an embodiment of the present invention, the count value Num is updated for the target network_countThe updating comprises the following steps: num_count＝Num_count+1。

In an embodiment of the present invention, whether to determine the weight parameter θ of the target network is determined according to the determination result^μ′And theta^E′The updating comprises the following steps: if the target network updates the count value Num_countIs less than the update period value Num_updateContinuing the learning step; if the target network updates the count value Num_countIs equal to the update period value Num_updateAccording to

θ^E′←τθ^E+(1-τ)θ^E′

θ^μ′←τθ^μ+(1-τ)θ^μ′

A weight parameter theta to the target network^μ′And theta ^E′Updating and updating the target network update count value Num_countResetting to zero; wherein, τ is the update coefficient of the target network weight.

In an embodiment of the present invention, the sequence of the environmental status characteristic maps of the host vehicle includes a plurality of frames of environmental status characteristic maps, and each of the environmental status characteristic maps is generated by the following steps: generating an environment static picture taking the self-vehicle as a picture center based on the map data; generating an environment dynamic picture taking the self-vehicle as a picture center based on the target detection tracking result; and generating the environment state characteristic graph according to the environment static picture and the environment dynamic picture.

In an embodiment of the present invention, generating the environment status feature map according to the environment still picture and the environment moving picture includes: taking the environment static picture as a base map; overlaying picture information contained in the environment dynamic picture on the base map; taking the self-vehicle central point of the current frame as a pixel central point on the environment state characteristic diagram; and setting the heading angle direction of the vehicle as the direction right above the environmental state characteristic diagram, and generating the environmental state characteristic diagram.

Compared with the prior art, the invention has the following advantages: according to the technical scheme, the path planning model for automatic driving path planning can be subjected to reinforcement learning and training through designing the reinforcement learning model of the shared network and the related algorithm for model training, so that the application requirement of automatic driving is better met.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principle of the invention. In the drawings:

fig. 1 is a flowchart of a method for planning a driving path of a vehicle according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a path planning model according to an embodiment of the present application.

Fig. 3 is a schematic training diagram of a path planning model according to an embodiment of the present application.

Fig. 4-6 are schematic diagrams illustrating calculation of a desired speed of a host vehicle according to some embodiments of the present disclosure.

FIG. 7 is a schematic illustration of the calculation of a vehicle location reward value in accordance with some embodiments of the present application.

Fig. 8 is a schematic system implementation environment diagram of a vehicle travel path planning apparatus according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments disclosed below.

As used herein, the terms "a," "an," "the," and/or "the" are not intended to be inclusive and include the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description. Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate. In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In the description of the present application, it is to be understood that the orientation or positional relationship indicated by the directional terms such as "front, rear, upper, lower, left, right", "lateral, vertical, horizontal" and "top, bottom", etc., are generally based on the orientation or positional relationship shown in the drawings, and are used for convenience of description and simplicity of description only, and in the case of not making a reverse description, these directional terms do not indicate and imply that the device or element being referred to must have a particular orientation or be constructed and operated in a particular orientation, and therefore, should not be considered as limiting the scope of the present application; the terms "inner and outer" refer to the inner and outer relative to the profile of the respective component itself.

Furthermore, it should be noted that the terms "first", "second", etc. are used to define the components or assemblies, and are only used for convenience to distinguish the corresponding components or assemblies, and the terms have no special meaning if not stated, and therefore, the scope of protection of the present application should not be construed as being limited. Further, although the terms used in the present application are selected from publicly known and used terms, some of the terms mentioned in the specification of the present application may be selected by the applicant at his or her discretion, the detailed meanings of which are described in relevant parts of the description herein. Further, it is required that the present application is understood not only by the actual terms used but also by the meaning of each term lying within.

Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations are added to or removed from these processes.

The embodiment of the invention describes a method, a device and a computer readable medium for planning a driving path of a vehicle.

As shown in fig. 1, the method for planning a driving path of a vehicle includes a step 101 of generating an environmental state feature map sequence of the vehicle based on map data and a target tracking result. And 102, acquiring the vehicle state information of the self vehicle. And 103, inputting the environment state feature diagram sequence of the self vehicle and the self vehicle state information into a path planning model as environment and state data. And 104, acquiring a planned track of the self vehicle output by the path planning model.

Specifically, in step 101, an environmental state feature map sequence of the own vehicle is generated based on the map data and the target tracking result.

In some embodiments, the sequence of environmental status feature maps of the host vehicle includes a plurality of frames of environmental status feature maps, each of the environmental status feature maps being generated by: step 1011, generating an environment static picture taking the self-vehicle as a picture center based on the map data; step 1012, generating an environment dynamic picture taking the self-vehicle as a picture center based on the target detection tracking result; and 1013, generating the environment state feature map according to the environment static picture and the environment dynamic picture.

In some embodiments, the generating the environment status feature map according to the environment still picture and the environment moving picture of step 1013 comprises: step 1101, taking the environment static picture as a base map; step 1102, overlaying picture information contained in the environment dynamic picture on the base map; 1103, taking the center point of the current frame as a pixel center point on an environment state feature map; and 1104, setting the heading angle direction of the vehicle to be right above the environmental state characteristic diagram, and generating the environmental state characteristic diagram.

In some embodiments, the method for planning the driving path of the vehicle further includes step 105 of obtaining a model return estimation value output by the path planning model, and evaluating the model based on the return estimation value.

Referring to fig. 2, in some embodiments, the path planning model 401 includes a trunk neural network 403, a feature vectorization first module 405, a first fully-connected network FC1, a second fully-connected network FC2, a feature vectorization second module 407, a third fully-connected network FC3, and a fourth fully-connected network FC4 connected in series.

In some embodiments, the sequence 421 of the feature map of the environmental state of the host vehicle is input to the backbone neural network 403, and the vehicle state information 423 of the host vehicle is input to the first feature vectorization module 405. The sequence 421 of the characteristic map of the environmental state of the own vehicle and the information 423 of the vehicle state of the own vehicle constitute the environmental and state data 411.

Referring to fig. 3, in some embodiments, the environment and state data 411 is updated by the operation of the controller module 502 and the environment and state data calculation module 504.

In some embodiments, the environment and status data 411 may also be taken from other existing data sets. The controller module 502 and the environment and state data calculation module 504 are only used to better understand the training process of the path planning model in the method for planning a driving path of a vehicle according to the present application, and are not used to limit the structure of the path planning model according to the present application.

In some embodiments, the second fully-connected network FC2 inputs planned trajectory data FC2_ output of the own vehicle into the controller module 502, the controller module 502 controls the running of the own vehicle, and inputs first data generated by the running of the own vehicle and second data generated by observing the surroundings of the own vehicle into the environment and state data calculation module 504, and the environment and state data calculation module 504 updates the environment and state data 411 based on the first data and the second data. As described above, the environment and state data 411 includes the environment state feature map sequence 421 of the own vehicle and the vehicle state information 423 of the own vehicle.

In actual driving situations, the controller module 502 controls lateral and longitudinal movement of the vehicle, for example, by controlling components such as the throttle, brake, and steering wheel of the host vehicle.

In some embodiments, the fourth fully-connected network FC4 outputs a model reward estimate Q_t. Model return estimate Q_tThe schematic reference number in fig. 3 is 416.

In some embodiments, the environment and state data calculation module 504 outputs a vehicle state transition reward value r for the path planning model_t. Vehicle state transition reward value r_tThe schematic reference number in fig. 3 is 510.

In some embodiments, the vehicle state information includes speed, acceleration, heading angle, and heading angular velocity. In an actual driving situation, the raw data corresponding to the vehicle state information is acquired by, for example, a camera device, a laser radar, a millimeter wave radar, and the like mounted on the vehicle, and then the state information is obtained through data processing.

In some embodiments, obtaining the model return estimate output by the path planning model at step 105 includes: step 1051, calculating a vehicle speed reward value of the path planning model; step 1052, calculating a vehicle position reward value of the path planning model; step 1053, obtaining a vehicle state transition reward value based on the vehicle speed reward value and the vehicle position reward value; and 1054, calculating to obtain the model return estimation value based on the vehicle state transition reward value.

In some embodiments, the calculating of the vehicle speed reward value for the path planning model of step 1051 comprises: step 201, according to the actual speed V of the vehicle_realAnd the desired speed V of the vehicle_expReceive the reward parameter G_speed(ii) a 202, according to the reward parameter G_speedObtaining the vehicle speed reward value r _t，speed。

In some embodiments, the step 201 is based on the actual speed V of the vehicle_realAnd the desired speed V of the vehicle_expReceive the reward parameter G_speedThe method comprises the following steps:

wherein, | V_real-V_expI represents the pair V_real-V_expAnd taking an absolute value.

In some embodiments, the reward parameter G of step 202 is based on_speedObtaining the vehicle speed reward value r_t，speedThe method comprises the following steps:

when G is_speed> 1 or G_speed＝1，r_t，speed＝0；

When G is_speed＝0，r_t，speed＝1；

When 0 < G_speed＜1，

Fig. 4, 5 and 6 are schematic diagrams illustrating calculation of a desired speed of a host vehicle according to some embodiments of the present application.

In some embodiments, the desired speed V of the host vehicle_expThe calculation of (a) includes:

in case one, referring to fig. 4, when the vehicle encounters a red traffic light,

when the distance between the vehicle and the red light stop line is larger than L1, the desired speed of the vehicle

V_exp＝V_exp，max；

Performing linear deceleration;

The values of the above 1 and L2 can be set as desired, for example, L1 is 60 meters and L2 is 10 meters.

In case two, referring to fig. 5, when the vehicle encounters an obstacle,

the desired speed of the own vehicle when an actual distance P of the own vehicle from the obstacle, a distance D2 of the obstacle stop line from the obstacle, and a distance D1 of the own vehicle from the obstacle stop line satisfy P > D1+ D2

V_exp＝V_exp，max；

Performing linear deceleration; v_exp，maxIs the maximum desired speed.

The values of D1 and D2 can be set as desired, for example, D1 is 60 meters and D2 is 2 meters. Obstacles such as a leading vehicle or a road barrier, etc.

In case three, referring to fig. 6, when the vehicle encounters a green road condition, the vehicle expects a speed

V_exp＝V_exp，max；

V_exp，maxIs the maximum desired speed.

Referring to fig. 7, in some embodiments, the calculating of the vehicle location reward value of the path planning model of step 1052 comprises:

determining the vehicle position reward value according to the distance S1 between the center point of the vehicle and the center line of the lane;

wherein the content of the first and second substances,

when | S1| > 1 or | S1| > 1, r_t，position＝-1；

When | S1| ═ 0, r_t，position＝0；

When 0 < | S1| < 1,

| S1| represents the absolute value of S1.

In fig. 7, S2 represents the self vehicle width, and S3 represents half the road width. The values of S2 and S3 can be set according to the needs of model training and the actual automatic driving situation, for example, S2 is 1.8 meters, and S3 is 2 meters. The dotted box 702 of fig. 7 includes a legend.

The rectangle illustrating the host vehicle illustrated in fig. 7 may correspond to a rectangular box area in the environmental status feature map that indicates the host vehicle. The center point of the bicycle is indicated by the center point of the rectangle.

In some embodiments, deriving a vehicle state transition reward value based on the vehicle speed reward value and the vehicle location reward value of step 1053 comprises:

the vehicle state transition reward value

r_t＝r_t，speed+r_t，position

In some embodiments, calculating the model reward estimate based on the vehicle state transition reward value of step 1054 includes:

the model return estimate

In order to explain the training process of the path planning model of the present application, the composition of the path planning model is further explained.

In some embodiments, the backbone neural network, the feature vectorization first module, the first fully-connected network and the second fully-connected network in the path planning model are connected to form an Actor network;

And the Actor network is connected with the feature vectorization second module, the third full-connection network and the fourth full-connection network to form a Critic network.

Wherein the Actor network outputs a planned trajectory a of the host vehicle_tThe neural network weight parameter needing to be learned by the Actor network is theta^μThe Actor network is expressed as a weight parameter in the form of a_t＝μ(s_t|θ^u)，s_tRepresenting the environment and state data at the current time;

the Critic network output model return estimation value Q_tThe neural network weight parameters needing to be learned by the Critic network comprise theta of the first half network^μAnd theta of the latter half network^EThe Critic network is expressed as a weight parameterIs of the form Q_t＝Q(s_t，a_t|θ^μ，θ^E) (ii) a The environment state characteristic diagram sequence comprises a multi-frame environment state characteristic diagram.

Next, a method for training the path planning model of the present application will be explained. The process of training the path planning model is also the process of implementing reinforcement learning.

In some embodiments, the neural network weight parameter θ for the Actor network and the Critic network^μAnd theta^EThe process of performing reinforcement learning includes: step 301, setting the number of playback buffers RB used for reinforcement learning and the number of sample batches N during training, wherein RB and N are positive integers; step 302, weighting the neural network weight parameter theta ^μAnd theta^EThe Actor network mu(s)_t|θ^μ) The Critic network Q(s)_t，a_t|θ^μ，θ^E) Carrying out initialization; step 303, constructing the network mu(s) of the Actor_t|θ^μ) And said Critic network Q(s)_t，a_t|θ^μ，θ^E) Is completely identical to the first target network mu'(s)_t|θ^μ′) And a second target network Q'(s)_t，a_t|θ^μ′，θ^E′) (ii) a Step 304, for the first target network μ'(s)_t|θ^μ′) And a second target network Q'(s)_t，a_t|θ^μ′，θ^E′) Weight parameter theta of^μ′And theta^E′Carrying out initialization; step 305, setting the update period Num of the target network weight parameter_upddte(ii) a Step 306, setting an initial value s of the environment and state data₁And the target network update count value Num_countAn initial value of (d); step 307, for the environment and status data with total frame number of T frames, the environment and status data is s₁Initially, a learning step is performed.

In some embodiments, the environmental and status data of step 307 is s for the total frame number of T frames₁Initially, the performing learning step includes: step 3071, the step of making the Chinese character' BingdangFront Actor network output a_tAdding disturbance to obtain a_t，dAs the motion track indication of the current frame; 3072, environment and state data s based on the current time_tPerforming a on the environment and state_t，dAnd obtaining environment and state data s after the vehicle state is transferred _t+1And corresponding vehicle state transition excitation value r_t(ii) a Step 3073, transferring the current vehicle state to corresponding sample vector(s)_t，a_t，d，r_t，s_t+1) Saving in the playback cache; step 3074, randomly extracting N samples(s) from the playback buffer_i，a_i，d，r_i，s_i+1)(i＝1，2，…，N，a_i，d∈a_t，d，r_i∈r_t) And training the Actor network and the Critic network.

In some embodiments, the initialization in step 302 includes randomizing the parameters for the neural network weight parameter θ^μAnd theta^EThe Actor network mu(s)_t|θ^μ) The Critic network Q(s)_t，a_t|θ^μ，θ^E) Initialization is performed. The way of choosing the randomization parameters can be chosen as desired.

In some embodiments, the output a to the current Actor network of step 3071_tAdding disturbance to obtain a_t，dThe motion track indication as the current frame comprises the following steps:

a_t，d＝μ(s_t|θ^μ)+σζ_t-βμ(s_t|θ^μ)

wherein zeta is a Gaussian random process, sigma is a first disturbance parameter, and beta is a second disturbance parameter. The disturbance adding process is carried out during model training, and the disturbance adding process is not carried out when the model training is carried out after the model training is finished and is used. σ is, for example, 1.2, and β is, for example, 0.15.

In some embodiments, the current time-based environment and state data s of step 3072_tPerforming a on the environment and state_t，dAnd obtaining environment and state data s after the vehicle state is transferred _t+1By operation of the controller module 502 and the environment and state data calculation module 504.

In some embodiments, the step 3073 randomly draws N samples(s) from the playback buffer_i，a_i，d，r_i，s_i+1)(i＝1，2，…，N，a_i，d∈a_t，d) Training the Actor network and the Critic network comprises: step 401, calculating the model return estimation value Q_tA target value of (d); step 402, calculating a model return estimation value Q of the current frame_tAverage residual error with model reported target value; step 403, selecting and updating the weight parameter theta of the Critic network according to the sampling result of Bernoulli distribution^μAnd theta^EThe manner of (a); step 404, weighting parameter theta of the Actor network^μUpdating is carried out; step 405, updating the count value Num of the target network_countUpdating is carried out; step 406, comparing the updated count value Num of the target network_countAnd the update period value Num_updateObtaining a judgment result; step 407, determining whether to determine the weight parameter θ of the target network according to the determination result^μ′And theta^E′And (6) updating.

In some embodiments, the model return estimate Q of step 401 is calculated_tThe target values of (a) include: the model return estimate Q_tTarget value of

y_i＝r_i+γQ′(s_i+1，μ′(s_i+1|θ^μ′)|θ^μ′，θ^E′);

Where γ is a target value coefficient. γ is, for example, 0.9.

In some embodiments, the model return estimate Q of the current frame is calculated in step 402_tThe average residuals from the model reward target values include:

mean residual error

Wherein, y_iRepresenting the model reward objective value.

In some embodiments, the weight parameter θ of the criticic network is selected and updated according to the sampling result of the bernoulli distribution in step 403^μAnd theta^EThe method comprises the following steps:

for each frame of the environment and state data, sampling a Bernoulli sample once according to Bernoulli distribution to obtain a sampling result;

if the sampling result is 1, according to

if the sampling result is 0, according to

wherein the content of the first and second substances,

representing the function L vs. theta^μAnd (4) derivation, wherein the probability that the Bernoulli sample in the Bernoulli distribution is 1 is taken as k, and k is more than 0 and less than 1. k is, for example, 0.55.

In some embodiments, the weight parameter θ of step 404 to the Actor network^μThe updating comprises the following steps:

according to

A weight parameter theta to the Actor network^μUpdating, wherein J ═ Q (s, a | theta)^μ，θ^E)。

In some embodiments, the count Num is updated to the target network of step 405 _countThe updating comprises the following steps:

Num_count＝Num_counr+1。

in some embodiments, the step 407 determines whether to use the weight parameter θ of the target network according to the determination result^μ′And theta^E′The updating comprises the following steps:

if the target network updates the count value Num_countIs less than the update period value Num_updateContinuing the learning step;

if the target network updates the count value Num_countIs equal to the update period value Num_updateAccording to

θ^E′←τθ^E+(1-τ)θ^E′

θ^μ′←τθ^μ+(1-τ)θ^μ′

A weight parameter theta to the target network^μ′And theta^E′Updating and updating the target network update count value Num_countResetting to zero; wherein, τ is the update coefficient of the target network weight. τ is, for example, 0.1.

As mentioned above, the environmental status characteristic diagram sequence of the own vehicle comprises a multi-frame environmental status characteristic diagram. In some embodiments, the environmental state feature map sequence is formed by the environmental state feature map of the current frame and several consecutive frames before the current frame. And taking the vehicle state information of the current frame as the vehicle state information of the own vehicle.

In some embodiments, the step 1011 of generating the image of the environment with the own vehicle as the center of the image based on the map data includes the step 501 of setting processing parameters of the image. And 502, acquiring local map information with the radius of R based on the coordinate position of the current central point of the self-vehicle. Step 503, performing coordinate transformation on the road center line and the road boundary line in the local map information. Step 504, determine the RGB values of the pixels of the environmental still picture. And 505, generating the environment static picture based on the RGB values of the pixel points.

In some embodiments, the processing parameters for the picture include an initial resolution, a final resolution, and a scale ratio of picture pixels to the actual perceptual environment. The radius R can be set according to the actual situation, for example, R is 100 meters.

In some embodiments, the coordinate transformation of the road centerline and the road boundary line in step 503 includes, in step 5031, taking a picture with all black pixels in RGB color representation as a base picture of the environmental still picture. Step 5032, placing the center point of the vehicle at the center of the base map, and setting the heading angle direction of the vehicle to be right above the base map. Step 5033, converting the coordinates of the road center line and the road boundary line from absolute coordinates in a world coordinate system to relative coordinates in a cartesian coordinate system with the self-vehicle as an origin and the heading angle direction of the self-vehicle as the positive direction of the y axis. Step 5034, converting the relative coordinates of the road center line and the road boundary line into pixel coordinates which are set to be right above the environment static picture by taking the vehicle center point as a pixel center point on the environment static picture and the heading angle direction of the vehicle.

In some embodiments, in step 5033, the specific transformation formula for transforming the coordinates of the road center line and the road boundary line from the absolute coordinates in the world coordinate system to the relative coordinates in the cartesian coordinate system with the own vehicle as the origin and the heading angle direction of the own vehicle as the positive y-axis direction is as follows:

x2＝(x-x_center)*cosθ+(y-y_center)*sinθ

y2＝(y-y_center)*cosθ-(x-x_center)*sinθ。

in step 5034, the specific conversion formula for converting the relative coordinates of the road centerline and the road boundary line into the pixel coordinates with the vehicle center point as the pixel center point on the environmental still image and the heading angle direction of the vehicle as the position right above the environmental still image is as follows:

u＝u_{image_center}+(x2/scale)

v＝v_{image_center}+(y2/scale)

combining step 5033 with step 5034, a conversion formula for converting the absolute coordinates into the pixel coordinates can be obtained as follows:

u＝u_{image_center}+(((x-x_center)*cosθ+(y-y_center)*sinθ)/scale)

wherein x and y represent the abscissa and ordinate of absolute coordinates in a world coordinate system, u and v represent the abscissa and ordinate of pixel points, and x_center、y_centerAbsolute coordinates, u, representing the centre point of the vehicle_{image_center}、v_{image_center}Coordinates of center pixel point representing environment picture corresponding to the vehicleThe coordinates of pixel points of the center points on the picture, theta is the course angle of the vehicle, and scale is the scale proportion of the picture pixels to the actual perception environment.

In some embodiments, the determining RGB values of the pixels of the environmental still image in step 504 includes marking pixels in a polygonal area surrounded by the border line of the road as pure white pixels in an RGB color representation manner, where the polygonal area corresponds to a drivable area of the host vehicle. Then, for a point in the center line of the road, determining the RGB value of the point according to the deviation angle of the heading angle of the point and the heading angle of the own vehicle.

In some embodiments, determining the RGB values for a point in the road centerline from the angle of deviation of the heading angle of the point from the heading angle of the host vehicle comprises:

by passing

Determining the value of the V component of the point in the HSV color representation mode;

wherein, pi is the circumference ratio,

the course angle of a point in the road center line is theta, the course angle of the vehicle is theta, and V is a V component when HSV is used for describing a point pixel; h is 240 degrees, and S is 1;

and after the value of the pixel point in the HSV color representation mode is obtained, converting the value of the HSV color representation mode into the value of the corresponding RGB color representation mode.

In some embodiments, the generating the environmental still picture of step 505 comprises: and on the base map of the environment static picture, generating the environment static picture comprising the road center line and the drivable area around the self vehicle based on the drivable area surrounded by the road boundary line, the pixel point coordinates of the road center line and the pixel point RGB values.

In some embodiments, the generating the environmental dynamic picture centering on the self-vehicle based on the target detection and tracking result in step 1012 includes, in step 601, acquiring absolute coordinates of boundary points of the target object of which the target category is the vehicle. Step 602, performing coordinate transformation on the absolute coordinates of the target object. Step 603, determining the RGB values of the pixels of the target object in the environment dynamic picture. Step 604, generating the environment dynamic picture based on the RGB values of the pixels of the environment dynamic picture.

In some embodiments, the process of coordinate transformation of step 602 is similar to the process of coordinate transformation set forth in steps 603 and 5034, for example. And will not be described in detail herein.

In some embodiments, the determining RGB values of the target object at the pixel point of the environment moving picture in step 603 includes:

by passing

Determining the value of the V component of the pixel point in the rectangular area corresponding to the target object in the HSV color representation mode;

wherein N is_framesFor the total number of said successive frames, N_positionThe number of frame sequences of the frame in which the rectangular area is located in the continuous frames is the number;

taking H as 0 degree for the self vehicle; taking H as 60 degrees for non-self vehicles; and taking S as 1;

and then, converting the value of the HSV color representation mode into the value of the corresponding RGB color representation mode.

And generating the environmental state picture according to the environmental static picture and the environmental dynamic picture.

In some embodiments, generating the environmental status picture further comprises: and performing resolution clipping on the environment state picture. For example, the ambient state picture is cropped from an initial resolution to a final resolution.

In some embodiments, the input and output dimensions of the first, second, third, and fourth fully connected networks may be set according to model training needs and the actual situation of the autonomous driving. For example, if the path planning model outputs a vehicle trajectory for 5 seconds in the future (5 seconds after the current time), and outputs two-dimensional cartesian coordinates of a future position of the vehicle every half second, the Actor network outputs 20 values, and each value is a continuously variable value.

The feature vectorization first module expands the output quantity of the main neural network into a 1-dimensional vector, combines the vector with the vehicle state information of the vehicle and inputs the vector into the first full-connection network FC 1.

In some embodiments, the first fully-connected network FC1 has input and output dimensions of 2052 and 256, respectively, and the second fully-connected network FC2 has input and output dimensions of 256 and 20, respectively.

In the criticic network, the feature vectorization second module performs feature vectorization processing on the output of the Actor network (i.e., the output FC2_ output of the FC 2) and the result of the feature vectorization first module again, and inputs the result into the third fully-connected network FC 3. The output of the Critic network (i.e., the output of FC 4) is the return estimate for reinforcement learning of the model.

In some embodiments, the input and output dimensions of the third fully-connected network FC3 are 2072 and 256, respectively, and the input and output dimensions of the fourth fully-connected network FC4 are 256 and 1, respectively.

The former part of the Critic network is the Actor network, and the output of the Actor network (i.e. the output FC2_ output of FC 2) also serves as the input to the middle part of the Critic network. The architecture of an Actor-critical network may be referred to as a shared network.

According to the method for planning the driving path of the vehicle, the reinforcement learning model of the shared network and the related algorithm for model training are designed, and the reinforcement learning and training can be performed on the path planning model for planning the automatic driving path, so that the application requirement of automatic driving can be better met.

Meanwhile, because the input and the output of the model are all in the expression mode of intermediate quantity, the model can be subjected to reinforcement learning based on a simulation environment.

The present application further provides a driving path planning device for a vehicle, including: a memory for storing instructions executable by the processor; and a processor for executing the instructions to implement the method as previously described.

Fig. 8 is a schematic diagram of a system implementation environment of a driving path planning apparatus for a vehicle according to an embodiment of the present application. The driving path planning apparatus 800 of the vehicle may include an internal communication bus 801, a Processor (Processor)802, a Read Only Memory (ROM)803, a Random Access Memory (RAM)804, and a communication port 805. The vehicle driving path planning apparatus 800 is connected to a network through a communication port and may be connected to a server side, which may provide a strong data processing capability. The internal communication bus 801 may enable data communication between components of the driving path planning apparatus 800 of the vehicle, such as a CAN bus. The processor 802 may make the determination and issue the prompt. In some embodiments, the processor 802 may be comprised of one or more processors. The communication port 805 may enable sending and receiving information and data from a network. The vehicle's travel path planning apparatus 800 may also include various forms of program storage units and data storage units, such as a Read Only Memory (ROM)803 and a Random Access Memory (RAM)804, capable of storing various data files for computer processing and/or communication use, as well as possibly program instructions for execution by the processor 802. The processor executes these instructions to implement the main parts of the method. The results processed by the processor may be communicated to the user device via the communication port and displayed on a user interface, such as an interactive interface of the in-vehicle system.

The vehicle travel path planning apparatus 800 may be implemented as a computer program, stored in a memory, and executed by a processor 802 to implement the vehicle travel path planning method of the present application.

The present application also provides a computer readable medium having stored thereon computer program code which, when executed by a processor, implements a method of driving path planning for a vehicle as described above.

Aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. The processor may be one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), digital signal processing devices (DAPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, or a combination thereof. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media. For example, computer-readable media may include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips … …), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD) … …), smart cards, and flash memory devices (e.g., card, stick, key drive … …).

The computer readable medium may comprise a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. The computer readable medium can be any computer readable medium that can communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, radio frequency signals, or the like, or any combination of the preceding.

Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Although the present application has been described with reference to the present specific embodiments, it will be recognized by those skilled in the art that the foregoing embodiments are merely illustrative of the present application and that various changes and substitutions of equivalents may be made without departing from the spirit of the application, and therefore, it is intended that all changes and modifications to the above-described embodiments that come within the spirit of the application fall within the scope of the claims of the application.

Claims

1. A method for planning a driving path of a vehicle comprises the following steps:

generating an environment state characteristic diagram sequence of the self-vehicle based on the map data and the target tracking result;

acquiring vehicle state information of the self vehicle;

taking the environment state characteristic diagram sequence of the self vehicle and the self vehicle state information as environment and state data, and inputting the environment and state data into a path planning model;

and acquiring a planned track of the self-vehicle output by the path planning model.

2. The method for planning a travel path of a vehicle according to claim 1, characterized by further comprising:

and obtaining a model return estimation value output by the path planning model, and evaluating the model based on the return estimation value.

3. The method for planning a driving path of a vehicle according to claim 1, wherein the path planning model includes a trunk neural network, a first feature vectorization module, a first fully-connected network, a second feature vectorization module, a third fully-connected network, and a fourth fully-connected network, which are connected in sequence;

the environment state feature diagram sequence of the vehicle is input into the backbone neural network, and the vehicle state information is input into the feature vectorization first module.

4. The method for planning a driving path of a vehicle according to claim 3, wherein the environment and state data is updated by operation of a controller module and an environment and state data calculation module;

wherein the second fully-connected network inputs planned trajectory data of the self-vehicle into the controller module, the controller module controls the self-vehicle to travel, and inputs first data generated by the self-vehicle to travel and second data generated by observing the periphery of the self-vehicle into the environment and state data calculation module, and the environment and state data calculation module updates the environment and state data based on the first data and the second data; the fourth fully-connected network outputs a model-reported estimate.

5. The method according to claim 4, wherein the environment and state data calculation module outputs a vehicle state transition reward value of the path planning model.

6. The method of claim 1, wherein the vehicle state information comprises speed, acceleration, heading angle, and heading angular velocity.

7. The method according to claim 2, wherein obtaining the model reward estimate output by the path planning model comprises:

calculating a vehicle speed reward value of the path planning model;

calculating a vehicle position reward value of the path planning model;

deriving a vehicle state transition reward value based on the vehicle speed reward value and the vehicle location reward value;

and calculating to obtain the model return estimation value based on the vehicle state transition reward value.

8. The method of claim 7, wherein calculating a vehicle speed reward value for the path planning model comprises:

according to the actual speed V of the bicycle_realAnd the desired speed V of the vehicle_expReceive the reward parameter G _speed；

According to the reward parameter G_speedObtaining the vehicle speed reward value r_t，speed。

9. The method according to claim 8, wherein the method is based on an actual speed V of the vehicle_realAnd the desired speed V of the vehicle_expReceive the reward parameter G_speedThe method comprises the following steps:

10. The method according to claim 8, wherein the reward parameter G is a function of the vehicle_speedObtaining the vehicle speed reward value r_t，speedThe method comprises the following steps:

when G is_speed> 1 or G_speed＝1，r_t，speed＝0；

When G is_speed＝0，r_t，speed＝1；

When 0 < G_speed＜1，

11. The method according to claim 8, wherein the desired speed V is a speed of the vehicle_expThe calculation of (a) includes:

when the vehicle meets the red light road condition,

V_exp＝V_exp，max；

Performing linear deceleration;

12. The method according to claim 8, wherein the desired speed V is a speed of the vehicle _expThe calculation of (a) includes:

when the vehicle meets the road condition of the obstacle,

the desired speed of the host vehicle is determined when an actual distance P between the host vehicle and the obstacle, a distance D2 between the obstacle stop line and the obstacle, and a distance D1 between the host vehicle and the obstacle stop line satisfy P > D1+ D2

V_exp＝V_exp，max；

Performing linear deceleration;

wherein, V_exp，maxIs the maximum desired speed.

13. The cart of claim 8A method for planning a traveling route of a vehicle, characterized in that the desired speed V of the vehicle is_expThe calculation of (a) includes:

when the vehicle meets the green light road condition, the expected speed of the vehicle is

V_exp＝V_exp,max；

Wherein, V_exp，maxIs the maximum desired speed.

14. The method for planning a driving path of a vehicle according to claim 7, wherein calculating the vehicle location reward value of the path planning model comprises:

wherein the content of the first and second substances,

when | S1| > 1 or | S1| > 1, r _t，position＝-1；

When | S1| ═ 0, r_t，position＝0；

When 0 < | S1| < 1,

| S1| represents the absolute value of S1.

15. The method for planning a travel path of a vehicle according to claim 7, wherein deriving a vehicle state transition reward value based on the vehicle speed reward value and the vehicle location reward value comprises:

the vehicle state transition reward value

r_t＝r_t，speed+r_t，position

16. The method for planning a driving path of a vehicle according to claim 7, wherein calculating the model reward estimation value based on the vehicle state transition reward value comprises:

the model return estimate

17. The method according to claim 3, wherein the backbone neural network, the first module for vectorization of features, the first fully-connected network and the second fully-connected network are connected to form an Actor network;

the Actor network is connected with the feature vectorization second module, the third fully-connected network and the fourth fully-connected network to form a Critic network;

the Critic network output model return estimation value Q_tThe neural network weight parameters needing to be learned by the Critic network comprise theta of the first half network^μAnd theta of the latter half network^EThe Critic network is expressed as a weight parameter in the form of Q_t＝Q(s_t，a_t|θ^μ，θ^E) (ii) a The environment state characteristic diagram sequence comprises a multi-frame environment state characteristic diagram.

18. The method according to claim 17, wherein a neural network weight parameter θ for the Actor network and the Critic network is set to θ^μAnd theta^ETo proceed with intensive chemistryThe learning process comprises the following steps:

setting the number RB of playback buffers for the reinforcement learning and the number N of sample batches during training, wherein the RB and the N are positive integers;

for the weight parameter theta of the neural network^μAnd theta^EThe Actor network mu(s)_t|θ^μ) The Critic network Q(s)_t，a_t|θ^μ，θ^E) Carrying out initialization;

constructing and relating to the Actor network mu(s)_t|θ^μ) And said Critic network Q(s)_t，a_t|θ^μ，θ^E) Is completely identical to the first target network mu'(s) _t|θ^μ′) And a second target network Q'(s)_t，a_t|θ^μ′，θ^E′)；

For the first target network mu'(s)_t|θ^μ′) And a second target network Q'(s)_t，a_t|θ^μ′，θ^E′) Weight parameter theta of^μ′And theta^E′Carrying out initialization;

setting the update period value Num of the target network weight parameter_update；

Setting an initial value s of the environment and state data₁And the target network update count value Num_countAn initial value of (d);

for the environment and state data with the total frame number of T frames, the environment and state data is s₁Initially, a learning step is performed.

19. The method according to claim 18, wherein the environmental and status data of T frames is s from the environmental and status data₁Initially, the performing learning step includes:

output a to the current Actor network_tAdding disturbance to obtain a_t，dAs the motion track indication of the current frame;

context and status data s based on the current time_tPerforming a on the environment and state_t，dAnd get the carEnvironmental and state data s after vehicle state transition_t+1And corresponding vehicle state transition excitation value r_t；

Transferring the current vehicle state to corresponding sample vector(s)_t，a_t，d，r_t，s_t+1) Saving in the playback cache;

randomly taking N samples(s) from the playback buffer_i，a_i，d，r_i，s_i+1)(i＝1，2，…，N，a_i，d∈a_t，d，r_i∈r_t) And training the Actor network and the Critic network.

20. The method according to claim 18, wherein the initializing comprises randomizing the parameters for the neural network weight parameter θ^μAnd theta^EThe Actor network mu(s)_t|θ^μ) The Critic network Q(s)_t，a_t|θ^μ，θ^E) Initialization is performed.

21. The method according to claim 19, wherein a is output to a current Actor network_tAdding disturbance to obtain a_t，dThe motion track indication as the current frame comprises the following steps:

a_t，d＝μ(s_t|θ^μ)+σζ_t-βμ(s_t|θ^μ)

22. The method according to claim 19, wherein the method is based on environmental and status data s at the current time_tPerforming a on the environment and state_t，dAnd obtaining environment and state data s after the vehicle state is transferred_t+1The environment and state data calculation module is operated to realize the environment and state data calculation.

23. The method for planning a driving path of a vehicle according to claim 22, wherein the controller module controls lateral and longitudinal movements of the host vehicle.

24. The method for planning a driving path of a vehicle according to claim 19, characterized in that N samples(s) are randomly drawn from the playback buffer _i，a_t，d，r_i，s_i+1)(i＝1，2，…，N，a_i，d∈a_t，d) Training the Actor network and the Critic network comprises:

calculating the model return estimation value Q_tA target value of (d);

calculating the model return estimated value Q of the current frame_tAverage residual error with model reported target value;

selecting and updating the weight parameter theta of the Critic network according to the sampling result of Bernoulli distribution^μAnd theta^EThe manner of (a);

a weight parameter theta to the Actor network^μUpdating is carried out;

updating the count value Num for the target network_countUpdating is carried out;

comparing the updated count value Num of the target network_countAnd the update period value Num_updateObtaining a judgment result;

determining whether to determine the weight parameter theta of the target network according to the judgment result^μ′And theta^E′And (6) updating.

25. The method according to claim 24, wherein the model return estimation value Q is calculated_tThe target values of (a) include:

the model return estimate Q_tTarget value of

y_i＝r_i+γQ′(s_i+1，μ′(s_i+1|θ^μ′)|θ^μ′，θ^E′)：

Where γ is a target value coefficient.

26. The method according to claim 24, wherein the model return estimation value Q of the current frame is calculated_tThe average residuals from the model reward target values include:

mean residual error

Wherein, y_iRepresenting the model reward objective value.

27. The method according to claim 24, wherein the weighting parameter θ of the Critic network is selected and updated according to the sampling result of bernoulli distribution^μAnd theta^EThe method comprises the following steps:

if the sampling result is 1, according to

if the sampling result is 0, according to

wherein the content of the first and second substances,

28. The method according to claim 24, wherein a weight parameter θ for the Actor network is set to^μThe updating comprises the following steps:

according to

wherein J ═ Q: (s，a|θ^μ，θ^E)。

29. The method according to claim 24, wherein a count value Num is updated for the target network _countThe updating comprises the following steps:

Num_count＝Num_count+1。

30. the method according to claim 24, wherein whether to determine the weight parameter θ for the target network is determined according to the determination result^μ′And theta^E′The updating comprises the following steps:

θ^E′←τθ^E+(1-τ)θ^E′

θ^μ′←τθ^μ+(1-τ)θ^μ′

A weight parameter theta to the target network^μ′And theta^E′Updating and updating the target network update count value Num_countResetting to zero;

wherein, τ is the update coefficient of the target network weight.

31. The method for planning a travel path of a vehicle according to claim 1, wherein the sequence of the environmental state feature maps of the host vehicle includes a plurality of frames of environmental state feature maps, and each of the environmental state feature maps is generated by:

generating an environment static picture taking the self-vehicle as a picture center based on the map data;

generating an environment dynamic picture taking the self-vehicle as a picture center based on the target detection tracking result;

and generating the environment state characteristic graph according to the environment static picture and the environment dynamic picture.

32. The method for planning a driving path of a vehicle according to claim 31, wherein generating an environment state feature map based on the environment still picture and the environment moving picture comprises:

taking the environment static picture as a base map;

overlaying picture information contained in the environment dynamic picture on the base map;

taking the self-vehicle central point of the current frame as a pixel central point on the environment state characteristic diagram;

and setting the heading angle direction of the vehicle as the direction right above the environmental state characteristic diagram, and generating the environmental state characteristic diagram.