CN114358128A

CN114358128A - Method for training end-to-end automatic driving strategy

Info

Publication number: CN114358128A
Application number: CN202111480162.6A
Authority: CN
Inventors: 徐坤; 冯时羽; 李慧云
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-04-15
Also published as: WO2023102962A1

Abstract

The invention discloses a method for training an end-to-end automatic driving strategy. The method comprises the following steps: inputting high-dimensional visual information reflecting a driving environment into a pre-trained representation network, and automatically learning low-dimensional information, wherein the representation network utilizes collected teaching data to perform supervised learning, and the low-dimensional information is an abstract feature with strong correlation with an automatic driving task; and constructing a reinforcement learning model, and acquiring an observation result by an intelligent agent through a low-dimensional information representation result of a pre-trained representation network to obtain an optimized driving strategy, wherein the reinforcement learning process is realized based on a Markov decision process of discrete time, and the reinforcement learning aims at acquiring the maximum long-term return expectation. According to the method, the abstract characteristic representation with strong correlation degree with the automatic driving task is learned before reinforcement learning, and the optimal driving strategy can be obtained more quickly and accurately.

Description

Method for training end-to-end automatic driving strategy

Technical Field

The invention relates to the technical field of automatic driving, in particular to a method for training an end-to-end automatic driving strategy.

Background

The automatic driving system framework is generally divided into two types, one type is a modular framework and comprises key components such as perception, planning, decision, control and the like; the other type is an end-to-end architecture, which directly maps the input information (e.g., visual information) collected by the vehicle to the control output (e.g., desired vehicle speed, turn angle command, etc.).

The modular framework can clearly define each component and develop a deterministic rule, has good interpretability, but has a complex system structure, can only ensure the strategy behavior in the established capability, and still needs a great deal of verification on the performance of the whole vehicle after the components are integrated.

The end-to-end method is an automatic driving paradigm emerging in recent years, has a simple structure (can be regarded as a single learning task), can automatically learn and extract characteristics related to the automatic driving task, and automatically constructs input-output direct mapping capability aiming at the automatic driving task. Two learning paradigms commonly used in end-to-end autopilot are mock learning and reinforcement learning.

The mimic learning method has been applied to automatic driving navigation, which is intended to learn from observed example data and is generally regarded as a supervised learning problem. The simulation learning usually needs a large amount of teaching trajectory data from experts at present, most samples are positive examples, and the negative examples are very difficult to collect. On the other hand, mimic learning is limited by the problem of data distribution drift, as catastrophic results may end up as errors accumulate within each time step.

Reinforcement learning aims at collecting the reward signals of the environment for each foot movement through the interaction of the agent and the environment, so that the learning obtains the mapping from the environment state to the behavior. Existing reinforcement learning methods have great potential in automatic driving, and model-free reinforcement learning methods such as deep-Q-networks (DQN) have been applied to vehicle control systems based on visual input information. However, the current reinforcement learning and representation learning have certain limitations.

1) Enhanced learning with teaching

One of the major problems of reinforcement learning is the cold start problem, which is mainly caused by the sparsity of the reward signal in the high-dimensional space. When an agent begins learning in a new environment, it may take a considerable amount of time to obtain the first positive reward signal. To address this problem, the taught reinforcement learning method attempts to speed up the training process by combining the ideas of mock learning and reinforcement learning using mock learning as the initial strategy of reinforcement learning. Reinforcement learning allows an agent to learn the optimal driving strategy from the exploration, but its sample utilization is low, which means that the agent may need millions of explorations before obtaining the optimal approach.

2) Presentation learning

In general automated driving based on the mimic learning, the teaching of an expert can be learned by adjusting parameters of a neural network. In the vision-based urban road automatic driving task, the input is an image, and the output is a high-level control command. For representation learning, high-dimensional input is first encoded with a feature extractor,

representing an observation of the input, f represents a feature extractor, then h ═ f ρ (I) is obtained, where ρ represents which feature extractor is the valid distribution, and h is a low-dimensional representation of the original input.

An observation module of an existing condition emulation Learning (CIL) method is realized by adopting a convolutional neural network, and the other two modules are realized by adopting a fully-connected network. The output of these modules is a joint representation J (o, m, c) < i (o), m (m), c (c), where o is the observed value, m is the low-dimensional vector representation of the high-dimensional observation, and c is the input control command. Due to the complexity of the driving environment, the mock learning requires a large amount of data to be collected.

In summary, the current end-to-end automatic driving method still faces the problem of low learning sample efficiency, and the development and application of the end-to-end automatic driving paradigm are restricted.

Disclosure of Invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method of training an end-to-end autopilot strategy. The method comprises the following steps:

inputting high-dimensional visual information reflecting a driving environment into a pre-trained representation network, and automatically learning low-dimensional information, wherein the representation network utilizes collected teaching data to perform supervised learning, and the low-dimensional information is an abstract feature with strong correlation with an automatic driving task;

the constructed reinforcement learning method is explored in an abstract feature space to obtain an optimized driving strategy, wherein the reinforcement learning process is realized based on a Markov decision process of discrete time in a state s_tNext, the agent obtains the observation result o by observing the environment_tBased on the strategy pi (a)_t|s_t) Taking action a_tThen obtain the reward signal r_tThen shifts to the next state s_t+1The goal of reinforcement learning is to achieve maximum long-term return expectation.

Compared with the prior art, the invention has the advantages that the end-to-end accelerated learning framework and the method for the automatic driving strategy are provided, abstract low-dimensional information relevant to the automatic driving task is obtained through representation learning, irrelevant information is ignored, subsequent reinforcement learning is explored in a low-dimensional abstract feature space, the learning sample efficiency is improved, and the training model process is accelerated. According to the invention, the original high-dimensional observation is projected to the low dimension, and the characterization is learned before the reinforcement learning process, so that the problem of low sample learning efficiency of an end-to-end method is solved.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a block diagram of a framework for training an end-to-end autopilot strategy according to one embodiment of the present invention;

FIG. 2 is a flow diagram of a method of training an end-to-end autopilot strategy according to one embodiment of the invention;

FIG. 3 is a graph illustrating a comparison of performance of a reinforcement learning training process according to one embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The invention provides a technical scheme for realizing automatic driving by combining representation learning and reinforcement learning. The reinforcement learning algorithm is improved by inputting a representation of the environment (rather than raw data perceived from the environment) as a new state into the system prior to reinforcement learning.

Referring to fig. 1, the proposed end-to-end learning framework for automatic driving generally includes pre-training feature extraction, a learning environment representation network, a teaching generation module, efficient exploration and strategy output in an abstract feature space, and the like.

The representation network uses a representation learning method to obtain important features of the current observed data to ignore irrelevant information, e.g., ResNet-34 can be selected as the feature extractor. Further, the representation of the environment is input into the system as a new state, speeding up the algorithm convergence process of reinforcement learning.

The teaching generation module is used for simulating the control of experts on the driving speed and the steering angle of the steering wheel and the like according to expert teaching data.

Efficient exploration of abstract feature space adopts a reinforcement learning method to learn an optimal automatic driving strategy, the reinforcement learning process can follow a Markov decision process of discrete time, and an intelligent agent obtains an observation result o through observing the environment at the moment t_tBased on the strategy pi (a)_t|s_t) Taking action a_tThen obtain the reward signal r_tThen shifts to the next state s_t+1. The overall system goal is to achieve maximum long-term reward expectations.

Specifically, referring to FIG. 2, a method of training an end-to-end autopilot strategy is provided that includes the following steps.

Step S210, a representation network of the learning environment is constructed for automatically learning low-dimensional information, which is an abstract feature having a strong correlation with the automatic driving task, based on the high-dimensional visual information reflecting the driving environment.

In one embodiment, a representation network of a learning environment takes ResNet-34 pre-trained on a large-scale dataset (e.g., ImageNet, etc.) as a feature extractor to convert observed high-dimensional data into low-dimensional data, enabling automatic extraction of complex high-dimensional input information into simple low-dimensional abstract information.

Specifically, the image obtained by the camera is taken as input, and a time stamp is added as the observation result o of the time step t_t. In addition, the encoder also takes command c as input. In the urban automatic driving scene, an observation result is composed of different types of sensor data input, and low-dimensional state information cannot be accessed only by using an original reinforcement learning method. The embodiment of the invention adds the representation network and converts the original high dimensionThe observations are projected into the low dimension.

By pre-training ResNet-34, abstract features with strong correlation with driving tasks such as traffic lights, lane lines and surrounding traffic participants in the input image can be extracted, and irrelevant interference information such as weather conditions and building positions can be shielded.

Step S220, the network performs supervised learning using the collected teaching data.

The presentation network is trained by supervised learning based on the results extracted by the static feature extractor and expert teaching data. For example, the teaching generation module is two PID (proportional integral derivative) controllers, the first is longitudinal PID control, and the control of the speed is simulated by controlling the accelerator and the brake. The second is lateral PID control, which mimics the expert's control of steering wheel steering angle.

In particular, the longitudinal PID controller mimics the speed v of an expert by controlling throttle and brake^*The speed, which is related to the average speed of the vehicle to reach each track point, can be expressed as:

wherein v is^*Representing the expected speed, deltat representing the time interval between two trace points,

the starting position is shown, K represents the total number of track points, and i represents the serial number of the current target track point. The longitudinal PID controller uses throttle and brake to reduce the current velocity v and the expected velocity v^*The difference between them.

The transverse PID controller is used for realizing the desired steering angle s^*And (4) controlling. To reach the next track point w, the steering angle needs to satisfy:

s^*＝tan^-1(w_y/w_x) (2)

wherein, w_yRepresenting the lateral distance, w, of the current position from the next target track point_xIndicating the longitudinal distance of the current position from the next target track point.

And step S230, exploring in an abstract feature space by using a reinforcement learning method to obtain an optimized automatic driving strategy.

During reinforcement learning, the smart car explores in a low-dimensional abstract representation of the environment, wherein convergence of searching for optimal strategies may be accelerated due to less noise and irrelevant information. In a driving scenario, the invention can help the agent to focus on important information such as traffic lights and other traffic participants, while ignoring information that is not relevant to driving, such as weather conditions and the location of buildings.

Reinforcement learning explores efficiently in the abstract feature space. For example, reinforcement learning is implemented using a discrete time based Markov decision process. In particular, in state s_t(t is a time step), the agent obtains an observation result o by observing the environment_tBased on the strategy pi (a)_t|s_t) Taking action a_tThen obtain the reward signal r_tThen shifts to the next state s_t+1. The overall system goal is to achieve maximum long-term return expectation, described as:

wherein gamma is an attenuation factor, the value of gamma is between 0 and 1, and Q is a value function.

In an embodiment of the invention, the goal of reinforcement learning is to learn a policy π to maximize the jackpot, expressed as:

where M ═ S, a, R, T, ρ₀，r，γ)，

The status is represented by a number of time slots,

representing a behavior;

the function of the reward is represented by,

representing the state transition function. Rho₀Representing the probability distribution of the initial state, r representing the reward, and gamma representing the attenuation factor, with values between 0 and 1.

Reinforcement learning can solve the problem of learning to control a dynamic system. For example, soft actor-critic (SAC) algorithms are used for reinforcement learning with the goal of learning the strategy pi to maximize the jackpot. The soft actor-critic adopts an evaluation state value function and a state-behavior value function to optimize a target (maximize expected return) through a strategy-based optimization method, and directly obtains an optimization strategy pi_θWhere the parameter θ is determined by adjusting the gradient

And (4) obtaining.

In one embodiment, the state cost function V may be defined as:

wherein Q is_πAs a function of the value of the long-term returns.

In one embodiment, set rewards are used in reinforcement learning training and security is considered the most important factor. For example, the reward function is set to:

r_t＝r_v+0.05r_step+10r_col+0.8r_safe (6)

wherein r is_vIs the traffic efficiency, e.g. defined as r_v＝v+2(v_maxV) where v_maxIs the speed limit and v represents the current speed. r is_stepIs a constant step penalty used to force the agent to explore faster. r is_colIs a penalty for a collision, r_safeIs a security factor, defined for example as:

wherein λ is_sRepresenting an adjustable scaling factor, r₁、r₂Denotes a bonus factor, d₁、d₂Representing the distance, v, from the target track point_safeDenotes a speed controller, defined as v_safe＝v[1-(v≤v_min，a_t≤0)]Wherein v is_minRepresents a lower threshold value of speed, a_tRepresenting a behavioral action.

In one embodiment, the state-behavior cost function Q is defined as:

the optimization strategy may be optimized by maximizing

To obtain pi^*. In the actor-critic method, the value function and the Q function are alternately learned by minimizing the bellman error, and the strategy is learned by maximizing the Q value, so that the optimal strategy can be obtained. For soft actor-critics (SAC), for easy exploration, a maximum entropy term H is added to the target, expressed as:

wherein alpha is a set parameter, determines the importance of the entropy item relative to the reward, can be used for controlling the randomness of the optimal strategy, tau represents a vector in a strategy space,t denotes the total number of iterations, p_π(τ) represents a distribution function of the strategy.

In soft actor-criticism (SAC), random strategies are optimized in a non-strategic manner under an actor-critics framework. SAC incorporates the maximum entropy of the strategy into rewards to achieve stability and encourage exploration. In an autonomous driving maneuver learning process, this approach may learn a maneuver that acts as randomly as possible while still being able to complete the task.

In particular, SAC learns a policy network pi_φTwo Q networks

And a cost function network V_ψ. Target Q follows r(s)_t，a_t)+γV_target(s_t+1) Iterate together, here V_targetRepresenting the objective cost function network. For example, the Q network is updated with the following loss function:

value function network V_ψThe update is performed in the following manner:

where α is a non-negative parameter of the control entropy. Behavior from current policy

To obtain the compound.

Finally, the method for updating the policy network is as follows:

the flow of the embodiment of the invention refers to the following steps:

in order to further verify the effect of the invention, experimental simulation was performed. The experiments were performed in an open source CARLA simulator. Training an end-to-end automatic driving strategy in a CARLA simulator is currently an accepted learning training method. Cara provides not only high-level navigation commands for steering, straight-ahead, lane-keeping, etc., but also low-level control commands including steering angle of the steering wheel, throttle brake, etc. In addition, CARLA provides a variety of sensors, including lidar, multi-view cameras, depth sensors, GPS, etc., that enable the collection of multi-source data.

In the experimental example of the present invention, Town1 was used as the training and verification environment, and Town2 and Town3 were used as the testing and evaluation environment. The proposed method was evaluated on 10 different trajectories of the map Town 3.

Because the invention combines reinforcement learning and expression learning and uses SAC as reinforcement learning algorithm, the method of the invention is compared with the original soft actor-critic (SAC) algorithm, Condition Imitation Learning (CIL), sparse reward imitation learning (SQIL) and teaching depth Q learning (DQfD) as basic methods to display the improvement effect of the invention.

Compared with various basic methods, the method improves the performance to a certain extent. Experimental selection CIL and SQIL represent the simulated learning method, SAC represents the reinforcement learning method, and all methods are tested in the same illumination and weather environment. The test results are given in table 1 below.

Table 1 comparison of test properties

As can be seen from Table 1, the present invention is superior to the prior art in various illumination and weather environments such as sunny days, rainy days, nights, rainy nights, etc., the success rate exceeds 90%, the collision rate is also significantly improved, and the convergence time (Episode length hs) is also reduced. Therefore, compared with the existing simulation learning and reinforcement learning method for automatic driving, the method provided by the invention avoids the cold start problem, shows higher convergence speed, and shows stronger safety and higher stability under different weather conditions.

Fig. 3 is a graph of training performance versus reward (reward) on the ordinate, iteration (iteration) on the abscissa, and the uppermost curve representing the invention. It can be seen intuitively that the present invention significantly speeds up convergence and avoids the cold start problem to some extent.

In summary, the technical effects of the present invention are mainly reflected in the following aspects:

1) the representation learning is applied to the automatic driving strategy learning of the end-to-end urban road, specifically, high-dimensional visual input information is subjected to representation learning, and low-dimensional information is automatically learned, wherein the low-dimensional information is an abstract feature related to an automatic driving task. By the method, the noise of the original high-dimensional input, which is irrelevant to the task, is reduced, so that the training process is accelerated, and the sample efficiency is improved.

2) In order to ensure the effectiveness of the reinforcement learning process and to ensure that the representation can properly describe the relevant scenes of the city driving task, the stability of the system is improved by using, for example, ResNet-34 as a feature extractor pre-trained on a large-scale data set (such as ImageNet). In addition, to ensure that the representation can describe the environment efficiently, representation learning is performed using, for example, Dagger techniques to collect data in the caraa simulator.

3) The constructed reinforcement learning method is a search in an abstract world formed by low-dimensional abstract state information, and can accurately and efficiently obtain an optimal driving strategy.

4) In consideration of the fact that a frame which simultaneously performs learning representation and reinforcement learning target possibly causes a non-stationarity problem, the method learns the representation before the reinforcement learning process, so that the optimal driving strategy can be obtained more quickly.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method of training an end-to-end autopilot strategy, comprising the steps of:

constructing a reinforcement learning method, exploring in an abstract characteristic space to obtain an optimized driving strategy, wherein the reinforcement learning process is realized based on a Markov decision process of discrete time in a state s_tNext, the agent obtains an observation o by a pre-trained low-dimensional information representation representing the network_tBased on the strategy pi (a)_t|s_t) Taking a behavioral action a_tThen obtain the reward signal r_tThen shifts to the next state s_t+1The goal of reinforcement learning is to obtain the maximum long-term return expectation Q_πExpressed as:

wherein gamma is an attenuation factor, the value of gamma is between 0 and 1, and t represents the time.

2. The method of claim 1, wherein the teaching data comprises control data of accelerator, brake and steering wheel by experts, and longitudinal proportional-integral-derivative control and transverse proportional-integral-derivative control are adopted to reach target track points generated by a reinforcement learning method so as to simulate the control of speed and steering angle by experts.

3. The method of claim 1, wherein a soft actor-critic algorithm is employed for reinforcement learning to evaluate a state cost function and a state-behavior cost function to maximize an expected return, resulting in an optimization strategy, wherein:

the state cost function V is set to:

the state-behavior cost function Q is set to:

the reward function is set as:

r_t＝r_υ+0.05r_step+10r_col+0.8r_safe

wherein Q is_πA cost function of long-term returns, r_υIs the traffic efficiency, r_stepIs a constant step penalty, r_colIs a penalty for a collision, r_safeIs a security control item.

4. Method according to claim 3, characterized in that the safety control item r_safeThe method comprises the following steps:

traffic efficiency r_υThe method comprises the following steps:

r_υ＝υ+2(v_max-υ)

wherein upsilon is_maxRefers to the speed limit, λ_sRepresenting an adjustable scaling factor, r₁And r₂Denotes a bonus factor, d₁And d₂To representDistance from target track point, upsilon_safeDenotes a velocity controller, defined as v_safe＝υ[1-(υ≤υ_min，a_t≤0)]，υ_minRepresents a lower threshold value of speed, a_tRepresenting a behavioral action.

5. The method of claim 4, wherein the optimization objective of the soft actor-critic algorithm is expressed as:

wherein alpha is a set parameter and determines the importance of the entropy item relative to the reward, tau represents a vector in a strategy space, T represents the total iteration number, and p_π(τ) represents a distribution function of the strategy.

6. The method of claim 3, wherein the soft actor-critic algorithm learns a policy network pi_φTwo Q networks

And a cost function network V_ψThe Q network is updated with the following loss function:

value function network V_ψThe following formula is used for updating:

the update policy network is represented as:

wherein, V_targetRepresenting a target value function network, alpha is a non-negative parameter of control entropy, and behavior action is selected from the current strategy

To obtain the compound.

7. The method of claim 2, wherein the pid control simulates the expert's velocity v by controlling throttle and brake^*This speed is related to the average speed of the vehicle to reach the various trajectory points, and is expressed as:

wherein w represents an expected trajectory point, upsilon, generated by reinforcement learning^*Representing the expected speed, deltat representing the time interval between two trace points,

the starting position is shown, K represents the total number of track points, and i represents the serial number of the current target track point. The longitudinal PID controller adopts an accelerator and a brake to reduce the current speed v and the expected speed upsilon^*The difference between them.

8. Method according to claim 2, characterized in that the transverse proportional-integral-derivative control is used for a desired steering angle s^*The steering angle needs to satisfy:

s^*＝tan^-1(w_y/w_x)

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the steps of the method of any of claims 1 to 8 are implemented when the processor executes the program.