CN114358128A - Method for training end-to-end automatic driving strategy - Google Patents

Method for training end-to-end automatic driving strategy Download PDF

Info

Publication number
CN114358128A
CN114358128A CN202111480162.6A CN202111480162A CN114358128A CN 114358128 A CN114358128 A CN 114358128A CN 202111480162 A CN202111480162 A CN 202111480162A CN 114358128 A CN114358128 A CN 114358128A
Authority
CN
China
Prior art keywords
learning
reinforcement learning
strategy
network
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111480162.6A
Other languages
Chinese (zh)
Inventor
徐坤
冯时羽
李慧云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202111480162.6A priority Critical patent/CN114358128A/en
Priority to PCT/CN2021/137801 priority patent/WO2023102962A1/en
Publication of CN114358128A publication Critical patent/CN114358128A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a method for training an end-to-end automatic driving strategy. The method comprises the following steps: inputting high-dimensional visual information reflecting a driving environment into a pre-trained representation network, and automatically learning low-dimensional information, wherein the representation network utilizes collected teaching data to perform supervised learning, and the low-dimensional information is an abstract feature with strong correlation with an automatic driving task; and constructing a reinforcement learning model, and acquiring an observation result by an intelligent agent through a low-dimensional information representation result of a pre-trained representation network to obtain an optimized driving strategy, wherein the reinforcement learning process is realized based on a Markov decision process of discrete time, and the reinforcement learning aims at acquiring the maximum long-term return expectation. According to the method, the abstract characteristic representation with strong correlation degree with the automatic driving task is learned before reinforcement learning, and the optimal driving strategy can be obtained more quickly and accurately.

Description

Method for training end-to-end automatic driving strategy
Technical Field
The invention relates to the technical field of automatic driving, in particular to a method for training an end-to-end automatic driving strategy.
Background
The automatic driving system framework is generally divided into two types, one type is a modular framework and comprises key components such as perception, planning, decision, control and the like; the other type is an end-to-end architecture, which directly maps the input information (e.g., visual information) collected by the vehicle to the control output (e.g., desired vehicle speed, turn angle command, etc.).
The modular framework can clearly define each component and develop a deterministic rule, has good interpretability, but has a complex system structure, can only ensure the strategy behavior in the established capability, and still needs a great deal of verification on the performance of the whole vehicle after the components are integrated.
The end-to-end method is an automatic driving paradigm emerging in recent years, has a simple structure (can be regarded as a single learning task), can automatically learn and extract characteristics related to the automatic driving task, and automatically constructs input-output direct mapping capability aiming at the automatic driving task. Two learning paradigms commonly used in end-to-end autopilot are mock learning and reinforcement learning.
The mimic learning method has been applied to automatic driving navigation, which is intended to learn from observed example data and is generally regarded as a supervised learning problem. The simulation learning usually needs a large amount of teaching trajectory data from experts at present, most samples are positive examples, and the negative examples are very difficult to collect. On the other hand, mimic learning is limited by the problem of data distribution drift, as catastrophic results may end up as errors accumulate within each time step.
Reinforcement learning aims at collecting the reward signals of the environment for each foot movement through the interaction of the agent and the environment, so that the learning obtains the mapping from the environment state to the behavior. Existing reinforcement learning methods have great potential in automatic driving, and model-free reinforcement learning methods such as deep-Q-networks (DQN) have been applied to vehicle control systems based on visual input information. However, the current reinforcement learning and representation learning have certain limitations.
1) Enhanced learning with teaching
One of the major problems of reinforcement learning is the cold start problem, which is mainly caused by the sparsity of the reward signal in the high-dimensional space. When an agent begins learning in a new environment, it may take a considerable amount of time to obtain the first positive reward signal. To address this problem, the taught reinforcement learning method attempts to speed up the training process by combining the ideas of mock learning and reinforcement learning using mock learning as the initial strategy of reinforcement learning. Reinforcement learning allows an agent to learn the optimal driving strategy from the exploration, but its sample utilization is low, which means that the agent may need millions of explorations before obtaining the optimal approach.
2) Presentation learning
In general automated driving based on the mimic learning, the teaching of an expert can be learned by adjusting parameters of a neural network. In the vision-based urban road automatic driving task, the input is an image, and the output is a high-level control command. For representation learning, high-dimensional input is first encoded with a feature extractor,
Figure BDA0003394210600000021
representing an observation of the input, f represents a feature extractor, then h ═ f ρ (I) is obtained, where ρ represents which feature extractor is the valid distribution, and h is a low-dimensional representation of the original input.
An observation module of an existing condition emulation Learning (CIL) method is realized by adopting a convolutional neural network, and the other two modules are realized by adopting a fully-connected network. The output of these modules is a joint representation J (o, m, c) < i (o), m (m), c (c), where o is the observed value, m is the low-dimensional vector representation of the high-dimensional observation, and c is the input control command. Due to the complexity of the driving environment, the mock learning requires a large amount of data to be collected.
In summary, the current end-to-end automatic driving method still faces the problem of low learning sample efficiency, and the development and application of the end-to-end automatic driving paradigm are restricted.
Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method of training an end-to-end autopilot strategy. The method comprises the following steps:
inputting high-dimensional visual information reflecting a driving environment into a pre-trained representation network, and automatically learning low-dimensional information, wherein the representation network utilizes collected teaching data to perform supervised learning, and the low-dimensional information is an abstract feature with strong correlation with an automatic driving task;
the constructed reinforcement learning method is explored in an abstract feature space to obtain an optimized driving strategy, wherein the reinforcement learning process is realized based on a Markov decision process of discrete time in a state stNext, the agent obtains the observation result o by observing the environmenttBased on the strategy pi (a)t|st) Taking action atThen obtain the reward signal rtThen shifts to the next state st+1The goal of reinforcement learning is to achieve maximum long-term return expectation.
Compared with the prior art, the invention has the advantages that the end-to-end accelerated learning framework and the method for the automatic driving strategy are provided, abstract low-dimensional information relevant to the automatic driving task is obtained through representation learning, irrelevant information is ignored, subsequent reinforcement learning is explored in a low-dimensional abstract feature space, the learning sample efficiency is improved, and the training model process is accelerated. According to the invention, the original high-dimensional observation is projected to the low dimension, and the characterization is learned before the reinforcement learning process, so that the problem of low sample learning efficiency of an end-to-end method is solved.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a block diagram of a framework for training an end-to-end autopilot strategy according to one embodiment of the present invention;
FIG. 2 is a flow diagram of a method of training an end-to-end autopilot strategy according to one embodiment of the invention;
FIG. 3 is a graph illustrating a comparison of performance of a reinforcement learning training process according to one embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The invention provides a technical scheme for realizing automatic driving by combining representation learning and reinforcement learning. The reinforcement learning algorithm is improved by inputting a representation of the environment (rather than raw data perceived from the environment) as a new state into the system prior to reinforcement learning.
Referring to fig. 1, the proposed end-to-end learning framework for automatic driving generally includes pre-training feature extraction, a learning environment representation network, a teaching generation module, efficient exploration and strategy output in an abstract feature space, and the like.
The representation network uses a representation learning method to obtain important features of the current observed data to ignore irrelevant information, e.g., ResNet-34 can be selected as the feature extractor. Further, the representation of the environment is input into the system as a new state, speeding up the algorithm convergence process of reinforcement learning.
The teaching generation module is used for simulating the control of experts on the driving speed and the steering angle of the steering wheel and the like according to expert teaching data.
Efficient exploration of abstract feature space adopts a reinforcement learning method to learn an optimal automatic driving strategy, the reinforcement learning process can follow a Markov decision process of discrete time, and an intelligent agent obtains an observation result o through observing the environment at the moment ttBased on the strategy pi (a)t|st) Taking action atThen obtain the reward signal rtThen shifts to the next state st+1. The overall system goal is to achieve maximum long-term reward expectations.
Specifically, referring to FIG. 2, a method of training an end-to-end autopilot strategy is provided that includes the following steps.
Step S210, a representation network of the learning environment is constructed for automatically learning low-dimensional information, which is an abstract feature having a strong correlation with the automatic driving task, based on the high-dimensional visual information reflecting the driving environment.
In one embodiment, a representation network of a learning environment takes ResNet-34 pre-trained on a large-scale dataset (e.g., ImageNet, etc.) as a feature extractor to convert observed high-dimensional data into low-dimensional data, enabling automatic extraction of complex high-dimensional input information into simple low-dimensional abstract information.
Specifically, the image obtained by the camera is taken as input, and a time stamp is added as the observation result o of the time step tt. In addition, the encoder also takes command c as input. In the urban automatic driving scene, an observation result is composed of different types of sensor data input, and low-dimensional state information cannot be accessed only by using an original reinforcement learning method. The embodiment of the invention adds the representation network and converts the original high dimensionThe observations are projected into the low dimension.
By pre-training ResNet-34, abstract features with strong correlation with driving tasks such as traffic lights, lane lines and surrounding traffic participants in the input image can be extracted, and irrelevant interference information such as weather conditions and building positions can be shielded.
Step S220, the network performs supervised learning using the collected teaching data.
The presentation network is trained by supervised learning based on the results extracted by the static feature extractor and expert teaching data. For example, the teaching generation module is two PID (proportional integral derivative) controllers, the first is longitudinal PID control, and the control of the speed is simulated by controlling the accelerator and the brake. The second is lateral PID control, which mimics the expert's control of steering wheel steering angle.
In particular, the longitudinal PID controller mimics the speed v of an expert by controlling throttle and brake*The speed, which is related to the average speed of the vehicle to reach each track point, can be expressed as:
Figure BDA0003394210600000051
wherein v is*Representing the expected speed, deltat representing the time interval between two trace points,
Figure BDA0003394210600000052
Figure BDA0003394210600000053
the starting position is shown, K represents the total number of track points, and i represents the serial number of the current target track point. The longitudinal PID controller uses throttle and brake to reduce the current velocity v and the expected velocity v*The difference between them.
The transverse PID controller is used for realizing the desired steering angle s*And (4) controlling. To reach the next track point w, the steering angle needs to satisfy:
s*=tan-1(wy/wx) (2)
wherein, wyRepresenting the lateral distance, w, of the current position from the next target track pointxIndicating the longitudinal distance of the current position from the next target track point.
And step S230, exploring in an abstract feature space by using a reinforcement learning method to obtain an optimized automatic driving strategy.
During reinforcement learning, the smart car explores in a low-dimensional abstract representation of the environment, wherein convergence of searching for optimal strategies may be accelerated due to less noise and irrelevant information. In a driving scenario, the invention can help the agent to focus on important information such as traffic lights and other traffic participants, while ignoring information that is not relevant to driving, such as weather conditions and the location of buildings.
Reinforcement learning explores efficiently in the abstract feature space. For example, reinforcement learning is implemented using a discrete time based Markov decision process. In particular, in state st(t is a time step), the agent obtains an observation result o by observing the environmenttBased on the strategy pi (a)t|st) Taking action atThen obtain the reward signal rtThen shifts to the next state st+1. The overall system goal is to achieve maximum long-term return expectation, described as:
Figure BDA0003394210600000061
wherein gamma is an attenuation factor, the value of gamma is between 0 and 1, and Q is a value function.
In an embodiment of the invention, the goal of reinforcement learning is to learn a policy π to maximize the jackpot, expressed as:
Figure BDA0003394210600000062
where M ═ S, a, R, T, ρ0,r,γ),
Figure BDA0003394210600000063
The status is represented by a number of time slots,
Figure BDA0003394210600000064
representing a behavior;
Figure BDA0003394210600000065
the function of the reward is represented by,
Figure BDA0003394210600000066
representing the state transition function. Rho0Representing the probability distribution of the initial state, r representing the reward, and gamma representing the attenuation factor, with values between 0 and 1.
Reinforcement learning can solve the problem of learning to control a dynamic system. For example, soft actor-critic (SAC) algorithms are used for reinforcement learning with the goal of learning the strategy pi to maximize the jackpot. The soft actor-critic adopts an evaluation state value function and a state-behavior value function to optimize a target (maximize expected return) through a strategy-based optimization method, and directly obtains an optimization strategy piθWhere the parameter θ is determined by adjusting the gradient
Figure BDA0003394210600000068
And (4) obtaining.
In one embodiment, the state cost function V may be defined as:
Figure BDA0003394210600000067
wherein Q isπAs a function of the value of the long-term returns.
In one embodiment, set rewards are used in reinforcement learning training and security is considered the most important factor. For example, the reward function is set to:
rt=rv+0.05rstep+10rcol+0.8rsafe (6)
wherein r isvIs the traffic efficiency, e.g. defined as rv=v+2(vmaxV) where vmaxIs the speed limit and v represents the current speed. r isstepIs a constant step penalty used to force the agent to explore faster. r iscolIs a penalty for a collision, rsafeIs a security factor, defined for example as:
Figure BDA0003394210600000071
wherein λ issRepresenting an adjustable scaling factor, r1、r2Denotes a bonus factor, d1、d2Representing the distance, v, from the target track pointsafeDenotes a speed controller, defined as vsafe=v[1-(v≤vmin,at≤0)]Wherein v isminRepresents a lower threshold value of speed, atRepresenting a behavioral action.
In one embodiment, the state-behavior cost function Q is defined as:
Figure BDA0003394210600000072
the optimization strategy may be optimized by maximizing
Figure BDA0003394210600000073
To obtain pi*. In the actor-critic method, the value function and the Q function are alternately learned by minimizing the bellman error, and the strategy is learned by maximizing the Q value, so that the optimal strategy can be obtained. For soft actor-critics (SAC), for easy exploration, a maximum entropy term H is added to the target, expressed as:
Figure BDA0003394210600000074
wherein alpha is a set parameter, determines the importance of the entropy item relative to the reward, can be used for controlling the randomness of the optimal strategy, tau represents a vector in a strategy space,t denotes the total number of iterations, pπ(τ) represents a distribution function of the strategy.
In soft actor-criticism (SAC), random strategies are optimized in a non-strategic manner under an actor-critics framework. SAC incorporates the maximum entropy of the strategy into rewards to achieve stability and encourage exploration. In an autonomous driving maneuver learning process, this approach may learn a maneuver that acts as randomly as possible while still being able to complete the task.
In particular, SAC learns a policy network piφTwo Q networks
Figure BDA0003394210600000075
And a cost function network Vψ. Target Q follows r(s)t,at)+γVtarget(st+1) Iterate together, here VtargetRepresenting the objective cost function network. For example, the Q network is updated with the following loss function:
Figure BDA0003394210600000076
value function network VψThe update is performed in the following manner:
Figure BDA0003394210600000081
where α is a non-negative parameter of the control entropy. Behavior from current policy
Figure BDA0003394210600000082
To obtain the compound.
Finally, the method for updating the policy network is as follows:
Figure BDA0003394210600000083
the flow of the embodiment of the invention refers to the following steps:
Figure BDA0003394210600000084
in order to further verify the effect of the invention, experimental simulation was performed. The experiments were performed in an open source CARLA simulator. Training an end-to-end automatic driving strategy in a CARLA simulator is currently an accepted learning training method. Cara provides not only high-level navigation commands for steering, straight-ahead, lane-keeping, etc., but also low-level control commands including steering angle of the steering wheel, throttle brake, etc. In addition, CARLA provides a variety of sensors, including lidar, multi-view cameras, depth sensors, GPS, etc., that enable the collection of multi-source data.
In the experimental example of the present invention, Town1 was used as the training and verification environment, and Town2 and Town3 were used as the testing and evaluation environment. The proposed method was evaluated on 10 different trajectories of the map Town 3.
Because the invention combines reinforcement learning and expression learning and uses SAC as reinforcement learning algorithm, the method of the invention is compared with the original soft actor-critic (SAC) algorithm, Condition Imitation Learning (CIL), sparse reward imitation learning (SQIL) and teaching depth Q learning (DQfD) as basic methods to display the improvement effect of the invention.
Compared with various basic methods, the method improves the performance to a certain extent. Experimental selection CIL and SQIL represent the simulated learning method, SAC represents the reinforcement learning method, and all methods are tested in the same illumination and weather environment. The test results are given in table 1 below.
Table 1 comparison of test properties
Figure BDA0003394210600000091
As can be seen from Table 1, the present invention is superior to the prior art in various illumination and weather environments such as sunny days, rainy days, nights, rainy nights, etc., the success rate exceeds 90%, the collision rate is also significantly improved, and the convergence time (Episode length hs) is also reduced. Therefore, compared with the existing simulation learning and reinforcement learning method for automatic driving, the method provided by the invention avoids the cold start problem, shows higher convergence speed, and shows stronger safety and higher stability under different weather conditions.
Fig. 3 is a graph of training performance versus reward (reward) on the ordinate, iteration (iteration) on the abscissa, and the uppermost curve representing the invention. It can be seen intuitively that the present invention significantly speeds up convergence and avoids the cold start problem to some extent.
In summary, the technical effects of the present invention are mainly reflected in the following aspects:
1) the representation learning is applied to the automatic driving strategy learning of the end-to-end urban road, specifically, high-dimensional visual input information is subjected to representation learning, and low-dimensional information is automatically learned, wherein the low-dimensional information is an abstract feature related to an automatic driving task. By the method, the noise of the original high-dimensional input, which is irrelevant to the task, is reduced, so that the training process is accelerated, and the sample efficiency is improved.
2) In order to ensure the effectiveness of the reinforcement learning process and to ensure that the representation can properly describe the relevant scenes of the city driving task, the stability of the system is improved by using, for example, ResNet-34 as a feature extractor pre-trained on a large-scale data set (such as ImageNet). In addition, to ensure that the representation can describe the environment efficiently, representation learning is performed using, for example, Dagger techniques to collect data in the caraa simulator.
3) The constructed reinforcement learning method is a search in an abstract world formed by low-dimensional abstract state information, and can accurately and efficiently obtain an optimal driving strategy.
4) In consideration of the fact that a frame which simultaneously performs learning representation and reinforcement learning target possibly causes a non-stationarity problem, the method learns the representation before the reinforcement learning process, so that the optimal driving strategy can be obtained more quickly.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

1. A method of training an end-to-end autopilot strategy, comprising the steps of:
inputting high-dimensional visual information reflecting a driving environment into a pre-trained representation network, and automatically learning low-dimensional information, wherein the representation network utilizes collected teaching data to perform supervised learning, and the low-dimensional information is an abstract feature with strong correlation with an automatic driving task;
constructing a reinforcement learning method, exploring in an abstract characteristic space to obtain an optimized driving strategy, wherein the reinforcement learning process is realized based on a Markov decision process of discrete time in a state stNext, the agent obtains an observation o by a pre-trained low-dimensional information representation representing the networktBased on the strategy pi (a)t|st) Taking a behavioral action atThen obtain the reward signal rtThen shifts to the next state st+1The goal of reinforcement learning is to obtain the maximum long-term return expectation QπExpressed as:
Figure FDA0003394210590000011
wherein gamma is an attenuation factor, the value of gamma is between 0 and 1, and t represents the time.
2. The method of claim 1, wherein the teaching data comprises control data of accelerator, brake and steering wheel by experts, and longitudinal proportional-integral-derivative control and transverse proportional-integral-derivative control are adopted to reach target track points generated by a reinforcement learning method so as to simulate the control of speed and steering angle by experts.
3. The method of claim 1, wherein a soft actor-critic algorithm is employed for reinforcement learning to evaluate a state cost function and a state-behavior cost function to maximize an expected return, resulting in an optimization strategy, wherein:
the state cost function V is set to:
Figure FDA0003394210590000012
the state-behavior cost function Q is set to:
Figure FDA0003394210590000013
the reward function is set as:
rt=rυ+0.05rstep+10rcol+0.8rsafe
wherein Q isπA cost function of long-term returns, rυIs the traffic efficiency, rstepIs a constant step penalty, rcolIs a penalty for a collision, rsafeIs a security control item.
4. Method according to claim 3, characterized in that the safety control item rsafeThe method comprises the following steps:
Figure FDA0003394210590000021
traffic efficiency rυThe method comprises the following steps:
rυ=υ+2(vmax-υ)
wherein upsilon ismaxRefers to the speed limit, λsRepresenting an adjustable scaling factor, r1And r2Denotes a bonus factor, d1And d2To representDistance from target track point, upsilonsafeDenotes a velocity controller, defined as vsafe=υ[1-(υ≤υmin,at≤0)],υminRepresents a lower threshold value of speed, atRepresenting a behavioral action.
5. The method of claim 4, wherein the optimization objective of the soft actor-critic algorithm is expressed as:
Figure FDA0003394210590000022
wherein alpha is a set parameter and determines the importance of the entropy item relative to the reward, tau represents a vector in a strategy space, T represents the total iteration number, and pπ(τ) represents a distribution function of the strategy.
6. The method of claim 3, wherein the soft actor-critic algorithm learns a policy network piφTwo Q networks
Figure FDA0003394210590000023
And a cost function network VψThe Q network is updated with the following loss function:
Figure FDA0003394210590000024
value function network VψThe following formula is used for updating:
Figure FDA0003394210590000025
the update policy network is represented as:
Figure FDA0003394210590000026
wherein, VtargetRepresenting a target value function network, alpha is a non-negative parameter of control entropy, and behavior action is selected from the current strategy
Figure FDA0003394210590000027
To obtain the compound.
7. The method of claim 2, wherein the pid control simulates the expert's velocity v by controlling throttle and brake*This speed is related to the average speed of the vehicle to reach the various trajectory points, and is expressed as:
Figure FDA0003394210590000031
wherein w represents an expected trajectory point, upsilon, generated by reinforcement learning*Representing the expected speed, deltat representing the time interval between two trace points,
Figure FDA0003394210590000032
the starting position is shown, K represents the total number of track points, and i represents the serial number of the current target track point. The longitudinal PID controller adopts an accelerator and a brake to reduce the current speed v and the expected speed upsilon*The difference between them.
8. Method according to claim 2, characterized in that the transverse proportional-integral-derivative control is used for a desired steering angle s*The steering angle needs to satisfy:
s*=tan-1(wy/wx)
wherein, wyRepresenting the lateral distance, w, of the current position from the next target track pointxIndicating the longitudinal distance of the current position from the next target track point.
9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the steps of the method of any of claims 1 to 8 are implemented when the processor executes the program.
CN202111480162.6A 2021-12-06 2021-12-06 Method for training end-to-end automatic driving strategy Pending CN114358128A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111480162.6A CN114358128A (en) 2021-12-06 2021-12-06 Method for training end-to-end automatic driving strategy
PCT/CN2021/137801 WO2023102962A1 (en) 2021-12-06 2021-12-14 Method for training end-to-end autonomous driving strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111480162.6A CN114358128A (en) 2021-12-06 2021-12-06 Method for training end-to-end automatic driving strategy

Publications (1)

Publication Number Publication Date
CN114358128A true CN114358128A (en) 2022-04-15

Family

ID=81097622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111480162.6A Pending CN114358128A (en) 2021-12-06 2021-12-06 Method for training end-to-end automatic driving strategy

Country Status (2)

Country Link
CN (1) CN114358128A (en)
WO (1) WO2023102962A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114770523A (en) * 2022-05-31 2022-07-22 苏州大学 Robot control method based on offline environment interaction
CN116149338A (en) * 2023-04-14 2023-05-23 哈尔滨工业大学人工智能研究院有限公司 Automatic driving control method, system and sprayer
CN116881707A (en) * 2023-03-17 2023-10-13 北京百度网讯科技有限公司 Automatic driving model, training method, training device and vehicle

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596060B (en) * 2023-07-19 2024-03-15 深圳须弥云图空间科技有限公司 Deep reinforcement learning model training method and device, electronic equipment and storage medium
CN116946162B (en) * 2023-09-19 2023-12-15 东南大学 Intelligent network combined commercial vehicle safe driving decision-making method considering road surface attachment condition
CN117725985B (en) * 2024-02-06 2024-05-24 之江实验室 Reinforced learning model training and service executing method and device and electronic equipment
CN117911414A (en) * 2024-03-20 2024-04-19 安徽大学 Automatic driving automobile motion control method based on reinforcement learning
CN117933346A (en) * 2024-03-25 2024-04-26 之江实验室 Instant rewarding learning method based on self-supervision reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10503174B1 (en) * 2019-01-31 2019-12-10 StradVision, Inc. Method and device for optimized resource allocation in autonomous driving on the basis of reinforcement learning using data from lidar, radar, and camera sensor
CN110824912A (en) * 2018-08-08 2020-02-21 华为技术有限公司 Method and apparatus for training a control strategy model for generating an autonomous driving strategy
CN111679660A (en) * 2020-06-16 2020-09-18 中国科学院深圳先进技术研究院 Unmanned deep reinforcement learning method integrating human-like driving behaviors
US10792810B1 (en) * 2017-12-14 2020-10-06 Amazon Technologies, Inc. Artificial intelligence system for learning robotic control policies

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10308242B2 (en) * 2017-07-01 2019-06-04 TuSimple System and method for using human driving patterns to detect and correct abnormal driving behaviors of autonomous vehicles
CN108594804B (en) * 2018-03-12 2021-06-18 苏州大学 Automatic driving control method for distribution trolley based on deep Q network
CN110046712A (en) * 2019-04-04 2019-07-23 天津科技大学 Decision search learning method is modeled based on the latent space for generating model
CN111950691A (en) * 2019-05-15 2020-11-17 天津科技大学 Reinforced learning strategy learning method based on potential action representation space
JP2021135770A (en) * 2020-02-27 2021-09-13 ソニーグループ株式会社 Information processing apparatus and information processing method, computer program, as well as observation device
CN111460891B (en) * 2020-03-01 2023-05-26 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Automatic driving-oriented vehicle-road cooperative pedestrian re-identification method and system
WO2021226921A1 (en) * 2020-05-14 2021-11-18 Harman International Industries, Incorporated Method and system of data processing for autonomous driving
CN113561986B (en) * 2021-08-18 2024-03-15 武汉理工大学 Automatic driving automobile decision making method and device
CN113657292A (en) * 2021-08-19 2021-11-16 东南大学 Vehicle automatic tracking driving method based on deep reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10792810B1 (en) * 2017-12-14 2020-10-06 Amazon Technologies, Inc. Artificial intelligence system for learning robotic control policies
CN110824912A (en) * 2018-08-08 2020-02-21 华为技术有限公司 Method and apparatus for training a control strategy model for generating an autonomous driving strategy
US10503174B1 (en) * 2019-01-31 2019-12-10 StradVision, Inc. Method and device for optimized resource allocation in autonomous driving on the basis of reinforcement learning using data from lidar, radar, and camera sensor
CN111679660A (en) * 2020-06-16 2020-09-18 中国科学院深圳先进技术研究院 Unmanned deep reinforcement learning method integrating human-like driving behaviors

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114770523A (en) * 2022-05-31 2022-07-22 苏州大学 Robot control method based on offline environment interaction
CN114770523B (en) * 2022-05-31 2023-09-15 苏州大学 Robot control method based on offline environment interaction
CN116881707A (en) * 2023-03-17 2023-10-13 北京百度网讯科技有限公司 Automatic driving model, training method, training device and vehicle
CN116149338A (en) * 2023-04-14 2023-05-23 哈尔滨工业大学人工智能研究院有限公司 Automatic driving control method, system and sprayer

Also Published As

Publication number Publication date
WO2023102962A1 (en) 2023-06-15

Similar Documents

Publication Publication Date Title
CN114358128A (en) Method for training end-to-end automatic driving strategy
US11663441B2 (en) Action selection neural network training using imitation learning in latent space
CN107229973B (en) Method and device for generating strategy network model for automatic vehicle driving
WO2022052406A1 (en) Automatic driving training method, apparatus and device, and medium
CN112232490B (en) Visual-based depth simulation reinforcement learning driving strategy training method
US11899748B2 (en) System, method, and apparatus for a neural network model for a vehicle
CN111231983B (en) Vehicle control method, device and equipment based on traffic accident memory network
Chen et al. Driving maneuvers prediction based autonomous driving control by deep Monte Carlo tree search
WO2020052480A1 (en) Unmanned driving behaviour decision making and model training
CN111260027A (en) Intelligent agent automatic decision-making method based on reinforcement learning
CN112487954B (en) Pedestrian crossing behavior prediction method for plane intersection
Darapaneni et al. Autonomous car driving using deep learning
JP2022547611A (en) Simulation of various long-term future trajectories in road scenes
CN116595871A (en) Vehicle track prediction modeling method and device based on dynamic space-time interaction diagram
CN112334914A (en) Mock learning using generative lead neural networks
CN116227620A (en) Method for determining similar scenes, training method and training controller
US20230061411A1 (en) Autoregressively generating sequences of data elements defining actions to be performed by an agent
CN112947466B (en) Parallel planning method and equipment for automatic driving and storage medium
CN112198794A (en) Unmanned driving method based on human-like driving rule and improved depth certainty strategy gradient
Wang et al. An End-to-End Deep Reinforcement Learning Model Based on Proximal Policy Optimization Algorithm for Autonomous Driving of Off-Road Vehicle
Carton Exploration of reinforcement learning algorithms for autonomous vehicle visual perception and control
US20230365146A1 (en) Selection-inference neural network systems
KR20240035565A (en) Autoregressively generates sequences of data elements that define the actions to be performed by the agent.
WO2023050048A1 (en) Method and apparatus for simulating environment for performing task
Ge et al. Urban Driving Based on Condition Imitation Learning and Multi-Period Information Fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination