CN112508164A - End-to-end automatic driving model pre-training method based on asynchronous supervised learning - Google Patents

End-to-end automatic driving model pre-training method based on asynchronous supervised learning Download PDF

Info

Publication number
CN112508164A
CN112508164A CN202010727803.2A CN202010727803A CN112508164A CN 112508164 A CN112508164 A CN 112508164A CN 202010727803 A CN202010727803 A CN 202010727803A CN 112508164 A CN112508164 A CN 112508164A
Authority
CN
China
Prior art keywords
vehicle
training
model
strategy
automatic driving
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010727803.2A
Other languages
Chinese (zh)
Other versions
CN112508164B (en
Inventor
田大新
郑坤贤
段续庭
周建山
韩旭
郎平
林椿眄
赵元昊
郝威
龙科军
刘赫
拱印生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010727803.2A priority Critical patent/CN112508164B/en
Publication of CN112508164A publication Critical patent/CN112508164A/en
Application granted granted Critical
Publication of CN112508164B publication Critical patent/CN112508164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W2050/0001Details of the control system
    • B60W2050/0002Automatic control, details of type of controller or control system architecture

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Human Computer Interaction (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Traffic Control Systems (AREA)

Abstract

An end-to-end automatic driving model pre-training method based on asynchronous supervised learning is characterized in that a plurality of supervised learning processes are executed asynchronously and parallelly on a plurality of demonstration data sets acquired by a real vehicle, so that the stability of the supervised learning processes is improved, and the convergence of the pre-training processes is accelerated. After the end-to-end automatic driving model is pre-trained by expert demonstration data collected from the real world, the initial performance of the model in the subsequent real vehicle reinforcement learning training stage can be improved and the convergence of the model can be accelerated. In addition, the invention provides a visual analysis method for an end-to-end automatic driving model training process, so as to analyze the model performance improvement brought by the pre-training method based on asynchronous supervised learning from a microscopic angle. The invention designs a multi-vehicle distributed reinforcement learning-driven automatic driving model training system which is used for collecting expert demonstration data and verifying the feasibility of the application of the pre-training method in the real world.

Description

End-to-end automatic driving model pre-training method based on asynchronous supervised learning
Technical Field
The invention relates to the field of traffic, in particular to an end-to-end model pre-training method for an automatic driving vehicle.
Technical Field
Current automated driving faces a significant challenge: the traditional automatic driving system is too large and complex in structure. In order to perfect the automatic driving system as much as possible so as to meet the requirements of different working conditions, the traditional automatic driving system cannot avoid the problem that the system structure is huge and complicated due to perfect logic. The traditional automatic driving system which is too complex faces three problems of algorithm bloat, performance limitation and decision contradiction:
(1) algorithm bloated: the traditional automatic driving system needs to manually set a rule base to generalize the driving state of the unmanned vehicle, and the algorithm scale is continuously huge along with the increase and complexity of driving environment scenes;
(2) performance is limited: the system structure determines that certain bottlenecks exist in the scene traversal depth and the decision accuracy, and complex working conditions are difficult to process;
(3) and (4) making a decision contradiction: the traditional automatic driving system adopts a finite-state machine to switch driving behaviors under different states, and the state division of the finite-state machine needs to be based on definite boundary conditions. In fact, some 'grey zones' exist among the driving behaviors, namely more than 1 reasonable behavior selection is possible in the same scene, so that the driving states conflict.
The widespread success of Deep Reinforcement Learning (Deep RL) has led to the development of Deep RL for applications that are increasingly used in the training of end-to-end autopilot models. The learning-based algorithm abandons the hierarchical structure of the rule algorithm, is simpler and more direct, and greatly simplifies the structure of the decision system. In the course of Deep RL model training, through the process of cyclic state observation-action execution-reward acquisition, the mapping relation between the environment state and the optimal action can be established only by little prior knowledge. However, due to the lack of prior knowledge, the initial performance of the Deep RL is poor, and thus the Deep RL has a problem of long training time (requiring excessive real-world experience) in the process of training an automatic driving model which can be actually applied to the ground. In a simulation environment, the disadvantage of poor initial performance of the Deep RL model can be tolerated. However, if it is necessary to run the Deep RL model-based autonomous driving vehicle in the real world in a normalized manner, it is inevitably necessary to train the Deep RL model-based autonomous driving vehicle in the real world using the real vehicle. In this case, poor initial performance means that real vehicles may frequently collide in the real world or be interrupted from training by frequent human intervention to avoid danger, which greatly increases the workload of the tester and the risk during training. Therefore, in order to deploy an end-to-end autopilot model based on a Deep RL on an actual autopilot vehicle, the problem of poor initial performance of the Deep RL model must be solved.
The invention introduces prior knowledge into the training of the Deep RL model to solve the problem of poor initial performance when the Deep RL model is trained in the real world. The invention provides an asynchronous supervised learning method for a continuous action Deep RL model, which is used for parallelly and asynchronously executing a plurality of supervised learning processes on a plurality of training data sets acquired from the real world. By running different supervised learning processes in different threads, parallel asynchronous online updating of model parameters by a plurality of agents is realized, and compared with the parameter updating process of a single agent, the time correlation of strategy exploration is greatly reduced, so that the supervised learning process is more stable. In order to avoid collecting time-consuming and labor-consuming human expert Driving demonstration data, the invention also uses a Manually Designed Heuristic Driving strategy (MDHDP) to drive the vehicle by means of the MDHDP to generate high-reward experience data as expert demonstration, and a supervised learning training data set is formed. In order to visually analyze the improvement brought by the pre-training process from a microscopic angle, the invention provides a visualization method suitable for a continuous action Deep RL model, and the visualization analysis method has important significance for testing and verifying a continuous output neural network model. Finally, the invention designs a multi-vehicle distributed reinforcement learning driving automatic driving model training system which is used for collecting expert demonstration data and verifying the feasibility of the application of the pre-training method in the real world.
Disclosure of Invention
The invention aims to solve the technical problem of providing an end-to-end automatic driving model pre-training method to solve the problems of poor initial performance and slow model convergence when an end-to-end automatic driving model is driven by training reinforcement learning in the real world.
The technical scheme adopted by the invention for solving the technical problems is as follows: an end-to-end automatic driving model pre-training method based on asynchronous supervised learning is designed, and a plurality of supervised learning processes are executed asynchronously and parallelly on a plurality of demonstration data sets acquired by a real vehicle, so that the stability of the supervised learning process is improved, and the convergence of the pre-training process is accelerated. After the end-to-end automatic driving model is pre-trained by expert demonstration data collected from the real world, the initial performance of the model in the subsequent real vehicle reinforcement learning training stage can be improved and the convergence of the model can be accelerated. In addition, the invention provides a visual analysis method facing the end-to-end automatic driving model training process for analyzing the effectiveness of the pre-training method from a microscopic angle. In order to avoid collecting time-consuming and labor-consuming human expert Driving demonstration data, the invention also uses a Manually Designed Heuristic Driving strategy (MDHDP) to drive the vehicle by means of the MDHDP to generate high-reward experience data as expert demonstration, and a supervised learning training data set is formed. Finally, in order to verify the feasibility of the application of the pre-training method in the real world, the invention designs a multi-vehicle distributed reinforcement learning driven automatic driving model training system in a matching way.
An end-to-end automatic driving model pre-training method based on asynchronous supervised learning is characterized in that a plurality of supervised learning processes, namely asynchronous supervised learning, are asynchronously executed in parallel on a plurality of demonstration data sets acquired by a real vehicle, so that the stability of the supervised learning processes is improved, and the convergence of the pre-training processes is accelerated.
The demonstration data set is a heuristic strategy pi 'designed by people'iN 'collected by drive data collection vehicle'iThe design is as follows based on the preview theory: determining the wheel rotation angle and the braking and accelerator amount, wherein the wheel rotation angle is determined as a collection vehicle i according to the current vehicle speed v of the vehicleitThe specific steps of determining the wheel rotation angle of the front vehicle position are as follows:
(1) the collection vehicle i is according to the current speed vitAnd position E (x)it,yit) Determining Preview point F (x'it,y′it)
lEF=L+vitIn the multiplied by delta t formula, L is a fixed pre-aiming distance, and delta t is a pre-aiming coefficient;
(2) calculating the rotation angle of the pointing preview point F
Figure RE-GDA0002936600670000031
Wherein
Figure RE-GDA0002936600670000032
Is the center of the collection vehicle i;
(3) correcting the turning angle according to the position and speed of the front vehicle j to avoid collision
Figure RE-GDA0002936600670000033
In the formula, W is D/C, D is the transverse distance between a front vehicle j and a collection vehicle i, and C is a side collision safety threshold;
determining the braking and accelerator amount as a collection vehicle i according to the current vehicle speed vitCurrent road section rtSpeed limit of
Figure RE-GDA0002936600670000034
Distance d from preceding vehicle jitThe specific steps for determining the brake and accelerator amount are as follows:
(1) determining a current road section rtSpeed limit of
Figure RE-GDA0002936600670000035
Figure RE-GDA0002936600670000036
Where g is the acceleration of gravity, u is the coefficient of friction, M is the vehicle mass, CA is the down force coefficient, and ρ is the road section rtThe curvature of (a);
(2) vehicle speed vitNot exceed
Figure RE-GDA0002936600670000037
The throttle amount is increased and the vehicle speed v is increaseditExceed
Figure RE-GDA0002936600670000038
Or the distance from the preceding vehicle j is less than the forward collision safety threshold value, and the braking amount is increased.
The pre-training process is based on quintuple
Figure RE-GDA0002936600670000039
The definition is as follows:
status of state
Figure RE-GDA00029366006700000310
Order to
Figure RE-GDA00029366006700000311
Set of time-varying environmental states collected as a demonstration vehicle i, wherein
Figure RE-GDA00029366006700000312
Figure RE-GDA00029366006700000313
Representing the state of the nth experience in the pre-training demonstration data set omega, and consisting of 4 continuous single-channel images shot by a front camera;
movement of
Figure RE-GDA00029366006700000314
Order to
Figure RE-GDA00029366006700000315
Set of demonstrated driving actions captured as a demonstration vehicle i, wherein
Figure RE-GDA00029366006700000316
Represents the n-th empirical demonstration motion (wheel turning angle) in omega, and
Figure RE-GDA00029366006700000317
loss function
Figure RE-GDA00029366006700000318
Order to
Figure RE-GDA00029366006700000319
Represents the loss of the nth experience in Ω, as follows
Figure RE-GDA00029366006700000320
In the formula
Figure RE-GDA00029366006700000321
And
Figure RE-GDA00029366006700000322
is corresponding to
Figure RE-GDA00029366006700000323
The variable of (a) is selected,
Figure RE-GDA00029366006700000324
is an input
Figure RE-GDA00029366006700000325
An action later output by the pre-trained model;
state transfer function
Figure RE-GDA00029366006700000326
Order to
Figure RE-GDA00029366006700000327
For a given state
Figure RE-GDA00029366006700000328
And actions
Figure RE-GDA00029366006700000329
Thereafter (assuming n corresponds to the tth time slot), the system transitions to state in the next time slot
Figure RE-GDA00029366006700000330
Is expressed as
Figure RE-GDA00029366006700000331
Discount coefficient γ: γ ∈ [0,1] to balance current and long term losses.
The pre-training process comprises the following steps:
(1) given a random strategy
Figure RE-GDA0002936600670000041
Input state
Figure RE-GDA0002936600670000042
Rear output action
Figure RE-GDA0002936600670000043
Probability distribution of;
(2) Deriving an expected total loss function
Figure RE-GDA0002936600670000044
Indicating a current state
Figure RE-GDA0002936600670000045
Always Enforcement policy piiTotal loss to final state as follows
Figure RE-GDA0002936600670000046
(3) Derivation of random exploration total loss function
Figure RE-GDA0002936600670000047
If the agent is in the state
Figure RE-GDA0002936600670000048
Time not according to strategy piiPerforming an action
Figure RE-GDA0002936600670000049
But performs other actions
Figure RE-GDA00029366006700000410
But still following the strategy in the subsequent stateiThen the total loss is expected to be
Figure RE-GDA00029366006700000411
(4) Deriving a merit function
Figure RE-GDA00029366006700000412
Representing a random exploration strategy piiExternal movement
Figure RE-GDA00029366006700000413
With the following advantages
Figure RE-GDA00029366006700000414
(5) Determining a problem formula: given the current state
Figure RE-GDA00029366006700000415
By minimizing the merit function
Figure RE-GDA00029366006700000416
Finding an optimal strategy
Figure RE-GDA00029366006700000417
To minimize the desired total loss function
Figure RE-GDA00029366006700000418
When the exploration process converges
Figure RE-GDA00029366006700000419
Satisfies the following formula
Figure RE-GDA00029366006700000420
Where Π is the set of random strategies.
The asynchronous supervised learning introduces actor-comment family neural network as a nonlinear function estimator prediction random action strategy
Figure RE-GDA00029366006700000421
And expected total loss function
Figure RE-GDA00029366006700000422
To solve the problem equation
Figure RE-GDA00029366006700000423
Where theta and thetavParameters for actors and commenting on the neural network of the family are updated as follows:
Figure RE-GDA00029366006700000424
Figure RE-GDA00029366006700000425
wherein theta 'and theta'vRespectively thread-related parameters, theta and thetavIs a global sharing parameter.
Specifically, under the condition that other pixels in an image of an input model are kept unchanged, the value of a certain pixel o is changed, the change amplitude is delta o, and for a certain layer in a neural network, the change amplitude is delta o
Figure RE-GDA0002936600670000051
The output impact of (c) is as follows:
Figure RE-GDA0002936600670000052
in the formula
Figure RE-GDA0002936600670000053
And if
Figure RE-GDA0002936600670000054
Then
Figure RE-GDA0002936600670000055
Figure RE-GDA0002936600670000056
And
Figure RE-GDA0002936600670000057
are respectively as
Figure RE-GDA0002936600670000058
And finally obtaining the influence of each pixel in the image of the input model on the final output of the model by the weight and the bias parameters of the layer, and drawing to obtain the end-to-end automatic driving model attention thermodynamic diagram.
The number of pixels in the end-to-end automatic driving model attention thermodynamic diagram is consistent with the number of pixels of an image of an input model, an image area which has a great influence on a model output result is displayed in the thermodynamic diagram in a special highlight mode, whether the attention area of the model is an area related to driving decision or not can be checked from a microscopic angle, and the effectiveness of model training is verified.
A multi-vehicle distributed reinforcement learning drive automatic driving model training system is composed of a plurality of robot vehicles, a building model, a pavement map and the like and comprises a strategy learning scene, a strategy verification scene and a UWB positioning-reinforcement learning reward system, wherein the robot vehicles in the strategy learning scene are used as learners of driving strategies in the reinforcement learning training process, the plurality of vehicle distributed exploration environments execute actor-comment family network parameter updating tasks in parallel and asynchronously.
The robot trolley in the strategy verification scene inherits the parallel and asynchronous updated global driving strategy of other trolleys and runs in the strategy verification scene, the UWB positioning-reinforcement learning reward system gives rewards, and the strategy verification trolley records scores.
The UWB positioning-reinforcement learning reward system determines the position of the trolley according to a UWB positioning label bound on the robot trolley, and gives a strategy learning and strategy verification reward obtained in real time in the process of reinforcement learning training of the trolley according to a reinforcement learning reward function.
Compared with the prior art, the invention has the following advantages and positive effects: aiming at the problems of poor initial performance and low convergence rate of the existing reinforcement learning-driven end-to-end automatic driving model, the invention provides a series of methods of end-to-end automatic driving model pre-training, effect analysis and landing verification, which take an end-to-end automatic driving model pre-training method based on asynchronous supervised learning as a core, well solves the problem that the reinforcement learning-driven end-to-end automatic driving model is difficult to land and deploy, greatly promotes the development of the learning-driven end-to-end automatic driving technology, and assists the development of the automatic driving technology in China. Therefore, in summary, the method has great significance for improving the overall performance of the vehicle end-to-end automatic driving system.
Drawings
FIG. 1 is a diagram of an end-to-end autopilot model architecture;
FIG. 2 is a diagram of the theoretical architecture of an asynchronous supervised learning approach;
FIG. 3 is a diagram of a multi-vehicle distributed reinforcement learning driven autopilot model training system architecture;
fig. 4 is a visualization analysis method example.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims. The end-to-end automatic driving model used by the invention is shown in fig. 1, the input of the model is four driving images shot by a preprocessed front camera, the first layer of convolutional layer comprises 32 convolutional kernels with the step size of 4 multiplied by 8, the next following convolutional layer comprises 32 convolutional kernels with the step size of 2 multiplied by 4, the next following convolutional layer comprises 32 convolutional kernels with the step size of 1 multiplied by 3, and finally, the model is a full connection layer with 256 hidden units. These four hidden layers are followed by active layer modified linear units (ReLU). The neural network in fig. 1 has two sets of outputs: two linear outputs for representing model output actions
Figure RE-GDA0002936600670000061
The mean and variance of the normal distribution of (a); a linear output for representing a value function
Figure RE-GDA0002936600670000062
Pre-training process modeling
The pre-training process based on asynchronous supervised learning consists of five tuples
Figure RE-GDA0002936600670000063
The definition is as follows:
status of state
Figure RE-GDA0002936600670000064
Order to
Figure RE-GDA0002936600670000065
Set of time-varying environmental states collected as a demonstration vehicle i, wherein
Figure RE-GDA0002936600670000066
Representing the state of the nth experience in the pre-training demonstration data set omega, and consisting of 4 continuous single-channel images shot by a front camera;
movement of
Figure RE-GDA0002936600670000067
Order to
Figure RE-GDA0002936600670000068
Set of demonstrated driving actions captured as a demonstration vehicle i, wherein
Figure RE-GDA0002936600670000069
Represents the n-th empirical demonstration motion (wheel turning angle) in omega, and
Figure RE-GDA00029366006700000610
loss function
Figure RE-GDA00029366006700000611
Order to
Figure RE-GDA00029366006700000612
Represents the loss of the nth experience in Ω, as follows
Figure RE-GDA00029366006700000613
In the formula
Figure RE-GDA00029366006700000614
And
Figure RE-GDA00029366006700000615
is corresponding to
Figure RE-GDA00029366006700000616
The variable of (a) is selected,
Figure RE-GDA00029366006700000617
is an input
Figure RE-GDA00029366006700000618
An action later output by the pre-trained model;
state transfer function
Figure RE-GDA00029366006700000619
Order to
Figure RE-GDA00029366006700000620
For a given state
Figure RE-GDA00029366006700000621
And actions
Figure RE-GDA00029366006700000622
Thereafter (assuming n corresponds to the tth time slot), the system transitions to state in the next time slot
Figure RE-GDA00029366006700000623
Is expressed as
Figure RE-GDA00029366006700000624
Discount coefficient γ: γ ∈ [0,1] to balance current and long term losses.
Problem formula derivation
According to the pre-training process model, we further derive a problem formula of the pre-training process:
(1) given a random strategy
Figure RE-GDA00029366006700000625
Input state
Figure RE-GDA00029366006700000626
Rear output action
Figure RE-GDA00029366006700000627
A probability distribution of (a);
(2) deriving an expected total loss function
Figure RE-GDA00029366006700000628
Indicating a current state
Figure RE-GDA00029366006700000629
Always Enforcement policy piiTotal loss to final state as follows
Figure RE-GDA0002936600670000071
(3) Derivation of random exploration total loss function
Figure RE-GDA0002936600670000072
If the agent is in the state
Figure RE-GDA0002936600670000073
Time not according to strategy piiPerforming an action
Figure RE-GDA0002936600670000074
But performs other actions
Figure RE-GDA0002936600670000075
But thenIn the state of (1) still according to the strategy piiThen the total loss is expected to be
Figure RE-GDA0002936600670000076
(4) Deriving a merit function
Figure RE-GDA0002936600670000077
Representing a random exploration strategy piiExternal movement
Figure RE-GDA0002936600670000078
With the following advantages
Figure RE-GDA0002936600670000079
(5) Determining a problem formula: given the current state
Figure RE-GDA00029366006700000710
By minimizing the merit function
Figure RE-GDA00029366006700000711
Finding an optimal strategy
Figure RE-GDA00029366006700000712
To minimize the desired total loss function
Figure RE-GDA00029366006700000713
When the exploration process converges
Figure RE-GDA00029366006700000714
Satisfies the following formula
Figure RE-GDA00029366006700000715
Where Π is the set of random strategies.
Pre-training demonstration data acquisition
The demonstration data set used for pre-training is a heuristic strategy pi 'designed by man'iN 'collected by drive data collection vehicle'iThe design is as follows based on the preview theory:
turning the wheel: the collection vehicle i is according to the current vehicle speed vitAnd determining the wheel angle of the front vehicle
(1) The collection vehicle i is according to the current speed vitAnd position E (x)it,yit) Determining Preview point F (x'it,y′it)
lEF=L+vit×Δt
Wherein L is a fixed pre-aiming distance, and delta t is a pre-aiming coefficient;
(2) calculating the rotation angle of the pointing preview point F
Figure RE-GDA00029366006700000716
Wherein
Figure RE-GDA00029366006700000717
Is the center of the collection vehicle i;
(3) correcting the turning angle according to the position and speed of the front vehicle j to avoid collision
Figure RE-GDA00029366006700000718
And W is D/C, D is the transverse distance between the front vehicle j and the collection vehicle i, and C is a side collision safety threshold value.
Brake/throttle: the collection vehicle i is according to the current vehicle speed vitCurrent road section rtSpeed limit of
Figure RE-GDA00029366006700000719
Distance d from preceding vehicle jitDetermining brake and throttle amounts
(1) Determining a current road section rtSpeed limit of
Figure RE-GDA0002936600670000081
Figure RE-GDA0002936600670000082
Where g is the acceleration of gravity, u is the coefficient of friction, M is the vehicle mass, CA is the down force coefficient, and ρ is the road section rtThe curvature of (a);
(2) vehicle speed vitNot exceed
Figure RE-GDA0002936600670000083
The throttle amount is increased and the vehicle speed v is increaseditExceed
Figure RE-GDA0002936600670000084
Or the distance from the preceding vehicle j is less than the forward collision safety threshold value, and the braking amount is increased.
Asynchronous supervised learning method
To solve the problem equation
Figure RE-GDA0002936600670000085
We introduce actor-comment family neural network as a non-linear function estimator to predict random action strategy
Figure RE-GDA0002936600670000086
And expected total loss function
Figure RE-GDA0002936600670000087
Where theta and thetavParameters for actors and commenting on the neural network of the family are updated as follows:
Figure RE-GDA0002936600670000088
Figure RE-GDA0002936600670000089
wherein theta 'and theta'vRespectively thread-related parameters, theta and thetavIs a global sharing parameter. And executing a plurality of supervised learning processes in a parallel and asynchronous mode on a plurality of pre-training demonstration data sets, namely an asynchronous supervised learning method.
Visual analysis
The visual analysis method is designed based on a univariate analysis method, and specifically, under the condition that other pixels in an image of an input model are kept unchanged, the value of a certain pixel o is changed, the change amplitude is delta o, and for a certain layer in a neural network
Figure RE-GDA00029366006700000810
The output impact of (c) is as follows:
Figure RE-GDA00029366006700000811
in the formula
Figure RE-GDA00029366006700000812
And if
Figure RE-GDA00029366006700000813
Then
Figure RE-GDA00029366006700000814
Figure RE-GDA00029366006700000815
And
Figure RE-GDA00029366006700000816
are respectively as
Figure RE-GDA00029366006700000817
The weight and bias parameters for this layer. And finally, obtaining the influence of each pixel in the image of the input model on the final output of the model, and drawing to obtain an end-to-end automatic driving model attention thermodynamic diagram. Method for comparing number of pixels in end-to-end autopilot model attention thermodynamic diagram with image of input modelThe pixel numbers are consistent, an image area which has a great influence on the output result of the model is displayed in a special highlight mode in the thermodynamic diagram, whether the concerned area of the model is an area relevant to the driving decision or not can be examined from a microscopic angle, and the effectiveness of model training is verified. For example, if the highlight region is located in the sky, a roadside building, or the like in the input image, it can be inferred that the model training has a problem, whereas if the highlight region is located in the road surface, other vehicles, or the like in the image, it can be known that the training is effective.
Multi-vehicle distributed reinforcement learning driving automatic driving model training system
In order to verify the engineering feasibility of the pre-training method in the real world, the invention provides a multi-vehicle distributed reinforcement learning driven automatic driving model training system which comprises a strategy learning scene, a strategy verification scene and a UWB positioning-reinforcement learning reward system and consists of a plurality of robot trolleys, a building model, a pavement map and the like. The robot trolley in the strategy learning scene is used as a learner of a driving strategy in the reinforcement learning training process, and a plurality of trolleys are distributed in an exploration environment to execute the network parameter updating task of the actor-commenter asynchronously. The robot trolley in the strategy verification scene inherits the global driving strategy of other trolleys which are asynchronously updated in parallel and runs in the strategy verification scene, the UWB positioning-reinforcement learning reward system gives rewards, and the strategy verification trolley records scores. The UWB positioning-reinforcement learning reward system determines the position of the trolley according to a UWB positioning label bound on the robot trolley, and gives strategy learning and verifies real-time acquired reward in the trolley reinforcement learning training process according to a reinforcement learning reward function.
In the embodiment, a theoretical framework of the asynchronous supervised learning method provided by the present invention is specifically shown in fig. 2. A plurality of agents having the end-to-end autopilot model shown in fig. 1 execute a plurality of supervised learning processes asynchronously in parallel on a plurality of demonstration data sets acquired by a real vehicle to improve the stationarity of the supervised learning process and accelerate the convergence of the pre-training process. After the end-to-end automatic driving model is pre-trained by expert demonstration data collected from the real world, the initial performance of the model in the subsequent real vehicle reinforcement learning training stage can be improved and the convergence of the model can be accelerated.
In this embodiment, the architecture diagram of the multi-vehicle distributed reinforcement learning-driven automatic driving model training system provided by the present invention is shown in fig. 3, and includes 2 policy learning scenarios, 1 policy verification scenario, and a UWB positioning-reinforcement learning reward system. After the real vehicle training system is built, the artificially designed heuristic strategy pi 'provided by the invention is adopted'iAnd driving the robot trolley to collect demonstration data in a strategy learning scene, and constructing a pre-training demonstration data set omega. And then pre-training an end-to-end automatic driving model by adopting an asynchronous supervised learning method based on the data set omega, deploying the model in the robot trolley after the pre-training is finished, and performing subsequent reinforcement learning training in a real-vehicle training system.
There may be a phenomenon that the trained samples are biased, resulting in the trained model not being actually used to solve the problem to be solved. Macroscopically based on existing training data, it is difficult to analyze, at which time it is necessary to microscopically determine whether the model reacts to the correct position in the input image. Therefore, the training effect visualization analysis method is designed based on the single-factor analysis method, the 'attention degree' of the model to each pixel is obtained by sequentially making slight changes to each pixel of the input image and observing the change of the model output, and the thermodynamic diagram of the end-to-end automatic driving model sensitive area is drawn. As shown in fig. 4, for example, by changing the pixel of the left blue square, a different result can be obtained after inputting the model, and this difference is the importance of this pixel to the model output, and obtaining the importance of each pixel can draw the thermodynamic diagram.

Claims (10)

1. An end-to-end automatic driving model pre-training method based on asynchronous supervised learning is characterized in that a plurality of supervised learning processes, namely asynchronous supervised learning, are asynchronously executed in parallel on a plurality of demonstration data sets acquired by a real vehicle, so that the stability of the supervised learning processes is improved, and the convergence of the pre-training processes is accelerated.
2. According to claim1 the method of, wherein the presentation data set is pi 'through a manually designed heuristic strategy'iN 'collected by drive data collection vehicle'iThe design is as follows based on the preview theory: determining the wheel rotation angle and the braking and accelerator amount, wherein the wheel rotation angle is determined as a collection vehicle i according to the current vehicle speed v of the vehicleitThe specific steps of determining the wheel rotation angle of the front vehicle position are as follows:
(1) the collection vehicle i is according to the current speed vitAnd position E (x)it,yit) Determining Preview point F (x'it,y′it)
lEF=L+vitIn the multiplied by delta t formula, L is a fixed pre-aiming distance, and delta t is a pre-aiming coefficient;
(2) calculating the rotation angle of the pointing preview point F
Figure RE-FDA0002936600660000011
Wherein
Figure RE-FDA0002936600660000012
Is the center of the collection vehicle i;
(3) correcting the turning angle according to the position and speed of the front vehicle j to avoid collision
Figure RE-FDA0002936600660000013
In the formula, W is D/C, D is the transverse distance between a front vehicle j and a collection vehicle i, and C is a side collision safety threshold;
determining the braking and accelerator amount as a collection vehicle i according to the current vehicle speed vitCurrent road section rtSpeed limit of
Figure RE-FDA0002936600660000014
Distance d from preceding vehicle jitThe specific steps for determining the brake and accelerator amount are as follows:
(1) determining a current road section rtSpeed limit of
Figure RE-FDA0002936600660000015
Figure RE-FDA0002936600660000016
Where g is the acceleration of gravity, u is the coefficient of friction, M is the vehicle mass, CA is the down force coefficient, and ρ is the road section rtThe curvature of (a);
(2) vehicle speed vitNot exceed
Figure RE-FDA0002936600660000017
The throttle amount is increased and the vehicle speed v is increaseditExceed
Figure RE-FDA0002936600660000018
Or the distance from the preceding vehicle j is less than the forward collision safety threshold value, and the braking amount is increased.
3. The method of claim 1, wherein the pre-training process is comprised of five tuples
Figure RE-FDA0002936600660000019
The definition is as follows:
status of state
Figure RE-FDA00029366006600000113
Order to
Figure RE-FDA00029366006600000110
Set of time-varying environmental states collected as a demonstration vehicle i, wherein
Figure RE-FDA00029366006600000111
Figure RE-FDA00029366006600000112
Pre-training of representativesTraining the state of the nth experience in the demonstration data set omega, wherein the state is composed of 4 continuous single-channel images shot by a front camera;
movement of
Figure RE-FDA0002936600660000021
Order to
Figure RE-FDA0002936600660000022
Set of demonstrated driving actions captured as a demonstration vehicle i, wherein
Figure RE-FDA0002936600660000023
Represents the n-th empirical demonstration motion (wheel turning angle) in omega, and
Figure RE-FDA0002936600660000024
loss function
Figure RE-FDA00029366006600000237
Order to
Figure RE-FDA0002936600660000025
Represents the loss of the nth experience in Ω, as follows
Figure RE-FDA0002936600660000026
In the formula
Figure RE-FDA0002936600660000027
And
Figure RE-FDA0002936600660000028
is corresponding to
Figure RE-FDA0002936600660000029
The variable of (a) is selected,
Figure RE-FDA00029366006600000210
is an input
Figure RE-FDA00029366006600000211
An action later output by the pre-trained model;
state transfer function
Figure RE-FDA00029366006600000238
Order to
Figure RE-FDA00029366006600000212
For a given state
Figure RE-FDA00029366006600000213
And actions
Figure RE-FDA00029366006600000214
Thereafter (assuming n corresponds to the tth time slot), the system transitions to state in the next time slot
Figure RE-FDA00029366006600000215
Is expressed as
Figure RE-FDA00029366006600000216
Discount coefficient γ: γ ∈ [0,1] to balance current and long term losses.
4. A training method as claimed in claim 1, wherein the pre-training process is:
(1) given a random strategy
Figure RE-FDA00029366006600000217
Input state
Figure RE-FDA00029366006600000218
Rear output action
Figure RE-FDA00029366006600000219
A probability distribution of (a);
(2) deriving an expected total loss function
Figure RE-FDA00029366006600000220
Indicating a current state
Figure RE-FDA00029366006600000221
Total penalty in always implementing strategy pi i to final state, as follows
Figure RE-FDA00029366006600000222
(3) Derivation of random exploration total loss function
Figure RE-FDA00029366006600000223
If the agent is in the state
Figure RE-FDA00029366006600000224
Time not according to strategy piiPerforming an action
Figure RE-FDA00029366006600000225
But performs other actions
Figure RE-FDA00029366006600000226
But still following the strategy in the subsequent stateiThen the total loss is expected to be
Figure RE-FDA00029366006600000227
(4) Deriving a merit function
Figure RE-FDA00029366006600000228
Representative random soundingCable strategy piiExternal movement
Figure RE-FDA00029366006600000229
With the following advantages
Figure RE-FDA00029366006600000230
(5) Determining a problem formula: given the current state
Figure RE-FDA00029366006600000231
By minimizing the merit function
Figure RE-FDA00029366006600000232
Finding an optimal strategy
Figure RE-FDA00029366006600000233
To minimize the desired total loss function
Figure RE-FDA00029366006600000234
When the exploration process converges
Figure RE-FDA00029366006600000235
Satisfies the following formula
Figure RE-FDA00029366006600000236
Where n is the set of random strategies.
5. The training method of claim 1, wherein the asynchronous supervised learning introduces actor-commentary neural network as a non-linear function estimator predictive stochastic action strategy
Figure RE-FDA0002936600660000031
And expected total loss function
Figure RE-FDA0002936600660000032
To solve the problem equation
Figure RE-FDA0002936600660000033
Where theta and thetavParameters for actors and commenting on the neural network of the family are updated as follows:
Figure RE-FDA0002936600660000034
Figure RE-FDA0002936600660000035
wherein theta 'and theta'vRespectively thread-related parameters, theta and thetavIs a global sharing parameter.
6. A visual analysis method for an end-to-end automatic driving model training process, which is proposed by aiming at the training method of claim 1, is characterized in that the visual analysis method is designed based on a univariate analysis method, specifically, under the condition that other pixels in an image of an input model are kept unchanged, the value of a certain pixel o is changed, the change amplitude is delta o, and for a certain layer in a neural network, the change amplitude is delta o
Figure RE-FDA0002936600660000036
The output impact of (c) is as follows:
Figure RE-FDA0002936600660000037
in the formula
Figure RE-FDA0002936600660000038
And if
Figure RE-FDA0002936600660000039
Figure RE-FDA00029366006600000310
And
Figure RE-FDA00029366006600000311
are respectively as
Figure RE-FDA00029366006600000312
And finally obtaining the influence of each pixel in the image of the input model on the final output of the model by the weight and the bias parameters of the layer, and drawing to obtain the end-to-end automatic driving model attention thermodynamic diagram.
7. The training method of claim 6, wherein the number of pixels in the end-to-end automatic driving model attention thermodynamic diagram is consistent with the number of pixels of the image of the input model, the image area which has a great influence on the output result of the model is displayed in a special highlight form in the thermodynamic diagram, and whether the attention area of the model is the area relevant to the driving decision or not can be checked from a microscopic angle, so that the effectiveness of model training is verified.
8. The multi-vehicle distributed reinforcement learning-driven automatic driving model training system is provided by the asynchronous supervised learning-based end-to-end automatic driving model pre-training method, and is characterized in that the training system comprises a plurality of robot vehicles, a building model, a road surface map and the like, and comprises a strategy learning scene, a strategy verification scene and a UWB positioning-reinforcement learning reward system, wherein the robot vehicles in the strategy learning scene are used as learners of driving strategies in the reinforcement learning training process, and a plurality of vehicle distributed exploration environments are used for parallelly and asynchronously executing actor-comment family network parameter updating tasks.
9. The training system of claim 8, wherein the robotic vehicle in the strategy verification scenario inherits the global driving strategy of the other vehicles and is driven in the strategy verification scenario with the UWB localization-reinforcement learning reward system giving a reward, and the strategy verification vehicle records a score.
10. The training system of claim 8, wherein the UWB location-reinforcement learning reward system determines the location of the vehicle based on UWB location tags bound to the robotic vehicle, and provides policy learning and policy validation rewards acquired in real time during vehicle reinforcement learning training based on reinforcement learning reward functions.
CN202010727803.2A 2020-07-24 2020-07-24 End-to-end automatic driving model pre-training method based on asynchronous supervised learning Active CN112508164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010727803.2A CN112508164B (en) 2020-07-24 2020-07-24 End-to-end automatic driving model pre-training method based on asynchronous supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010727803.2A CN112508164B (en) 2020-07-24 2020-07-24 End-to-end automatic driving model pre-training method based on asynchronous supervised learning

Publications (2)

Publication Number Publication Date
CN112508164A true CN112508164A (en) 2021-03-16
CN112508164B CN112508164B (en) 2023-01-10

Family

ID=74953327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010727803.2A Active CN112508164B (en) 2020-07-24 2020-07-24 End-to-end automatic driving model pre-training method based on asynchronous supervised learning

Country Status (1)

Country Link
CN (1) CN112508164B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449823A (en) * 2021-08-31 2021-09-28 成都深蓝思维信息技术有限公司 Automatic driving model training method and data processing equipment
CN113561986A (en) * 2021-08-18 2021-10-29 武汉理工大学 Decision-making method and device for automatically driving automobile
CN113743469A (en) * 2021-08-04 2021-12-03 北京理工大学 Automatic driving decision-making method fusing multi-source data and comprehensive multi-dimensional indexes
CN114895560A (en) * 2022-04-25 2022-08-12 浙江大学 Foot type robot object tracking self-adaptive control method under motor stalling condition
AT526259A1 (en) * 2022-06-23 2024-01-15 Avl List Gmbh Method for training an artificial neural network of a driver model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492763A (en) * 2018-09-17 2019-03-19 同济大学 A kind of automatic parking method based on intensified learning network training
CN110291477A (en) * 2016-12-02 2019-09-27 斯塔斯凯机器人公司 Vehicle control system and application method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110291477A (en) * 2016-12-02 2019-09-27 斯塔斯凯机器人公司 Vehicle control system and application method
CN109492763A (en) * 2018-09-17 2019-03-19 同济大学 A kind of automatic parking method based on intensified learning network training

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUN-PENG WANG: "Cooperative channel assignment for VANETs based on multiagent reinforcement learning", 《FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743469A (en) * 2021-08-04 2021-12-03 北京理工大学 Automatic driving decision-making method fusing multi-source data and comprehensive multi-dimensional indexes
CN113743469B (en) * 2021-08-04 2024-05-28 北京理工大学 Automatic driving decision method integrating multi-source data and comprehensive multi-dimensional indexes
CN113561986A (en) * 2021-08-18 2021-10-29 武汉理工大学 Decision-making method and device for automatically driving automobile
CN113561986B (en) * 2021-08-18 2024-03-15 武汉理工大学 Automatic driving automobile decision making method and device
CN113449823A (en) * 2021-08-31 2021-09-28 成都深蓝思维信息技术有限公司 Automatic driving model training method and data processing equipment
CN114895560A (en) * 2022-04-25 2022-08-12 浙江大学 Foot type robot object tracking self-adaptive control method under motor stalling condition
CN114895560B (en) * 2022-04-25 2024-03-19 浙江大学 Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition
AT526259A1 (en) * 2022-06-23 2024-01-15 Avl List Gmbh Method for training an artificial neural network of a driver model

Also Published As

Publication number Publication date
CN112508164B (en) 2023-01-10

Similar Documents

Publication Publication Date Title
CN112508164B (en) End-to-end automatic driving model pre-training method based on asynchronous supervised learning
JP7287707B2 (en) Driverless vehicle lane change decision method and system based on adversarial imitation learning
CN108227710A (en) Automatic Pilot control method and device, electronic equipment, program and medium
US11474529B2 (en) System and method for motion planning of an autonomous driving machine
CN111222630A (en) Autonomous driving rule learning method based on deep reinforcement learning
Hu et al. Learning a deep cascaded neural network for multiple motion commands prediction in autonomous driving
CN112784485B (en) Automatic driving key scene generation method based on reinforcement learning
Huang et al. Deductive reinforcement learning for visual autonomous urban driving navigation
Siebinga et al. A human factors approach to validating driver models for interaction-aware automated vehicles
CN116134292A (en) Tool for performance testing and/or training an autonomous vehicle planner
Sun et al. Human-like highway trajectory modeling based on inverse reinforcement learning
Kim et al. An open-source low-cost mobile robot system with an RGB-D camera and efficient real-time navigation algorithm
CN115062202A (en) Method, device, equipment and storage medium for predicting driving behavior intention and track
Hao et al. Aggressive lane-change analysis closing to intersection based on UAV video and deep learning
Wang et al. Pre-training with asynchronous supervised learning for reinforcement learning based autonomous driving
CN116300944A (en) Automatic driving decision method and system based on improved Double DQN
CN115981302A (en) Vehicle following lane change behavior decision-making method and device and electronic equipment
WO2021258847A1 (en) Driving decision-making method, device, and chip
CN115031753A (en) Driving condition local path planning method based on safety potential field and DQN algorithm
Mohammed et al. Reinforcement learning and deep neural network for autonomous driving
Liu et al. Enhancing Social Decision-Making of Autonomous Vehicles: A Mixed-Strategy Game Approach With Interaction Orientation Identification
Wu et al. Learning driving behavior for autonomous vehicles using deep learning based methods
Tan et al. RCP‐RF: A comprehensive road‐car‐pedestrian risk management framework based on driving risk potential field
Bhattacharyya Modeling Human Driving from Demonstrations
US20240157978A1 (en) Mixed reality simulation for autonomous systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant