CN112784485B - Automatic driving key scene generation method based on reinforcement learning - Google Patents

Automatic driving key scene generation method based on reinforcement learning Download PDF

Info

Publication number
CN112784485B
CN112784485B CN202110082493.8A CN202110082493A CN112784485B CN 112784485 B CN112784485 B CN 112784485B CN 202110082493 A CN202110082493 A CN 202110082493A CN 112784485 B CN112784485 B CN 112784485B
Authority
CN
China
Prior art keywords
init
light
dynamic
probability
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110082493.8A
Other languages
Chinese (zh)
Other versions
CN112784485A (en
Inventor
董乾
薛云志
孟令中
杨光
王鹏淇
师源
武斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202110082493.8A priority Critical patent/CN112784485B/en
Publication of CN112784485A publication Critical patent/CN112784485A/en
Application granted granted Critical
Publication of CN112784485B publication Critical patent/CN112784485B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01MTESTING STATIC OR DYNAMIC BALANCE OF MACHINES OR STRUCTURES; TESTING OF STRUCTURES OR APPARATUS, NOT OTHERWISE PROVIDED FOR
    • G01M17/00Testing of vehicles
    • G01M17/007Wheeled or endless-tracked vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses an automatic driving key scene generation method based on reinforcement learning, which comprises the following steps: 1) selecting a road scene from a map library, setting a driving route of a main vehicle in a simulation system and respectively establishing a probability model for each dynamic environment element; 2) the simulation system controls the main vehicle to start executing a simulation task; training the probability models of all dynamic elements in the selected road scene based on a reinforcement learning technology to obtain the optimal parameters of all probability models for the selected road scene and storing the optimal parameters in a test case library; 3) the step 1-2) is circulated, and the optimal parameters of each probability model for each road scene in the map library are obtained; 4) acquiring a plurality of road scenes from the map library, combining the road scenes to obtain a test map, and selecting dynamic elements required in a simulation environment; 5) and importing the probability model and the corresponding optimal parameters of each dynamic element contained in the test map from the test case library to generate a key scene test case.

Description

Automatic driving key scene generation method based on reinforcement learning
Technical Field
The invention relates to an automatic driving key scene generation method based on reinforcement learning, and belongs to the technical field of computer software.
Background
Today, the performance of most perceptual and predictive algorithms is very sensitive to imbalances in training data (also known as the long tail problem), rare events are often difficult to collect, and are easily overlooked in large data streams, which greatly challenges the application of robots in the real world, especially in safety critical areas (e.g. autopilot).
In the automotive industry, it is common to reproduce, through simulation, the key scenes collected during a test drive. The prior art proposes an alternative method, called worst case evaluation, to search for controllers in the field of worst case evaluation vehicles. Although it may be useful to evaluate certain cases of mining by worst case, it is almost impossible to have some cases at risk in the real world, and it is of little instructive interest for practical use. In addition, the prior art mainly simulates the route or task completion condition of a simulation main body (such as an unmanned vehicle) for automatic driving, but no modeling method is provided for how the deployment of the simulation environment meets the key safety scene requirement required by an enterprise.
Reinforcement learning is a branch of the field of artificial intelligence machine learning for controlling agents that are capable of autonomous actions in an environment by interacting with the environment, including sensing and rewarding, to continuously improve its behavior. The two most important features in reinforcement learning are trial and error and late rewards. Therefore, the invention provides a key scene generation method in the automatic driving test process based on the reinforcement learning theory.
Disclosure of Invention
The invention aims to provide an automatic driving key scene generation method based on reinforcement learning, and solves the problems that training of dynamic environment elements in an automatic driving simulation environment is lacked, and the generation of an automatic driving key safety scene for how the dynamic environment elements are deployed is lacked in the prior art. Aiming at dynamic environment elements in an automatic driving simulation scene, the invention obtains the neural network model of the dynamic environment elements in different road scenes by continuously training model parameters in the simulation process through reinforcement learning, thereby generating a series of key scene test cases. The model parameters of the dynamic environment elements include initial position, movement speed, movement route, trigger distance, and the like. The invention designs a reasonable dynamic environment element rewarding mechanism, combines with road scenes, and fully considers the motion trail of dynamic environment elements such as pedestrians, vehicles, traffic lights and the like and the influence on the main vehicle.
In the invention, the map library of the automatic driving test scene can be preset by the test system, and the map scene can also be imported by a user; the main vehicle is a virtual vehicle to be tested in a test system, and the motion trail and the behavior mode of the main vehicle are controlled by a decision module of a simulation system; the dynamic environment elements mainly comprise three types of pedestrians, other running vehicles and traffic lights, and can cause dynamic interference to the running of the tested virtual vehicle in the simulation system, wherein the pedestrians are road participants in the test scene, the other running vehicles are non-tested vehicles which commonly use the road of the test scene, and the traffic lights are relatively static traffic elements and are used for controlling the time conversion of the traffic lights at the intersection.
The method for generating the automatic driving key scene based on reinforcement learning comprises the following steps:
step 1: initializing a test scene, selecting a road scene from a map library, setting a driving route of a main vehicle, and respectively establishing an initial probability model for three dynamic environment elements, namely pedestrians, other driving vehicles and traffic lights;
step 2: a decision module of the simulation system controls the main vehicle to start executing a simulation task; training probability model parameters of three types of dynamic elements in the selected road scene based on a reinforcement learning technology;
and step 3: the three types of dynamic elements finally obtain the optimal parameters of the probability model aiming at the selected road condition, and the optimal parameters are stored in a test case library;
and 4, step 4: the step 1-3 is circulated until all road scenes of the three types of dynamic elements in the map library are trained to obtain optimal parameters of the probability model;
and 5: importing road combinations into a map library to be any test map, and selecting dynamic elements required by a user in a simulation environment, wherein the dynamic elements mainly comprise pedestrians, other running vehicles, traffic lights and the like;
step 6: and according to the road scenes in the test map, importing the dynamic element probability model corresponding to each road and the corresponding optimal parameters from the test case library to generate a series of key scene test cases.
Further, step 1 specifically comprises:
the dynamic environment elements are other dynamic elements except the main vehicle in the test scene, and mainly comprise three types of pedestrians, other running vehicles and traffic lights.
The road conditions in the map library comprise a one-way lane, a two-way lane, a crossroad, a T-shaped intersection, a Y-shaped intersection, an entrance and an exit of a highway, an overpass and the like; the key scenes appearing on different road conditions are different, the situations of collision, line pressing, retrograde motion, red light running and the like can exist, different dynamic environment element models exist for each road condition, and the combination of the dynamic environment elements is equivalent to modeling of the key scenes of the road conditions.
For the initial models of the pedestrians and the vehicles, the parameters mainly comprise the elements of initial positions, movement routes, movement speeds, trigger distances and the like; the pedestrian movement route comprises straight movement along the direction of the road, crossing the road and the like, the vehicle movement route comprises straight movement, left turning, right turning, turning around, lane changing and the like, and the pedestrian movement route and the vehicle movement route have different options according to different road conditions and initial positions, but all the options are required to accord with traffic rules; for the initial model of traffic lights, it was primarily the time setting of the traffic lights, including the duration of the red, green, and yellow lights.
Further, step 2 specifically comprises:
step 2.1: setting the total iteration times E of model training; initializing the iteration times e to be 0;
step 2.2: obtaining a road scene, selecting various dynamic element types (pedestrians, other running vehicles and traffic lights) needing training in the road scene, obtaining a main vehicle running route, obtaining an initial model of each type of dynamic elements, wherein the number of the selected each type of dynamic elements is more than or equal to 1; it should be noted that, for pedestrians and vehicles, the set movement route and the initial position are in compliance with the traffic rules;
step 2.3: for each road scene, the state S is determined, the state S comprises the road type, the route of the host vehicle and the speed of the host vehicle, the probability distribution of each dynamic element in the selected road scene can be calculated according to the current state of the host vehicle (the current state of the host vehicle is determined in step 1; the road scene and the current state of the host vehicle are known conditions, such as the condition that the road scene is a crossroad and the current state of the host vehicle is a right turn, is set before the test), wherein the probability distribution formula of the pedestrians and the vehicles is as follows:
Figure BDA0002909880240000031
wherein, the ith action element a in the formula (1)iThe motion element a includes specific elements such as initial position (X, Y), motion path L, motion speed V, and trigger distance D of the dynamic elementiThe probability of (c) is:
Figure BDA0002909880240000032
the ith motion route is a discrete variable li _ init _ state obtained by discretizing the continuous random variable li _ init _ index, and the discrete variable li _ init _ state is the motion route initial state of the ith scene li. The complexity of the motion route is that the possible route options are strongly dependent on the road structure and the initial point, and assuming that the total number of options of the motion route under a specific condition is N (i.e. the selectable total number of motion routes), the conditional probability density of li _ init _ index can be modeled by using a probability density function on a [0,1] interval of the neural network structure, and the formula is as follows (3):
li_init_index~P(li_init_index|S,a1,...,ai-1,xi,yi) (3)
the discretization of the continuous random variable li _ init _ index is detailed in step 2.4.
The duration of the traffic light is a continuous variable, and the probability distribution formula of the duration of the traffic light is as follows (4):
Figure BDA0002909880240000033
light _ init _ index, t _ red, t _ green, and t _ yellow in equation (4) represent the initial state (initial red, green, and yellow colors), red light duration, green light duration, and yellow light duration, respectively, of the traffic light.
Conditional probability densities for traffic light durations t _ red, t _ green, t _ yellow
P(t_red|S,a1,...,ai-1,light_init_index)、P(t_green|S,a1,...,ai-1,light_init_index,t_red)、P(t_yellow|S,a1,...,ai-1Light _ init _ index, t _ red, t _ green) can be modeled using a gaussian distribution.
The initial state light _ init _ state of the traffic light is a discrete variable, and can be obtained by discretizing a continuous random variable light _ init _ index, and the conditional probability density of the light _ init _ index can be modeled by a probability density function on a section [0,1] of a neural network structure to obtain a formula (5);
light_init_index~P(light_init_index|S,a1,...,ai-1) (5)
the discretization of the continuous random variable light _ init _ index is detailed in step 2.4.
Step 2.4: randomly sampling the probability distribution of each dynamic element to obtain the action parameters of the dynamic element model in the state S, namely obtaining the initial position X, the initial position Y, the movement speed v, the movement route L and the trigger distance D for pedestrians, obtaining the initial position X, the initial position Y, the movement speed v, the movement route L and the trigger distance D for vehicles, and obtaining the initial state light _ init _ state, the red light time setting t _ red, the yellow light time setting t _ yellow and the green light time setting t _ green of traffic lights;
step 2.4.1: modeling dynamic elements by using Gaussian distribution N (mu, sigma) for continuous random variables such as initial position (X, Y), motion speed v, trigger distance D, red light time setting t _ red, yellow light time setting t _ yellow and green light time setting t _ green, modeling dynamic elements by using polynomial distribution for discrete random variables such as initial state light _ init _ state of traffic light and motion route L, and using Neural Network (NN) for conditional probability inference;
the gaussian distribution probability sampling formula is as follows:
μkk←Mk(S) (6)
ε~N(0,1) (7)
ak=μkk*ε (8)
random variable akIs an operation of sampling from the kth node, MkIs a model representing the conditional distribution of the kth action, followed by akParameters that scale and move to represent the real scene:
bk=ak*lk+sk (9)
wherein lkAnd skRespectively the range and mean of the kth action.
Step 2.4.2: for discrete random variables, such as the movement route L, the initial state light _ init _ index of the traffic light, [0,1] distribution, the probability sampling formula is as follows:
1) for the moving route L of pedestrians and other traveling vehicles, knowing the probability of li _ init _ index obeys equation (3), firstly, probability sampling is performed on the initial state li _ init _ index of the route, a discrete random variable li _ init _ state which can be further constructed by using a continuous random variable li _ init _ index is used, and when the kth route is selected, the correspondence between the continuous type and the discrete type is as follows:
li _ init _ state ═ k, where li _ init _ index ∈ ((k-1)/N, k/N) (10)
2) For the initial state of the traffic light, knowing the probability of light _ init _ index obeys equation (5), first performing probability sampling on the light _ init _ index of the initial state of the traffic light, and then performing mapping from the light _ init _ index to the light _ init _ state, the random variable light _ init _ state of the initial state of the traffic light can be further constructed as follows:
Figure BDA0002909880240000051
thus obtaining the initial state of the traffic light (initial red, green, yellow colors).
Step 2.5: and (3) testing the main vehicle by taking the random sampling result in the step (2.4) as a condition, and calculating an incentive value R by using the operation result, wherein the design principle of the incentive value R is a key scene in which some main vehicle accidents are expected to occur or the main vehicle violates the traffic rules, and the calculation formula is as follows:
Figure BDA0002909880240000052
w1, w2, w3, w4 and w5 in the formula (12) are all non-negative weight coefficients, and w1+ w2+ w3+ w4+ w5 is 1; here, ped denotes a set of pedestrian dynamic elements, c denotes a set of other traveling vehicle dynamic elements, l denotes a set of traffic light dynamic elements, r denotes a set of traffic regulations violated by the object to be measured (host vehicle), and p denotes a set of penalty terms for the object to be measured (host vehicle).
Wherein, the first term R of formula (12)pedExpressing the reward value of the pedestrian, and the calculation formula is as follows:
Figure BDA0002909880240000053
both b1 and b2 are non-negative weight coefficients, and b1+ b2 is 1;
Figure BDA0002909880240000054
indicates for the ith action element aiAccording to the minimum distance dis between the pedestrian and the main vehiclepThe prize value earned, i.e. for the ith action element aiThe host vehicle presents the reward value obtained about the key scene of the pedestrian, and the following same reason.
Of formula (13)
Figure BDA0002909880240000055
Indicating the minimum distance dis between the pedestrian and the host vehiclepAvailable prize value, dispIndicating the distance between the pedestrian and the host vehicle, as the distance dispLess than thresholdpWhen the temperature of the water is higher than the set temperature,indicating that the distance between the pedestrian and the main vehicle is less than the safe distance, and obtaining the corresponding reward value DIS>0, DIS is a specific value which can be set, otherwise, the reward value is 0, and the calculation formula is as follows:
Figure BDA0002909880240000061
of formula (13)
Figure BDA0002909880240000062
Col for indicating traffic accident between main car and pedestrianpAvailable prize value, colpIndicating that if the host vehicle and the pedestrian have a traffic accident, the corresponding reward value COL is obtained, and the COL is a specific value which can be set, and the calculation formula is as follows:
Figure BDA0002909880240000063
second term R of formula (12)cIndicating the reward value of the vehicle, including the distance between the vehicle and the host vehicle being less than the safe distance discCollision accident col of main vehiclecIn this case, the calculation formula is as follows:
Figure BDA0002909880240000064
here, c1 and c2 are non-negative weight coefficients, and c1+ c2 is 1;
of formula (16)
Figure BDA0002909880240000065
A bonus value, dis, being available representing the minimum distance between the other travelling vehicles and the host vehiclecIndicating the distance between the other running vehicle and the host vehicle, when the distance discLess than thresholdcWhen the distance between other driving vehicles and the main vehicle is smaller than the safe distance, obtaining a corresponding reward value DIS, wherein DIS is a settable specific numerical value, otherwise, the reward value is 0, and the calculation formula is as follows;
Figure BDA0002909880240000066
of formula (16)
Figure BDA0002909880240000067
The prize value, col, available to indicate a traffic accident between the host vehicle and other vehiclescIt means that if the main vehicle and other running vehicles have a traffic accident, then the corresponding reward value COL is obtained, and the COL is a specific value which can be set, and the calculation formula is as follows:
Figure BDA0002909880240000068
a third term R of formula (12)lThe reward value of the traffic light comprises that the main vehicle runs the red light and the main vehicle runs the yellow light, and the calculation formula is as follows:
Rl=f1*Rred(ai)+f2*Ryellow(ai) (19)
both f1 and f2 are non-negative weight coefficients, and f1+ f2 is 1;
r of formula (19)redThe reward value obtained when the host vehicle runs the RED light is shown, RED shows that the host vehicle runs the RED light, and then the corresponding reward value RED is obtained, wherein RED is a specific value which can be set, and the calculation formula is as follows:
Figure BDA0002909880240000069
r of formula (19)yellowThe reward value which can be obtained when the main vehicle runs the red light is shown, YELLOW shows that the main vehicle runs the YELLOW light, the corresponding reward value YELLOW is obtained, YELLOW is a specific numerical value which can be set, and the calculation formula is as follows:
Figure BDA0002909880240000071
dis of formula (21)yellowThe distance value of the host car exceeding the stop line is shown after the host car finds that the yellow light is turned on until the host car stops, and alpha represents the coefficient of the distance.
Fourth term R of formula (12)rThe traffic rule violation behaviors such as main vehicle line pressing rate, reverse driving, illegal lane change and the like are represented by the following calculation formula:
Rr=g1*Rcross(ai)+g2*Rconverse(ai)+g3*Rlane_change(ai) (22)
g1, g2 and g3 are all non-negative weight coefficients, and g1+ g2+ g3 is 1;
r of formula (22)crossThe reward value which can be obtained when the main straight line runs is represented, CROSS represents the condition that the main straight line runs, and then a corresponding reward value CROSS is obtained, wherein CROSS is a specific value which can be set, and the calculation formula is as follows:
Figure BDA0002909880240000072
r of formula (22)converseThe reward value which can be obtained by the reverse driving of the main vehicle is represented, the change represents the condition that the main vehicle reverses, and then the corresponding reward value is obtained, the change is a specific value which can be set, and the calculation formula is as follows:
Figure BDA0002909880240000073
r of formula (22)lane_changeThe LANE _ CHANGE indicates that the host vehicle has an illegal LANE CHANGE, and the LANE _ CHANGE indicates that the host vehicle has an illegal LANE CHANGE, so that a corresponding reward value LANE _ CHANGE is obtained, the LANE _ CHANGE is a settable specific numerical value, and the calculation formula is as follows:
Figure BDA0002909880240000074
a fifth term R of formula (12)pIndicating the use of penalties to avoidThe special situation is avoided, namely, the dynamic element has some unreasonable situations, which are usually related to the distance between the main vehicle and the dynamic element, and the calculation formula is as follows;
Figure BDA0002909880240000075
wherein eta isiIs that the main vehicle is in state siRun-way of, p0Indicates the position of the dynamic element, and γ is a set distance threshold.
Step 2.6: optimizing the probability model of the dynamic elements by using a strategy gradient method, wherein the objective function formula is as follows:
Figure BDA0002909880240000076
wherein a is the distribution from the strategy piφMiddle sampling action, phi ═ a1,...,an) (ii) a E is the expectation function and R is the prize value.
And (3) sampling and approximating the target function for N times, wherein the gradient for updating the model parameter phi is as follows:
Figure BDA0002909880240000081
in order to make the selection of the strategy as diverse as possible, an entropy term H (pi) is added in the objective functionφ):
H(πφ)=-∫πφ(x)logπφ(x)dx (29)
πφIs a distribution characterized by phi, where x is an independent variable, which, when taken at different values, has a probability density value of piφ(x) Will change correspondingly; for entropy term H (pi)φ) And synchronizing with the reward value to maximize, and then adding the entropy term to obtain the gradient of the objective function:
Figure BDA0002909880240000082
the updating formula of the parameter phi is as follows, the gradient descent method is used for optimizing the parameter to obtain the minimum value of the formula, and thus the maximum reward value and entropy value are obtained:
Figure BDA0002909880240000083
when using the autoregressive Gaussian distribution Pair strategyφIn modeling, the joint probabilities can be computed using chain rules:
Figure BDA0002909880240000084
πφ,iis the model parameter phi corresponding to the ith dynamic element model.
Step 2.7: adding 1 to the iteration number e; when the iteration number E of the model training is smaller than E, returning to the step 2.2; and when the iteration times of the model training are equal to E, completing the model parameter training of the dynamic elements.
Further, the final test case obtained in step 4 specifically is:
for the selected road condition, the state of the test scene is determined, after the three types of dynamic elements are subjected to iterative training based on reinforcement learning, the model of each type of dynamic elements can obtain the probability distribution of the configuration parameters of the key scene, and the configuration parameters of the key scene comprise an initial position, a trigger position, a movement speed, a movement route, the change time of a traffic light and the like; based on the probability distribution of the configuration parameters of the dynamic element models, a key scene test case for unmanned vehicle simulation can be generated quickly.
Further, the map of the test case in step 5 is a free combination of arbitrary road conditions that have undergone iterative training based on reinforcement learning in a map library.
Further, the test case generation in step 6 specifically includes:
step 6.1: selecting required road types from a map library, and freely combining the road types to form a test case map;
step 6.2: setting a main vehicle running route, and selecting the type and the number of the dynamic elements;
step 6.3: after the road scene is added into the main vehicle driving route, the state S of the test scene is determined; according to the road condition and the movement track of the main vehicle, a dynamic element model M trained in the state S can be found in a test case library; for the ith state Si, the dynamic element model M obtains the probability distribution Pi of the action parameters, random sampling is carried out on the probability distribution according to the step 2.4 to obtain specific values ai of the action parameters, for pedestrians and vehicles, ai comprises an initial position X, an initial position Y, a movement speed v, a movement route L and a trigger distance D, for traffic lights, ai comprises a traffic light initial state light _ init _ state, a red light time setting t _ red, a yellow light time setting t _ yellow, a green light time setting t _ green, and specific values of the action parameters are set for the dynamic elements;
step 6.4: and generating a final test case.
The invention has the positive effects that:
(1) in the prior art, simulation of an unmanned vehicle is mostly considered for simulation of automatic driving, dynamic environment elements in a simulation scene are ignored, initial positions, movement speeds, movement routes, trigger distances and the like of the dynamic environment elements in the scene are mainly researched, and based on a reinforcement learning technology, a series of test cases of key scenes can be quickly generated on the premise that the movement of the dynamic elements accords with actual conditions.
(2) The dynamic environment elements in the simulation scene are trained in the simulation through the reinforcement learning technology to obtain the key scene of accidents such as high-probability collision and the like in automatic driving, invalid actions are avoided, the problems of more invalid explorations and low training speed in the training process are solved, and the training efficiency is obviously improved.
(3) The reward mechanism is reasonable in design, and the influence of pedestrians, vehicles, traffic lights and the like on the main vehicle is fully considered in combination with real traffic rules.
(4) The automatic driving test scene generation is put on the key scenes with few participants, such as an AV and a dynamic vehicle, the probability distribution calculation is simple and convenient, and the model training is simple and easy to realize.
Drawings
FIG. 1 is a flow chart of a method for generating an autopilot key scene;
FIG. 2 is a flow chart of model parameter training for three types of dynamic elements.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, belong to the scope of the present invention.
An automatic driving key scene generation method based on reinforcement learning comprises the following steps:
step 1: initializing a test scene, selecting a road scene from a map library, setting a driving route of a main vehicle, and respectively establishing an initial probability model for three dynamic environment elements, namely pedestrians, other driving vehicles and traffic lights;
step 2: the main vehicle starts to execute a simulation task; based on a reinforcement learning technology, aiming at the selected road condition, carrying out probability model parameter training on the three types of dynamic elements;
and step 3: the three types of dynamic elements finally obtain the optimal parameters of the probability model for the selected road condition, and the optimal parameters are stored in a test case library;
and 4, step 4: the step 1-3 is circulated until all road scenes of the three types of dynamic elements in the map library are trained to obtain optimal parameters of the probability model;
and 5: importing road combinations into a map library to be any test map, and selecting dynamic elements required by a user in a simulation environment, wherein the dynamic elements mainly comprise pedestrians, other running vehicles, traffic lights and the like;
step 6: and according to the selected road, importing a dynamic element probability model corresponding to each road from a test case library to generate a series of key scene test cases.
Further, step 1 specifically comprises:
the dynamic environment elements are other dynamic elements except the main vehicle in the test scene, and mainly comprise three types of pedestrians, other running vehicles and traffic lights.
The road conditions in the map library comprise a one-way lane, a two-way lane, a crossroad, a T-shaped intersection, a Y-shaped intersection, an entrance and an exit of a highway, an overpass and the like; the key scenes appearing on different road conditions are different, the situations of collision, line pressing, retrograde motion, red light running and the like can exist, different dynamic environment element models exist for each road condition, and the combination of the dynamic environment elements is equivalent to modeling of the key scenes of the road conditions.
For the initial models of the pedestrians and the vehicles, the parameters mainly comprise the elements of initial positions, movement routes, movement speeds, trigger distances and the like; the pedestrian movement route comprises straight movement along the direction of the road, crossing the road and the like, the vehicle movement route comprises straight movement, left turning, right turning, turning around, lane changing and the like, and the pedestrian movement route and the vehicle movement route have different options according to different road conditions and initial positions, but all the options are required to accord with traffic rules; for the initial model of traffic lights, it was primarily the time setting of the traffic lights, including the duration of the red, green, and yellow lights.
In one embodiment, the moving routes of the pedestrian and the vehicle have different options according to different road conditions and initial positions, such as: the road condition selects a straight road, the initial position of the pedestrian is on the sidewalk on one side of the road, and then the moving route of the pedestrian is straight along the sidewalk; the road condition selects an intersection, the initial position of the pedestrian is at the intersection, and then the moving route of the pedestrian is to pass through the road.
In one embodiment, the road condition selects an intersection; the initial position of the pedestrian is in the northeast corner of the crossroad, and the movement route passes through two roads of the crossroad and reaches the southwest corner of the crossroad; the initial position of the vehicle is behind the right of the main vehicle, the movement route passes through the crossroad together with the main vehicle, and the condition of merging can occur; the time of the traffic lights was set to 60 seconds red, 30 seconds green and 3 seconds yellow.
The detailed description of step 2 is provided in the summary of the invention.
Further, the final test case obtained in step 4 specifically is:
for the selected road condition, the state of the test scene is determined, after the three types of dynamic elements are subjected to iterative training based on reinforcement learning, the model of each type of dynamic elements can obtain the probability distribution of the configuration parameters of the key scene, and the configuration parameters of the key scene comprise an initial position, a trigger position, a movement speed, a movement route, the change time of a traffic light and the like; based on the probability distribution of the configuration parameters of the dynamic element models, a key scene test case for unmanned vehicle simulation can be generated quickly.
Further, the map of the test case in step 5 is a free combination of arbitrary road conditions that have undergone iterative training based on reinforcement learning in a map library.
Further, the test case generation in step 6 specifically includes:
step 6.1: selecting required road types from a map library, and freely combining the road types to form a test case map;
step 6.2: setting a main vehicle running route, and selecting the type and the number of the dynamic elements;
step 6.3: after the road scene is added into the main vehicle driving route, the state S of the test scene is determined; according to the road condition and the movement track of the main vehicle, a dynamic element model M trained in the state S can be found in a test case library; for the ith state Si, the dynamic element model M obtains the probability distribution Pi of the action parameters, random sampling is carried out on the probability distribution according to the step 2.4 to obtain specific values ai of the action parameters, for pedestrians and vehicles, ai comprises an initial position X, an initial position Y, a movement speed v, a movement route L and a trigger distance D, for traffic lights, ai comprises a traffic light initial state light _ init _ state, a red light time setting t _ red, a yellow light time setting t _ yellow, a green light time setting t _ green, and specific values of the action parameters are set for the dynamic elements;
step 6.4: and generating a final test case.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (6)

1. An automatic driving key scene generation method based on reinforcement learning comprises the following steps:
1) selecting a road scene from a map library, setting a driving route of a main vehicle in a simulation system and respectively establishing a probability model for each dynamic environment element; the dynamic environment elements comprise pedestrians, other running vehicles except the main vehicle and traffic lights;
2) the simulation system controls the main vehicle to start executing a simulation task; training the probability models of all dynamic elements in the selected road scene based on a reinforcement learning technology to obtain the optimal parameters of all probability models for the selected road scene and storing the optimal parameters in a test case library;
3) the step 1-2) is circulated, and the optimal parameters of each probability model for each road scene in the map library are obtained;
4) acquiring a plurality of road scenes from the map library, combining the road scenes to obtain a test map, and selecting dynamic elements required in a simulation environment;
5) importing the probability model and the corresponding optimal parameters of each dynamic element contained in the test map from a test case library to generate a key scene test case as an automatic driving key scene;
the method for training the probability model of each dynamic element in the selected road scene comprises the following steps:
21) setting the total iteration times E of model training; initializing the iteration times e to be 0;
22) setting a motion route and an initial position of each dynamic element in the selected road scene;
23) calculating the probability distribution of each dynamic element in the selected road scene according to the current state of the main vehicle;
24) randomly sampling the probability distribution of each dynamic element to obtain the action parameters of the probability model of each dynamic element in the state S;
25) testing the main vehicle by using the random sampling result of the step 24) as a condition, and then calculating an award value R according to the running result of the test, wherein the award value R
Figure FDA0003136301820000011
aiIs the ith dynamic element, and n is the number of the dynamic elements; w1, w2, w3, w4 and w5 are all non-negative weight coefficients, ped represents a pedestrian set in the selected road scene, c represents other running vehicle sets in the selected road scene, l represents a traffic light set in the selected road scene, r represents a set of the host vehicle violating the traffic rules, and p represents a set of host vehicle penalty terms;
reward value
Figure FDA0003136301820000012
b1 and b2 are non-negative weight coefficients;
Figure FDA0003136301820000013
indicates for the ith action element aiAccording to the minimum distance dis between the pedestrian and the main vehiclepThe value of the benefit to be obtained is,
Figure FDA0003136301820000014
indicates for the ith action element aiTraffic accident col according to the main car and the pedestrianp(ai) The value of the reward earned;
reward value
Figure FDA0003136301820000015
Wherein c1, c2 are non-negative weighting coefficients,
Figure FDA0003136301820000016
is shown for the ithAction element aiThe bonus value obtained on the basis of the minimum distance between the other running vehicles and the host vehicle,
Figure FDA0003136301820000021
(ai) Indicates for the ith action element aiObtaining a reward value according to the traffic accident of the main vehicle and other running vehicles;
reward value Rl=f1*Rred(ai)+f2*Ryellow(ai) F1 and f2 are all non-negative weight coefficients, Rred(ai) Indicates for the ith action element aiObtaining a reward value according to the condition that the main vehicle runs the red light;
Rr=g1*Rcross(ai)+g2*Rconverse(ai)+g3*Rlane_change(ai) G1, g2 and g3 are all non-negative weight coefficients, Rcross(ai) Indicates for the ith action element aiAccording to the reward value, R, obtained by driving the main vehicle line-pressing lineconverse(ai) Indicates for the ith action element aiBased on the reward value, R, available for retrograde driving of the host vehiclelane_change(ai) Indicates for the ith action element aiThe reward value can be obtained according to the illegal lane change of the main vehicle;
Figure FDA0003136301820000022
wherein h isiIs the driving route of the main vehicle in the state si, rho0The position of the dynamic element is shown, gamma is a set threshold value, and RP is a driving state reward value;
26) optimizing the probability model of the dynamic elements by using a strategy gradient method; wherein the objective function for optimization is determined based on the reward value as
Figure FDA0003136301820000023
a is the distribution of the slave strategies πφMiddle sampling action, phi ═ a1,...,an) (ii) a E is an expectation function;
27) adding 1 to the iteration number e; when the iteration number E of the model training is less than E, returning to the step 22); and when the iteration number of the model training is equal to E, finishing the training of the probability model of the dynamic element.
2. The method of claim 1, wherein the probability distributions for pedestrians and vehicles are
Figure FDA0003136301820000024
Wherein, the dynamic element aiThe method comprises the steps of obtaining an initial position (X, Y), a movement route L, a movement speed V and a trigger distance D of a dynamic element; dynamic element aiThe probability of (c) is:
Figure FDA0003136301820000025
the probability distribution of the duration of the traffic light is
Figure FDA0003136301820000026
Wherein light _ init _ index, t _ red, t _ green and t _ yellow respectively represent the initial state, red light duration, green light duration and yellow light duration of the traffic light; the conditional probability density P (t _ red | S, a) of the traffic light duration t _ red is respectively obtained by modeling by using Gaussian distribution1,...,ai-1Light _ init _ index), and conditional probability density P of t _ green (t _ green | S, a)1,...,ai-1Light _ init _ index, t _ red), and t _ yellow conditional probability density P (t _ yellow | S, a)1,...,ai-1,light_init_index,t_red,t_green)。
3. The method according to claim 2, wherein the initial state li _ init _ state of the motion route in each road scene is obtained by discretizing a continuous random variable li _ init _ index, and li _ init _ state is the initial state of the motion route of the ith road scene li, and the conditional probability density of li _ init _ index is obtained by modeling with a probability density function on a neural network construction [0,1] section; the initial state of the traffic light is obtained by discretizing the continuous random variable light _ init _ index, and the conditional probability density of the light _ init _ index is obtained by modeling by using a probability density function on a neural network structure [0,1] interval.
4. The method of claim 3, wherein the step 24) of randomly sampling the probability distribution of each dynamic element comprises:
241) modeling dynamic elements by using Gaussian distribution for continuous random variables; for discrete random variables, modeling the dynamic elements using a polynomial distribution;
242) sampling the probability distribution of discrete random variables: a) for the moving routes of pedestrians and other traveling vehicles, firstly, probability sampling is carried out on an initial state li _ init _ index of the moving route, a discrete random variable li _ init _ state is constructed by using a continuous random variable li _ init _ index, and when a k-th route is selected from the discrete random variable li _ init _ state, the correspondence relationship li _ init _ state from continuous to discrete is equal to k, wherein li _ init _ index belongs to ((k-1)/N, k/N); b) for the initial state of the traffic light, probability sampling is firstly carried out on the light _ init _ index of the initial state of the traffic light, then mapping from the light _ init _ index to the light _ init _ state is carried out, and the random variable light _ init _ state of the initial state of the traffic light is constructed.
5. The method of claim 4, wherein for an initial state of a traffic light, a light _ init _ index to light _ init _ state mapping relationship is
Figure FDA0003136301820000031
6. An automatic driving test method, characterized in that a simulation system adopts the automatic driving key scene obtained by the method of claim 1 to test a target main vehicle.
CN202110082493.8A 2021-01-21 2021-01-21 Automatic driving key scene generation method based on reinforcement learning Active CN112784485B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110082493.8A CN112784485B (en) 2021-01-21 2021-01-21 Automatic driving key scene generation method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110082493.8A CN112784485B (en) 2021-01-21 2021-01-21 Automatic driving key scene generation method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN112784485A CN112784485A (en) 2021-05-11
CN112784485B true CN112784485B (en) 2021-09-10

Family

ID=75758033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110082493.8A Active CN112784485B (en) 2021-01-21 2021-01-21 Automatic driving key scene generation method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN112784485B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022221979A1 (en) * 2021-04-19 2022-10-27 华为技术有限公司 Automated driving scenario generation method, apparatus, and system
CN113485300B (en) * 2021-07-15 2022-10-04 南京航空航天大学 Automatic driving vehicle collision test method based on reinforcement learning
CN113823096B (en) * 2021-11-25 2022-02-08 禾多科技(北京)有限公司 Random traffic flow obstacle object arrangement method for simulation test
CN115257891B (en) * 2022-05-27 2024-06-04 浙江众合科技股份有限公司 CBTC scene test method based on integration of key position extraction and random position
CN115630583B (en) * 2022-12-08 2023-04-14 西安深信科创信息技术有限公司 Method, device, equipment and medium for generating simulated vehicle driving state

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159832A (en) * 2018-10-19 2020-05-15 百度在线网络技术(北京)有限公司 Construction method and device of traffic information flow
CN111983934A (en) * 2020-06-28 2020-11-24 中国科学院软件研究所 Unmanned vehicle simulation test case generation method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159832A (en) * 2018-10-19 2020-05-15 百度在线网络技术(北京)有限公司 Construction method and device of traffic information flow
CN111983934A (en) * 2020-06-28 2020-11-24 中国科学院软件研究所 Unmanned vehicle simulation test case generation method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Scenarios for Development, Test and Validation;Till Menzel 等;《2018 IEEE Intelligent Vehicles Symposium (IV)》;20180630;全文 *
面向决策规划***测试的具体场景自动化生成方法;陈君毅 等;《汽车技术》;20201231(第10期);全文 *

Also Published As

Publication number Publication date
CN112784485A (en) 2021-05-11

Similar Documents

Publication Publication Date Title
CN112784485B (en) Automatic driving key scene generation method based on reinforcement learning
Chen et al. Deep imitation learning for autonomous driving in generic urban scenarios with enhanced safety
CN110647839B (en) Method and device for generating automatic driving strategy and computer readable storage medium
US11243532B1 (en) Evaluating varying-sized action spaces using reinforcement learning
Nishi et al. Merging in congested freeway traffic using multipolicy decision making and passive actor-critic learning
WO2022052406A1 (en) Automatic driving training method, apparatus and device, and medium
CN109709956B (en) Multi-objective optimized following algorithm for controlling speed of automatic driving vehicle
Khodayari et al. A modified car-following model based on a neural network model of the human driver effects
US20230124864A1 (en) Graph Representation Querying of Machine Learning Models for Traffic or Safety Rules
CN112888612A (en) Autonomous vehicle planning
CN113044064B (en) Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning
JP6916552B2 (en) A method and device for detecting a driving scenario that occurs during driving and providing information for evaluating a driver's driving habits.
CN112508164B (en) End-to-end automatic driving model pre-training method based on asynchronous supervised learning
CN114035575B (en) Unmanned vehicle motion planning method and system based on semantic segmentation
Chen et al. Continuous decision making for on-road autonomous driving under uncertain and interactive environments
CN110956851A (en) Intelligent networking automobile cooperative scheduling lane changing method
Qiao et al. Behavior planning at urban intersections through hierarchical reinforcement learning
Pal et al. Emergent road rules in multi-agent driving environments
Sun et al. Human-like highway trajectory modeling based on inverse reinforcement learning
CN117373243A (en) Three-dimensional road network traffic guidance and emergency rescue collaborative management method for underground roads
Wen et al. Modeling human driver behaviors when following autonomous vehicles: An inverse reinforcement learning approach
CN114701517A (en) Multi-target complex traffic scene automatic driving solution based on reinforcement learning
Mohammed et al. Reinforcement learning and deep neural network for autonomous driving
CN115031753A (en) Driving condition local path planning method based on safety potential field and DQN algorithm
Alighanbari et al. Safe adaptive deep reinforcement learning for autonomous driving in urban environments. additional filter? How and where?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant