CN113561995A - Automatic driving decision method based on multi-dimensional reward architecture deep Q learning - Google Patents

Automatic driving decision method based on multi-dimensional reward architecture deep Q learning Download PDF

Info

Publication number
CN113561995A
CN113561995A CN202110956262.5A CN202110956262A CN113561995A CN 113561995 A CN113561995 A CN 113561995A CN 202110956262 A CN202110956262 A CN 202110956262A CN 113561995 A CN113561995 A CN 113561995A
Authority
CN
China
Prior art keywords
reward
driving
value
network
comfort
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110956262.5A
Other languages
Chinese (zh)
Other versions
CN113561995B (en
Inventor
崔建勋
张瞫
刘昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110956262.5A priority Critical patent/CN113561995B/en
Publication of CN113561995A publication Critical patent/CN113561995A/en
Application granted granted Critical
Publication of CN113561995B publication Critical patent/CN113561995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W2050/0001Details of the control system
    • B60W2050/0019Control system elements or transfer functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Human Computer Interaction (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Traffic Control Systems (AREA)

Abstract

An automatic driving decision method based on multi-dimensional reward architecture deep Q learning belongs to the technical field of automatic driving. The problem that the multi-dimensional performance cannot be optimized simultaneously in the existing driving decision method is solved. The method comprises the steps of collecting environmental information of an automatic driving vehicle in real time by adopting a vision sensor and a LIDAR sensor; acquiring image information and/or point cloud information; inputting image information and/or point cloud information into a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort; summing the reward valuations of the driving decisions in three dimensions to obtain a total driving strategy reward valuation; and analyzing the driving strategy reward evaluation value by adopting an epsilon-greedy algorithm to obtain an optimal decision action. The method is suitable for generating the control strategy of the automatic driving vehicle.

Description

Automatic driving decision method based on multi-dimensional reward architecture deep Q learning
Technical Field
The invention belongs to the technical field of automatic driving.
Background
The automatic driving decision is a crucial link in the whole technical link of automatic driving, takes state perception as input and takes traffic decision under a specific traffic scene as output to serve subsequent automatic driving vehicle motion planning and vehicle control, and the intelligent level of the automatic driving decision directly determines the level and quality of the automatic driving automation degree.
Traditionally, there are two general implementations of automated driving decision-making, rule-based and learning-based methods. The rule-based method is usually used for enumerating various possible traffic driving states artificially, then providing corresponding driving decisions in the states, storing the driving decisions as a rule set, and triggering response decision behaviors when an automatic driving vehicle encounters a certain driving state in the rule set.
The greatest benefit of this approach is safety, controllability, and everything is within the scope of human design and understanding, but the biggest problem is that it is impractical to enumerate all the traffic conditions that an autonomous vehicle may encounter, and for some states that are not defined in a rule set, the autonomous vehicle will not know how to make a decision, i.e. it is not a generalizable problem.
The learning-based method can just overcome the problem that the rule-based method is difficult to generalize, and can train a decision model through state and action samples of some scenes, so that better decision action can be generated when unknown scenes are met. In the learning-based method, a decision mode based on reinforcement learning is used to compare the weighted and highlighted potential. Reinforcement learning allows autonomous vehicles to interact with the environment continuously, and the decision level of the autonomous vehicle is continuously improved under the condition of autonomous exploration. However, the automatic driving decision is a complex decision behavior, and the decision target dimensions considered by the automatic driving decision are many, such as safety, comfort, high efficiency, economy, and the like, whereas the conventional automatic driving decision method based on reinforcement learning generally adopts an accumulated benefit estimation function to estimate the comprehensive benefits of multiple dimensions. However, due to the simultaneous consideration of the combined revenue of multiple dimensions, it is not possible to guarantee that each dimension achieves the best revenue.
Disclosure of Invention
The invention aims to solve the problem that multi-dimensional performance cannot be optimized simultaneously in the existing driving decision method, and provides an automatic driving decision method based on multi-dimensional reward architecture deep Q learning.
The invention relates to an automatic driving decision method based on multidimensional reward architecture deep Q learning, which comprises the following steps:
the method comprises the following steps of firstly, acquiring environmental information of an automatic driving vehicle in real time by adopting a vision sensor and a LIDAR sensor; acquiring image information and/or point cloud information;
inputting image information and/or point cloud information into a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions in three dimensions of safety, efficiency and comfort;
step three, summing the reward valuations of the driving decisions under three dimensions to obtain a total driving strategy reward valuation;
and step four, analyzing the driving strategy reward evaluation value by adopting an epsilon-greedy algorithm to obtain an optimal decision action.
Further, in the second step of the present invention, a specific method for obtaining an incentive valuation of a driving decision in three dimensions of safety, efficiency and comfort is as follows:
when the acquired data is only image information:
regularization processing is carried out on the image information, and data after regularization processing are sequentially input to a convolution layer and a full connection layer of a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;
when the acquired data is only point cloud data:
sequentially inputting the point cloud data into a circulating neural network (LSTM) and a full connection layer to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;
when the received data contains both point cloud data and image information:
the point cloud data is input into a cyclic neural network, image information is input into a convolutional layer after being regularized, the output of the cyclic neural network and the output of the convolutional layer are input into a full-connection layer after being spliced, and reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving are obtained.
Further, in the present invention, the driving strategy in the second step is: accelerating, decelerating, turning to the right, turning to the left and not doing five strategic actions.
Further, in the present invention, the deep Q-value network of the multi-reward architecture includes three reward functions, which are: a security reward function, an efficiency reward function, and a comfort reward function.
Further, in the present invention, the security reward function is:
Figure BDA0003220391990000021
wherein r issThe value of a preset safety reward constant value is positive; rs(s, a) is the security reward function value obtained with action a under the current environment s.
Further, in the present invention, the efficiency reward function is:
Figure BDA0003220391990000022
wherein r isoA constant value is awarded for the preset efficiency, and the value of the constant value is positive; ro(s, a) is the efficiency reward function value obtained with action a in the current context s.
Further, in the present invention, the comfort reward function is:
Figure BDA0003220391990000031
wherein r islA preset comfort reward constant value is positive; rl(s, a) is comfort obtained with action a under current environment sA value of a sexual reward function.
Further, in the present invention, the deep Q-value network of the multi-reward framework in the step one is trained, and the specific method of training is as follows:
s1, selecting an input environmental state sample S, inputting the sample S into a depth Q value network of a multi-reward architecture to be trained, and obtaining reward evaluation values of driving decisions in three dimensions of comprehensiveness, efficiency and comfort;
step S2, calculating a sum Q of reward valuations of driving decisions in three dimensionsRAMObtaining Q of any one of actions aRAMObtaining a parameter set to be trained of the deep network;
step S3, determining a loss objective function according to a parameter set to be trained of the deep network, minimizing the objective function, and obtaining a driving strategy of the next step;
and S4, acquiring the driving strategy in the step S3, updating the environmental state sample S of the arranged S1, and returning to execute the step S1 until the minimized value of the objective function converges.
Further, in the present invention, the loss objective function in step S3 is:
Figure BDA0003220391990000032
wherein, thetakK belongs to s, o, l is a parameter set to be trained by the depth network under the security dimension, the efficiency dimension or the comfort dimension; s, a, s ', a' respectively represent the environmental state at the current moment, the action taken by the current environmental state, the environmental state at the next moment and the action taken at the environmental state at the next moment; qk(s,a;θk) K ∈ s, o, l indicates that in the environment state s, with action a, the parameter set is θkIn the case of (1), the future is expected to be the cumulative expected total revenue that can be obtained over the reward dimension k.
Further, in the present invention, in step three, the reward evaluation values for driving decisions in three dimensions are summed as:
Figure BDA0003220391990000033
wherein Q isRAM(s, a; theta) is the sum of the reward estimates for driving decisions in three dimensions, theta is the set of parameters to be trained for the deep network, thetakA set of parameters to be trained for a deep network in a security dimension, an efficiency dimension, or a comfort dimension.
The method mainly comprises the step of training an accumulated profit valuation function independently from multiple dimensions by a deep Q valuation network (MRA-DQN), namely, a multi-reward framework is adopted to evaluate profits of each dimension independently and not in a coupling mode, so that the performance of an automatic driving decision model based on reinforcement learning is further improved. Meanwhile, the invention designs respective valuation network structures aiming at the gains of three dimensions of safety, overtaking and lane changing respectively. The three dimensions share a bottom deep network, and then each has its own unique deep network. The network of three dimensions estimates the reward accumulated income of the self dimension respectively and then the reward accumulated income is collected to form the total income QMRAFinally, the adopted decision-making behavior is decided, and the optimal strategy is obtained through multi-dimensional consideration.
Drawings
FIG. 1 is a deep Q network architecture diagram of the reward architecture of the present invention;
FIG. 2 is a Q-depth network architecture with only camera sensor acquisition image data as input;
FIG. 3 is a Q-depth network architecture with only LIDAR sensors collecting 3D point cloud data as input;
fig. 4 is a Q depth network architecture with two kinds of perceptual data, image and 3D point cloud as inputs.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The first embodiment is as follows: the following describes an embodiment with reference to fig. 1 to 4, where the method for automatic driving decision based on multidimensional reward framework deep Q learning according to the embodiment includes:
the method comprises the following steps of firstly, acquiring environmental information of an automatic driving vehicle in real time by adopting a vision sensor and a LIDAR sensor; acquiring image information and/or point cloud information;
inputting image information and/or point cloud information into a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions in three dimensions of safety, efficiency and comfort;
step three, summing the reward valuations of the driving decisions under three dimensions to obtain a total driving strategy reward valuation;
and step four, analyzing the driving strategy reward evaluation value by adopting an epsilon-greedy algorithm to obtain an optimal decision action.
The invention provides an automatic driving decision method based on multi-reward architecture deep Q learning, which mainly comprises a deep Q valuation network (MRA-DQN) as shown in figure 1. The network receives sensory information inputs from the autopilot sensors, including images of the vision sensors and point clouds of the LIDAR sensors, and the corresponding outputs are the cumulative reward revenue values for 5 autopilot decisions (acceleration, deceleration, turn right, turn left, do nothing). And determining the finally executed action based on the epsilon-greedy strategy according to the size of the accumulated profit value.
Further, in the second embodiment, in the second step, a specific method for obtaining the reward evaluation value of the driving decision in the three dimensions of safety, efficiency and comfort is as follows:
when the acquired data is only image information:
regularization processing is carried out on the image information, and data after regularization processing are sequentially input to a convolution layer and a full connection layer of a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;
when the acquired data is only point cloud data:
sequentially inputting the point cloud data into a circulating neural network (LSTM) and a full connection layer to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;
when the received data contains both point cloud data and image information:
the point cloud data is input into a cyclic neural network, image information is input into a convolutional layer after being regularized, the output of the cyclic neural network and the output of the convolutional layer are input into a full-connection layer after being spliced, and reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving are obtained.
The Q network described in this embodiment is used to estimate the future cumulative benefit of a state-action pair, and is the key to automated driving behavior decision-making. The invention adopts a deep neural network to approximate the Q function. In order to accommodate the two types of autopilot sensor inputs of camera and LIDAR, three types of Q-deep neural networks are designed in the present invention, as shown in fig. 2, 3 and 4, respectively.
The input module is the state awareness information necessary to provide automated driving decisions. A forward looking camera, a 360LIDAR, and a combination of the two may be employed herein. The camera provides image data and the LIDAR provides 3D point cloud data.
(1) And (3) a Q-depth network architecture taking only image data collected by a camera sensor as input (figure 2). The input data image size is 80x80 pixels, and is first regularized by a Normalization layer and then passed through a series of convolutional layers as shown in FIG. 2, which are shared by the Q-value functions of three dimensions. After passing through the shared convolution layer, the data enters respective full connection layers, and then 5 discrete actions (acceleration, deceleration, turning to the right, turning to the left) corresponding to respective dimensions are outputNo action) is performed. The Q values of the three dimensions are respectively expressed as Qs,Qo,QlSafety, efficiency (driving closer to the desired speed, higher efficiency) and comfort (less number of vehicle changes, higher comfort) are associated. Reward function (R) per dimensions,Ro,Rl)。
(2) Q-depth network architecture (fig. 3) with only LIDAR sensor acquisition of 3D point cloud data as input. The input data is point cloud data collected by the LIDAR, and because the point cloud data collected by the LIDAR is time sequence data, a recurrent neural network LSTM (recurrent neural network) is adopted in the underlying network shared by different dimensionality Q value networks, and the number of internal neurons is 512. Similarly, the input point cloud data enters the own network of each dimension after passing through the shared LSTM network, and then the accumulated profit estimate values of each dimension corresponding to 5 discrete actions are output.
(3) And a Q depth network architecture taking two perception data of an image and a 3D point cloud as input (figure 4). The input data are image data acquired by a camera and point cloud data acquired by a LIDAR, the image data are processed by a convolution network, the 3D point cloud data are processed by a recurrent neural network (LSTM), finally, a splicing layer (splice) is arranged to splice the feature extraction results of the image data and the point cloud data, then the image data and the 3D point cloud data enter the network of each dimension, and finally, the accumulated income estimation values of the dimensions corresponding to 5 discrete actions are output.
Further, in the present embodiment, the driving strategy in the second step is: accelerating, decelerating, turning to the right, turning to the left and not doing five strategic actions.
Further, in this embodiment, the deep Q-value network of the multi-reward framework includes three reward functions, which are: a security reward function, an efficiency reward function, and a comfort reward function.
Further, in the present embodiment, the security reward function is:
Figure BDA0003220391990000061
wherein r issThe value of a preset safety reward constant value is positive; rs(s, a) is the security reward function value obtained with action a under the current environment s.
Further, in the present embodiment, the efficiency reward function is:
Figure BDA0003220391990000062
wherein r isoA constant value is awarded for the preset efficiency, and the value of the constant value is positive; ro(s, a) is the efficiency reward function value obtained with action a in the current context s.
Further, in the present embodiment, the comfort reward function is:
Figure BDA0003220391990000063
wherein r islA preset comfort reward constant value is positive; rl(s, a) is the comfort reward function value obtained with action a in the current environment s.
The input module is the state awareness information necessary to provide automated driving decisions. The present invention may employ a forward looking camera, a 360LIDAR, and a combination of the two. The camera provides image data and the LIDAR provides 3D point cloud data.
Further, in this embodiment, the deep Q-value network of the multi-reward framework in the step one is trained, and the specific method of training is as follows:
s1, selecting an input environmental state sample S, inputting the sample S into a depth Q value network of a multi-reward architecture to be trained, and obtaining reward evaluation values of driving decisions in three dimensions of comprehensiveness, efficiency and comfort;
step S2, calculating a sum Q of reward valuations of driving decisions in three dimensionsRAMObtaining Q of any one of actions aRAMValue, obtainTaking a parameter set to be trained of the deep network;
step S3, determining a loss objective function according to a parameter set to be trained of the deep network, minimizing the objective function, and obtaining a driving strategy of the next step;
and S4, acquiring the driving strategy in the step S3, updating the environmental state sample S of the arranged S1, and returning to execute the step S1 until the minimized value of the objective function converges.
Further, in the present embodiment, the loss objective function in step S3 is:
Figure BDA0003220391990000071
wherein, thetakK belongs to s, o, l is a parameter set to be trained by the depth network under the security dimension, the efficiency dimension or the comfort dimension; s, a, s ', a' respectively represent the environmental state at the current moment, the action taken by the current environmental state, the environmental state at the next moment and the action taken at the environmental state at the next moment; qk(s,a;θk) K ∈ s, o, l indicates that in the environment state s, with action a, the parameter set is θkIn the case of (1), the future is expected to be the cumulative expected total revenue that can be obtained over the reward dimension k.
Further, in the third step, the reward evaluation values for driving decisions in three dimensions are summed as:
Figure BDA0003220391990000072
wherein Q isRAM(s, a; theta) is the sum of the reward estimates for driving decisions in three dimensions, theta is the set of parameters to be trained for the deep network, thetakA set of parameters to be trained for a deep network in a security dimension, an efficiency dimension, or a comfort dimension.
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims (10)

1. An automatic driving decision method based on multi-dimensional reward architecture deep Q learning is characterized by comprising the following steps:
the method comprises the following steps of firstly, acquiring environmental information of an automatic driving vehicle in real time by adopting a vision sensor and a LIDAR sensor; acquiring image information and/or point cloud information;
inputting image information and/or point cloud information into a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions in three dimensions of safety, efficiency and comfort;
step three, summing the reward valuations of the driving decisions under three dimensions to obtain a total driving strategy reward valuation;
and step four, analyzing the driving strategy reward evaluation value by adopting an epsilon-greedy algorithm to obtain an optimal decision action.
2. The automatic driving decision method based on the multidimensional reward architecture deep Q learning as claimed in claim 1, wherein in the second step, the specific method for obtaining the reward evaluation value of the driving decision in three dimensions of safety, efficiency and comfort is as follows:
when the acquired data is only image information:
regularization processing is carried out on the image information, and data after regularization processing are sequentially input to a convolution layer and a full connection layer of a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;
when the acquired data is only point cloud data:
sequentially inputting the point cloud data into a circulating neural network and a full connection layer of a depth Q value network of a multi-reward architecture to obtain reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving;
when the received data contains both point cloud data and image information:
the point cloud data is input into a recurrent neural network of a depth Q value network of a multi-reward architecture, image information is input into a convolution layer of the depth Q value network of the multi-reward architecture after being regularized, output of the recurrent neural network and output of the convolution layer are spliced and then input into a full connection layer, and reward evaluation values of driving decisions under three dimensions of safety, efficiency and comfort of automatic driving are obtained.
3. The method as claimed in claim 1 or 2, wherein the driving strategy in step two is: accelerating, decelerating, turning to the right, turning to the left and not doing five strategic actions.
4. The method as claimed in claim 3, wherein the deep Q-value network of the multi-reward framework comprises three reward functions, which are: a security reward function, an efficiency reward function, and a comfort reward function.
5. The method according to claim 4, wherein the safety reward function is:
Figure FDA0003220391980000021
wherein r issThe value of a preset safety reward constant value is positive; rs(s, a) is the security reward function value obtained with action a under the current environment s.
6. The method of claim 4, wherein the efficiency reward function is:
Figure FDA0003220391980000022
wherein r isoA constant value is awarded for the preset efficiency, and the value of the constant value is positive; ro(s, a) is the efficiency reward function value obtained with action a in the current context s.
7. The method of claim 4, wherein the comfort reward function is:
Figure FDA0003220391980000023
wherein r islA preset comfort reward constant value is positive; rl(s, a) is the comfort reward function value obtained with action a in the current environment s.
8. The automatic driving decision method based on the multidimensional reward architecture deep Q learning as claimed in claim 1, wherein the deep Q value network of the multidimensional reward architecture in the first step is trained, and the specific method for training is as follows:
s1, selecting an input environmental state sample S, inputting the sample S into a depth Q value network of a multi-reward architecture to be trained, and obtaining reward evaluation values of driving decisions in three dimensions of comprehensiveness, efficiency and comfort;
step S2, calculating a sum Q of reward valuations of driving decisions in three dimensionsRAMTo obtain any one ofQ of action aRAMObtaining a parameter set to be trained of the deep network;
step S3, determining a loss objective function according to a parameter set to be trained of the deep network, minimizing the objective function, and obtaining a driving strategy of the next step;
and step S4, executing the driving strategy described in step S3, updating the environmental state sample S after arrangement of S1, and returning to execute step S1 until the minimized value of the objective function converges.
9. The method for automatic driving decision based on multidimensional reward architecture deep Q learning as claimed in claim 8, wherein the loss objective function in step S3 is:
Figure FDA0003220391980000031
wherein, L (theta)k) Is the value of the objective function, thetakK belongs to s, o, l is a parameter set to be trained by the depth network under the security dimension, the efficiency dimension or the comfort dimension; s, a, s ', a' respectively represent the environmental state at the current moment, the action taken by the current environmental state, the environmental state at the next moment and the action taken at the environmental state at the next moment; qk(s, a; θ k), k ∈ s, o, l indicates that in the environmental state s, with action a, the parameter set is θkIn the case of (1), the future is expected to be the cumulative expected total revenue that can be obtained over the reward dimension k.
10. The method according to claim 9, wherein in step three, the reward evaluation values for the driving decision in three dimensions are summed as:
Figure FDA0003220391980000032
wherein Q isRAM(s, a; theta) is the sum of the reward estimates for driving decisions in three dimensions, theta is the set of parameters to be trained for the deep network, thetakA set of parameters to be trained for a deep network in a security dimension, an efficiency dimension, or a comfort dimension.
CN202110956262.5A 2021-08-19 2021-08-19 Automatic driving decision method based on multi-dimensional reward architecture deep Q learning Active CN113561995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110956262.5A CN113561995B (en) 2021-08-19 2021-08-19 Automatic driving decision method based on multi-dimensional reward architecture deep Q learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110956262.5A CN113561995B (en) 2021-08-19 2021-08-19 Automatic driving decision method based on multi-dimensional reward architecture deep Q learning

Publications (2)

Publication Number Publication Date
CN113561995A true CN113561995A (en) 2021-10-29
CN113561995B CN113561995B (en) 2022-06-21

Family

ID=78172192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110956262.5A Active CN113561995B (en) 2021-08-19 2021-08-19 Automatic driving decision method based on multi-dimensional reward architecture deep Q learning

Country Status (1)

Country Link
CN (1) CN113561995B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151564A1 (en) * 2018-11-12 2020-05-14 Honda Motor Co., Ltd. System and method for multi-agent reinforcement learning with periodic parameter sharing
CN114802307A (en) * 2022-05-23 2022-07-29 哈尔滨工业大学 Intelligent vehicle transverse control method under automatic and manual hybrid driving scene

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110850854A (en) * 2018-07-27 2020-02-28 通用汽车环球科技运作有限责任公司 Autonomous driver agent and policy server for providing policies to autonomous driver agents
CN111098852A (en) * 2019-12-02 2020-05-05 北京交通大学 Parking path planning method based on reinforcement learning
US20200175364A1 (en) * 2017-05-19 2020-06-04 Deepmind Technologies Limited Training action selection neural networks using a differentiable credit function
CN111275249A (en) * 2020-01-15 2020-06-12 吉利汽车研究院(宁波)有限公司 Driving behavior optimization method based on DQN neural network and high-precision positioning
CN111898211A (en) * 2020-08-07 2020-11-06 吉林大学 Intelligent vehicle speed decision method based on deep reinforcement learning and simulation method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200175364A1 (en) * 2017-05-19 2020-06-04 Deepmind Technologies Limited Training action selection neural networks using a differentiable credit function
CN110850854A (en) * 2018-07-27 2020-02-28 通用汽车环球科技运作有限责任公司 Autonomous driver agent and policy server for providing policies to autonomous driver agents
CN111098852A (en) * 2019-12-02 2020-05-05 北京交通大学 Parking path planning method based on reinforcement learning
CN111275249A (en) * 2020-01-15 2020-06-12 吉利汽车研究院(宁波)有限公司 Driving behavior optimization method based on DQN neural network and high-precision positioning
CN111898211A (en) * 2020-08-07 2020-11-06 吉林大学 Intelligent vehicle speed decision method based on deep reinforcement learning and simulation method thereof

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151564A1 (en) * 2018-11-12 2020-05-14 Honda Motor Co., Ltd. System and method for multi-agent reinforcement learning with periodic parameter sharing
US11657251B2 (en) * 2018-11-12 2023-05-23 Honda Motor Co., Ltd. System and method for multi-agent reinforcement learning with periodic parameter sharing
CN114802307A (en) * 2022-05-23 2022-07-29 哈尔滨工业大学 Intelligent vehicle transverse control method under automatic and manual hybrid driving scene
CN114802307B (en) * 2022-05-23 2023-05-05 哈尔滨工业大学 Intelligent vehicle transverse control method under automatic and manual mixed driving scene

Also Published As

Publication number Publication date
CN113561995B (en) 2022-06-21

Similar Documents

Publication Publication Date Title
CN110675623B (en) Short-term traffic flow prediction method, system and device based on hybrid deep learning
WO2022052406A1 (en) Automatic driving training method, apparatus and device, and medium
CN112965499B (en) Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning
CN112201069B (en) Deep reinforcement learning-based method for constructing longitudinal following behavior model of driver
CN113561995B (en) Automatic driving decision method based on multi-dimensional reward architecture deep Q learning
CN110516537B (en) Face age estimation method based on self-learning
CN115578876A (en) Automatic driving method, system, equipment and storage medium of vehicle
CN110281949B (en) Unified hierarchical decision-making method for automatic driving
CN116592883A (en) Navigation decision method based on attention and cyclic PPO
CN111160161B (en) Self-learning face age estimation method based on noise elimination
CN116452472A (en) Low-illumination image enhancement method based on semantic knowledge guidance
CN113723540B (en) Unmanned scene clustering method and system based on multiple views
Ithnin et al. Intelligent locking system using deep learning for autonomous vehicle in internet of things
CN113420824B (en) Pre-training data screening and training method and system for industrial vision application
CN115700546A (en) Model double checking method, system, equipment and storage medium based on cause and effect
Yao et al. Regional attention reinforcement learning for rapid object detection
CN114663953A (en) Facial expression recognition method based on facial key points and deep neural network
CN114119382A (en) Image raindrop removing method based on attention generation countermeasure network
CN113734170B (en) Automatic driving lane change decision method based on deep Q learning
CN117709602B (en) Urban intelligent vehicle personification decision-making method based on social value orientation
CN117275240B (en) Traffic signal reinforcement learning control method and device considering multiple types of driving styles
CN117973457B (en) Federal learning method based on reasoning similarity in automatic driving perception scene
Fodi Building motion prediction models for self-driving vehicles
Połap et al. Acceleration of data handling in neural networks by using cascade classification model
WO2022242175A1 (en) Data processing method and apparatus, and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant