CN108791302A

CN108791302A - Driving behavior modeling

Info

Publication number: CN108791302A
Application number: CN201810662040.0A
Authority: CN
Inventors: 邹启杰; 李昊宇; 裴腾达
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2018-11-13
Anticipated expiration: 2038-06-25
Also published as: CN108791302B

Abstract

The invention discloses a kind of driving behavior modelings, specifically include feature extractor, extraction structure Reward Program feature；Reward Program generator obtains the Reward Program needed for structure driving strategy；Driving strategy getter completes the structure of driving strategy；Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria；If not satisfied, then rebuilding Reward Program, repetition builds optimal driving strategy, iterates, until meeting judgment criteria；It is final to obtain the true driving strategy for driving demonstration of description.The application can be applicable in new state scene, to obtain its respective action, substantially increase the generalization ability of the driver behavior model of foundation, and applicable scene is wider, and robustness is stronger.

Description

Driving behavior modeling

Technical field

The present invention relates to a kind of modeling method, specifically driving behavior modeling.

Background technology

A pith of the autonomous driving as intelligent transportation field.By reasons such as current techniques, Autonomous Vehicles still need Intelligent driving system (intelligent DAS (Driver Assistant System)) and human driver is wanted to cooperate to complete driving task.And in this mistake Cheng Zhong, either preferably the information of quantization driver is not all for intelligence system decision, or by distinguish driver People provide personalized service, and driver modeling is all essential important step.

Current related in the method for driver modeling, intensified learning method because of for driver vehicle drive in this way Complicated sequential decision problem with extensive continuous space and multiple optimization aims has and solves effect well, then It is also that one kind modeling effective method for driving behavior.Intensified learning be used as based on MDP the problem of solution, need and Environmental interaction, the feedback signal for taking action to obtain the evaluation property from environment returns (reward), and makes long-term Return maximization.

By the retrieval discovery for existing literature, in the existing modeling for driving behavior, for Reward Program Setting method includes mainly：Traditional method for being directed to different scenes state manually by researcher and being configured, and by The method of reverse intensified learning is come the method that is arranged.Traditional method relies on the subjectivity of researcher in very big, return letter Several quality depends on the ability and experience of researcher.Simultaneously as in vehicle travel process, in order to correctly be arranged back Report function, need to balance a large amount of decision variable, these variables exist greatly can not the property spent together even paradox, and study people Member can not often design the Reward Program that can balance every demand.

And reverse intensified learning distributes suitable weight for all kinds of driving characteristics by example data is driven, it can be automatic Study obtains required Reward Program, and then solves the deficiency of original artificial decision.But traditional reverse intensified learning side Method can only learn for driving already present scene state in example data, and when actually drive, because weather, The difference of the factors such as scenery, true Driving Scene often surmount driving example range.Thus, the method for reverse intensified learning solves The relationship for driving example data Scene and decision action is shown to the problem of generalization ability deficiency.

There are mainly two types of thinkings for the existing driving behavior modeling method based on intensified learning theory：Thinking one, using biography Unite the method for intensified learning, the setting dependence researcher of Reward Program for the analysis of scene, arrangement, screening and conclusion, And then a series of related feature of Driving Decision-makings is acquired, such as：Chinese herbaceous peony away from, whether far from curb, whether far from pedestrian, rationally speed Degree, lane change frequency etc.；Further according to Driving Scene demand, a series of experiment is designed to seek these features in corresponding scene environment Under Reward Program in weight accounting, finally complete the global design for Reward Program, and drive as description driver The model of behavior.Thinking two is based on probabilistic model modeling method, and driving behavior is solved using the reverse intensified learning of maximum entropy Function.It assumes first that in the presence of potential, specific one probability distribution, produces the demonstration track of driving；In turn, it needs to look for The problem of probability distribution for driving demonstration can be fitted to one, and seek this probability distribution, can be converted into Non-Linear Programming and ask Topic, i.e.,：

max-plogp

∑ P=1

P acute pyogenic infection of finger tip is exactly the probability distribution of track of demonstrating, and is solved after obtaining probability distribution by formula above, by

It seeks obtaining relevant parameter, you can acquire Reward Program r=θ^Tf(s_t)。

Traditional driver's driving behavior model, using the analysis of known driving data, description and reasoning driving behavior, however Inexhaustible driving behavior can not be completely covered in the driving data acquired, unlikely obtain whole states and correspond to The case where action.Under practical Driving Scene, because of the difference of weather, scene, object, driving condition has numerous possibility, time It is impossible thing to go through whole states.Therefore traditional driver's driving behavior model generalization ability is weak, model hypothesis condition It is more, poor robustness.

Secondly, in actual driving problem, the method for Reward Program is only set with researcher, needs balance too many right In the demand of various features, it is completely dependent on the experience setting of researcher entirely, reconciles, takes time and effort manually repeatedly, it is more fatal It is excessively subjective.Under different scenes and environment, researcher then needs to face too many scene state；Meanwhile even needle To the scene state of some determination, the difference of demand also results in the variation of driving behavior.For accurate description, this is driven The task of sailing will distribute a series of weights with these factors of accurate description.In existing method, the reverse reinforcing based on probabilistic model Study is mainly from existing example data, using example data as data with existing, and then seeks point of corresponding current data Cloth situation, the action that can be just sought based on this under corresponding states are chosen.But the distribution of given data can not indicate total data Distribution, it is correct to obtain distribution, the case where needing to obtain whole state respective actions.

Invention content

To solve the problems, such as that driver modeling generalization is weak, i.e., not showing for Driving Scene in the presence of the prior art Norm can not establish corresponding Reward Program come the technical issues of carrying out driving behavior modeling, the application carries in the case of Driving behavior modeling has been supplied, new state scene can be applicable in, to obtain its respective action, be greatly improved The generalization ability for the driver behavior model established, is applicable in that scene is wider, and robustness is stronger.

To achieve the goals above, the technical essential of the present invention program is：Driving behavior modeling, specifically includes：

Feature extractor, extraction structure Reward Program feature；

Reward Program generator obtains driving strategy；

Driving strategy getter completes the structure of driving strategy；

Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria；If not satisfied, then weighing New structure Reward Program, repetition build optimal driving strategy, iterate, until meeting judgment criteria；It is final to obtain description very The real driving strategy for driving demonstration.

Further, the specific implementation process of feature extractor extraction structure Reward Program feature is：

S11. in vehicle travel process, driving video is adopted using vehicle windscreen subsequent video camera is placed on Sample obtains the picture of N group different vehicle driving environment road conditions；Corresponding driver behavior data simultaneously, i.e., under the road environment Steering angle situation, joint mapping gets up training data；

S12. it translated to collecting the picture come, cut, change brightness operation, to simulate the field of different illumination and weather Scape；

S13. convolutional neural networks are built, using picture after treatment as input, the operation data of corresponding picture is made It for label value, is trained, optimizes god using optimal solution is sought to mean square error loss based on the optimization method of Nadam optimizers Weight parameter through network；

S14. the network structure of the convolutional neural networks after the completion of training and weights are preserved, to establish a new convolution Neural network, completion status feature extractor.

Further, the convolutional neural networks established in step S13 include 1 input layer, 3 convolutional layers, 3 ponds Layer, 4 full articulamentums；Input layer is sequentially connected first convolutional layer, first pond layer, then connect second convolutional layer, Second pond layer reconnects third convolutional layer, third pond layer, is finally sequentially connected first full articulamentum, second A full articulamentum, third full articulamentum, the 4th full articulamentum.

Further, the convolutional neural networks after the completion of the training in step S14 do not include output layer.

Further, Reward Program generator, which obtains driving strategy specific implementation process, is：

S21. the driving example data of expert is obtained：Example data is driven from the sampling for driving video data of demonstrating Extraction, samples according to one section of continuous driving video of certain frequency pair, obtains one group of track demonstration；One expert's demonstration number According to including a plurality of track, totally it is denoted as：

Wherein D_EIndicate whole driving example data, (s_j,a_j) indicating that corresponding states j corresponds to the data pair of decision instruction composition with the state, M represents driving example data in total Number, N_TIt represents and drives demonstration trace number, L_iIt represents i-th and drives the state-decision instruction for including in demonstration track to (s_j, a_j) number；

S22. seek driving the feature desired value of demonstration；

Example data D will be driven first_EIn each description driving environment situation state s_tInput state feature extractor In, obtain corresponding states s_tUnder feature situation f (s_t,a_t), f (s_t,a_t) one group of correspondence s of acute pyogenic infection of finger tip_tInfluence Driving Decision-making result Driving environment scene characteristic value, be then based on following formula calculate drive demonstration feature desired value：

Wherein γ is discount factor, and according to the difference of problem, correspondence is configured；

S23. state-behavior aggregate under greedy strategy is sought；

S24. the weights of Reward Program are sought.

Further, comprising the concrete steps that for state-behavior aggregate under greedy strategy is sought：Due to Reward Program generator It is two parts of cycle with driving strategy getter；First, the neural network in driving strategy getter is obtained：It demonstrates driving Data D_EExtract the state feature f (s of obtained description ambient conditions_t,a_t) input neural network, obtain output g_w(s_t)；g_w (s_t) it is about description state s_tOne group of Q value set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, and Q (s_t,a_i) represent state- Working value, for describing in current Driving Scene state s_tUnder, choose decision driver behavior a_iQuality, based on formula Q (s, a) (s a) is acquired=θ μ, the weights in the current Reward Program of θ acute pyogenic infection of finger tip in the formula, μ (s, a) acute pyogenic infection of finger tip feature desired value.

ε-greedy strategies are then based on, carry out choosing description Driving Scene state s_tCorresponding Driving Decision-making action It chooses about current Driving Scene s_tUnder Q value sets in allow the maximum decision action of Q valuesOtherwise, then it randomly selects It has chosenLater, it records at this time

So for driving the D that demonstrates_EIn each state state feature f (s_t,a_t), the neural network is inputted, is obtained altogether M state-action is obtained to (s_t,a_t), which depict the Driving Scene state s of t moment_tLower selection Driving Decision-making acts a_t；Together When based on action choose the case where, obtain the Q values of M corresponding states-action pair, be denoted as Q.

Further, the weights for seeking Reward Program comprise the concrete steps that：

It is primarily based on following formula, builds object function：

Loss function is represented, i.e., according to current state-action to whether there is among driving demonstration, if depositing It is being then 0, is being otherwise 1；For the corresponding states-working value recorded above；For S22 In the driving exemplary features sought it is expected and the product of the weights θ of Reward Program；For regular terms；

The object function, i.e. t=min are minimized by gradient descent method_θJ (θ), acquisition enable the minimization of object function Variable θ, the θ are the weights of striked required Reward Program.

Further, Reward Program generator acquisition driving strategy specific implementation process further includes：S25. it is based on obtaining Correspondence Reward Program weights θ, according to formula r (s, a)=θ^T(s a) builds Reward Program generator to f.

As further, driving strategy getter completes the specific implementation process that driving strategy is built and is：

S31 builds the training data of driving strategy getter

Training data is obtained, each data include two parts：One is that t moment Driving Scene state is inputted driving condition The Driving Decision-making feature f (s that extractor obtains_t), another is namely based on what following formula obtained

Wherein, r_θ(s_t,a_t) by Reward Program of the Reward Program generator based on driving example data generation；Q^π(s_t, a_t) and Q^π(s_t+1,a_t+1) come from Q values recorded in S23, choose t moment Driving Scene s described in it_tQ values and selection T+1 moment Driving Scenes s described in it_t+1Q values；

S32. neural network is established

Neural network includes three layers, and first layer is as input layer, the output of neuron number and feature extractor therein Feature type is identical for k, the feature f (s for inputting Driving Scene_t,a_t), the hidden layer number of the second layer is 10, third layer Neuron number in motion space carry out decision driver behavior number n it is identical；Input layer and the activation primitive of hidden layer are all For sigmoid functions, i.e.,Have：

Z=w⁽¹⁾X=w⁽¹⁾[1,f_t]^T

H=sigmoid (z)

g_w(s_t)=sigmoid (w⁽²⁾[1,h]^T)

Wherein w⁽¹⁾For the weights of hidden layer；f_tFor the state s of t moment Driving Scene_tFeature, that is, neural network is defeated Enter；Network layer output when z is without hidden layer sigmoid activation primitives；H is hidden after sigmoid activation primitives Layer output；w⁽²⁾For the weights of output layer；

The g of network output_w(s_t) it is t moment Driving Scene state s_tQ set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, Q in S31^π(s_t,a_t) it is exactly by state s_tInput neural network, a in selection output_tObtained by；

S33. optimization neural network

The loss function of optimization for the neural network, foundation is cross entropy cost function, and formula is as follows：

The wherein number of N acute pyogenic infection of finger tip training data；Q^π(s_t,a_t) it is that will describe t moment Driving Scene state s_tInput nerve net Network, the correspondence Driving Decision-making in selection output act a_tThe obtained numerical value of item；For the numerical value acquired in S31；It is regular terms, W={ w therein⁽¹⁾,w⁽²⁾Weights in neural network above acute pyogenic infection of finger tip；

The training data that will be obtained in S31 inputs the Neural Network Optimization cost function；By gradient descent method completion pair In the minimum of the cross entropy cost function, the neural network of obtained optimization completion, and then obtain driving strategy getter.

As further, judging device implements process and includes：

Regard current Reward Program generator and driving strategy getter as an entirety, checks the t in current S22 Value, if meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, judge to be currently used in acquisition driving Whether the Reward Program of strategy meets the requirements；Its numerical value carries out different settings according to specific needs；

When the numerical value of t, when being unsatisfactory for the formula；It needs to rebuild Reward Program generator, be needed at this time by current S23 The neural network of middle needs is substituted for the new neural network after having already passed through optimization in S33, i.e., will be used to generate description and exist Driving Scene state s_tUnder, the decision driver behavior a of selection_iGood and bad Q (s_t,a_i) value network, be substituted in S33 by ladder The new network structure that degree descending method optimized；Then Reward Program generator is rebuild, driving strategy is obtained and obtains Take device, judge again t numerical value whether meet demand；

When meeting the formula, current θ is exactly the weights of required Reward Program；Reward Program generator, which then meets, to be wanted It asks, driving strategy getter is also met the requirements；Then acquisition needs to establish the driving data of certain driver of pilot model, i.e., Environment scene image in driving procedure and corresponding operation data input driving environment feature extractor, obtain for current The decision feature of scene；Then feature extraction obtained inputs Reward Program generator, obtains the return of corresponding scene state Function；Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy getter, obtains the driver Corresponding driving strategy.

Advantageous effect is the present invention compared with prior art：For describing driver's decision in the present invention, establishes and drive Member behavior model method, because using neural network come Descriptive strategies, when neural network parameter determines, state and action Correspond, then for state-action to it is possible in the case of be no longer limited by demonstration track.Then in actual driving situation In, because the corresponding big state space of Driving Scene various caused by the reasons such as weather, scenery, outstanding by means of neural network Approximate expression arbitrary function ability, approximately can be by a kind of this Policy Table up to regarding black box as：Pass through the feature of input state Value exports corresponding state-working value, while further being acted according to the case where output valve to choose, corresponding dynamic to obtain Make.To make to be greatly enhanced come the applicability modeled for driving behavior by reverse intensified learning, conventional method is because attempting It is fitted to demonstration track by a certain probability distribution, thus the optimal policy obtained remains unchanged and is limited to having in demonstration track State status, and the present invention new state scene can be applicable in, to obtain its respective action, substantially increase and build The generalization ability of vertical driver behavior model, applicable scene is wider, and robustness is stronger.

Description of the drawings

Fig. 1 is new depth convolutional neural networks；

Fig. 2 is driving video sample graph；

Fig. 3 is this system workflow block diagram；

Fig. 4 is to establish neural network structure figure in step S32.

Specific implementation mode

Below in conjunction with Figure of description, the invention will be further described.Following embodiment is only used for clearly Illustrate technical scheme of the present invention, and not intended to limit the protection scope of the present invention.

The present embodiment provides a kind of driving behavior modelings, including：

1. feature extractor, extraction structure Reward Program feature, concrete mode are：

S11. in vehicle travel process, the driving video that is obtained using the subsequent video camera of the windshield for being placed on vehicle into Row sampling, sample graph are as shown in Figure 2.

Obtain the picture of N group different vehicle driving road environment road conditions and corresponding steering angle situation.Including N1 Straight way and N2 bends, the value of N1, N2 can be N1>=300, N2>=3000, at the same corresponding driver behavior data, joint Construct training data.

S12. carry out relevant translation to collecting the image come, cut, the change operations such as brightness, with simulate different illumination and The scene of weather.

S13. convolutional neural networks are built, using picture after treatment as input, the operation data of corresponding picture is made For label value, it is trained；Optimize to seek optimal solution to mean square error loss using based on the optimization method of Nadam optimizers The weight parameter of neural network.

Convolutional neural networks include 1 input layer, 3 convolutional layers, 3 pond layers, 4 full articulamentums.Input layer is successively First convolutional layer, first pond layer are connected, second convolutional layer, second pond layer are then connected, reconnects third Convolutional layer, third pond layer, be then sequentially connected the full articulamentum of first full articulamentum, second full articulamentum, third, 4th full articulamentum.

S14. the network structure by the convolutional neural networks after the completion of training in addition to the last output layer and weights preserve, To establish a new convolutional neural networks, completion status feature extractor.

2. Reward Program generator, obtains driving strategy, concrete mode is：

Reward Program returns letter as the standard for acting selection in intensified learning method in the acquisition process of driving strategy Several quality plays the role of conclusive, directly determines the quality of the driving strategy of acquisition, and the strategy obtained is No strategy corresponding with true driving example data is identical.The formula of Reward Program is reward=θ^Tf(s_t,a_t), f (s_t, a_t) acute pyogenic infection of finger tip corresponds to the t moment state s under driving environment scene " vehicle-periphery "_tOne group of influence Driving Decision-making result spy Value indicative, for describing vehicle-periphery scenario.And θ acute pyogenic infection of finger tip corresponds to one group of weights of the feature for influencing Driving Decision-making, power The corresponding environmental characteristic of the numbers illustrated of value proportion shared in Reward Program, embodies importance.It is carried in state feature On the basis of taking device, need to solve this weights θ, to come build influence driving strategy Reward Program.

S21. the driving example data of expert is obtained

Example data is driven from sampling extraction (and the driving environment feature extraction before for driving video data of demonstrating Data used in device are different), it can be sampled according to the continuous driving video of one section of frequency pair of 10hz, obtain one group of track and show Model.One expert's demonstration should have a plurality of track.Totally it is denoted as： Wherein D_EIndicate whole driving example data, (s_j,a_j) indicate the corresponding states j (videos of the driving environment of the time j of sampling Picture) data pair that decision instruction (steering angle in such as steering order) is constituted are corresponded to the state, M represents driving in total The number of example data, N_TIt represents and drives demonstration trace number, L_iIt represents i-th and drives the state-decision for including in demonstration track Instruction is to (s_j,a_j) number

S22. the feature for seeking driving demonstration it is expected

Example data D will be driven first_EIn each description driving environment situation state s_tInput state feature extraction Device obtains corresponding states s_tUnder feature situation f (s_t,a_t), f (s_t,a_t) one group of correspondence s of acute pyogenic infection of finger tip_tInfluence Driving Decision-making result Driving environment scene characteristic value, be then based on following formula calculate drive demonstration feature it is expected：

Wherein γ is discount factor, and according to the difference of problem, correspondence is configured, and referential data can be set as 0.65.

S23. state-behavior aggregate under greedy strategy is sought

First, the neural network in the driving strategy getter in S32 is obtained.(because of Reward Program generator and drive plan Slightly getter is two parts in a cycle, and most neural network is the neural network just initialized in S32 at first. With the progress of cycle, each step in cycle is all：In the structure for completing the primary Reward Program for influencing Driving Decision-making, then Corresponding optimal driving strategy is obtained in driving strategy getter based on current Reward Program, judges whether to meet end loop Standard, rebuild in Reward Program if not satisfied, being then put into the neural network that the process in current S34 optimized)

Driving example data D_EExtract the state feature f (s of obtained description ambient conditions_t,a_t), neural network is inputted, Obtain output g_w(s_t)；g_w(s_t) it is about description state s_tOne group of Q value set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, and Q(s_t,a_i) state-working value is represented, for describing in current Driving Scene state s_tUnder, choose decision driver behavior a_iIt is excellent It is bad, can be based on formula Q (s, a)=θ μ (s a) is acquired, the weights in the current Reward Program of the θ acute pyogenic infection of finger tip in the formula, μ (s, a) acute pyogenic infection of finger tip feature expectation.

ε-greedy strategies are then based on, if setting ε is 0.5, carry out choosing description Driving Scene state s_tIt is corresponding Driving Decision-making actsThat is there is 50 percent possibility, choose about current Driving Scene s_tUnder Q value collection The maximum decision of Q values is allowed to act in conjunctionOtherwise, then it randomly selectsIt has chosenLater, it records at this time

So for driving the D that demonstrates_EIn each state state feature f (s_t,a_t), the neural network is inputted, is obtained altogether M state-action is obtained to (s_t,a_t) which depict the Driving Scene state s of t moment_tLower selection Driving Decision-making acts a_t.Together When based on action choose the case where, obtain the Q values of M corresponding states-action pair, be denoted as Q.

S24. the weights of Reward Program are sought

It is primarily based on following formula, builds object function：

Represent loss function, i.e., according to current state-action to whether there is among driving demonstration, if It is otherwise 1 in the presence of being then 0.For the corresponding states-working value recorded above.For The driving exemplary features sought in S22 it is expected and the product of the weights θ of Reward Program.For regular terms, to prevent over-fitting The appearance of problem, the γ can be 0.9.

S25. the correspondence Reward Program weights θ based on acquisition, according to formula r (s, a)=θ^T(s a) builds Reward Program to f Generator.

3. driving strategy getter, completes the structure of driving strategy, concrete mode is：

The structure of the training data of S31 driving strategy getters

Obtain training data.Data come from the sampling to example data before, but are handled to obtain one group The data of new type amount to N number of.Each data include two parts in data：One is to input t moment Driving Scene state The Driving Decision-making feature f (s that driving condition extractor obtains_t), another is namely based on what following formula obtained

Include parameter r in the formula_θ(s_t,a_t) by return of the Reward Program generator based on driving example data generation Function.Q^π(s_t,a_t) and Q^π(s_t+1,a_t+1) come from that group of Q value Q recorded in S23, choose t moment driver training ground described in it Scape s_tQ values and choose its described in t+1 moment Driving Scenes s_t+1Q values.

S32. neural network is established

Neural network includes three layers, and first layer is as input layer, the output of neuron number and feature extractor therein Feature type is identical for k, the feature f (s for inputting Driving Scene_t,a_t), the hidden layer number of the second layer is 10, third layer Neuron number in motion space carry out decision driver behavior number n as number；The activation of input layer and hidden layer Function is all sigmoid functions, i.e.,Have：

Z=w⁽¹⁾X=w⁽¹⁾[1,f_t]^T

H=sigmoid (z)

g_w(s_t)=sigmoid (w⁽²⁾[1,h]^T)

Wherein w⁽¹⁾The weights of acute pyogenic infection of finger tip hidden layer；f_tThe state s of acute pyogenic infection of finger tip t moment Driving Scene_tFeature, that is, neural network Input；The output of network layer when z acute pyogenic infection of finger tip is without hidden layer sigmoid activation primitives；H acute pyogenic infection of finger tip is activated by sigmoid Hidden layer output after function；w⁽²⁾The weights of acute pyogenic infection of finger tip output layer；Network structure such as Fig. 3：

The g of network output_w(s_t) it is t moment Driving Scene state s_tQ set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, Q in S31^π(s_t,a_t) it is exactly by state s_tInput neural network, a in selection output_tObtained by.

S33. optimization neural network

The wherein number of N acute pyogenic infection of finger tip training data.Q^π(s_t,a_t) it is exactly that will describe t moment Driving Scene state s_tInput nerve Network, the correspondence Driving Decision-making in selection output act a_tThe obtained numerical value of item.For the numerical value acquired in S31.Equally it is regular terms, prevents over-fitting and be arranged.The γ may be 0.9.W={ w therein⁽¹⁾,w⁽²⁾Acute pyogenic infection of finger tip Weights in neural network above.

The training data that will be obtained in S31 inputs the Neural Network Optimization cost function.By gradient descent method completion pair In the minimum of the cross entropy cost function, the neural network of obtained optimization completion obtains driving strategy getter.

4. judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria；If not satisfied, then Reward Program is rebuild, repetition builds optimal driving strategy, iterates, until meeting judgment criteria；Finally described The true driving strategy for driving demonstration.

Regard current Reward Program generator and driving strategy getter as an entirety, checks the t in current S22 Value, if meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, judge to be currently used in acquisition driving Whether the Reward Program of strategy meets the requirements.Its numerical value carries out different settings according to specific needs.

When the numerical value of t is unsatisfactory for the formula.It needs to rebuild Reward Program generator, needs to work as at this time The neural network needed in preceding S23 is substituted for the new neural network after having already passed through optimization in S33, i.e., will be used to generate and retouch It states in Driving Scene state s_tUnder, the decision driver behavior a of selection_iGood and bad Q (s_t,a_i) value network, be substituted in S33 and pass through Cross the new network structure that gradient descent method optimized.Then it rebuilds Reward Program generator, obtain driving plan Slightly getter, judge again t numerical value whether meet demand.

When meeting the formula, current θ is exactly the weights of required Reward Program.Reward Program generator is then full Foot requires, and driving strategy getter is also met the requirements.It then can be with：Acquisition needs to establish driving for certain driver of pilot model Data, i.e., the environment scene image in driving procedure and corresponding operation data are sailed, steering angle is such as driven.It is special to input driving environment Extractor is levied, the decision feature for current scene is obtained.Then feature extraction obtained inputs Reward Program generator, obtains To the Reward Program of corresponding scene state.Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy Getter obtains the corresponding driving strategy of the driver.

In markov decision process, a kind of strategy needs connection status to its corresponding action.But have for one When large-scale state space, for the region not traversed, indicated it is difficult to be depicted and carry out a determining strategy, tradition Also the description to this part is had ignored among method, is only based on demonstration track, to illustrate the probability mould of entire track distribution Type does not provide specific strategy for new state and indicates, i.e., takes the possibility for determining and acting not for new state Provide specific method.Strategy is described by neural network in the present invention, neural network can be in any essence because of it The characteristic of approximate representation arbitrary function in exactness, while having outstanding generalization ability.By the expression of state feature, on the one hand Those states being not included in demonstration track can be represented, in addition, inputting neural network by by corresponding state feature. Corresponding working value can be sought, to seek deserved action according to strategy, thus, conventional method can not extensive driving demonstration number It is addressed according to not traversing Driving Scene state issues.

The preferable specific implementation mode of the above, only the invention, but the protection domain of the invention is not It is confined to this, any one skilled in the art is in the technical scope that the invention discloses, according to the present invention The technical solution of creation and its inventive concept are subject to equivalent substitution or change, should all cover the invention protection domain it It is interior.

Claims

1. driving behavior modeling, which is characterized in that specifically include：

Feature extractor, extraction structure Reward Program feature；

Reward Program generator obtains driving strategy；

Driving strategy getter completes the structure of driving strategy；

Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria；If not satisfied, then structure again Reward Program is built, repetition builds optimal driving strategy, iterates, until meeting judgment criteria.

2. driving behavior modeling according to claim 1, which is characterized in that feature extractor extraction structure return letter Counting the specific implementation process of feature is：

S11. in vehicle travel process, driving video is sampled using vehicle windscreen subsequent video camera is placed on, Obtain the picture of N group different vehicle driving environment road conditions and corresponding steering angle situation；Corresponding driver behavior data simultaneously, Joint mapping gets up training data；

S12. it translated to collecting the picture come, cut, change brightness operation, to simulate the scene of different illumination and weather；

S13. convolutional neural networks are built, using picture after treatment as input, the operation data of corresponding picture is as mark Label value, is trained, and optimizes nerve net using optimal solution is sought to mean square error loss based on the optimization method of Nadam optimizers The weight parameter of network；

S14. the network structure of the convolutional neural networks after the completion of training and weights are preserved, to establish a new convolutional Neural Network, completion status feature extractor.

3. driving behavior modeling according to claim 2, which is characterized in that the convolutional Neural established in step S13 Network includes 1 input layer, 3 convolutional layers, 3 pond layers, 4 full articulamentums；Input layer be sequentially connected first convolutional layer, Then first pond layer connects second convolutional layer, second pond layer, reconnect third convolutional layer, third pond Layer is finally sequentially connected first full articulamentum, second full articulamentum, third full articulamentum, the 4th full articulamentum.

4. driving behavior modeling according to claim 2, which is characterized in that after the completion of the training in step S14 Convolutional neural networks do not include output layer.

5. driving behavior modeling according to claim 1, which is characterized in that Reward Program generator, which obtains, drives plan Slightly implementing process is：

S21. the driving example data of expert is obtained：Drive example data from for demonstrate driving video data sampling carry It takes, is sampled according to one section of continuous driving video of certain frequency pair, obtain one group of track demonstration；One expert's example data Including a plurality of track, totally it is denoted as：

D_E={ (s₁,a₁),(s₂,a₂),...,(s_M,a_M)}Wherein D_EIndicate whole driving example data, (s_j, a_j) indicating that corresponding states j corresponds to the data pair of decision instruction composition with the state, M represents of driving example data in total Number, N_TIt represents and drives demonstration trace number, L_iIt represents i-th and drives the state-decision instruction for including in demonstration track to (s_j, a_j) number；

S22. seek driving the feature desired value of demonstration；

Example data D will be driven first_EIn each description driving environment situation state s_tIn input state feature extractor, Obtain corresponding states s_tUnder feature situation f (s_t,a_t), f (s_t,a_t) one group of correspondence s of acute pyogenic infection of finger tip_tInfluence Driving Decision-making result drive Environment scene characteristic value is sailed, following formula is then based on and calculates the feature desired value for driving demonstration：

S23. state-behavior aggregate under greedy strategy is sought；

S24. the weights of Reward Program are sought.

6. driving behavior modeling according to claim 5, which is characterized in that the state-sought under greedy strategy is dynamic That makees to collect comprises the concrete steps that：Since Reward Program generator and driving strategy getter are two parts of cycle；First, acquisition is driven Sail the neural network in tactful getter：Driving example data D_EExtract the state feature f (s of obtained description ambient conditions_t, a_t), neural network is inputted, output g is obtained_w(s_t)；g_w(s_t) it is about description state s_tOne group of Q value set, i.e. [Q (s_t, a₁),...,Q(s_t,a_n)]^T, and Q (s_t,a_i) state-working value is represented, for describing in current Driving Scene state s_tUnder, choosing Depend on plan driver behavior a_iQuality, ((s a) is acquired, and the θ acute pyogenic infection of finger tip in the formula is current by s, a)=θ μ based on formula Q Weights in Reward Program, μ (s, a) acute pyogenic infection of finger tip feature desired value.

ε-greedy strategies are then based on, carry out choosing description Driving Scene state s_tCorresponding Driving Decision-making actionIt chooses About current Driving Scene s_tUnder Q value sets in allow the maximum decision action of Q valuesOtherwise, then it randomly selectsIt chooses It is completeLater, it records at this time

So for driving the D that demonstrates_EIn each state state feature f (s_t,a_t), the neural network is inputted, is acquired altogether M state-action is to (s_t,a_t), which depict the Driving Scene state s of t moment_tLower selection Driving Decision-making acts a_t；Base simultaneously In acting the case where choosing, the Q values of M corresponding states-action pair are obtained, Q is denoted as.

7. driving behavior modeling according to claim 5, which is characterized in that the weights for seeking Reward Program specifically walk Suddenly it is：It is primarily based on following formula, builds object function：

Loss function is represented, i.e., according to current state-action to whether there is among driving demonstration, if in the presence of if It is 0, is otherwise 1；For the corresponding states-working value recorded above；To be asked in S22 The driving exemplary features taken it is expected and the product of the weights θ of Reward Program；For regular terms；

The object function, i.e. t=min are minimized by gradient descent method_θJ (θ) obtains the variable for enabling the minimization of object function θ, the θ are the weights of striked required Reward Program.

8. driving behavior modeling according to claim 5, which is characterized in that Reward Program generator, which obtains, drives plan Slightly implementing process further includes：S25. the correspondence Reward Program weights θ based on acquisition, according to formula r (s, a)=θ^Tf(s,a) Build Reward Program generator.

9. driving behavior modeling according to claim 1, which is characterized in that driving strategy getter is completed to drive plan The specific implementation process slightly built is：

S31 builds the training data of driving strategy getter

Training data is obtained, each data include two parts：One is by the input driving condition extraction of t moment Driving Scene state The Driving Decision-making feature f (s that device obtains_t), another is namely based on what following formula obtained

Wherein, r_θ(s_t,a_t) by Reward Program of the Reward Program generator based on driving example data generation；Q^π(s_t,a_t) and Q^π (s_t+1,a_t+1) come from Q values recorded in S23, choose t moment Driving Scene s described in it_tQ values and choose wherein retouch State t+1 moment Driving Scenes s_t+1Q values；

S32. neural network is established

Neural network includes three layers, and first layer is as input layer, the output feature of neuron number and feature extractor therein Type is all mutually k, the feature f (s for inputting Driving Scene_t,a_t), the hidden layer number of the second layer is 10, the god of third layer It is identical with the driver behavior number n of progress decision in motion space through first number；Input layer and the activation primitive of hidden layer are all Sigmoid functions, i.e.,Have：

Z=w⁽¹⁾X=w⁽¹⁾[1,f_t]^T

H=sigmoid (z)

g_w(s_t)=sigmoid (w⁽²⁾[1,h]^T)

Wherein w⁽¹⁾For the weights of hidden layer；f_tFor the state s of t moment Driving Scene_tFeature, that is, neural network input；z Network layer output when for without hidden layer sigmoid activation primitives；H is that the hidden layer after sigmoid activation primitives is defeated Go out；w⁽²⁾For the weights of output layer；

The g of network output_w(s_t) it is t moment Driving Scene state s_tQ set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, S31 In Q^π(s_t,a_t) it is exactly by state s_tInput neural network, a in selection output_tObtained by；

S33. optimization neural network

The wherein number of N acute pyogenic infection of finger tip training data；Q^π(s_t,a_t) it is that will describe t moment Driving Scene state s_tInput neural network, choosing Select the correspondence Driving Decision-making action a in output_tThe obtained numerical value of item；For the numerical value acquired in S31；It is just Then item, W={ w therein⁽¹⁾,w⁽²⁾Weights in neural network above acute pyogenic infection of finger tip；

The training data that will be obtained in S31 inputs the Neural Network Optimization cost function；It is completed for this by gradient descent method The minimum of cross entropy cost function, the neural network that obtained optimization is completed, and then obtain driving strategy getter.

10. driving behavior modeling according to claim 1, which is characterized in that judging device implements process and includes：

Regard current Reward Program generator and driving strategy getter as an entirety, checks the t values in current S22, be It is no to meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, judge to be currently used in and obtain driving strategy Whether Reward Program meets the requirements；Its numerical value carries out different settings according to specific needs；

When the numerical value of t, when being unsatisfactory for the formula；It needs to rebuild Reward Program generator, needs to need in current S23 at this time The neural network wanted is substituted for the new neural network after having already passed through optimization in S33, i.e., will be used to generate description and drive Scene state s_tUnder, the decision driver behavior a of selection_iGood and bad Q (s_t,a_i) value network, be substituted in S33 by under gradient The new network structure that drop method optimized；Then it rebuilds Reward Program generator, obtain driving strategy getter, Judge again t numerical value whether meet demand；

When meeting the formula, current θ is exactly the weights of required Reward Program；Reward Program generator is then met the requirements, Driving strategy getter is also met the requirements；Then acquisition needs to establish the driving data of certain driver of pilot model, that is, drives Environment scene image during sailing and corresponding operation data input driving environment feature extractor, obtain for working as front court The decision feature of scape；Then feature extraction obtained inputs Reward Program generator, obtains the return letter of corresponding scene state Number；Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy getter, obtains the driver couple The driving strategy answered.