CN107169567A

CN107169567A - The generation method and device of a kind of decision networks model for Vehicular automatic driving

Info

Publication number: CN107169567A
Application number: CN201710201086.8A
Authority: CN
Inventors: 夏伟; 李慧云
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2017-03-30
Filing date: 2017-03-30
Publication date: 2017-09-15
Anticipated expiration: 2037-03-30
Also published as: CN107169567B

Abstract

The present invention is applicable field of computer technology there is provided a kind of generation method of decision networks model for Vehicular automatic driving and device, and methods described includes：The car status information gathered according to each experiment moment, default vehicle set of actions and default prize payouts function, generation each experiment moment corresponding sample triple, it is the sample data in the experience database built in advance by all sample triple stores, and clustering is carried out to all sample datas, according to default oversampling ratio value, training sample is equably gathered in each cluster that experience database clustering is obtained, and calculate the return aggregate-value of each training sample, according to all training samples, the return aggregate-value and default deep learning algorithm of each training sample, training obtains the decision networks model of Vehicular automatic driving, so as to be effectively improved the training effectiveness of the decision networks model and the generalization ability of the tactful network model.

Description

The generation method and device of a kind of decision networks model for Vehicular automatic driving

Technical field

The invention belongs to field of computer technology, more particularly to a kind of decision networks model for Vehicular automatic driving Generation method and device.

Background technology

With expanding economy and the propulsion of Development of China's Urbanization, Global Auto recoverable amount and mileages of transport route are stepped up, and are led The problem of a series of orthodox cars such as cause traffic congestion, accident, pollution, land resource in short supply can not be properly settled is increasingly convex It is aobvious.Pilotless automobile technology is considered as the effective solution of these problems, and its development gets most of the attention.

Pilotless automobile, i.e., travelled in the case of without driver by the DAS (Driver Assistant System) of itself on road, Possesses environment sensing ability.At present, the control method of DAS (Driver Assistant System) is mainly rule-based control decision, i.e., according to The driving experience known, builds to the expert decision system of situation of remote output control decision-making, as similar Expert Rules system Shallow-layer learning method is considered as from the process that rule is found between labeled data, when rule is difficult to be conceptualized as formula or letter During single logic, shallow-layer study can not just prove effective.

With the fast development of deeply learning art, some research institutions propose that the automatic Pilot of " end-to-end " formula is calculated Method, the control decision model in DAS (Driver Assistant System) is built by depth network, the input of depth network is camera, laser The status datas such as radar, GPS location, speed, the output of depth network is directly as the actuating signal for controlling vehicle drive.It is this Method need not carry out rule-based identification to the state of vehicle, but the training of depth network generally requires substantial amounts of data Sample, with the raising and the increase of complicated network structure degree of vehicle sensor data dimension, the computing resource of model training disappears Consumption is greatly increased, and the huge consumption of computing resource is referred to as a big obstruction of depth network model training.

The content of the invention

It is an object of the invention to provide a kind of generation method of decision networks model for Vehicular automatic driving and dress Put, it is intended to which the decision model training effectiveness for solving Vehicular automatic driving is relatively low and learning ability of Vehicle Decision Method model is weaker, nothing Method preferably adapts to different route and scene.

On the one hand, it is described the invention provides a kind of generation method of the decision networks model for Vehicular automatic driving Method comprises the steps：

Returned according to car status information, default vehicle set of actions and the default reward that each experiment moment gathers Function is reported, corresponding sample triple of each experiment moment is generated；

It is the sample data in the experience database built in advance by all sample triple stores, and to the experience number Clustering is carried out according to sample data all in storehouse；

According to default oversampling ratio value, equably gathered in each cluster that the experience database clustering is obtained Training sample, and calculate the corresponding accumulative return value of each training sample；

According to all training samples, the accumulative return value of each training sample and default deep learning algorithm, Training obtains the decision networks model of Current vehicle automatic Pilot.

On the other hand, the invention provides a kind of generating means of the decision networks model for Vehicular automatic driving, institute Stating device includes：

Sample generation module, for the car status information according to each experiment moment collection, default vehicle behavior aggregate Close and default prize payouts function, generate corresponding sample triple of each experiment moment；

Cluster Analysis module, for being the sample number in the experience database that builds in advance by all sample triple stores According to, and clustering is carried out to sample data all in the experience database；

Sampling module is trained, for according to default oversampling ratio value, being obtained in the experience database clustering Training sample is equably gathered in each cluster, and calculates the corresponding accumulative return value of each training sample；And

Model generation module, for according to all training samples, the accumulative return value of each training sample and pre- If deep learning algorithm, training obtains the decision networks model of Current vehicle automatic Pilot.

The present invention obtains experience number according to the car status information, vehicle set of actions and prize payouts function of collection According to the sample data in storehouse, i.e. sample triple, the sample data in experience database is classified by clustering, and by than Example uniform sampling in each cluster of classification, obtains the training sample of depth network, and finally training obtains Vehicular automatic driving Decision networks model, so as to simplify training sample by clustering method, and uses return value and deep learning training decision-making mode Network model, not only enables decision networks model quickly to be trained under the data set simplified, and effective raising The learning ability and generalization ability of decision networks model in control loop.

Brief description of the drawings

Fig. 1 is the reality of the generation method for the decision networks model for Vehicular automatic driving that the embodiment of the present invention one is provided Existing flow chart；

Fig. 2 is the knot of the generating means for the decision networks model for Vehicular automatic driving that the embodiment of the present invention two is provided Structure schematic diagram；And

Fig. 3 is the knot of the generating means for the decision networks model for Vehicular automatic driving that the embodiment of the present invention three is provided Structure schematic diagram.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Implementing for the present invention is described in detail below in conjunction with specific embodiment：

Embodiment one：

Fig. 1 shows the generation method for the decision networks model for Vehicular automatic driving that the embodiment of the present invention one is provided Implementation process, for convenience of description, illustrate only the part related to the embodiment of the present invention, details are as follows：

In step S101, according to it is each experiment the moment gather car status information, default vehicle set of actions with And default prize payouts function, generation each experiment moment corresponding sample triple.

The present invention is applied to be based on racing car analogue simulation platform or race simulator (such as open race simulator TORCS, The open racing car simulation) interaction platform set up, on the interaction platform carrying out nobody drives Sail the traveling interaction experiment of automobile.In current interaction process of the test, pass through default multiple sensor collecting vehicles on vehicle Status information, car status information may include vehicle from a distance from center line of road, the folder of vehicle forward direction and road tangentially The velocity component of angle, the distance value of vehicle front laser range finder and vehicle on road is tangential etc..In addition to the initial trial moment, The vehicle-state at each experiment moment is the vehicle-state of last moment and the result or function of vehicle action, for example, using S_t Car status information when representing to carve t at the trial, then car status information when carving t+1 at the trial is S_t+1=f (S_t,a_t) =f (f (S_t-1,a_t-1))=..., wherein, a_tVehicle action message during for experiment moment t.Specifically, vehicle action message can Including straight trip, brake etc..

In embodiments of the present invention, after the car status information for collecting the current test moment, rewarded back according to default Value function is reported, travels through and the vehicle action that can obtain maximal rewards value is searched in default vehicle set of actions, and sent out to vehicle Give the vehicle to act, for the ease of distinguishing, vehicle action is referred to as the action of maximal rewards value, by car status information, maximum Return value is acted and the return value of maximal rewards value action is combined into sample triple, for example, the sample three during experiment moment t Tuple is represented by (S_t,a_t,r_t), r_tFor the return value of experiment moment t maximal rewards value action.

As illustratively, in the current traveling-position of consideration vehicle, the formula of prize payouts function can be：

R=Δs dis*cos (α * angle) * sgn (trackPos-threshold), wherein, r is prize payouts function Return value, Δ dis is the coverage that vehicle was run at the adjacent experiment moment, and α is default weight zoom factor, and angle is Vehicle current driving direction and the tangential angle of road, trackPos be vehicle from center line of road with a distance from, threshold is pre- If threshold value, when trackPos is more than threshold, r is infinitesimal, can represent to punish when being too near to road boundary to vehicle Penalize.In addition, it is also possible to consider travel speed, SFC, smoothness etc. for prize payouts function.

In step s 102, by all sample triple stores it is sample data in the experience database that builds in advance, And clustering is carried out to sample data all in experience database.

In embodiments of the present invention, in current interaction off-test, each experiment moment pair during the interaction is tested The sample triple answered all is stored as the sample data in experience database.Each sample is calculated by the disaggregated model initialized Classification belonging to notebook data (i.e. sample triple) so that all sample datas are assigned in corresponding cluster, so as to pass through cluster point Some inwardness or rule of sample in experience database are found in analysis, with reduce decision networks model training sample dimension or Quantity, reaches the purpose for simplifying training sample.

In embodiments of the present invention, when initializing disaggregated model, driving number of the professional driver of collection in driving According to can be described as professional driving data, car status information and vehicle action when professional driving data includes professional driver driving " state-action " two tuples of information composition, carry out clustering, with first by default clustering algorithm to professional driving data Beginningization disaggregated model.Wherein, clustering algorithm can be the clustering algorithm such as K-means or principal component analysis (PCA).

As illustratively, when using K-means algorithm initialization disaggregated models, randomly selected in professional driving data Multiple cluster centres, and the classification of each " state-action " two tuples in professional driving data is calculated, update the poly- of each classification Class center, calculating the formula of classification can be：

Wherein, x⁽ⁱ⁾For i-th of " state-action " two tuple in professional driving data, u_jFor j-th of cluster centre.

Updating the formula of the cluster centre of each classification can be：

Preferably, whether occur in the driving procedure of interaction experiment unexpected or whether complete to preset by detecting vehicle Test drive task, to determine whether current interaction experiment terminates, when vehicle occur in driving procedure it is unexpected or complete During into default test drive task, it is determined that current interactive task terminates, so as to obtain a series of sample three of Time Continuous Tuple.Specifically, occur unexpected may include to roll road away from, collide or fuel tank oil starvation etc. in driving procedure.

In step s 103, according to default oversampling ratio value, in each cluster that experience database clustering is obtained Training sample is equably gathered, and calculates the corresponding accumulative return value of each training sample.

In embodiments of the present invention, the uniform sampling in each cluster of experience database, to select representative instruction Practice sample and keep independent same distribution between training sample, so, can have when with these training sample Training strategy network models Effect improves the training effectiveness of tactful network model.Car status information, the action of maximal rewards value in each training sample With maximal rewards value, it can calculate and obtain corresponding accumulative return value, because each car status information is a upper vehicle shape State information and the result of action, therefore optimal strategy can be determined by accumulative return value.Accumulative return value is also tactful network mould The output of type.

Specifically, return value Q (s are added up_t,a_t) r can be passed through₀+γr₁+γ²r₂+ ... calculate, γ is parameter preset and 0≤γ <1。

In step S104, according to the accumulative return value and default depth of all training samples, each training sample Learning algorithm, training obtains the decision networks model of Current vehicle automatic Pilot.

In embodiments of the present invention, by the car status information S in training sample_t, vehicle action message a_tAnd training sample Accumulative return value Q store into default data set, according to the data set and default deep learning algorithm, training vehicle from The dynamic decision networks model driven.Wherein, can be using elastic BP neural network (Rprop), Back Propagation Algorithm or length memory Algorithm (LSTM) even depth learning algorithm.

Preferably, experiment is repeatedly interacted, multiple training is carried out to decision-making network model, after once training terminates Whether detection experience database meets default constraints, when being unsatisfactory for, reject empirical data can in undesirable sample Notebook data, so that the empirical data that upgrades in time, improves the quality of training sample.

Specifically, constraints can be len (DS_h) ＜ μ_rms&&num(DS_h) ＜ K_num, wherein, DS_hFor experience database, Len () is used for calculating the number of sample data in experience database, and num () is used for counting the number of times of interaction experiment, μ_rmsFor sample The maximum quantity of notebook data, K_numFor the maximum times of interaction experiment.

Specifically, time gap that can be rule of thumb in data between adjacent sample data comes whether judgement sample data meet It is required that, when the time gap between current sample data and a upper sample data is less than pre-determined distance threshold value, it is determined that current sample Data are undesirable.

In embodiments of the present invention, it is corresponding most by the car status information at different tests moment, the car status information Big return value action and the maximal rewards value act the sample data in corresponding return value composition experience database, to all Sample data carry out clustering and to after clustering each cluster carry out uniform sampling, obtain be used for train vehicle automatic The training sample of driving strategy network model, and experience database is updated according to constraints after each training terminates, to pick Except undesirable sample data, so as to improve the representativeness of training sample, the dimension of training sample is reduced, and use Prize payouts and deep learning training decision networks model, are effectively improved training effectiveness, the study energy of decision networks model Power and generalization ability.

Can be with one of ordinary skill in the art will appreciate that realizing that all or part of step in above-described embodiment method is The hardware of correlation is instructed to complete by program, described program can be stored in a computer read/write memory medium, Described storage medium, such as ROM/RAM, disk, CD.

Embodiment two：

Fig. 2 shows the generating means for the decision networks model for Vehicular automatic driving that the embodiment of the present invention two is provided Structure, for convenience of description, illustrate only the part related to the embodiment of the present invention, including：

Sample generation module 21, for the car status information according to each experiment moment collection, the action of default vehicle Set and default prize payouts function, generation each experiment moment corresponding sample triple.

In embodiments of the present invention, after the car status information for collecting the current test moment, rewarded back according to default Value function is reported, the action of maximal rewards value, i.e. maximal rewards value can be obtained and move by traveling through to search in default vehicle set of actions Make, sample triple is combined into by the return value of car status information, the action of maximal rewards value and the action of maximal rewards value.

Cluster Analysis module 22, for being the sample in the experience database that builds in advance by all sample triple stores Data, and clustering is carried out to sample data all in experience database.

In embodiments of the present invention, by the disaggregated model initialized through default clustering algorithm and professional driving data, Calculate the classification belonging to each sample data (i.e. sample triple) so that all sample datas are assigned in corresponding cluster, so that Some inwardness or rule of sample in experience database are found by clustering, to reduce decision networks model training sample This dimension or quantity, reaches the purpose for simplifying training sample.

Train sampling module 23, for according to default oversampling ratio value, experience database clustering obtain it is every Training sample is equably gathered in individual cluster, and calculates the corresponding accumulative return value of each training sample.

In embodiments of the present invention, the uniform sampling in each cluster of experience database, to select representative instruction Practice sample and keep independent same distribution between training sample, so, can have when with these training sample Training strategy network models Effect improves the training effectiveness of tactful network model.Car status information, the action of maximal rewards value in each training sample With maximal rewards value, it can calculate and obtain corresponding accumulative return value, specifically, add up return value Q (s_t,a_t) r can be passed through₀+γr₁ +γ²r₂+ ... calculate, γ is parameter preset and 0≤γ<1.

Model generation module 24, for according to all training samples, the accumulative return value of each training sample and default Deep learning algorithm, training obtains the decision networks model of Current vehicle automatic Pilot.

In embodiments of the present invention, by the car status information S in training sample_t, vehicle action message a_tAnd training sample Accumulative return value Q store into default data set, according to the data set and default deep learning algorithm, training vehicle from The dynamic decision networks model driven.

In embodiments of the present invention, it is corresponding most by the car status information at different tests moment, the car status information Big return value action and the maximal rewards value act the sample data in corresponding return value composition experience database, and pass through Clustering and uniform sampling choose representative training sample from sample data, pass through prize payouts and deep learning Training sample is trained and obtains decision networks model, so that by the simplifying of training sample, prize payouts and deep learning, It is effectively improved training effectiveness, learning ability and the generalization ability of decision networks model.

Embodiment three：

Fig. 3 shows the generating means for the decision networks model for Vehicular automatic driving that the embodiment of the present invention three is provided Structure, including：

Sample generation module 31, for the car status information according to each experiment moment collection, the action of default vehicle Set and default prize payouts function, generation each experiment moment corresponding sample triple.

Off-test module 32, occurs the unexpected or default examination of completion for that ought detect vehicle during test drive When testing driving task, terminate current interaction experiment, and obtain the corresponding sample ternary of each experiment moment in interactive process of the test Group.

In embodiments of the present invention, occur surprisingly may include to roll road away from, collide or fuel tank in driving procedure Oil starvation etc..

Cluster Analysis module 33, for being the sample in the experience database that builds in advance by all sample triple stores Data, and clustering is carried out to sample data all in experience database.

In embodiments of the present invention, in embodiments of the present invention, by through at the beginning of default clustering algorithm and professional driving data The good disaggregated model of beginningization, calculates the classification belonging to each sample data (i.e. sample triple) so that all sample datas point Into corresponding cluster, so that some inwardness or rule of sample in experience database are found by clustering, to reduce The dimension or quantity of decision networks model training sample, reach the purpose for simplifying training sample.

Train sampling module 34, for according to default oversampling ratio value, experience database clustering obtain it is every Training sample is equably gathered in individual cluster, and calculates the corresponding accumulative return value of each training sample.

In embodiments of the present invention, the uniform sampling in each cluster of experience database, to select representative instruction Practice sample and keep independent same distribution between training sample, so, can have when with these training sample Training strategy network models Effect improves the training effectiveness of tactful network model.Car status information, the action of maximal rewards value in each training sample With maximal rewards value, it can calculate and obtain corresponding accumulative return value.

Model generation module 35, for according to all training samples, the accumulative return value of each training sample and default Deep learning algorithm, training obtains the decision networks model of Current vehicle automatic Pilot.

Experience update module 36, for detecting whether the experience database meets default constraints, when the warp When testing database and being unsatisfactory for the constraints, the sample data in the experience database is updated.

In embodiments of the present invention, experiment is repeatedly interacted, multiple training is carried out to decision-making network model, once Training terminate whether rear detection experience database meets default constraints, when being unsatisfactory for, reject empirical data can in be not inconsistent Desired sample data is closed, so that the empirical data that upgrades in time, improves the quality of training sample.

Preferably, sample generation module 31 includes vehicle-state acquisition module 311, vehicle action searching modul 312 and sample This generation submodule 313, wherein：

Vehicle-state acquisition module 311, the car status information for gathering the current test moment；

Vehicle acts searching modul 312, for the car status information according to the current test moment and prize payouts function, The action of maximal rewards value is searched in vehicle set of actions and order vehicle performs the maximal rewards value action；And

Sample generate submodule 313, for by the car status information at current test moment, maximal rewards value action and The return value of maximal rewards value action is combined, and obtains current test moment corresponding sample triple.

Preferably, Cluster Analysis module 33 includes disaggregated model initialization module 331 and classification generation module 332, wherein：

Disaggregated model initialization module 331, for according to default clustering algorithm and the professional driving data gathered in advance, Initialize the disaggregated model for carrying out clustering to experience database；And

Classification generation module 332, for being calculated by disaggregated model in experience database, vehicle-state is believed in sample data Classification belonging to breath, to obtain the classification of all sample datas in experience database.

In embodiments of the present invention, it is corresponding most by the car status information at different tests moment, the car status information Big return value action and the maximal rewards value act the sample data in corresponding return value composition experience database, and pass through Clustering and uniform sampling choose representative training sample from sample data, pass through prize payouts and deep learning Training sample is trained and obtains decision networks model, carrying out test of many times to decision-making network model repeatedly trains, and every Experience database is updated after secondary training, so as to by the simplifying of training sample, prize payouts and deep learning, be effectively improved Training effectiveness, learning ability and the generalization ability of decision networks model.

In embodiments of the present invention, each module for the generating means of the decision networks model of Vehicular automatic driving can be by Corresponding hardware or software module realize that each module can be independent soft and hardware module, can also be integrated into one it is soft, hard Part module, herein not to limit the present invention.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention Any modifications, equivalent substitutions and improvements made within refreshing and principle etc., should be included in the scope of the protection.

Claims

1. the generation method of a kind of decision networks model for Vehicular automatic driving, it is characterised in that under methods described includes State step：

Car status information, default vehicle set of actions and the default prize payouts letter gathered according to each experiment moment Number, generates corresponding sample triple of each experiment moment；

It is the sample data in the experience database built in advance by all sample triple stores, and to the experience database In all sample data carry out clustering；

According to default oversampling ratio value, the equably collection training in each cluster that the experience database clustering is obtained Sample, and calculate the corresponding accumulative return value of each training sample；

According to all training samples, the accumulative return value of each training sample and default deep learning algorithm, training Obtain the decision networks model of Current vehicle automatic Pilot.

2. the method as described in claim 1, it is characterised in that training obtains the decision networks model of the Vehicular automatic driving The step of after, methods described also includes：

Detect whether the experience database meets default constraints, when the experience database is unsatisfactory for the constraint bar During part, the sample data in the experience database is updated.

3. the method as described in claim 1, it is characterised in that according to the car status information of each experiment moment collection, in advance If vehicle set of actions and default prize payouts function, generate each experiment moment corresponding sample triple Step, including：

Gather the car status information at current test moment；

According to the car status information at the current test moment and the prize payouts function, in the vehicle set of actions The action of maximal rewards value is searched, and the maximal rewards value is sent to the vehicle and is acted；

By the car status information at the current test moment, maximal rewards value action and maximal rewards value action Return value be combined, obtain current test moment corresponding sample triple.

4. the method as described in claim 1, it is characterised in that generate corresponding sample triple of each experiment moment After step, the step of be the sample data in the experience database that builds in advance by all sample triple stores before, institute Stating method also includes：

When detecting the vehicle in generation accident during test drive or the default test drive task of completion, terminate to work as Preceding interaction experiment, and obtain in the interactive process of the test corresponding sample triple of each experiment moment.

5. the method as described in claim 1, it is characterised in that gathered to sample data all in the experience database The step of alanysis, including：

According to default clustering algorithm and the professional driving data gathered in advance, initialize for being carried out to the experience database The disaggregated model of clustering；

Classification in the experience database in sample data belonging to car status information is calculated by the disaggregated model, with The classification of all sample datas into the experience database.

6. the generating means of a kind of decision networks model for Vehicular automatic driving, it is characterised in that described device includes：

Sample generation module, for the car status information according to each experiment moment collection, default vehicle set of actions with And default prize payouts function, generate corresponding sample triple of each experiment moment；

Cluster Analysis module, for being the sample data in the experience database that builds in advance by all sample triple stores, And clustering is carried out to sample data all in the experience database；

Train sampling module, for according to default oversampling ratio value, the experience database clustering obtain it is each Training sample is equably gathered in cluster, and calculates the corresponding accumulative return value of each training sample；And

Model generation module, for according to all training samples, the accumulative return value of each training sample and default Deep learning algorithm, training obtains the decision networks model of Current vehicle automatic Pilot.

7. device as claimed in claim 6, it is characterised in that described device also includes：

Experience update module, for detecting whether the experience database meets default constraints, when the empirical data When storehouse is unsatisfactory for the constraints, the sample data in the experience database is updated.

8. device as claimed in claim 6, it is characterised in that the sample generation module includes：

Vehicle-state acquisition module, the car status information for gathering the current test moment；

Vehicle acts searching modul, for the car status information according to the current test moment and the prize payouts letter Number, maximal rewards value is searched in the vehicle set of actions and acts and orders the vehicle execution maximal rewards value to be moved Make；And

Sample generate submodule, for by the car status information at the current test moment, the maximal rewards value action with And the return value of the maximal rewards value action is combined, and obtains the current test moment corresponding sample triple.

9. device as claimed in claim 6, it is characterised in that described device also includes：

Off-test module, occurs the unexpected or default experiment of completion for that ought detect the vehicle during test drive During driving task, terminate current interaction experiment, and obtain the corresponding sample three of each experiment moment in the interactive process of the test Tuple.

10. device as claimed in claim 6, it is characterised in that the Cluster Analysis module includes：

Disaggregated model initialization module, for according to default clustering algorithm and the professional driving data gathered in advance, initialization Disaggregated model for carrying out clustering to the experience database；And

Classification generation module, for being calculated by the disaggregated model in the experience database, vehicle-state is believed in sample data Classification belonging to breath, to obtain the classification of all sample datas in the experience database.