CN112232490A

CN112232490A - Deep simulation reinforcement learning driving strategy training method based on vision

Info

Publication number: CN112232490A
Application number: CN202011154491.7A
Authority: CN
Inventors: 邹启杰; 熊康; 高兵; 汪祖民; 王东
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-01-15
Anticipated expiration: 2040-10-26
Also published as: CN112232490B

Abstract

The invention discloses a deep simulation reinforcement learning driving strategy training method based on vision, which comprises the following steps: constructing a simulated learning network; training the mimic learning network; carrying out network splitting on the simulated learning network after training to obtain a perception module; constructing a DDPG network to obtain a control module; the construction of a deep simulation reinforcement learning model is completed through the sensing module and the control module; training the deep simulation reinforcement learning model. The simulation learning network comprises 5 convolutional layers and four full-connection layers, wherein the convolutional layers are used for extracting characteristics, and the full-connection layers are used for predicting steering angles, accelerator opening degrees and brake opening degrees; in addition, a reward function is set in the training process of the deep simulation reinforcement learning model, and the comfort and the safety of curve running are guaranteed.

Description

Deep simulation reinforcement learning driving strategy training method based on vision

Technical Field

The invention relates to the technical field of automatic driving, in particular to a deep simulation reinforcement learning driving strategy training method based on vision.

Background

The rise of automatic driving technology provides a new solution for the existing traffic problems. The automatic driving technology can effectively improve the driving efficiency of road motor vehicles, and further relieve traffic pressure. And the high-efficient and accurate executive power of machine is utilized, the traffic accident is reduced, and the driving safety index is improved. Meanwhile, the development of science and technology promotes the rise of traffic intellectualization, and the rapid development of the automatic driving technology is promoted together from the calculation capability and traffic big data to the deep learning of hot door.

In various tasks of automatic driving, sensors such as radar, laser radar, ultrasonic sensor and infrared camera are widely applied, but common color cameras can acquire abundant information due to low cost, and are still the most widely accepted environment sensing methods with extremely high reliability at present. Meanwhile, the intelligent driving system adopts a vision technology to distinguish and analyze the traffic environment, which also accords with the environment cognitive approach of artificial driving.

In the process of learning the driving strategy of the human simulator, deep reinforcement learning, simulation learning, transfer learning and the like are often adopted. The driving strategy learning based on the visual angle of the driver can adaptively extract features from the original picture if an end-to-end training mode is adopted, so that the limitation of manual feature extraction is effectively avoided, the cost is saved, artificial experience is not relied on, and the comprehensiveness, accuracy and objectivity of the driving strategy learning are improved. However, the method has some defects in the process of learning the autonomous driving strategy, the quality of the driving strategy learned by simulating and learning is limited by expert individual demonstration data, and an unknown scene is difficult to generalize by using a naive algorithm; deep reinforcement learning needs to perform large-scale exploration in a traffic environment before decision making, and the real-time performance of a driving decision and a planning process is difficult to guarantee due to low learning efficiency; in addition, the migration learning is a very effective way to solve the driving strategy learning, but it is also difficult to solve the driving strategy migration problem in the atypical environment.

Therefore, on one hand, the learning of expert driving strategies needs to be solved, and on the other hand, the driving strategies obtained through learning can have certain generalization capability, namely the driving strategies can also reach the human-like level in unknown traffic environments, so that the application range of automatic driving is enlarged.

The performance of the end-to-end model of the end-to-end vision-based lane keeping method disclosed in chinese patent document CN109446919A needs to depend on the quantity and quality of the collected driving data, and it takes a lot of time to collect data of various driving scenes to obtain an excellent driving strategy. Secondly, it is impractical to collect driving data for all scenes, so that the model cannot process unknown scenes, and it is difficult to further improve the performance.

In the method for constructing the layered end-to-end automatic vehicle driving system disclosed in chinese patent document CN108897313A, the first two layers of neural network models need to be trained by a large amount of label data, and the second two layers of reinforcement learning model algorithms are still training modes of traditional reinforcement learning, which needs a large amount of exploration, and training is time-consuming. Due to the complex network structure, each network model needs to be trained independently, and the training complexity is further increased.

The Model-free Deep Reinforcement Learning algorithm proposed by berkeley in 2019, Model-free Deep Reinforcement Learning for Urban Autonomous Driving, has the biggest problem that the Deep Reinforcement Learning part is still the traditional Learning mode, so that the Learning efficiency is low in the huge search space of Urban simulation, and the application range of the Model is limited.

The network model of the DAVE2 proposed by the NVIDIA team in "End to End learning for self-driving cars" has the same disadvantages as the model proposed in the patent document CN109446919A, requires a large amount of labeled data training networks, and cannot process other unknown scenes due to difficulty in collecting driving data of a whole scene.

Disclosure of Invention

Aiming at the defects in the prior art, the application provides a driving strategy training method based on deep simulation reinforcement learning of vision, which solves the problem of driving strategy learning in unknown environment, thereby improving the generalization capability of the driving strategy obtained by learning, also learning the principle of human driving, setting a reinforcement learning reward function suitable for a curve, and further ensuring the driving stability of a vehicle in the curve.

In order to achieve the purpose, the technical scheme of the application is as follows: a deep simulation reinforcement learning driving strategy training method based on vision comprises the following steps:

constructing a simulated learning network;

training the mimic learning network;

carrying out network splitting on the simulated learning network after training to obtain a perception module;

constructing a DDPG network to obtain a control module;

the construction of a deep simulation reinforcement learning model is completed through the sensing module and the control module;

training the deep simulation reinforcement learning model.

Further, the simulation learning network comprises 5 convolutional layers and four full-connection layers, wherein the convolutional layers are used for extracting features, and the full-connection layers are used for predicting the steering angle, the accelerator opening and the brake opening; the 5 convolutional layers use 5x5 convolutional kernels, wherein a max pooling layer and a Dropout layer are added to optimize the network; the 5 convolutional layers and the first three full-connection layers all use Relu activation functions, the last full-connection layer is an output layer and comprises three full-connection networks, the three full-connection networks respectively use tanh, sigmoid and sigmoid activation functions, and 3 actions of steering, accelerating and braking are correspondingly output.

Further, the imitation learning network inputs images with the size of 64x64 pixels after processing, and outputs automobile control information comprising predicted steering angle, predicted accelerator information and predicted braking information.

Further, training the imitation learning network specifically comprises:

the method comprises the steps that a TORCS (simulated driving simulator) is utilized to collect artificial driving data, the artificial driving data with excellent performance and a vehicle video frame of a corresponding driver visual angle are selected as sample data, an artificial control instruction is used as a label, and the artificial control instruction comprises steering, an accelerator and a brake;

the simulated learning network is trained using the DAgger algorithm, which is an iterative strategy training algorithm that reverts to an online learning state, with the learner retraining the primary classifier on all states encountered during each iteration.

Further, the simulated learning network after training is subjected to network splitting to obtain a perception module, which specifically comprises:

carrying out weight storage and network splitting on the simulated learning network after training is completed, splitting a front 7-layer network to be used as a sensing module and endowing the sensing module with a corresponding weight, wherein the front 7-layer network comprises 5 convolutional layers and 2 full-connection layers;

the sensing module inputs a first visual angle driving image and outputs a corresponding feature vector; the last two full-connection layer networks are used as action generating networks, and the action generating networks are used for initializing the Actor network in the control part by using the weight values of the action generating networks so as to ensure the initial performance of the whole model.

Further, the building of the DDPG network and obtaining of the control module specifically include:

dividing the DDPG network into an Actor network and a Critic network; the Actor network is divided into three layers, an input layer receives the characteristics generated by the sensing module, a hidden layer is a full-connection layer, and an output layer consists of three full-connection networks and respectively corresponds to output steering, an accelerator and a brake;

the Actor network has the same structure as the action generating network, and is initialized by using the weight of the action generating network; the Critic network is composed of fully-connected networks, the environmental characteristics generated by the sensing module and the action information in the Actor network are used as input, the action information is combined with the characteristic vector after being processed by one layer of fully-connected network, and then is processed by one layer of fully-connected network, and finally the values of three actions are output by one layer of fully-connected network and are provided for DDPG network learning.

Further, training the deep simulation reinforcement learning model specifically includes:

adding an OU exploration factor into a deep simulation reinforcement learning model, setting a reward function which accords with a task, setting a specified speed at a curve in the reward function, and encouraging the automobile to decelerate to the specified speed at the curve through the reward function and accelerate on a straight road;

the OU exploration formula is as follows:

E dx_t＝E(μ-x_t)dt+σdw_t (1)

wherein E represents the recovery to the average value too fast, μ represents the average value, σ represents the amplitude of the fluctuation, and the specific parameters are shown in Table 1;

TABLE 1 parameters of OU noise

The reward function is

Wherein I is a switch, the condition value in the middle brackets is 1 when the I is satisfied, and the condition value is 0 when the I is not satisfied; d₁The distance of the lane line directly in front of the vehicle, d₂The vehicle lane center value is a parameter for judging whether the vehicle is in the lane center, and the value approaches to 0 as the vehicle approaches to the lane center; v. of_xThe longitudinal speed of the automobile is represented, and theta represents the included angle between the automobile and the lane line; α and β are the target speed and penalty discount at the curve, respectively; when d is₁When less than 10, it means that the vehicle is in a curve, when d₁Less than 40, indicating an impending curve entry ahead, α is set to 50, encouraging the vehicle to slow down to 50 at the curve, when d is₁Greater than 40 indicates that the vehicle is on a straight road, encouraging vehicle acceleration.

The invention can obtain the following technical effects: the intelligent driving control structure is divided into two parts of perception and control, and the driving strategy is efficiently learned through simulation learning and deep reinforcement learning respectively. On one hand, the learning cost is reduced by simulating learning based on the DAgger algorithm, and a better driving strategy is obtained by a small number of samples; on the other hand, the unknown environment incremental learning improves the unknown environment processing capacity, accelerates the convergence speed of the model and improves the overall real-time performance. In addition, the reward function set in the training process of the reinforcement learning model is deeply simulated, and the comfort and the safety of curve running are also ensured.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings needed to be used in the embodiments are briefly described as follows:

FIG. 1 is a deep-mimicking reinforcement learning model workflow diagram;

FIG. 2 is a schematic diagram of network splitting and weight sharing;

FIG. 3 is a schematic diagram of a deep-simulated reinforcement learning model;

fig. 4 is a view of a simulated learning network architecture.

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples: the present application is further described by taking this as an example. It is to be understood that the embodiments described are only some of the embodiments of the invention, and not all of them.

The embodiment provides a deep simulation reinforcement learning driving strategy training method based on vision, which combines the advantages of simulation learning and deep reinforcement learning, obtains initial driving strategy learning through simulation learning, and solves the problem of online driving strategy learning through deep reinforcement learning. The output of the simulated learning is used as the input of the deep reinforcement learning, so that the exploration space is reduced, and the learning efficiency is improved; meanwhile, the driving strategy learning of unknown environment is achieved through deep reinforcement learning, and therefore the generalization capability of the driving strategy obtained through learning is improved. The invention also learns the driving principle of human beings, sets a reinforcement learning reward function suitable for the curve, and further ensures the stability of the vehicle when the vehicle runs on the curve. The method specifically comprises the following steps:

constructing a simulated learning network;

the simulation learning network comprises 5 convolutional layers and four full-connection layers, wherein the convolutional layers are used for extracting characteristics, and the full-connection layers are used for predicting steering angles, accelerator opening degrees and brake opening degrees; the input of the simulated learning network is processed images with the size of 64x64 pixels, and the output of the simulated learning network is automobile control information comprising predicted steering angle, predicted accelerator information and predicted brake information. The 5 convolutional layers use 5x5 convolutional kernels, wherein a max pooling layer and a Dropout layer are added to optimize the network; the 5 convolutional layers and the first three full-connection layers all use Relu activation functions, the last full-connection layer is an output layer and comprises three full-connection networks, the three full-connection networks respectively use tanh, sigmoid and sigmoid activation functions, and 3 actions of steering, accelerating and braking are correspondingly output.

Training the mimic learning network;

the method comprises the steps of utilizing a TORCS (simulated driving simulator) to collect artificial driving data, selecting data with excellent driving performance and a vehicle video frame of a corresponding driver visual angle as sample data, and using artificial control instructions (steering, accelerator and brake) as labels. Only typical driving data need to be collected, and the driving data can be used as a sample of simulation learning.

Under the condition of less demonstration data, the simulation learning network is trained by using a DAgger algorithm so as to obtain the best training effect. DAgger is an iterative strategy training algorithm that reverts to an online learning state. In each iteration, the learner retrains the primary classifier on all states encountered. The main advantage of DAgger is that expert demonstration is used to teach the learner how to recover from past errors, enabling active learning to the expert.

carrying out weight value storage and network splitting on the simulated learning network after training is finished, splitting a front 7-layer network comprising 5 convolutional layers and 2 full-connection layers to serve as a sensing module and endowing the sensing module with corresponding weight values, wherein a first visual angle driving image is input by the sensing module, and a corresponding characteristic vector is output by the sensing module; the latter two layers of networks serve as action generating networks, and the action generating networks are used for initializing the Actor network in the control part by using the weights of the action generating networks so as to ensure the initial performance of the whole model, as shown in fig. 2.

Constructing a DDPG network to obtain a control module;

the DDPG network is divided into an Actor network and a Critic network. Because the perception part processes the input of the original image, the network of the control part can be simplified to improve the learning efficiency. The Actor network is divided into three layers, an input layer receives characteristics generated by a sensing part, a hidden layer is a full-connection layer, and an output layer consists of three full-connection networks and respectively corresponds to output steering, an accelerator and a brake.

The last two layers of networks (action generating networks) of the Actor network and the simulation learning network have the same structure, and the Actor network is initialized by using the weight of the action generating network. The Critic network is composed of a fully-connected network, the environmental characteristics generated by a sensing part and action information in an Actor network are used as input, the action information is combined with the characteristic vector after being processed by a layer of the fully-connected network, then is processed by a layer of the fully-connected network, and finally, the values of three actions are output by a layer of the fully-connected network and are provided for DDPG network learning.

Completing the construction of a Deep simulation Reinforcement Learning model (DIRL) through the sensing module and the control module;

the deep simulation reinforcement learning model comprises a perception module and a control module, and the flow is shown in the figure 1. The sensing module inputs a first visual angle driving image from the TORCS simulator, the size of the first visual angle driving image is 64x64, the first visual angle driving image is processed by the sensing module to generate a corresponding feature vector, and the control module takes the feature vector generated by the sensing module as input and outputs a final control command. The specific network structure is shown in fig. 3.

Training the deep simulation reinforcement learning model;

and adding a proper OU exploration factor into the deep simulation reinforcement learning model, setting a reward function which accords with a task, exploring and learning in a simulation environment, and further improving the performance of the whole model. In order to solve the problem that the vehicle cannot be correctly braked at a curve when the vehicle is driven at a high speed, a specified speed is set at the curve in a reward function, and the vehicle is encouraged to decelerate to the specified speed at the curve and accelerate on a straight road through the reward function. Because the whole model has initial performance, a large amount of exploration time can be reduced, and the learning efficiency is improved.

The OU exploration formula is as follows:

E dx_t＝E(μ-x_t)dt+σdw_t (1)

TABLE 1 parameters of OU noise

The reward function is

According to the method, the problem of learning of the driving strategy outside the expert demonstration is solved, in the DIRL model, a certain initial performance is obtained in a short time by using a small amount of labeled data through a supervised learning mode through simulation learning, then exploration learning is carried out by using a deep reinforcement learning algorithm DDPG, and an agent is further enabled to learn the driving strategy outside the expert demonstration by adding OU exploration noise and a corresponding reward function, so that more excellent performance is obtained.

The capacity of coping with unknown traffic environment is improved. Due to the fact that appropriate exploration noise is added into the DDPG, the agent can obtain a large amount of driving data of unknown scenes in a simulation environment, and the agent has better capability of coping with the unknown scenes relative to simulation learning through adjustment of network parameters by the aid of the reward function.

Improve the exploration efficiency of deep reinforcement learning and accelerate the convergence speed. Through a small amount of label data, the simulation learning network is trained by using a Dagger algorithm, then the sensing module and the action generating network are constructed in a network splitting mode, the action generating network is used for initializing the Actor network of the control part, so that the whole DIRL model has certain initial performance, because the model has the initial performance, a large amount of unnecessary exploration in the reinforcement learning is avoided, and the sensing module processes original pixels, the network structure of the reinforcement learning can be greatly simplified, and the learning efficiency is further improved.

The above description is only for the purpose of creating a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims

1. A deep simulation reinforcement learning driving strategy training method based on vision is characterized by comprising the following steps:

constructing a simulated learning network;

training the mimic learning network;

constructing a DDPG network to obtain a control module;

training the deep simulation reinforcement learning model.

2. The training method of the deep simulation reinforcement learning driving strategy based on the vision is characterized in that the simulation learning network comprises 5 convolutional layers and four fully-connected layers, wherein the convolutional layers are used for extracting features, and the fully-connected layers are used for predicting the steering angle, the throttle opening and the brake opening; the 5 convolutional layers use 5x5 convolutional kernels, wherein a max pooling layer and a Dropout layer are added to optimize the network; the 5 convolutional layers and the first three full-connection layers all use Relu activation functions, the last full-connection layer is an output layer and comprises three full-connection networks, the three full-connection networks respectively use tanh, sigmoid and sigmoid activation functions, and 3 actions of steering, accelerating and braking are correspondingly output.

3. The method as claimed in claim 2, wherein the mimic learning network inputs processed 64x64 pixels and outputs vehicle control information including predicted steering angle, predicted throttle information and predicted braking information.

4. The method for training the deep simulation reinforcement learning driving strategy based on vision as claimed in claim 1, wherein the simulation learning network is trained by:

5. The vision-based deep simulation reinforcement learning driving strategy training method of claim 1, wherein a simulated learning network after training is subjected to network splitting to obtain a perception module, and specifically comprises the following steps:

6. The method for training the deep simulation reinforcement learning driving strategy based on the vision as claimed in claim 1, wherein the building of the DDPG network and the obtaining of the control module are specifically as follows:

7. The vision-based deep-imitation reinforcement learning driving strategy training method of claim 1, wherein the deep-imitation reinforcement learning model is trained by specifically:

the OU exploration formula is as follows:

E dx_t＝E(μ-x_t)dt+σdw_t (1)

TABLE 1 parameters of OU noise

The reward function is