CN113741533A

CN113741533A - Unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning

Info

Publication number: CN113741533A
Application number: CN202111089240.XA
Authority: CN
Inventors: 柴兴华; 耿虎军; 柯良军; 刘子锋; 陈彦桥; 高峰; 张小龙; 田苗; 关俊志; 王小强; 王雅涵; 轩书哲; 张格玮
Original assignee: CETC 54 Research Institute
Current assignee: CETC 54 Research Institute
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-03

Abstract

The invention discloses an unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning, which comprises an environment perception module, an expert behavior demonstration module, a decision-making learning module and a control execution module, wherein the system can make an accurate decision on a complex real-time scene, so that a target flying from a starting point to a set ending point can be reached, the decision can be made autonomously in the whole flying process, unmanned aerial vehicle control strategies are selected according to environment information, real-time events and the like, obstacles are avoided, and the target point can be reached safely and efficiently. Compared with a traditional decision system for designing expert rules, the method has the advantages that the expert behaviors are cloned by simulating learning to obtain the initial value of the decision control network, then the reward function is set according to events and states in the process of executing flight tasks by deep reinforcement learning, the decision control Q network which is more accurate and has better generalization is obtained by learning, and the method has an important effect on improving the autonomous control performance of the unmanned aerial vehicle.

Description

Unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning

Technical Field

The invention belongs to the crossing field of unmanned aerial vehicle intelligent control decision and computer technology, and particularly relates to an unmanned aerial vehicle intelligent decision system based on simulation learning and reinforcement learning.

Background

Unmanned aerial vehicles are increasingly commonly applied in various fields due to relatively low manufacturing cost, good maneuverability and high safety factor. However, the traditional drone control technology still needs a lot of human participation, and the autonomous decision generation technology still is a bottleneck problem limiting the autonomous capability of the drone. Currently, the challenges facing autonomous decision making for unmanned aerial vehicles are mainly: 1) the decision model has large calculation amount for solving and high requirement for obtaining a decision result in real time; 2) in practical application, an accurate unmanned aerial vehicle mathematical model is difficult to establish; 3) in the actual flight process, the environment is complex, a large amount of missing and wrong information exists, and the decision-making difficulty is increased.

The current intelligent decision-making method comprises an expert system, differential countermeasures, dynamic planning and the like. The expert system can match and output decision information according to situation information and expert rules, but the decision information excessively depends on the rules, so that the decision process is lack of flexibility and has insufficient adaptability to complex environments; the differential strategy method solves the decision problem from the angle of numerical optimization calculation, but is based on an accurate unmanned aerial vehicle mathematical model and is difficult to obtain in practical application; the decision problem is solved by using an approximate value function dynamic programming method, the establishment of a unified standard decision model is difficult, and the dimension disaster is easily caused when the numerical method is used for solving. Unlike traditional decision-making methods, reinforcement learning is where the agent learns in a "trial and error" manner, with the goal of making the agent receive the greatest reward by rewarding guided actions through interaction with the environment. Compared with the traditional decision-making method, the reinforcement learning system needs to mainly learn by depending on self experiences, the reinforcement signals provided by the environment evaluate the quality of generated actions, and the best model is determined by executing the actions capable of obtaining the maximum reward for a limited time. However, reinforcement learning essentially belongs to a data-driven optimization algorithm, and has strong dependence on information provided by an external environment, and how to obtain high-quality control decision data becomes a major bottleneck problem limiting the application of the reinforcement learning in the field of unmanned aerial vehicle measurement and control.

Disclosure of Invention

In order to solve the problems that a traditional decision method is weak in adaptability, a reinforcement learning method is limited by the problem that data acquisition is difficult to exert effectiveness and the like in the autonomous decision control process of the unmanned aerial vehicle, the invention provides an unmanned aerial vehicle intelligent decision system based on simulation learning and reinforcement learning, wherein simulation learning of artificial control knowledge and self-adaptive reinforcement learning of a complex environment are combined, unmanned aerial vehicle decision model training is realized step by step, and the unmanned aerial vehicle intelligent decision system has strong generalization capability and high robustness in response to complex dynamic scenes.

The invention adopts the modified technical scheme that:

an unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning comprises an environment perception module, an expert behavior demonstration module, a decision learning module and a control execution module;

the environment perception module: extracting and fusing environmental information acquired by various sensors in the flight process of the unmanned aerial vehicle to form a state space vector and outputting the state space vector to a demonstration module and a decision learning module; wherein the environment information comprises an image, an orientation and a distance;

the expert behavior demonstration module: collecting unmanned plane control instructions made by experts according to domain knowledge and operation experience in various environments and events, wherein the unmanned plane control instructions and data provided by the environment perception module form an expert demonstration data set together and are output to the decision learning module;

a decision learning module: taking the state space vector as the input of a network structure, pre-learning the network structure according to an expert demonstration data set to obtain a pre-learning model, training on the basis of the pre-learning model, learning decision control actions which should be taken under different scenes and different events to obtain a final decision learning model, and outputting the action vector and the probability thereof to a control execution module;

the control execution module: and after obtaining each action and the probability thereof output by the decision learning module, selecting the action instruction with the maximum probability to be executed by the unmanned aerial vehicle, and obtaining the new environment state of the unmanned aerial vehicle after the action is executed.

The specific processing procedure of the environment perception module is as follows: the input is environmental data acquired by various sensors in the flight process of the unmanned aerial vehicle, the image data is extracted by adopting a Resnet18 network to obtain high-dimensional characteristics, then azimuth information and distance information of the unmanned aerial vehicle are fused to form a fused state space vector as output, and the state space vector s is_t＝(l_t，d_t，x_t，1，x_t，2，...，x_t，m) In which S is_tIs the state space vector of unmanned plane at time t,/_tIn its own orientation, d_t(x) is the distance of the current position from the target position_t，1，x_t，2，...，x_t，m) Is high-dimensional information of the image in the current field of view.

The specific processing process of the expert behavior demonstration module is as follows: the input is the state space vector of environmental perception module output, utilizes expert's knowledge, carries out unmanned aerial vehicle under different scenes and different time in virtual reality simulation platform and controls, outputs and controls the instruction sequence, and the state space vector of input constitutes expert's demonstration data set with the control instruction sequence of output jointly.

The decision learning module comprises an imitation learning module and a reinforcement learning module;

the simulation learning module and the reinforcement learning module have the same network structure and are realized by a network structure Q, the input of the network structure is a state space vector obtained by the environment sensing module, the input layer is fully connected with the hidden layer, the hidden layer is fully connected with the output layer, and finally the action space and the probability thereof are obtained by output, wherein the action space is an action vector a taken at t time_t＝(a_t，f，a_t，b，a_t，w，a_t，e，a_t，u，a_t，d)；

The simulation learning module performs pre-learning according to the expert demonstration data set, and the expert demonstration data set is used as input environmental images, direction and distance information and unmanned plane control instructions output by experts to perform training to obtain a pre-learning model of the decision learning module; the reinforcement learning module performs reinforcement training on the basis of the initialization parameters of the pre-learning model, learns decision control actions which should be taken under different scenes and different events, and obtains a final decision learning model.

The specific processing procedure of the control execution module is as follows:

the action vector a output by the decision learning module_t＝(a_t，f，a_t，b，a_t，w，a_t，e，a_t，u，a_t，d) And selecting the action instruction with the maximum action probability for execution to obtain a new environment state of the unmanned aerial vehicle after the action is executed, and then circularly iterating the environment sensing to control and execute the whole process to realize the autonomous control decision of the unmanned aerial vehicle.

The invention has the beneficial effects that:

the invention can make accurate decision on complex real-time scenes, so that the unmanned aerial vehicle flies from the starting point to the set ending point: 1) the whole flight process is completely and autonomously decided, and unmanned aerial vehicle control strategies can be selected according to environmental information, real-time events and the like, so that obstacles such as buildings, pedestrians and the like are avoided; 2) expert rules are not required to be designed, and behavior cloning is performed on expert behaviors through simulation learning to obtain a rough initial value of the decision control network, so that initialization of the decision network is completed; 3) by adopting a deep reinforcement learning method and setting a reward function for events and states in the flight process, the decision generation is more accurate and the generalization is higher.

Drawings

FIG. 1 is a diagram of a decision network architecture design of the system of the present invention.

FIG. 2 is a schematic diagram of the system components and information interaction of the present invention.

FIG. 3 is a flow chart of decision network training for the system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

An unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning comprises an environment perception module, an expert behavior demonstration module, a decision learning module (comprising a simulation learning module and a reinforcement learning module) and a control execution module, and it needs to be explained that the system generally needs a virtual reality simulation platform to provide a typical scene for simulation environment data acquisition and unmanned aerial vehicle virtual model operation. The construction and realization ways of each module of the system are as follows:

1. the environment perception module: the method comprises the steps of extracting and fusing characteristics of environmental information collected by various sensors in the flight process of the unmanned aerial vehicle, wherein the environmental information comprises images, directions, distances and the like, and forming state space vectors to provide input data for a subsequent expert behavior demonstration module and a decision learning module.

Multiple sensorsExtracting the obtained environment information mainly comprises extracting image information by using a Resnet18 network to obtain high-dimensional features, wherein the network extracts the high-dimensional information in the environment image from an input layer through a convolutional layer, a pooling layer and a full-connection layer; then, the image feature, the orientation feature and the distance feature are fused into an environment state vector s through a full connection layer_t＝(l_t，d_t，x_t，1，x_t，2，...，x_t，m) Wherein s is_tFor the state of the unmanned plane at time t, the state vector contains self direction l_tDistance d of current position from target position_tAnd high dimensional information (x) of the image in the current field of view_t，1，x_t，2，…，x_t，m)。

Referring to fig. 1, the input layer of the network adopts a ResNet18 network, which includes 256 neurons, and the input image is in a 64 × 64 pixel 3 channel format; the hidden layer has two layers, 128 and 64 neurons, respectively, and the output layer has 6 neurons. The input layer and the hidden layer are all connected, the hidden layer and the output layer are all connected, all neuron activation functions use relu functions, and the learning rate alpha is set to be 0.01.

2. The expert behavior demonstration module: by collecting unmanned plane control instructions made by experts according to field knowledge and operation experience in various environments and events, the unmanned plane control instructions and data provided by the environment perception module jointly form an expert demonstration data set, and instruction learning samples are provided for a simulation learning part of a follow-up decision learning module.

The module input is a state space vector s output by the environment sensing module_t＝(l_t，d_t，x_t，1，x_t，2，...，x_t，m) The unmanned aerial vehicle control system is characterized in that an unmanned aerial vehicle operator (namely, a flight control expert) utilizes expert knowledge, a control handle controller (the Xbox controller is adopted in the scheme), a large number of unmanned aerial vehicles with different scenes and time are controlled in a virtual reality simulation platform (the AirSim controller is adopted in the scheme), an output control instruction sequence is output, an input state space vector and an output control instruction sequence form an expert demonstration data set, and the expert demonstration data set comprises an expertThe data of the presentation data set is presented by all experts<State, action>And matching pairs for subsequent imitation learning.

3. A decision learning module: the system comprises an imitation learning module and a reinforcement learning module, wherein the two modules have the same network structure and are realized by a network structure Q, an input layer of the network structure is a state space vector obtained by an environment sensing module, the input layer is fully connected with a hidden layer, the hidden layer is fully connected with an output layer, and finally, an action space is obtained through output. The simulation learning module performs pre-learning according to the expert demonstration data set to obtain a pre-learning model of the decision learning module; the reinforcement learning is trained on the basis of a pre-learning model, and the final decision learning model is obtained by learning the decision control action which should be taken under different scenes and events.

3.1 the model learning module mainly pre-learns according to the expert demonstration data set, namely model learning is performed through behavior cloning to obtain an initial learning model of the decision learning module as an initialization parameter of the Q network. The process of training to mimic learning is: decision data { τ) provided by domain experts₁，τ₂，...，τ_mEach decision data comprises a sequence of states and actions, i.e.

Then all of<State, action>And for extracting to obtain a new data set:

D＝{(s₁，a₁)，(s₂，a₂)，(s₃，a₃)，...}

and classifying or regressing by taking the state as a characteristic and the action as a mark so as to obtain an optimal decision strategy model. By collecting a large amount of human expert operation flight data, the state s of the flight data is the environment scene during flight, and the action A is the action in the scene. The data are input into a neural network Q network, so that the output of the network is as close as possible to the action actually made by human, and the task is completed. Namely, the decision control strategy is learned according to the state action pairs provided by human experts, and the process is behavior cloning. By means of the simulation learning, the search space of the reinforcement learning free exploration can be reduced, and a pre-training model is provided for deep reinforcement learning.

3.2 the reinforcement learning module mainly carries out reinforcement training on the basis of Q network initialization parameters obtained by pre-learning, so as to achieve the effects of reducing the search space and accelerating convergence. The implementation approach is to further learn through deep reinforcement learning after the Q network initialization is completed by the simulation learning module, specifically, a DQN algorithm is adopted, the DQN algorithm is an algorithm based on a value function, and the DQN updates the current state s_tQ value Q(s) of next operation A_tA), action A is first executed, updating the one-step reached state s_t+1(ii) a Then handle s_t+1Input Q network, calculate s_t+1Taking the Q value of all actions, and taking the maximum Q value plus the reward R as an updating target; finally, calculate Q(s)_tA) and max (Q(s)_t+1And, A)) + R, and updating the Q network with the loss, i.e., DQN updating, according to the following formula:

wherein, Q(s)_tAnd A) represents the agent in state s_tAfter selecting action A, the reward function is set to reward R relative to distance to destination, up to the expectation of the reward sum for the final state_dWith time-of-flight dependent reward R_tAnd (4) summing.

R＝R_d+R_t

Destination dependent reward R_dComprises the following steps: the arrival at the destination receives a larger reward, the closer to the destination the distance d, the greater the reward, and the encounter with an obstacle receives a negative reward. Namely:

time dependent negative rewards (penalties) R_tIncreases with increasing flight time so that the flight mission is performed in a shorter time. Termination of reaching a final stateThe conditions are set as reaching the destination, flying to the boundary of the map, or executing the set exploration times, and any one of the conditions is the final state. Through continuous iterative training, an accurate Q network can be obtained and used as a core network of a decision learning module of the unmanned aerial vehicle intelligent decision system.

3.3Q network model training flow of decision learning module referring to FIG. 3, at the beginning of training, the weights of the decision learning network are initialized by the parameters obtained by the mimic learning. And 3-1, setting parameters, including an initial value N of the number of the cyclic training times, a set flight target point and the boundary setting of a map. And 3-2, setting the maximum exploration times n in one training. And 3-3, acquiring current environment information and inputting the current environment information to the environment sensing module. And 3-4, extracting the characteristics of the environment information to obtain the state space vector of the environment. And 3-5, inputting the state space vector into the Q network to obtain the probability of each action in the current state. And 3-6, calculating the error delta between the actual value of the network and the updated target. And 3-7, updating the weight in the network according to a back propagation algorithm by using the error delta. And 3-8, selecting the action with the maximum probability to execute according to the output action probability to obtain the next state. And 3-9, judging whether the current state is the final state, judging according to whether the maximum exploration frequency n of one-time training is reached, whether the target point is reached and whether the map boundary is exceeded, repeating the steps to continue exploration if the final state is not reached, and executing the step 3-10 if the final state is reached, wherein the training frequency is increased by 1. And 3-11, judging whether the training times reach the set circulating training times, if not, continuing to perform a new training, and if so, ending the training process.

4. The control execution module: and the system is responsible for the specific execution of the decision learning module, and after the probability of each action in the action space output by the decision learning module is obtained, the action instruction with the maximum probability is selected to be executed by the unmanned aerial vehicle, so that the new environment state of the unmanned aerial vehicle after the action is executed is obtained.

The action vector a output by the decision learning module_t＝(a_t，f，a_t，b，a_t，w，a_t，e，a_t，u，a_t，d) And selecting the action instruction with the highest probability for executing the probability of each action so as to obtain a new environment state of the unmanned aerial vehicle after the action is executed, and then circularly iterating the environment sensing to control and execute the whole process to realize the autonomous control decision of the unmanned aerial vehicle.

Claims

1. An unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning is characterized by comprising an environment perception module, an expert behavior demonstration module, a decision learning module and a control execution module;

2. The unmanned aerial vehicle intelligent decision making system based on simulation learning and reinforcement learning of claim 1, wherein the specific processing procedures of the environment perception module are as follows: the input is unmannedThe method comprises the steps of collecting environmental data by various sensors in the flying process of the aircraft, extracting image data by adopting a Resnet18 network to obtain high-dimensional features, fusing azimuth information and distance information of the unmanned aerial vehicle to form a fused state space vector as output, and outputting the state space vector s_t＝(l_t，d_t，x_t，1，x_t，2，...，x_t，m) In which S is_tIs the state space vector of unmanned plane at time t,/_tIn its own orientation, d_t(x) is the distance of the current position from the target position_t，1，x_t，2，...，x_t，m) Is high-dimensional information of the image in the current field of view.

3. The unmanned aerial vehicle intelligent decision making system based on simulation learning and reinforcement learning of claim 1, wherein the specific processing procedure of the expert behavior demonstration module is as follows: the input is the state space vector of environmental perception module output, utilizes expert's knowledge, carries out unmanned aerial vehicle under different scenes and different time in virtual reality simulation platform and controls, outputs and controls the instruction sequence, and the state space vector of input constitutes expert's demonstration data set with the control instruction sequence of output jointly.

4. The unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning of claim 1, wherein the decision-making learning module comprises a simulation learning module and a reinforcement learning module;

5. The unmanned aerial vehicle intelligent decision making system based on simulation learning and reinforcement learning of claim 4, wherein the specific processing procedure of the control execution module is as follows: