CN116307331A - Aircraft trajectory planning method - Google Patents

Aircraft trajectory planning method Download PDF

Info

Publication number
CN116307331A
CN116307331A CN202310540315.4A CN202310540315A CN116307331A CN 116307331 A CN116307331 A CN 116307331A CN 202310540315 A CN202310540315 A CN 202310540315A CN 116307331 A CN116307331 A CN 116307331A
Authority
CN
China
Prior art keywords
tree
level
node
task
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310540315.4A
Other languages
Chinese (zh)
Other versions
CN116307331B (en
Inventor
张筱
吴发国
郭宁
姚望
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202310540315.4A priority Critical patent/CN116307331B/en
Publication of CN116307331A publication Critical patent/CN116307331A/en
Application granted granted Critical
Publication of CN116307331B publication Critical patent/CN116307331B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • G06Q10/047Optimisation of routes or paths, e.g. travelling salesman problem
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Operations Research (AREA)
  • Medical Informatics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present disclosure relates to a method for planning an aircraft track, and relates to the technical field of aviation, wherein the method for planning an aircraft track comprises: acquiring an initial state and a position target of an aircraft as an original task; decomposing the original task in an AND or tree by a high-level strategy network; determining a sub-target with the smallest distance difference between the sub-target and two ends of the node in each node of the node, forming a first task of a lower layer and the node according to the node and the sub-target, and decomposing each first task to form a second task of the lower layer or the node; each second task of the last layer in the AND or tree is learned by a low-layer strategy network to form a division tree, wherein the division tree comprises a plurality of initial tracks of the aircraft obtained through learning; from the division tree, a target trajectory of the aircraft is determined. When the sub-target is determined, the distance difference between the sub-target and the two ends of the node is minimum, so that the speed of processing the second task can be increased, and the efficiency of planning the aircraft track is improved.

Description

Aircraft trajectory planning method
Technical Field
The disclosure relates to the field of aviation technology, and in particular relates to a planning method for an aircraft track.
Background
At present, with the increasing of the traffic flow of air roads, it is important to plan the flight path of the aircraft in advance under the limited airspace resource. However, in the existing aircraft trajectory planning method, the planning process is complex, which results in low efficiency of aircraft trajectory planning.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a method of planning an aircraft trajectory.
According to a first aspect of embodiments of the present disclosure, there is provided a method for planning an aircraft trajectory, the method comprising:
acquiring an initial state and a position target of an aircraft as an original task;
decomposing the original task in an AND or tree by a high-level strategy network; determining a sub-target with the smallest distance difference between the sub-target and two ends of the or node in each or node of the and or tree, forming a first task of a lower layer and a first task of the node according to the or node and the sub-target, and decomposing each first task to form a second task of the lower layer or the node;
learning each second task of the last layer in the AND or tree by using a low-layer strategy network to form a division tree, wherein the division tree comprises a plurality of initial tracks of the aircraft obtained through learning;
And determining the target track of the aircraft according to the division tree.
In some embodiments of the disclosure, the distance sum of the child object and both ends of the or node is minimal.
In some embodiments of the present disclosure, a loss function of the higher-level policy network is used to assist in training of the higher-level policy network, the loss function is calculated by a main loss function and an auxiliary loss function, a parameter updating method in the main loss function and the loss function is a random gradient increasing method, and a parameter updating method in the auxiliary loss function is a random gradient decreasing method.
In some embodiments of the disclosure, the primary loss function
Figure SMS_1
The formula of (2) is as follows:
Figure SMS_2
the auxiliary loss function
Figure SMS_3
The formula of (2) is as follows:
Figure SMS_4
the loss function
Figure SMS_5
The formula of (2) is as follows:
Figure SMS_6
wherein ,
Figure SMS_8
to measure the higher layer policy network +.>
Figure SMS_14
Function of->
Figure SMS_17
Gradient of->
Figure SMS_9
For the joint distribution of the current initial state and the current position target, < >>
Figure SMS_12
For the higher-layer policy network +.>
Figure SMS_20
Is used for the control of the temperature of the liquid crystal display device,Nto update the higher-level policy network
Figure SMS_21
Parameter of->
Figure SMS_7
The minimum number of samples of the experience pool taken at the time,/->
Figure SMS_11
For the first weight parameter, +.>
Figure SMS_15
For the second weight parameter, +.>
Figure SMS_18
Is->
Figure SMS_10
Distance of the sub-object from the current initial state,/or- >
Figure SMS_13
Is->
Figure SMS_16
Distance of the sub-object from the current position object,/or->
Figure SMS_19
As the weight of the auxiliary loss function,iis a positive integer.
In some embodiments of the disclosure, the decomposing the original task in an and or tree with a high-level policy network includes:
taking the original task as a root node of the AND or tree, wherein the root node is the OR node of the AND or tree first layer;
the following first process is circularly executed until the number of layers of the AND or tree reaches a first preset number of layers:
inputting the or node into the high-level policy network to obtain the high-level behavior output by the high-level policy network; the distance difference between the high-level behavior and the two ends of the or node is the smallest;
determining the higher-level behavior as the child target;
forming the first task of the lower layer of the AND node according to the OR node and the sub-target;
and decomposing the first tasks to form second tasks of the lower layer or nodes, wherein each second task comprises the sub-target.
In some embodiments of the present disclosure, when the number of layers of the and or tree reaches the first preset number of layers, the decomposing the original task in the and or tree with a high-level policy network further includes:
And circularly executing the following second process until the number of layers of the AND or tree reaches a second preset number of layers:
inputting each second task in the deepest layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network;
determining the number of steps learned by the low-level policy network to complete the second task according to the low-level behavior;
if the step number is within the preset step number range, storing the or node where the second task is located in the AND or tree;
if the step number is out of the preset step number range, executing the first process again;
wherein the second preset layer number is the last layer of the AND or tree.
In some embodiments of the present disclosure, the high-level behavior of the high-level policy network output is as follows:
Figure SMS_22
wherein ,
Figure SMS_23
for higher layer behavior, ++>
Figure SMS_24
For a higher-level policy network, ++>
Figure SMS_25
For the current initial state of the node, +.>
Figure SMS_26
Is or isCurrent location object of node,/>
Figure SMS_27
Is a higher layer policy network->
Figure SMS_28
Parameter of->
Figure SMS_29
Is random noise with an average value of 0.
In some embodiments of the disclosure, the learning each of the second tasks of the last layer in the and or tree with the low-level policy network to form the partition tree includes:
Inputting each second task of the last layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network;
determining the number of steps learned by the low-level policy network to complete the second task according to the low-level behavior;
and if the step number is within a preset step number range, storing the or node where the second task is located in the AND or tree to form the partition tree.
In some embodiments of the present disclosure, the low-level behavior of the low-level policy network output is as follows:
Figure SMS_30
wherein ,
Figure SMS_31
for low-level behavior, ++>
Figure SMS_32
For a low-level policy network, ++>
Figure SMS_33
For the current initial state of the node, +.>
Figure SMS_34
For the current location target of the or node, +.>
Figure SMS_35
Is a low-level policy network->
Figure SMS_36
Parameter of->
Figure SMS_37
Is random noise with an average value of 0.
In some embodiments of the disclosure, the determining the target trajectory of the aircraft according to the partition tree includes:
traversing the division tree to form a planning tree;
forming a solution tree according to the planning tree;
and determining the target track of the aircraft according to the solution tree.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
the original task is decomposed in an AND or tree with a high-level policy network to form an AND or tree alternating with nodes and or nodes layer by layer. In each or node, determining a sub-target through a high-level strategy network, wherein the distance difference between the sub-target and two ends of the or node is the smallest. And according to the nodes or the sub-targets, forming a first task of the lower layer and the nodes, and decomposing the first task to form a second task of the lower layer or the nodes. When the last layer of the AND OR tree is reached, each second task of the last layer is learned by a low-layer strategy network to form a partition tree. And determining a target track of the aircraft according to the division tree, and planning the aircraft track. The complexity of each second task in each layer or node is approximately the same as the distance difference between the sub-target and the two ends of the or node is the smallest when the sub-target is determined. When each second task is processed in parallel, the speed of processing the second task can be increased, so that the efficiency of aircraft track planning is improved. Meanwhile, the original tasks are decomposed through the high-level strategy network and the low-level strategy network to form a partition tree, so that the complexity of aircraft track planning is reduced, and the reliability of aircraft track planning is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the disclosure, and do not constitute a limitation on the disclosure. In the drawings:
FIG. 1 is a schematic diagram of a hierarchical structure shown in an exemplary embodiment of the present disclosure;
FIG. 2 is a flow chart of a method of planning an aircraft trajectory shown in a first exemplary embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an AND or Tree architecture shown in a first exemplary embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an AND or Tree architecture shown in a second exemplary embodiment of the present disclosure;
FIG. 5 is a flow chart of a method of planning an aircraft trajectory shown in a second exemplary embodiment of the present disclosure;
FIG. 6 is a flow chart of a method of planning an aircraft trajectory shown in a third exemplary embodiment of the present disclosure;
FIG. 7 is a flow chart of a method of planning an aircraft trajectory shown in a fourth exemplary embodiment of the present disclosure;
FIG. 8 is a flow chart of a method of planning an aircraft trajectory shown in a fifth exemplary embodiment of the present disclosure;
FIG. 9 is a schematic diagram of a solution tree shown in an exemplary embodiment of the present disclosure;
FIG. 10 is a flowchart of a method of planning an aircraft trajectory, as shown in a sixth exemplary embodiment of the present disclosure;
FIG. 11 is a flow chart of a method of planning an aircraft trajectory shown in a seventh exemplary embodiment of the present disclosure;
FIG. 12 is a schematic diagram of a simulation environment shown in an exemplary embodiment of the present disclosure;
fig. 13 is a schematic plan view of an aircraft trajectory shown in an exemplary embodiment of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be arbitrarily combined with each other. In the embodiment of the disclosure, the term "and/or" is an association relationship describing an object, which means that three relationships may exist. For example, a and/or B, represent: a or B, or, A and B. In the embodiment of the disclosure, learning refers to a process of obtaining behaviors by a policy network, solving refers to a process of obtaining an estimated value by an estimated value network, and training refers to a process of updating parameters of each network.
At present, the task of air traffic is to effectively maintain and promote air traffic safety and ensure air traffic smoothness. With the increasing flow of air traffic, under the condition of limited airspace resources, the management of air traffic is more complex and the probability of flight conflict gradually rises. The smooth traffic of the air path is mainly embodied by planning the track of the aircraft, and it is particularly important to plan one or more optimal flight routes from the initial position to the target position in advance under the condition of comprehensively considering safety, endurance, range and the like. In an actual application scene, security threats such as geography, weather, military and the like often exist between an initial position and a target position, and even a forbidden airspace possibly exists, so that planning of an aircraft track is a sequence decision problem in a complex environment.
Most standard planning methods for sequence decision problems (e.g., monte carlo planning, monte carlo tree search, and dynamic planning) contain a fixed assumption, i.e., sequential planning. These methods start from an initial state or target position and then program the behavior forward or backward in time sequence. While such methods face two challenges: firstly, a planning model deduced according to data is difficult to be trusted in a long-term range; secondly, assignment of confidence for each individual action is difficult. These two challenges make the agent have the problems of low reliability and sparse rewards in the face of long sequence decision problems in a sparse rewarding environment. In addition, when the track of the aircraft is planned by the standard planning method, the problem of complex planning process exists, so that the efficiency of aircraft track planning is low.
In recent years, the combination of deep neural networks and reinforcement learning is called deep reinforcement learning. Deep reinforcement learning combines the good perceptibility of deep learning to high-dimensional data and the strategic learning ability of reinforcement learning to data, and is one of the important ways to realize general artificial intelligence. In reinforcement learning, rewards play the role of supervisory signals, and agents optimize the strategy network according to rewards. To solve the problems of low reliability and sparse rewards, the layered idea is added to reinforcement learning. The essence of hierarchical reinforcement learning is to break up the original task into subtasks at different levels of abstraction. Because the state space of the subtasks is limited, the subtasks have higher solving speed compared with the original tasks, and finally the solving efficiency of the whole problem is improved.
One type of hierarchical reinforcement learning algorithm that is commonly used is a sub-objective based hierarchical reinforcement learning algorithm: the higher level policy learns given sub-targets, the lower level policy. Wherein, the difficulty level of the sub-target does not need to be considered. The process of generating sub-targets by higher level policies is not complex, while actions generated on lower level policies can affect the efficiency of the original task completion. Therefore, when a hierarchical reinforcement learning algorithm is adopted to plan the trajectory of the aircraft, the problem of low planning efficiency also exists.
Based on the above, the present disclosure provides a method for planning an aircraft trajectory, so as to plan an optimal trajectory of an aircraft in a complex air-road traffic environment. The original task is decomposed in an AND or tree through a high-level strategy network, and the long sequence decision process is split into a short sequence decision process. In the AND or tree or node, determining a sub-target through a high-level strategy network, and forming a first task of a lower layer and the node through the sub-target and the OR node. After decomposing the first task, a second task of the lower layer or node is formed. The complexity of each second task in the AND or Tree gradually decreases as the number of decompositions increases. And the second task with low complexity is learned through the low-level strategy network, so that the complexity of the low-level strategy network learning is reduced, and the reliability of aircraft track planning is improved. The complexity of each second task in each layer or node is approximately the same as the minimum distance difference between the sub-target and the two ends of the or node. When each second task is processed in parallel, the speed of processing the second task can be increased, so that the efficiency of aircraft track planning is improved.
To facilitate understanding, a hierarchical reinforcement learning model of the present disclosure is first constructed. The hierarchical reinforcement learning model includes an environmental model, a hierarchical architecture, a hierarchical model, and a partition tree. The environmental model of the present disclosure is to add a set of targets (including location targets and sub-targets) that are desired to be learned by an agent, each target being a state or a set of states, based on a markov decision process (Markov Decision Process, MDP). One markov decision process and a set of targets are defined as a target-controlled markov decision process (gol-conditioned Markov Decision Process, G-MDP). Recording a target-controlled Markov decision process as a plurality of sets
Figure SMS_38
, wherein ,/>
Figure SMS_42
States of agents are described,/->
Figure SMS_45
Targets of agent are described,/->
Figure SMS_41
Describes the behaviour of the agent according to the proposed goal +.>
Figure SMS_44
Is a rewarding value obtained in the environment by the behaviour of the agent,/-for>
Figure SMS_47
Is to calculate the discount rate of the jackpot +.>
Figure SMS_49
∈[0,1]. The Markov decision process based on the object control is that when the agent proposes the object +/according to the environment>
Figure SMS_39
Take a current action->
Figure SMS_43
After that, a prize value is obtained
Figure SMS_46
. Then, the agent is according to the current action +.>
Figure SMS_48
Interact with the environment to reach the next state +.>
Figure SMS_40
. The basic property of a markov decision process is markov, i.e. a random process, given a current state and all past states, the conditional probability distribution of its next state is only relevant to the current state, and not to the past states, i.e.:
Figure SMS_50
wherein ,
Figure SMS_51
for conditional probability +.>
Figure SMS_52
For history state->
Figure SMS_53
For the current state +.>
Figure SMS_54
The next state.
From the markov, the next state
Figure SMS_55
Only +.>
Figure SMS_56
In this regard, the agent need not consider historical status each time it makes a decision. At the beginning of each segment of the goal-controlled markov decision process, a location goal is selected from a set of goals. The solution to the Markov decision process for target control is to solve the control target-oriented strategy
Figure SMS_57
Maximizing the state value function>
Figure SMS_58
And behavior value function->
Figure SMS_59
Figure SMS_60
Figure SMS_61
wherein ,
Figure SMS_69
as a function of state values>
Figure SMS_63
As a function of the behavior value>
Figure SMS_66
For the status of->
Figure SMS_70
For action (I)>
Figure SMS_74
For the purpose of->
Figure SMS_73
For policy->
Figure SMS_77
Is (are) desirable to be (are)>
Figure SMS_71
To add up to->
Figure SMS_76
Discount rate of time awards, ->
Figure SMS_65
Is->
Figure SMS_67
Prize value (s)/(s)>
Figure SMS_62
For the current state +.>
Figure SMS_68
For the current action +.>
Figure SMS_72
For the current goal +.>
Figure SMS_75
and />
Figure SMS_64
Is a positive integer.
The hierarchical architecture of the present disclosure is a hierarchical structure using nested policies that enable an agent to learn tasks that require long sequences of original behaviors, while policies only require short sequences of behaviors. As shown in fig. 1, taking a two-layer structure as an example, when a higher-layer policy outputs a sub-target according to the state and target of a higher layer, the sub-target is passed down as a target of a lower layer. Low-level policy at most attemptsSThis object is achieved by a step of, among others,
Figure SMS_78
the number of steps (preset number of steps) for which the target is most tried is shown as a custom super parameter. The low-level policy outputs the original behavior according to the state of the low-level and the target. Behavior of high-level policiesThe space is the same as the target space (i.e., state space) of the lower level policy, and the task is divided into shorter subtasks by the state space. The behavior space of the high-level strategy is set as a state space, so that the intelligent agent can simulate a transfer function which is supposed to be optimal for the low-level strategy network, and the intelligent agent can learn the multi-level strategy network in parallel.
The framework of the hierarchical model of the present disclosure is divided into a high-level goal-controlled markov decision process (i.e., high-level) and a low-level goal-controlled markov decision process (i.e., low-level), which can be regarded as independent, and the goal-controlled markov decision process is a limited-length round-robin markov decision process. At the beginning of a segment, first an initial state of an agent is given
Figure SMS_81
And position target->
Figure SMS_84
Figure SMS_86
I.e. the goal to be achieved by the markov decision process for high level goal control. Then by a higher layer policy->
Figure SMS_79
Propose high-level behavior->
Figure SMS_82
Finally obtain the prize value->
Figure SMS_85
. According to the framework of the hierarchical model, the sub-objective of the Markov decision process controlled by the lower-level objective +.>
Figure SMS_87
It is actually a high-level policy +.>
Figure SMS_80
Proposed high-level behavior->
Figure SMS_83
The method comprises the following steps:
Figure SMS_88
then, low level policy
Figure SMS_89
Propose low-level behavior->
Figure SMS_90
And obtains the prize value +.>
Figure SMS_91
Intelligent agent in
Figure SMS_92
The return value of the moment is:
Figure SMS_93
Figure SMS_94
wherein ,
Figure SMS_96
is->
Figure SMS_98
High-level return value of time of day,/->
Figure SMS_99
Is->
Figure SMS_97
Low-level return value of time of day,/>
Figure SMS_100
Is that
Figure SMS_101
High-rise prizeExcitation value (I)>
Figure SMS_102
Is->
Figure SMS_95
A low-level prize value that is associated with the prize,itTis a positive integer.
High-level policy and low-level policy are in
Figure SMS_103
The state value function of time is:
Figure SMS_104
Figure SMS_105
wherein ,
Figure SMS_106
as a function of the state values of the high-level policies, +.>
Figure SMS_107
As a function of low-level policy state values.
High-level policy and low-level policy are in
Figure SMS_108
The behavior value function of the moment is:
Figure SMS_109
Figure SMS_110
wherein ,
Figure SMS_111
as a function of the behavior values of the higher-level policies, +.>
Figure SMS_112
Is a low-level strategyBehavior value function.
When the agent completes the given target at any time, the rewarding value is 1, and the round is ended; otherwise the prize value is 0.
Illustratively, an agent in this disclosure refers to an autonomously movable software or hardware entity, which may be located within or independent of an aircraft.
Referring to fig. 2, an exemplary embodiment of the present disclosure provides a method for planning an aircraft trajectory, including:
s100, acquiring an initial state and a position target of the aircraft as an original task.
S200, decomposing the original task in an AND or tree by a high-level strategy network; and determining a sub-target with the smallest distance difference between the sub-target and two ends of the node in each node of the node, and decomposing each first task to form a second task of the lower layer or the node according to the first tasks of the lower layer and the node formed by the node and the sub-target.
And S300, learning each second task of the last layer in the AND or tree by using a low-level strategy network to form a division tree, wherein the division tree comprises a plurality of initial tracks of the aircraft obtained through learning.
S400, determining the target track of the aircraft according to the division tree.
In this embodiment, the original task is decomposed in the and or tree with a high-level policy network to form and or trees alternating with nodes and or nodes layer by layer. In each or node, determining a sub-target through a high-level strategy network, wherein the distance difference between the sub-target and two ends of the or node is the smallest. And according to the nodes or the sub-targets, forming a first task of the lower layer and the nodes, and decomposing the first task to form a second task of the lower layer or the nodes. When the last layer of the AND or tree is reached, each second task of the last layer is learned by a low-layer strategy network to form a partition tree. And determining a target track of the aircraft according to the division tree, and planning the aircraft track. The complexity of each second task in each layer or node is approximately the same as the distance difference between the sub-target and the two ends of the or node is the smallest when the sub-target is determined. When each second task is processed in parallel, the speed of processing the second task can be increased, so that the efficiency of aircraft track planning is improved. Meanwhile, the original tasks are decomposed through the high-level strategy network and the low-level strategy network to form a partition tree, so that the complexity of aircraft track planning is reduced, and the reliability of aircraft track planning is improved.
The initial state and the position target in step S100 may be, for example, an initial position and a target position of the aircraft. Record the initial state as
Figure SMS_113
The position target is +.>
Figure SMS_114
The original task is (+)>
Figure SMS_115
)。
Illustratively, as shown in fig. 3 and 4, the and or tree of the present disclosure includes and nodes and or nodes. The root node of the or tree is the or node, and the root node comprises the original task #
Figure SMS_116
). One end of the root node is->
Figure SMS_117
The other end is->
Figure SMS_118
. The root node determines sub-targets through a high-level policy network, wherein the sub-targets and the original tasks form first tasks of an AND or tree second level, and each first task marks one AND node. The number of sub-targets determined by the higher layer policy network may be multiple, i.e. the number of nodes in each layer may be multiple. Decomposing each and node of the second layer into two or nodes, wherein each or node is positioned at the third layer of the and or tree. Each or node contains sub-targets in the node and contains one end of the original task. According to the method, the decomposition of the nodes or the node sum and the node is repeated until the first preset layer number is reached. Then, atAnd adding a low-level strategy network between the first preset layer number and the second preset layer number to learn the node or the deepest layer of the tree. When the learning condition is not met, decomposing is performed again through the high-level strategy network until the second preset layer number of the AND or tree (namely the last layer of the AND or tree) is reached. The nodes of the first preset layer number and the second preset layer number in the or tree are or nodes, part or nodes only comprise a plurality of sub-targets, part or nodes comprise initial states and sub-targets, and part or nodes comprise position targets and sub-targets. The deepest layer of the and or tree refers to the last layer of the and or tree under the current structure.
Illustratively, the sub-target with the smallest distance difference from the two ends of the or node is determined in each or node in step S200, which refers to a plurality of sub-targets with the smallest distance difference from the two ends of the or node in all sub-targets under the current high-level policy network. Since the number of the sub-targets which can be determined is a plurality of under the current high-level policy network, each sub-target is located between two ends of the node. The sub-targets with the smallest distance difference with or at the two ends of the node in all the sub-targets are selected, so that the speed of processing the second task can be increased, and the efficiency of planning the aircraft track is improved. Through subsequent training of the high-level strategy network, the proposed sub-targets can be continuously close to or at the middle points of the two ends of the node, namely, the distance difference between the sub-targets and the two ends of the node is close to or even equal to 0.
Illustratively, in the and or tree and the partition tree, the different layers each employ the same high-level policy network and the same low-level policy network.
Illustratively, the initial trajectory in step S300 refers to a trajectory of each aircraft capable of satisfying the air traffic requirement, including various trajectories of different distances between the initial state and the position target. The target trajectory in step S400 refers to one or more trajectories that satisfy conditions of minimum distance, minimum number of turns, and/or minimum crossing of boundary lines between the initial state and the position target while satisfying the trajectories of the respective aircrafts required for the air traffic. The target track is one or more of the initial tracks.
In one embodiment, as shown in fig. 5, the decomposition of the original task in the and-or tree with the high-level policy network in step S200 is determined by:
s210, taking the original task as a root node of an AND or tree, wherein the root node is an OR node of the first layer of the AND or tree.
The following first process is circularly executed until the number of layers of the AND or tree reaches a first preset number of layers:
s220, inputting the nodes into a high-level strategy network to obtain high-level behaviors output by the high-level strategy network; wherein, the distance difference between the high-level behavior and the two ends of the or node is the smallest.
S230, determining the high-level behavior as a sub-target.
S240, forming a first task of the lower layer and the node according to the node and the sub-target.
S250, decomposing the first tasks to form lower-layer or node second tasks, wherein each second task comprises a sub-target.
In this embodiment, in the process of constructing the and or tree, the original task is taken as the root node of the and or tree. And inputting the or node into a high-level policy network to obtain high-level behaviors so as to determine the sub-targets. And adding the sub-target between two ends of the node according to the node or the sub-target, and forming a first task of the lower layer and the node. And decomposing the first tasks with the nodes to form lower layers or second tasks of the nodes, wherein each second task comprises a sub-target. By constantly decomposing the first task and the second task, an and or tree is formed that alternates with nodes and or nodes layer by layer. The complexity of the first task and the second task is reduced by decomposing the first task and the second task layer by layer, so that the reliability distribution of the aircraft track planning is improved. The complexity of each second task in each layer or node is approximately the same, so that the speed of processing the second task can be increased, and the efficiency of aircraft track planning is improved.
Illustratively, the high-level policy network trains the target-oriented policies
Figure SMS_119
(i.e., a high level policy). Due to->
Figure SMS_122
Not the optimal strategy, the strategy can be improved by planning. In a large-scale sparse rewarding environment, the original behavior is adopted to be +.>
Figure SMS_124
Direct arrival at location target->
Figure SMS_120
Has a very low probability, i.e->
Figure SMS_123
The present embodiment therefore plans through a series of intermediate sub-objectives. Sub-objects can guide->
Figure SMS_125
From the initial state->
Figure SMS_126
To the location target->
Figure SMS_121
And training the low-level agent according to the sub-targets.
Illustratively, the principle of decomposing the first task to form the second task of the lower layer or node in step S250 is as follows: order the
Figure SMS_128
Is->
Figure SMS_130
The sequence set on the upper part is provided with the sequence->
Figure SMS_135
Is->
Figure SMS_129
Subset of->
Figure SMS_132
Is the sequence->
Figure SMS_137
The optimal plan is recorded as +.>
Figure SMS_138
. If->
Figure SMS_127
And->
Figure SMS_131
Will->
Figure SMS_134
Called task->
Figure SMS_139
Is>
Figure SMS_133
Is denoted as +.>
Figure SMS_136
The present disclosure first makes use of the combinability of planning to make the planning problem efficient. Tasks
Figure SMS_140
Planning->
Figure SMS_142
Can use task->
Figure SMS_152
Planning->
Figure SMS_141
And task->
Figure SMS_144
Planning->
Figure SMS_146
And connecting the two parts. The planning of each task is therefore broken down into two parts, namely +.>
Figure SMS_151
. wherein />
Figure SMS_143
and />
Figure SMS_145
Is task->
Figure SMS_147
Is a subtask of (a). Briefly, optimal planning +. >
Figure SMS_149
Reporting function
Figure SMS_148
And an optimal state value function->
Figure SMS_150
The method comprises the following steps of:
Figure SMS_153
Figure SMS_154
Figure SMS_155
wherein ,
Figure SMS_156
is to compose an optimal plan->
Figure SMS_157
Is>
Figure SMS_158
Is corresponding to->
Figure SMS_159
Is a function of the two payback functions of (a).
So that the first task can be decomposed into two second tasks, thereby improving the success probability. By continuing the recursive decomposition, the difficulty of the target is reduced. The present disclosure specifies the second preset number of layers of the AND or tree as
Figure SMS_160
Illustratively, AND or nodes in the tree, also referred to as task nodes. Or the node is operated by the second task
Figure SMS_161
Marking (S)>
Figure SMS_162
Is the current initial state, ++>
Figure SMS_163
Is the current location target. Wherein the root node of the AND-OR tree is composed of the original task +.>
Figure SMS_164
Marked or nodes.
Each non-terminal or node having
Figure SMS_165
The individual AND nodes act as child nodes. />
Figure SMS_166
The most number of children per or node is represented as a self-defined hyper-parameter. And node is->
Figure SMS_167
Marking (S)>
Figure SMS_168
. As long as the first task of a lower layer is completed, the upper layer or node is completed.
Illustratively, the AND nodes in the AND or tree, also referred to as sequence nodes, are formed by the sequence
Figure SMS_169
Marking (S)>
Figure SMS_170
First task->
Figure SMS_171
Break down into->
Figure SMS_172
and />
Figure SMS_173
Two second tasks.
Due to
Figure SMS_174
and />
Figure SMS_175
Can continue the decomposition so each AND node +.>
Figure SMS_176
There are two child nodes, each of which is an or node. Only two second tasks of the lower layer are completed, and the upper layer and the nodes can be completed.
In an embodiment, when the number of layers of the and or tree reaches the first preset number of layers, the decomposing the original task in the and or tree with the high-level policy network in step S200 further includes:
the following second process is circularly executed until the number of layers of the AND or tree reaches a second preset number of layers:
and inputting each second task in the deepest layer of the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.
And determining the step number learned by the low-level strategy network to complete the second task according to the low-level behavior.
If the step number is within the preset step number range, storing the or node where the second task is located in the and or tree.
If the number of steps is outside the preset number of steps range, the first process is executed again.
In this embodiment, when the number of layers of the and or tree is between the first preset number of layers and the second preset number of layers, the and or tree already has a certain structure, and each second task has been simplified compared with the original task. And inputting the second task into a low-level strategy network, outputting low-level behaviors, and determining the number of steps learned by completing the second task. If the number of steps is within the preset number of steps range, the low-level strategy network can complete the second task according to the expectation, and the or node where the second task is located is stored in the and or tree. If the number of steps is outside the preset number of steps range, the complexity of the second task is higher, and the second task needs to be decomposed through the high-level strategy network again. By reserving the or node where the second task with low complexity is located between the first preset layer number and the second preset layer number to avoid re-decomposition, the complexity of the and or tree construction is reduced.
In one embodiment, as shown in fig. 6, each second task of the last layer in the and or tree in step S300 is learned by the low-level policy network to form a partition tree, which includes:
s310, inputting each second task of the last layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.
S320, determining the step number learned by the low-level strategy network to complete the second task according to the low-level behavior.
S330, if the step number is within the preset step number range, storing the or node where the second task is located in the and or tree to form a division tree.
In this embodiment, since the node at the last layer in the or tree is the or node, the second tasks can be learned after being input into the low-level policy network. And determining whether the step number learned by the low-level strategy network for completing the second task can reach the expectations or not through the process of the low-level behavior output by the low-level strategy network. If the number of steps is within the preset number of steps range, the low-level strategy network can complete the second task according to the expectation, and the or node where the second task is located is stored in the and or tree to form a division tree. The node of the last layer can be learned or not through the low-layer strategy network to form a division tree capable of reflecting the aircraft track planning, and the problematic track is eliminated, so that the effectiveness of the aircraft track planning is improved. Meanwhile, the second task with low complexity is input into the low-level strategy network, so that the time required by track planning is reduced, and the efficiency of aircraft track planning is improved.
In one embodiment, after determining the number of steps learned by the low-level policy network to complete the second task according to the low-level behavior in step S320, learning each second task of the last level in the and or tree in step S300 with the low-level policy network, and forming the partition tree further includes:
if the number of steps is out of the range of the preset number of steps, determining the node or the node where the second task is located as an undegradable node, and outputting an evaluation value through a low-layer evaluation network.
In this embodiment, when the number of steps is outside the preset number of steps range, the second task cannot meet the requirement of track planning, and the node or nodes where the second task is located are determined as non-resolvable nodes and are not stored. By removing nodes which do not meet the requirements, the complexity of the partition tree is reduced, and the expandability of the partition tree is improved.
Illustratively, as shown in FIG. 7, the overall process of partitioning tree construction is as follows:
s500, giving an initial state and a position target, and obtaining an original task as a root node.
S510, taking the original task as input of a high-level strategy network, and outputting high-level behaviors.
S520, obtaining a sub-target according to the high-level behavior, and forming a first task of a lower layer and a node according to the node and the sub-target.
S530, decomposing the first tasks to form two second tasks of two or nodes of the lower layer.
S540, judging whether the number of layers of the AND or OR tree reaches a first preset number of layers. If yes, go to step S550. If not, each second task is used as an input of the higher-level policy network instead of the original task, and the step S510 is returned.
S550, judging whether the number of layers of the AND or OR tree reaches a second preset number of layers. If yes, go to step S570. If not, go to step S560.
S560, inputting the second tasks in the deepest layer of the AND or OR tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.
S561, determining the step number learned by the low-level strategy network to complete the second task according to the low-level behavior.
S562, judging whether the step number is within a preset step number range. If yes, step S563 is performed. If not, each second task is used as an input of the higher-level policy network instead of the original task, and the step S510 is returned.
S563, storing the or node where the second task is located in the or tree and not decomposing any more.
S570, inputting each second task of the last layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.
S571, determining the number of steps learned by the low-level strategy network to complete the second task according to the low-level behavior.
And S572, if the step number is within the preset step number range, storing the or node where the second task is located in the and or tree to form a division tree.
S573, if the step number is out of the preset step number range, determining the or node where the second task is located as an insoluble node.
When the original task is replaced by each second task as input of the high-level policy network, the second task can be the second task of the deepest layer or node in the tree, the second task of the non-terminal or node in the tree, and the second task of the terminal or node in the tree.
For example, if the number of layers of the and or tree is between the first preset number of layers and the second preset number of layers and each or node is a node that is no longer decomposed, the expansion may be performed again by the root node, and the decomposition may be performed on the node that is no longer decomposed.
Illustratively, the reinforcement learning network architecture is pre-built prior to acquiring the initial state and position targets of the aircraft as raw tasks in step S100. First, a reinforcement learning network architecture (based on an environment model) is built for each layer in an and or tree, and each layer adopts a network of depth deterministic policy gradient (Deep Deterministic Policy Gradient, DDPG) algorithm to realize policy training and node evaluation:
Figure SMS_177
Wherein the higher level policy network is a network for training higher level policies and the lower level policy network is a network for training lower level policies. The current high-level policy network, the current low-level policy network, the current high-level estimation network and the current low-level estimation network are used in the processes of constructing the partition tree and traversing the partition tree to form the planning tree. The target high-level strategy network, the target low-level strategy network, the target high-level estimation network and the target low-level estimation network are respectively obtained by training the current high-level strategy network, the current low-level strategy network, the current high-level estimation network and the current low-level estimation network to update parameters, and are respectively used as the current high-level strategy network, the current low-level strategy network, the current high-level estimation network and the current low-level estimation network after updating.
According to the framework of the hierarchical model and the DDPG algorithm, the sub-objective is to pass through a high-level policy network
Figure SMS_178
Determined, namely:
Figure SMS_179
wherein ,
Figure SMS_180
for higher layer behavior, ++>
Figure SMS_181
For the current initial state of the node, +.>
Figure SMS_182
As or the current location target of the node,
Figure SMS_183
is a higher layer policy network->
Figure SMS_184
Parameter of->
Figure SMS_185
Is random noise with an average value of 0.
By a function of
Figure SMS_186
Meter-of-measure high-level policy->
Figure SMS_187
Is represented by: / >
Figure SMS_188
Figure SMS_189
wherein ,
Figure SMS_190
is->
Figure SMS_191
Is a combination of (a) and (b) of (b)>
Figure SMS_192
Is a desire for joint distribution. In the actual operation, < > and->
Figure SMS_193
Is to add vector->
Figure SMS_194
and />
Figure SMS_195
And (5) splicing.
Optimal strategy
Figure SMS_196
I.e. maximize function +.>
Figure SMS_197
Is a strategy of (1): />
Figure SMS_198
And updating the function using a random gradient ramp-up method>
Figure SMS_199
Gradient of->
Figure SMS_200
Figure SMS_201
wherein ,
Figure SMS_202
to measure higher layer policy network->
Figure SMS_209
Function of->
Figure SMS_215
Gradient of->
Figure SMS_204
To measure higher layer estimation network->
Figure SMS_207
Gradient of->
Figure SMS_210
For the joint distribution of the current initial state and the current position target, < >>
Figure SMS_213
Is a higher layer policy network->
Figure SMS_205
Parameter of->
Figure SMS_208
Is->
Figure SMS_211
Current initial state->
Figure SMS_214
Is->
Figure SMS_203
The number of current position targets is determined,Nto update a higher-level policy network->
Figure SMS_206
Parameters of (2)
Figure SMS_212
The minimum number of samples of the experience pool taken,iis a positive integer.
Through the method, the sub-targets can be determined through the high-level strategy network.
Illustratively, through a low-level policy network
Figure SMS_216
The resulting low-level behavior is as follows:
Figure SMS_217
wherein ,
Figure SMS_218
for low-level behavior, ++>
Figure SMS_219
Is a low-level policy network->
Figure SMS_220
Parameter of->
Figure SMS_221
Is random noise with an average value of 0.
Illustratively, in determining the sub-objective through the higher-level policy network, it is desirable that the sub-objective be equidistant from both ends of the or node to achieve the same complexity of the decomposed second task. Specifically, the training of the sub-targets is assisted by adding a constraint of distance to the higher-level behavior (i.e., sub-targets) output by the higher-level policy network. Wherein the behavior of the network output of the high-level strategy is in the current initial state
Figure SMS_222
And the current positionTarget->
Figure SMS_223
Sub-object of "midpoint" in between->
Figure SMS_224
. Child object->
Figure SMS_225
Is +_associated with the current initial state>
Figure SMS_226
And the current location target->
Figure SMS_227
Having the same dimensions on each component.
Sub-targets for exporting high-level policy networks
Figure SMS_230
Is +_associated with the current initial state>
Figure SMS_232
Distance of (2) is recorded as +.>
Figure SMS_234
Figure SMS_229
. Sub-objective of outputting higher-level policy network +.>
Figure SMS_231
Is +.>
Figure SMS_233
Distance of (2) is recorded as +.>
Figure SMS_235
Figure SMS_228
In one embodiment, to make a sub-target
Figure SMS_236
At the current initial state->
Figure SMS_237
And the current location target->
Figure SMS_238
The "midpoint" between can be achieved by two constraints: minimize->
Figure SMS_239
and />
Figure SMS_240
The difference between the sub-target and the node, i.e. the difference in distance between the sub-target and the node, is minimal. Minimize->
Figure SMS_241
and />
Figure SMS_242
The sum of the two, i.e. the sum of the distances between the sub-object and the two ends of the or node, is minimal.
In this embodiment, by minimizing the distance difference between the sub-target and the two ends of the or node, the complexity difference between the or node lower layer and each first task in the node is small, so that the efficiency of parallel processing of the first tasks is improved. By minimizing the sum of the distances between the sub-targets and the two ends of the or node, the path of the aircraft trajectory is reduced, thereby optimizing the aircraft trajectory.
In one embodiment, to minimize the two constraints described above, this is accomplished by a loss function of the higher-level policy network. The loss function of the higher-level policy network is calculated from the primary loss function and the secondary loss function. The main loss function and the parameter updating method in the loss function are random gradient rising methods. The parameter updating method in the auxiliary loss function is a random gradient descent method.
In this embodiment, training of the high-level policy network is assisted by the loss function, so that the sum and difference of distances between the sub-target and the two ends of the node proposed by the high-level policy network are reduced as much as possible, and the sub-target is more towards the midpoint of the two ends of the node. The sub-targets proposed by the high-level strategy network are trained in an auxiliary mode through the loss function, time required by track planning is shortened, and therefore efficiency of aircraft track planning is improved.
In one embodiment, the primary loss function
Figure SMS_243
The formula of (2) is as follows:
Figure SMS_244
auxiliary loss function
Figure SMS_245
The formula of (2) is as follows:
Figure SMS_246
loss function
Figure SMS_247
The formula of (2) is as follows:
Figure SMS_248
wherein ,
Figure SMS_258
to measure higher layer policy network->
Figure SMS_251
Function of->
Figure SMS_253
Gradient of->
Figure SMS_252
For the joint distribution of the current initial state and the current position target, < >>
Figure SMS_254
Is a higher layer policy network->
Figure SMS_257
Is used for the control of the temperature of the liquid crystal display device,Nto update a higher-level policy network->
Figure SMS_262
Parameter of->
Figure SMS_260
The minimum number of samples of the experience pool taken at the time,/->
Figure SMS_263
For the first weight parameter, +.>
Figure SMS_249
For the second weight parameter, +.>
Figure SMS_256
Is->
Figure SMS_255
Distance of sub-object from current initial state, +.>
Figure SMS_259
Is->
Figure SMS_261
Distance of sub-object from current position object, +.>
Figure SMS_264
In order to assist the weight of the loss function,iis a positive integer. Weight of auxiliary loss function->
Figure SMS_250
For example, it may be 0.3.
By way of example only, and in an illustrative,
Figure SMS_265
Representing the distance between the sub-target and the current initial state and the distance between the sub-target and the current position target and the occupied weight, +.>
Figure SMS_266
Representing the distance between the sub-object and the current initial state and the distance between the sub-object and the current initial stateWeight of distance difference of position target, +.>
Figure SMS_267
,/>
Figure SMS_268
In this embodiment, in the loss function, the high-level policy network is trained through the auxiliary loss function, so as to minimize the sum and difference of the distances between the sub-targets and the two ends of the node. By adjusting the first weight parameter, setting the first weight parameter to a smaller value can avoid aircraft falling into a suboptimal trajectory. And setting the second weight parameter to a larger value by adjusting the second weight parameter, so that the sub-target is more towards or at the midpoint of two ends of the node. The track of the sub-target and the aircraft is optimized by adjusting the first weight parameter and the second weight parameter in the auxiliary loss function, so that the efficiency of aircraft track planning is improved.
In one embodiment, as shown in fig. 8, the determination of the target trajectory of the aircraft in step S400 is determined according to the partition tree by:
s410, traversing the division tree to form a planning tree.
S420, forming a solution tree according to the planning tree.
S430, determining the target track of the aircraft according to the solution tree.
In this embodiment, by traversing the partition tree, the nodes that are insoluble in the partition tree are pruned to form a planning tree with multiple tracks. And reserving the nodes with the optimal track according to the planning tree to form a solution tree. From the solution tree, a target trajectory of the aircraft is determined. The solution tree is determined by gradually optimizing the division tree, the non-optimal track is removed, and the track of the aircraft is optimized. Meanwhile, by pruning the division tree, the complexity of knowing the tree structure is reduced, and therefore the expandability of knowing the tree is improved.
Illustratively, traversing the partition tree in step S410 to form a planning tree means that in one traversal, one planning tree is formed with all selected nodes and/or nodes.And after the one-time traversal is finished, updating the state values of each node and all the access times of the nodes through the high-level estimation network. And, starting from the end of the partition tree, back propagates to the root node. On the back propagation path, the state values of all nodes (i.e., the estimates solved by the higher-level estimation network) are updated. Wherein the state value of the node is as follows
Figure SMS_269
For input and output by higher-layer evaluation network +.>
Figure SMS_270
. The label of the high-level estimation network is the return value obtained by the agent.
The planning tree may also be extended and updated, for example, after the planning tree is formed. If the number of child nodes in the planning tree or the node is less than the maximum number of child nodes, step S200 is repeated to expand the node. If the number of the child nodes in the planning tree or the node is equal to the maximum number of the child nodes, selecting the node according to the confidence upper bound rule of the tree. After the number of samples in the experience pool is greater than the minimum number of samples of the experience pool taken when updating parameters of the higher-level policy network, each time the training segment reaches a preset segment period
Figure SMS_271
And pruning the undesolvable nodes in the planning tree structure and updating parameters of each network. Every time the training segment reaches 2 times the preset segment period +.>
Figure SMS_272
And pruning the child node with the lowest estimated value in the root node, and repeating the step S200 to expand the node.
Illustratively, the step S420 of forming a solution tree according to the planning tree refers to pruning the child node with the lowest state value of the root node of the planning tree after the step S410 is traversed by the preset number of times to form the solution tree. As shown in fig. 9, the solution tree has the following three properties: 1. the root node is in the solution tree. 2. There is at most one child node per or node. 3. If each AND node is in the solution tree, the two child nodes of the AND node also have to be in the solution tree.
Illustratively, determining the target trajectory of the aircraft from the solution tree in step S430 means that the original mission has been decomposed into the solution tree after the solution tree is formed by step S420. Therefore, according to the first task and the second task of each layer in the solution tree, the optimal track of the aircraft can be determined.
As shown in fig. 10, a specific description is given of a method for planning an aircraft trajectory in the present embodiment:
s600, constructing an environment model. The environment model specifies the state space, the target space and the behavior space of the agent based on a markov decision process of the target control. The state space is the same as the target space, and represents information of the position state of the agent. The behavior space represents the direction of movement of the agent.
S610, constructing a layered architecture. The hierarchical architecture is a hierarchical structure using a nested strategy, a Markov decision process controlled by a high-level target and a Markov decision process controlled by a low-level target of an agent are set, and the Markov decision process controlled by the target is a Markov decision process of a limited-length round system. When a higher layer policy network outputs a sub-target, the sub-target passes down as a target of the next layer. The target satisfies the behavior space of the higher-level policy network and the target space of the lower-level policy network, namely, the state space of the higher-level policy network and the state space of the lower-level policy network are the same. Setting the behavior space of the higher-level policy network as a state space, i.e
Figure SMS_273
S620, constructing a reward model. Wherein, the rewarding model is: if each layer of intelligent agent completes the given target at any time, the rewarding value is 1, and the round is ended; otherwise the prize value is 0.
S630, constructing a reinforcement learning network architecture. Wherein each layer in the AND or Tree employs a network of depth deterministic strategy gradient algorithms to achieve strategy training and node evaluation. Each layer contains 4 deep neural networks for training. Parameters of the depth deterministic strategy gradient algorithm are set, including a range of network output values, explored probability values, and the architecture of the neural network. I.e. the number of layers, network nodes and activation functions employed by each network.
S640, constructing a loss function. Wherein a first weight parameter and a second weight parameter in the loss function are set.
S650, acquiring initial states and position targets of the aircraft as original tasks.
S660, constructing a division tree according to the original task.
S670, traversing the division tree to form a planning tree.
S680, forming a solution tree according to the planning tree.
S690, determining the target track of the aircraft according to the solution tree.
As shown in fig. 11, steps S660 to S680 include the steps of:
s700, the first preset layer number and the second preset layer number of the AND or tree are specified, and the maximum number of child nodes of each OR node is specified.
S710, taking the original task as a root node of an AND or tree.
S720, taking the original task as input of a high-level strategy network, and outputting high-level behaviors.
S730, taking the high-level behavior as a sub-target.
S740, forming a first task of the lower layer and the node according to the node and the sub-target.
S750, decomposing each first task to form a second task of a lower layer or node.
And S760, when the number of layers of the AND or Tree does not reach the first preset number of layers, replacing the original task with each second task to serve as the input of the high-level strategy network, and returning to the step S720.
And S770, inputting each second task at the deepest layer in the AND or tree into the low-level strategy network when the number of layers of the AND or tree reaches the first preset number of layers and does not reach the second preset number of layers, and obtaining the low-level behavior output by the low-level strategy network.
S780, determining the number of steps learned by the low-level strategy network to complete the second task according to the low-level behaviors.
S790, judging whether the step number is within a preset step number range. If yes, go to step S800. If not, the second tasks replace the original tasks as input of the higher-level policy network, and then the step S720 is returned.
S800, storing the or node where the second task is located in an AND or tree and not decomposing any more.
And S810, when the number of layers of the AND or tree reaches a second preset number of layers, inputting each second task of the last layer of the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.
S820, determining the step number learned by the low-level strategy network to complete the second task according to the low-level behavior.
And S830, if the step number is within the preset step number range, storing the or node where the second task is located in the or tree to form a partition tree, and executing step S850.
And S840, if the step number is out of the preset step number range, determining the or node where the second task is located as an insoluble node and pruning.
S850, traversing the division tree to form a planning tree. Wherein the root node is the first node to be examined.
S860, judging whether the number of the child nodes in the planning tree or the node is smaller than the maximum number of the child nodes. If so, each second task is used as an input of the higher-level policy network instead of the original task, and the process returns to step S720. If not, step S870 is performed.
S870, selecting nodes through the confidence upper bound rule of the tree.
S880, pruning the insoluble nodes in the planning tree structure and updating parameters of each network when the training segments reach the preset segment period.
And S890, pruning the child node with the lowest estimated value in the root node when the training segment reaches 2 times of the preset segment period.
S900, judging whether the number of times of traversal reaches the preset number of times of traversal. If yes, go to step S910. If not, return to step S850.
S910, taking the planning tree as a solution tree.
And when the parameters of each network are updated, the higher-layer strategy network is assisted to update through the loss function. After the solution tree is obtained, the result of the solution tree is stored in a tree structure experience pool for auxiliary training.
As shown in fig. 12 and 13, a test procedure of the aircraft trajectory planning method in the present embodiment will be described as follows:
the test environment is an ant four road environment in MuJoCo, an analog simulation environment of air traffic and the like.
In constructing the environment model, an initial state space is provided that includes the location of the agents in the MuJoCo simulation environment, the range of angles and velocities of all joints. The range of dimensions of the state space in the x-axis and y-axis is set to (-8, 8), and the range of angles and velocities of the joints is generated by the ant four pivot environment in MuJoCo. The coordinates of the initial state of the agent and the absolute value of the position target are set to (3,6,5) and are not in the same quadrant. The position space of the sub-target is the same as the target space, the realization threshold of the position target is 0.4, and the realization threshold of the sub-target is 0.8.
In the process of constructing the hierarchical architecture, the behavior space of the higher-level policy network is set as a state space, namely
Figure SMS_274
The position vectors are x and y (-8, 8).
In the process of constructing the reward model, if each layer of intelligent agent completes a given target at any time, the reward value is 1, and the round is ended; otherwise the prize value is 0.
In the process of constructing the reinforcement learning network architecture, each layer in the AND or tree adopts a network of depth deterministic strategy gradient algorithm to realize strategy training and node evaluation. The range of network output values employs bounded V values: the range of the output V-value function is limited to 0 using a negative sigmoid function,
Figure SMS_275
]the lower limit value of 0 is because the algorithm is a non-negative prize value. The explored probability values are: action slaveThe probability of uniform random sampling in the action space of the layer is 20% and the probability of the sum of the actions sampled from the policy and gaussian noise of the layer is 80%. The architecture of the neural network is as follows: both the policy network and the evaluation network are composed of 3 hidden layers, and each hidden layer has 64 neurons, and the activation function is a ReLU function.
In the case of the loss function,
Figure SMS_276
the value is +.>
Figure SMS_277
=0.5, followed by +.>
Figure SMS_278
After the traversal, the result is observed with 0.01n, n.epsilon.Z decrease. / >
Figure SMS_279
First of all->
Figure SMS_280
=0.5, followed by +.>
Figure SMS_281
After the traversal, the result is observed with 0.01n, n E Z increasing.
In the process of constructing the partition tree, a first preset layer number of the AND or tree is specified
Figure SMS_282
Is 3, the second preset layer number +.>
Figure SMS_283
At most child node number of each or node of 7 +.>
Figure SMS_284
5, the preset step number of each layer to complete the target +.>
Figure SMS_285
=5, preset fragment period->
Figure SMS_286
5. The network parameters are updated 40 times after each traversal of the partition tree, the sample number of the minimum experience pool is taken when the parameters of the higher-level policy network are updated +.>
Figure SMS_287
. Every time a preset segment period is passed +.>
Figure SMS_288
After the traversing, pruning the child node with the lowest state value of the root node.
The above descriptions may be implemented alone or in various combinations, and these modifications are within the scope of the present disclosure.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In this disclosure, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of additional identical elements in an article or apparatus that comprises the element.
While the preferred embodiments of the present disclosure have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the disclosure.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present disclosure without departing from the spirit or scope of the disclosure. Thus, given that such modifications and variations of the present disclosure fall within the scope of the claims of the present disclosure and their equivalents, the intent of the present disclosure is to encompass such modifications and variations as well.

Claims (10)

1. A method of planning an aircraft trajectory, the method comprising:
acquiring an initial state and a position target of an aircraft as an original task;
decomposing the original task in an AND or tree by a high-level strategy network; determining a sub-target with the smallest distance difference between the sub-target and two ends of the or node in each or node of the and or tree, forming a first task of a lower layer and a first task of the node according to the or node and the sub-target, and decomposing each first task to form a second task of the lower layer or the node;
Learning each second task of the last layer in the AND or tree by using a low-layer strategy network to form a division tree, wherein the division tree comprises a plurality of initial tracks of the aircraft obtained through learning;
and determining the target track of the aircraft according to the division tree.
2. A method of planning a trajectory of an aircraft according to claim 1, wherein the sum of the distances of the sub-targets from both ends of the or node is minimal.
3. The method of claim 2, wherein a loss function of the higher-level strategic network is used to assist training of the higher-level strategic network, the loss function being calculated from a primary loss function and a secondary loss function, the primary loss function and a parameter update method in the loss function being a random gradient ascent method, and the parameter update method in the secondary loss function being a random gradient descent method.
4. A method of planning an aircraft trajectory according to claim 3, characterized in that the primary loss function
Figure QLYQS_1
The formula of (2) is as follows:
Figure QLYQS_2
the auxiliary loss function
Figure QLYQS_3
The formula of (2) is as follows:
Figure QLYQS_4
the loss function
Figure QLYQS_5
The formula of (2) is as follows:
Figure QLYQS_6
wherein ,
Figure QLYQS_7
To measure the high-level policy networkμ h Is a function of (2)JIs used for the gradient of (a),ρfor the joint distribution of the current initial state and the current position target, < >>
Figure QLYQS_8
For the higher layer policy networkμ h Is used for the control of the temperature of the liquid crystal display device,Nto update the higher-level policy networkμ h Parameter of->
Figure QLYQS_9
The minimum number of samples of the experience pool taken,λ 1 as a first weight parameter,λ 2 for the second weight parameter, +.>
Figure QLYQS_10
Is the firstiDistance of the sub-object from the current initial state,/or->
Figure QLYQS_11
Is the firstiThe distance of each said sub-object from said current position object,αas the weight of the auxiliary loss function,iis a positive integer.
5. The method of aircraft trajectory planning according to claim 1, wherein said decomposing the raw mission in an and-or tree with a high-level policy network comprises:
taking the original task as a root node of the AND or tree, wherein the root node is the OR node of the AND or tree first layer;
the following first process is circularly executed until the number of layers of the AND or tree reaches a first preset number of layers:
inputting the or node into the high-level policy network to obtain the high-level behavior output by the high-level policy network; the distance difference between the high-level behavior and the two ends of the or node is the smallest;
Determining the higher-level behavior as the child target;
forming the first task of the lower layer of the AND node according to the OR node and the sub-target;
and decomposing the first tasks to form second tasks of the lower layer or nodes, wherein each second task comprises the sub-target.
6. The method of claim 5, wherein when the number of layers of the and-or tree reaches the first preset number of layers, the decomposing the original task in the and-or tree with a higher-layer policy network, further comprises:
and circularly executing the following second process until the number of layers of the AND or tree reaches a second preset number of layers:
inputting each second task in the deepest layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network;
determining the number of steps learned by the low-level policy network to complete the second task according to the low-level behavior;
if the step number is within the preset step number range, storing the or node where the second task is located in the AND or tree;
if the step number is out of the preset step number range, executing the first process again;
Wherein the second preset layer number is the last layer of the AND or tree.
7. The method of claim 5, wherein the high-level behavior of the high-level strategic network output is as follows:
Figure QLYQS_12
wherein ,a h in order for the high-level behavior to be a high-level behavior,μ h in order for a higher-level policy network,
Figure QLYQS_13
for the current initial state of the node, +.>
Figure QLYQS_14
For the current location target of the or node, +.>
Figure QLYQS_15
For higher-level policy networksμ h Parameter of->
Figure QLYQS_16
Is random noise with an average value of 0.
8. The method of claim 1, wherein learning each of the second tasks of the last layer in the and or tree with a low-level policy network forms a partition tree, comprising:
inputting each second task of the last layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network;
determining the number of steps learned by the low-level policy network to complete the second task according to the low-level behavior;
and if the step number is within a preset step number range, storing the or node where the second task is located in the AND or tree to form the partition tree.
9. The method of claim 8, wherein the low-level behavior of the low-level strategic network output is as follows:
Figure QLYQS_17
wherein ,
Figure QLYQS_18
for low-level behavior, ++>
Figure QLYQS_19
For a low-level policy network, ++>
Figure QLYQS_20
For the current initial state of the node, +.>
Figure QLYQS_21
For the current location target of the or node, +.>
Figure QLYQS_22
Is a low-level policy network->
Figure QLYQS_23
Parameter of->
Figure QLYQS_24
Is random noise with an average value of 0.
10. The method of planning a trajectory of an aircraft according to any one of claims 1 to 9, wherein said determining a target trajectory of the aircraft from the partition tree comprises:
traversing the division tree to form a planning tree;
forming a solution tree according to the planning tree;
and determining the target track of the aircraft according to the solution tree.
CN202310540315.4A 2023-05-15 2023-05-15 Aircraft trajectory planning method Active CN116307331B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310540315.4A CN116307331B (en) 2023-05-15 2023-05-15 Aircraft trajectory planning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310540315.4A CN116307331B (en) 2023-05-15 2023-05-15 Aircraft trajectory planning method

Publications (2)

Publication Number Publication Date
CN116307331A true CN116307331A (en) 2023-06-23
CN116307331B CN116307331B (en) 2023-08-04

Family

ID=86803476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310540315.4A Active CN116307331B (en) 2023-05-15 2023-05-15 Aircraft trajectory planning method

Country Status (1)

Country Link
CN (1) CN116307331B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116882607A (en) * 2023-07-11 2023-10-13 中国人民解放军军事科学院***工程研究院 Key node identification method based on path planning task

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105353768A (en) * 2015-12-08 2016-02-24 清华大学 Unmanned plane locus planning method based on random sampling in narrow space
CN112947592A (en) * 2021-03-30 2021-06-11 北京航空航天大学 Reentry vehicle trajectory planning method based on reinforcement learning
CN113848974A (en) * 2021-09-28 2021-12-28 西北工业大学 Aircraft trajectory planning method and system based on deep reinforcement learning
CN114253296A (en) * 2021-12-22 2022-03-29 中国人民解放军国防科技大学 Airborne trajectory planning method and device for hypersonic aircraft, aircraft and medium
US20220375352A1 (en) * 2021-08-18 2022-11-24 The 28Th Research Institute Of China Electronics Technology Group Corporation Flight trajectory multi-objective dynamic planning method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105353768A (en) * 2015-12-08 2016-02-24 清华大学 Unmanned plane locus planning method based on random sampling in narrow space
CN112947592A (en) * 2021-03-30 2021-06-11 北京航空航天大学 Reentry vehicle trajectory planning method based on reinforcement learning
US20220375352A1 (en) * 2021-08-18 2022-11-24 The 28Th Research Institute Of China Electronics Technology Group Corporation Flight trajectory multi-objective dynamic planning method
CN113848974A (en) * 2021-09-28 2021-12-28 西北工业大学 Aircraft trajectory planning method and system based on deep reinforcement learning
CN114253296A (en) * 2021-12-22 2022-03-29 中国人民解放军国防科技大学 Airborne trajectory planning method and device for hypersonic aircraft, aircraft and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张源 等: "复杂禁飞区高超声速飞行器路径-轨迹双层规划", 宇航学报, vol. 43, no. 5, pages 615 - 626 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116882607A (en) * 2023-07-11 2023-10-13 中国人民解放军军事科学院***工程研究院 Key node identification method based on path planning task
CN116882607B (en) * 2023-07-11 2024-02-02 中国人民解放军军事科学院***工程研究院 Key node identification method based on path planning task

Also Published As

Publication number Publication date
CN116307331B (en) 2023-08-04

Similar Documents

Publication Publication Date Title
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
US11403526B2 (en) Decision making for autonomous vehicle motion control
Phung et al. Motion-encoded particle swarm optimization for moving target search using UAVs
CN110956148B (en) Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium
Hug et al. Particle-based pedestrian path prediction using LSTM-MDL models
CN112685165B (en) Multi-target cloud workflow scheduling method based on joint reinforcement learning strategy
CN116307331B (en) Aircraft trajectory planning method
CN114415735B (en) Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method
Tagliaferri et al. A real-time strategy-decision program for sailing yacht races
CN116136945A (en) Unmanned aerial vehicle cluster countermeasure game simulation method based on anti-facts base line
Abed-Alguni Cooperative reinforcement learning for independent learners
CN115964898A (en) Bignty game confrontation-oriented BC-QMIX on-line multi-agent behavior decision modeling method
Mitchell et al. Persistent multi-robot mapping in an uncertain environment
Tien et al. Deep Reinforcement Learning Applied to Airport Surface Movement Planning
CN114861368A (en) Method for constructing railway longitudinal section design learning model based on near-end strategy
CN114527759A (en) End-to-end driving method based on layered reinforcement learning
Etesami et al. Non-cooperative multi-agent systems with exploring agents
Nobahari et al. Optimization of fuzzy rule bases using continous ant colony system
Ma Model-based reinforcement learning for cooperative multi-agent planning: exploiting hierarchies, bias, and temporal sampling
Lin et al. Identification and prediction using neuro-fuzzy networks with symbiotic adaptive particle swarm optimization
Walther Topological path planning for information gathering in alpine environments
Wang Knowledge transfer in reinforcement learning: How agents should benefit from prior knowledge
CN117556681B (en) Intelligent air combat decision method, system and electronic equipment
CN114970714B (en) Track prediction method and system considering uncertain behavior mode of moving target
Ellis Multi-agent path finding with reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant