CN116307331B

CN116307331B - Aircraft trajectory planning method

Info

Publication number: CN116307331B
Application number: CN202310540315.4A
Authority: CN
Inventors: 张筱; 吴发国; 郭宁; 姚望
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-08-04
Anticipated expiration: 2043-05-15
Also published as: CN116307331A

Abstract

The present disclosure relates to a method for planning an aircraft track, and relates to the technical field of aviation, wherein the method for planning an aircraft track comprises: acquiring an initial state and a position target of an aircraft as an original task; decomposing the original task in an AND or tree by a high-level strategy network; determining a sub-target with the smallest distance difference between the sub-target and two ends of the node in each node of the node, forming a first task of a lower layer and the node according to the node and the sub-target, and decomposing each first task to form a second task of the lower layer or the node; each second task of the last layer in the AND or tree is learned by a low-layer strategy network to form a division tree, wherein the division tree comprises a plurality of initial tracks of the aircraft obtained through learning; from the division tree, a target trajectory of the aircraft is determined. When the sub-target is determined, the distance difference between the sub-target and the two ends of the node is minimum, so that the speed of processing the second task can be increased, and the efficiency of planning the aircraft track is improved.

Description

Aircraft trajectory planning method

Technical Field

The disclosure relates to the field of aviation technology, and in particular relates to a planning method for an aircraft track.

Background

At present, with the increasing of the traffic flow of air roads, it is important to plan the flight path of the aircraft in advance under the limited airspace resource. However, in the existing aircraft trajectory planning method, the planning process is complex, which results in low efficiency of aircraft trajectory planning.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method of planning an aircraft trajectory.

According to a first aspect of embodiments of the present disclosure, there is provided a method for planning an aircraft trajectory, the method comprising:

acquiring an initial state and a position target of an aircraft as an original task;

decomposing the original task in an AND or tree by a high-level strategy network; determining a sub-target with the smallest distance difference between the sub-target and two ends of the or node in each or node of the and or tree, forming a first task of a lower layer and a first task of the node according to the or node and the sub-target, and decomposing each first task to form a second task of the lower layer or the node;

learning each second task of the last layer in the AND or tree by using a low-layer strategy network to form a division tree, wherein the division tree comprises a plurality of initial tracks of the aircraft obtained through learning;

And determining the target track of the aircraft according to the division tree.

In some embodiments of the disclosure, the distance sum of the child object and both ends of the or node is minimal.

In some embodiments of the present disclosure, a loss function of the higher-level policy network is used to assist in training of the higher-level policy network, the loss function is calculated by a main loss function and an auxiliary loss function, a parameter updating method in the main loss function and the loss function is a random gradient increasing method, and a parameter updating method in the auxiliary loss function is a random gradient decreasing method.

In some embodiments of the disclosure, the primary loss functionThe formula of (2) is as follows:

；

the auxiliary loss functionThe formula of (2) is as follows:

；

the loss functionThe formula of (2) is as follows:

；

wherein ,to measure the higher layer policy network +.>Function of->Gradient of->For the joint distribution of the current initial state and the current position target, < >>For the higher-layer policy network +.>Is used for the control of the temperature of the liquid crystal display device,Nfor updating the higher-level policy network +.>Parameter of->The minimum number of samples of the experience pool taken at the time,/->For the first weight parameter, +.>For the second weight parameter, +.>Is->Distance of the sub-object from the current initial state,/or- >Is->Distance of the sub-object from the current position object,/or->As the weight of the auxiliary loss function,iis a positive integer.

In some embodiments of the disclosure, the decomposing the original task in an and or tree with a high-level policy network includes:

taking the original task as a root node of the AND or tree, wherein the root node is the OR node of the AND or tree first layer;

the following first process is circularly executed until the number of layers of the AND or tree reaches a first preset number of layers:

inputting the or node into the high-level policy network to obtain the high-level behavior output by the high-level policy network; the distance difference between the high-level behavior and the two ends of the or node is the smallest;

determining the higher-level behavior as the child target;

forming the first task of the lower layer of the AND node according to the OR node and the sub-target;

and decomposing the first tasks to form second tasks of the lower layer or nodes, wherein each second task comprises the sub-target.

In some embodiments of the present disclosure, when the number of layers of the and or tree reaches the first preset number of layers, the decomposing the original task in the and or tree with a high-level policy network further includes:

And circularly executing the following second process until the number of layers of the AND or tree reaches a second preset number of layers:

inputting each second task in the deepest layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network;

determining the number of steps learned by the low-level policy network to complete the second task according to the low-level behavior;

if the step number is within the preset step number range, storing the or node where the second task is located in the AND or tree;

if the step number is out of the preset step number range, executing the first process again;

wherein the second preset layer number is the last layer of the AND or tree.

In some embodiments of the present disclosure, the high-level behavior of the high-level policy network output is as follows:

；

wherein ,for higher layer behavior, ++>For a higher-level policy network, ++>For the current initial state of the node, +.>For the current location target of the or node, +.>Is a higher layer policy network->Parameter of->Is random noise with an average value of 0.

In some embodiments of the disclosure, the learning each of the second tasks of the last layer in the and or tree with the low-level policy network to form the partition tree includes:

Inputting each second task of the last layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network;

and if the step number is within a preset step number range, storing the or node where the second task is located in the AND or tree to form the partition tree.

In some embodiments of the present disclosure, the low-level behavior of the low-level policy network output is as follows:

；

wherein ,for low-level behavior, ++>For a low-level policy network, ++>For the current initial state of the node, +.>For the current location target of the or node, +.>Is a low-level policy network->Parameter of->Is random noise with an average value of 0.

In some embodiments of the disclosure, the determining the target trajectory of the aircraft according to the partition tree includes:

traversing the division tree to form a planning tree;

forming a solution tree according to the planning tree;

and determining the target track of the aircraft according to the solution tree.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

the original task is decomposed in an AND or tree with a high-level policy network to form an AND or tree alternating with nodes and or nodes layer by layer. In each or node, determining a sub-target through a high-level strategy network, wherein the distance difference between the sub-target and two ends of the or node is the smallest. And according to the nodes or the sub-targets, forming a first task of the lower layer and the nodes, and decomposing the first task to form a second task of the lower layer or the nodes. When the last layer of the AND OR tree is reached, each second task of the last layer is learned by a low-layer strategy network to form a partition tree. And determining a target track of the aircraft according to the division tree, and planning the aircraft track. The complexity of each second task in each layer or node is approximately the same as the distance difference between the sub-target and the two ends of the or node is the smallest when the sub-target is determined. When each second task is processed in parallel, the speed of processing the second task can be increased, so that the efficiency of aircraft track planning is improved. Meanwhile, the original tasks are decomposed through the high-level strategy network and the low-level strategy network to form a partition tree, so that the complexity of aircraft track planning is reduced, and the reliability of aircraft track planning is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the disclosure, and do not constitute a limitation on the disclosure. In the drawings:

FIG. 1 is a schematic diagram of a hierarchical structure shown in an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of planning an aircraft trajectory shown in a first exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an AND or Tree architecture shown in a first exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an AND or Tree architecture shown in a second exemplary embodiment of the present disclosure;

FIG. 5 is a flow chart of a method of planning an aircraft trajectory shown in a second exemplary embodiment of the present disclosure;

FIG. 6 is a flow chart of a method of planning an aircraft trajectory shown in a third exemplary embodiment of the present disclosure;

FIG. 7 is a flow chart of a method of planning an aircraft trajectory shown in a fourth exemplary embodiment of the present disclosure;

FIG. 8 is a flow chart of a method of planning an aircraft trajectory shown in a fifth exemplary embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a solution tree shown in an exemplary embodiment of the present disclosure;

FIG. 10 is a flowchart of a method of planning an aircraft trajectory, as shown in a sixth exemplary embodiment of the present disclosure;

FIG. 11 is a flow chart of a method of planning an aircraft trajectory shown in a seventh exemplary embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a simulation environment shown in an exemplary embodiment of the present disclosure;

fig. 13 is a schematic plan view of an aircraft trajectory shown in an exemplary embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be arbitrarily combined with each other. In the embodiment of the disclosure, the term "and/or" is an association relationship describing an object, which means that three relationships may exist. For example, a and/or B, represent: a or B, or, A and B. In the embodiment of the disclosure, learning refers to a process of obtaining behaviors by a policy network, solving refers to a process of obtaining an estimated value by an estimated value network, and training refers to a process of updating parameters of each network.

At present, the task of air traffic is to effectively maintain and promote air traffic safety and ensure air traffic smoothness. With the increasing flow of air traffic, under the condition of limited airspace resources, the management of air traffic is more complex and the probability of flight conflict gradually rises. The smooth traffic of the air path is mainly embodied by planning the track of the aircraft, and it is particularly important to plan one or more optimal flight routes from the initial position to the target position in advance under the condition of comprehensively considering safety, endurance, range and the like. In an actual application scene, security threats such as geography, weather, military and the like often exist between an initial position and a target position, and even a forbidden airspace possibly exists, so that planning of an aircraft track is a sequence decision problem in a complex environment.

Most standard planning methods for sequence decision problems (e.g., monte carlo planning, monte carlo tree search, and dynamic planning) contain a fixed assumption, i.e., sequential planning. These methods start from an initial state or target position and then program the behavior forward or backward in time sequence. While such methods face two challenges: firstly, a planning model deduced according to data is difficult to be trusted in a long-term range; secondly, assignment of confidence for each individual action is difficult. These two challenges make the agent have the problems of low reliability and sparse rewards in the face of long sequence decision problems in a sparse rewarding environment. In addition, when the track of the aircraft is planned by the standard planning method, the problem of complex planning process exists, so that the efficiency of aircraft track planning is low.

In recent years, the combination of deep neural networks and reinforcement learning is called deep reinforcement learning. Deep reinforcement learning combines the good perceptibility of deep learning to high-dimensional data and the strategic learning ability of reinforcement learning to data, and is one of the important ways to realize general artificial intelligence. In reinforcement learning, rewards play the role of supervisory signals, and agents optimize the strategy network according to rewards. To solve the problems of low reliability and sparse rewards, the layered idea is added to reinforcement learning. The essence of hierarchical reinforcement learning is to break up the original task into subtasks at different levels of abstraction. Because the state space of the subtasks is limited, the subtasks have higher solving speed compared with the original tasks, and finally the solving efficiency of the whole problem is improved.

One type of hierarchical reinforcement learning algorithm that is commonly used is a sub-objective based hierarchical reinforcement learning algorithm: the higher level policy learns given sub-targets, the lower level policy. Wherein, the difficulty level of the sub-target does not need to be considered. The process of generating sub-targets by higher level policies is not complex, while actions generated on lower level policies can affect the efficiency of the original task completion. Therefore, when a hierarchical reinforcement learning algorithm is adopted to plan the trajectory of the aircraft, the problem of low planning efficiency also exists.

Based on the above, the present disclosure provides a method for planning an aircraft trajectory, so as to plan an optimal trajectory of an aircraft in a complex air-road traffic environment. The original task is decomposed in an AND or tree through a high-level strategy network, and the long sequence decision process is split into a short sequence decision process. In the AND or tree or node, determining a sub-target through a high-level strategy network, and forming a first task of a lower layer and the node through the sub-target and the OR node. After decomposing the first task, a second task of the lower layer or node is formed. The complexity of each second task in the AND or Tree gradually decreases as the number of decompositions increases. And the second task with low complexity is learned through the low-level strategy network, so that the complexity of the low-level strategy network learning is reduced, and the reliability of aircraft track planning is improved. The complexity of each second task in each layer or node is approximately the same as the minimum distance difference between the sub-target and the two ends of the or node. When each second task is processed in parallel, the speed of processing the second task can be increased, so that the efficiency of aircraft track planning is improved.

To facilitate understanding, a hierarchical reinforcement learning model of the present disclosure is first constructed. The hierarchical reinforcement learning model comprises an environment model, a hierarchical architecture and a hierarchy Layer models and partition trees. The environmental model of the present disclosure is to add a set of targets (including location targets and sub-targets) that are desired to be learned by an agent, each target being a state or a set of states, based on a markov decision process (Markov Decision Process, MDP). One markov decision process and a set of targets are defined as a target-controlled markov decision process (gol-conditioned Markov Decision Process, G-MDP). Recording a target-controlled Markov decision process as a plurality of sets, wherein ,/>States of agents are described,/->Targets of agent are described,/->Describes the behaviour of the agent according to the proposed goal +.>Is a rewarding value obtained in the environment by the behaviour of the agent,/-for>Is to calculate the discount rate of the jackpot +.>∈[0,1]. The Markov decision process based on the object control is that when the agent proposes the object +/according to the environment>Take a current action->After that, a prize value is obtained>. Then, intelligentThe body is according to the current action->Interact with the environment to reach the next state +.>. The basic property of a markov decision process is markov, i.e. a random process, given a current state and all past states, the conditional probability distribution of its next state is only relevant to the current state, and not to the past states, i.e.:

；

wherein ,for conditional probability +.>For history state->For the current state +.>The next state.

From the markov, the next stateOnly +.>In this regard, the agent need not consider historical status each time it makes a decision. At the beginning of each segment of the goal-controlled markov decision process, a location goal is selected from a set of goals. The solution to the Markov decision process for target control is to solve the control target-oriented strategyMaximizing the state value function>And behavior value function->：

；

wherein ,as a function of state values>As a function of the behavior value>For the status of->For action (I)>For the purpose of->For policy->Is (are) desirable to be (are)>To add up to->Discount rate of time awards, ->Is->Prize value (s)/(s)>For the current state +.>For the current action +.>For the current goal +.> and />Is a positive integer.

The hierarchical architecture of the present disclosure is a hierarchical structure using nested policies that enable an agent to learn tasks that require long sequences of original behaviors, while policies only require short sequences of behaviors. As shown in fig. 1, taking a two-layer structure as an example, when a higher-layer policy outputs a sub-target according to the state and target of a higher layer, the sub-target is passed down as a target of a lower layer. Low-level policy at most attempts SThis object is achieved by a step of, among others,the number of steps (preset number of steps) for which the target is most tried is shown as a custom super parameter. The low-level policy outputs the original behavior according to the state of the low-level and the target. The behavior space of the higher-level policy is the same as the target space (i.e., state space) of the lower-level policy, and the task is divided into shorter subtasks by the state space. The behavior space of the high-level strategy is set as a state space, so that the intelligent agent can simulate a transfer function which is supposed to be optimal for the low-level strategy network, and the intelligent agent can learn the multi-level strategy network in parallel.

The framework of the hierarchical model of the present disclosure is divided into a higher level target-controlled markov decision process (i.e., higher level) and a lower level target-controlled markov decision process (i.e., lower level), a higher level target-controlled markov decision process and a lower level target-controlled markov decision processThe process can be regarded as independent and the goal-controlled markov decision process is a limited-length round-robin markov decision process. At the beginning of a segment, first an initial state of an agent is givenAnd position target->，/>I.e. the goal to be achieved by the markov decision process for high level goal control. Then by a higher layer policy- >Propose high-level behavior->Finally obtain the prize value->. According to the framework of the hierarchical model, the sub-objective of the Markov decision process controlled by the lower-level objective +.>It is actually a high-level policy +.>Proposed high-level behavior->The method comprises the following steps:

。

then, low level policyPropose low-level behavior->And obtain the prizeExcitation value->。

Intelligent agent inThe return value of the moment is:

；

wherein ,is->High-level return value of time of day,/->Is->Low-level return value of time of day,/>Is thatHigh-level prize value,/->Is->A low-level prize value that is associated with the prize,i、t、Tis a positive integer.

High-level policy and low-level policy are inThe state value function of time is:

；

wherein ,as a function of the state values of the high-level policies, +.>As a function of low-level policy state values.

High-level policy and low-level policy are inThe behavior value function of the moment is:

；

wherein ,as a function of the behavior values of the higher-level policies, +.>As a function of low-level policy behavior values.

When the agent completes the given target at any time, the rewarding value is 1, and the round is ended; otherwise the prize value is 0.

Illustratively, an agent in this disclosure refers to an autonomously movable software or hardware entity, which may be located within or independent of an aircraft.

Referring to fig. 2, an exemplary embodiment of the present disclosure provides a method for planning an aircraft trajectory, including:

S100, acquiring an initial state and a position target of the aircraft as an original task.

S200, decomposing the original task in an AND or tree by a high-level strategy network; and determining a sub-target with the smallest distance difference between the sub-target and two ends of the node in each node of the node, and decomposing each first task to form a second task of the lower layer or the node according to the first tasks of the lower layer and the node formed by the node and the sub-target.

And S300, learning each second task of the last layer in the AND or tree by using a low-level strategy network to form a division tree, wherein the division tree comprises a plurality of initial tracks of the aircraft obtained through learning.

S400, determining the target track of the aircraft according to the division tree.

In this embodiment, the original task is decomposed in the and or tree with a high-level policy network to form and or trees alternating with nodes and or nodes layer by layer. In each or node, determining a sub-target through a high-level strategy network, wherein the distance difference between the sub-target and two ends of the or node is the smallest. And according to the nodes or the sub-targets, forming a first task of the lower layer and the nodes, and decomposing the first task to form a second task of the lower layer or the nodes. When the last layer of the AND or tree is reached, each second task of the last layer is learned by a low-layer strategy network to form a partition tree. And determining a target track of the aircraft according to the division tree, and planning the aircraft track. The complexity of each second task in each layer or node is approximately the same as the distance difference between the sub-target and the two ends of the or node is the smallest when the sub-target is determined. When each second task is processed in parallel, the speed of processing the second task can be increased, so that the efficiency of aircraft track planning is improved. Meanwhile, the original tasks are decomposed through the high-level strategy network and the low-level strategy network to form a partition tree, so that the complexity of aircraft track planning is reduced, and the reliability of aircraft track planning is improved.

The initial state and the position target in step S100 may be, for example, an initial position and a target position of the aircraft. Record the initial state asThe position target is +.>The original task is (+)>)。

Illustratively, as shown in fig. 3 and 4, the and or tree of the present disclosure includes and nodes and or nodes. The root node of the or tree is the or node, and the root node comprises the original task #). One end of the root node is->The other end is->. The root node determines sub-targets through a high-level policy network, wherein the sub-targets and the original tasks form first tasks of an AND or tree second level, and each first task marks one AND node. The number of sub-targets determined by the higher layer policy network may be multiple, i.e. the number of nodes in each layer may be multiple. Decomposing each and node of the second layer into two or nodes, wherein each or node is positioned at the third layer of the and or tree. Each or node contains sub-targets in the node and contains one end of the original task. According to the method, the decomposition of the nodes or the node sum and the node is repeated until the first preset layer number is reached. And then, adding a low-layer strategy network between the first preset layer number and the second preset layer number to learn the node or the deepest layer of the tree. When the learning condition is not met, decomposing is performed again through the high-level strategy network until the second preset layer number of the AND or tree (namely the last layer of the AND or tree) is reached. The nodes of the first preset layer number and the second preset layer number in the or tree are or nodes, part or nodes only comprise a plurality of sub-targets, part or nodes comprise initial states and sub-targets, and part or nodes comprise position targets and sub-targets. The deepest layer of the AND or OR tree means the current junction The last layer of the constructed and or tree.

Illustratively, the sub-target with the smallest distance difference from the two ends of the or node is determined in each or node in step S200, which refers to a plurality of sub-targets with the smallest distance difference from the two ends of the or node in all sub-targets under the current high-level policy network. Since the number of the sub-targets which can be determined is a plurality of under the current high-level policy network, each sub-target is located between two ends of the node. The sub-targets with the smallest distance difference with or at the two ends of the node in all the sub-targets are selected, so that the speed of processing the second task can be increased, and the efficiency of planning the aircraft track is improved. Through subsequent training of the high-level strategy network, the proposed sub-targets can be continuously close to or at the middle points of the two ends of the node, namely, the distance difference between the sub-targets and the two ends of the node is close to or even equal to 0.

Illustratively, in the and or tree and the partition tree, the different layers each employ the same high-level policy network and the same low-level policy network.

Illustratively, the initial trajectory in step S300 refers to a trajectory of each aircraft capable of satisfying the air traffic requirement, including various trajectories of different distances between the initial state and the position target. The target trajectory in step S400 refers to one or more trajectories that satisfy conditions of minimum distance, minimum number of turns, and/or minimum crossing of boundary lines between the initial state and the position target while satisfying the trajectories of the respective aircrafts required for the air traffic. The target track is one or more of the initial tracks.

In one embodiment, as shown in fig. 5, the decomposition of the original task in the and-or tree with the high-level policy network in step S200 is determined by:

s210, taking the original task as a root node of an AND or tree, wherein the root node is an OR node of the first layer of the AND or tree.

s220, inputting the nodes into a high-level strategy network to obtain high-level behaviors output by the high-level strategy network; wherein, the distance difference between the high-level behavior and the two ends of the or node is the smallest.

S230, determining the high-level behavior as a sub-target.

S240, forming a first task of the lower layer and the node according to the node and the sub-target.

S250, decomposing the first tasks to form lower-layer or node second tasks, wherein each second task comprises a sub-target.

In this embodiment, in the process of constructing the and or tree, the original task is taken as the root node of the and or tree. And inputting the or node into a high-level policy network to obtain high-level behaviors so as to determine the sub-targets. And adding the sub-target between two ends of the node according to the node or the sub-target, and forming a first task of the lower layer and the node. And decomposing the first tasks with the nodes to form lower layers or second tasks of the nodes, wherein each second task comprises a sub-target. By constantly decomposing the first task and the second task, an and or tree is formed that alternates with nodes and or nodes layer by layer. The complexity of the first task and the second task is reduced by decomposing the first task and the second task layer by layer, so that the reliability distribution of the aircraft track planning is improved. The complexity of each second task in each layer or node is approximately the same, so that the speed of processing the second task can be increased, and the efficiency of aircraft track planning is improved.

Illustratively, the high-level policy network trains the target-oriented policies(i.e., a high level policy). Due to->Not the optimal strategy, the strategy can be improved by planning. In a large-scale sparse rewarding environment, the original behavior is adopted to be +.>Direct arrival at location target->Has a very low probability, i.e->The present embodiment therefore plans through a series of intermediate sub-objectives. Sub-objects can guide->From the initial state->To the location target->And training the low-level agent according to the sub-targets.

Illustratively, the principle of decomposing the first task to form the second task of the lower layer or node in step S250 is as follows: order theIs->The sequence set on the upper part is provided with the sequence->Is->Subset of->Is the sequence->The optimal plan is recorded as +.>. If->And->Will->Called task->Planning, task of (c)Is denoted as +.>。

The present disclosure first makes use of the combinability of planning to make the planning problem efficient. TasksPlanning->Can use task->Planning->And task->Planning->And connecting the two parts. The planning of each task is therefore broken down into two parts, namely +.>. wherein /> and />Is task->Is a subtask of (a). Briefly, optimal planning +. >Return function->And an optimal state value function->The method comprises the following steps of:

；

wherein ,is to compose an optimal plan->Is>To correspond toIs a function of the two payback functions of (a).

Thus, the first task can be decomposed into two second tasks, and the success probability is further improved. By continuing the recursive decomposition, the difficulty of the target is reduced. The present disclosure specifies the second preset number of layers of the AND or tree as。

Illustratively, AND or nodes in the tree, also referred to as task nodes. Or the node is operated by the second taskMarking (S)>Is the current initial state, ++>Is the current location target. Wherein the root node of the AND-OR tree is composed of the original task +.>Marked or nodes.

Each non-terminal or node havingThe individual AND nodes act as child nodes. />The most number of children per or node is represented as a self-defined hyper-parameter. And node is->Marking (S)>. As long as the first task of a lower layer is completed, the upper layer or node is completed.

Illustratively, the AND nodes in the AND or tree, also referred to as sequence nodes, are formed by the sequenceMarking (S)>First task->Is decomposed into and />Two second tasks.

Due to and />Can continue the decomposition so each AND node +.>There are two child nodes, each of which is an or node. Only two second tasks of the lower layer are completed, and the upper layer and the nodes can be completed.

In an embodiment, when the number of layers of the and or tree reaches the first preset number of layers, the decomposing the original task in the and or tree with the high-level policy network in step S200 further includes:

the following second process is circularly executed until the number of layers of the AND or tree reaches a second preset number of layers:

and inputting each second task in the deepest layer of the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.

And determining the step number learned by the low-level strategy network to complete the second task according to the low-level behavior.

If the step number is within the preset step number range, storing the or node where the second task is located in the and or tree.

If the number of steps is outside the preset number of steps range, the first process is executed again.

In this embodiment, when the number of layers of the and or tree is between the first preset number of layers and the second preset number of layers, the and or tree already has a certain structure, and each second task has been simplified compared with the original task. And inputting the second task into a low-level strategy network, outputting low-level behaviors, and determining the number of steps learned by completing the second task. If the number of steps is within the preset number of steps range, the low-level strategy network can complete the second task according to the expectation, and the or node where the second task is located is stored in the and or tree. If the number of steps is outside the preset number of steps range, the complexity of the second task is higher, and the second task needs to be decomposed through the high-level strategy network again. By reserving the or node where the second task with low complexity is located between the first preset layer number and the second preset layer number to avoid re-decomposition, the complexity of the and or tree construction is reduced.

In one embodiment, as shown in fig. 6, each second task of the last layer in the and or tree in step S300 is learned by the low-level policy network to form a partition tree, which includes:

s310, inputting each second task of the last layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.

S320, determining the step number learned by the low-level strategy network to complete the second task according to the low-level behavior.

S330, if the step number is within the preset step number range, storing the or node where the second task is located in the and or tree to form a division tree.

In this embodiment, since the node at the last layer in the or tree is the or node, the second tasks can be learned after being input into the low-level policy network. And determining whether the step number learned by the low-level strategy network for completing the second task can reach the expectations or not through the process of the low-level behavior output by the low-level strategy network. If the number of steps is within the preset number of steps range, the low-level strategy network can complete the second task according to the expectation, and the or node where the second task is located is stored in the and or tree to form a division tree. The node of the last layer can be learned or not through the low-layer strategy network to form a division tree capable of reflecting the aircraft track planning, and the problematic track is eliminated, so that the effectiveness of the aircraft track planning is improved. Meanwhile, the second task with low complexity is input into the low-level strategy network, so that the time required by track planning is reduced, and the efficiency of aircraft track planning is improved.

In one embodiment, after determining the number of steps learned by the low-level policy network to complete the second task according to the low-level behavior in step S320, learning each second task of the last level in the and or tree in step S300 with the low-level policy network, and forming the partition tree further includes:

if the number of steps is out of the range of the preset number of steps, determining the node or the node where the second task is located as an undegradable node, and outputting an evaluation value through a low-layer evaluation network.

In this embodiment, when the number of steps is outside the preset number of steps range, the second task cannot meet the requirement of track planning, and the node or nodes where the second task is located are determined as non-resolvable nodes and are not stored. By removing nodes which do not meet the requirements, the complexity of the partition tree is reduced, and the expandability of the partition tree is improved.

Illustratively, as shown in FIG. 7, the overall process of partitioning tree construction is as follows:

s500, giving an initial state and a position target, and obtaining an original task as a root node.

S510, taking the original task as input of a high-level strategy network, and outputting high-level behaviors.

S520, obtaining a sub-target according to the high-level behavior, and forming a first task of a lower layer and a node according to the node and the sub-target.

S530, decomposing the first tasks to form two second tasks of two or nodes of the lower layer.

S540, judging whether the number of layers of the AND or OR tree reaches a first preset number of layers. If yes, go to step S550. If not, each second task is used as an input of the higher-level policy network instead of the original task, and the step S510 is returned.

S550, judging whether the number of layers of the AND or OR tree reaches a second preset number of layers. If yes, go to step S570. If not, go to step S560.

S560, inputting the second tasks in the deepest layer of the AND or OR tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.

S561, determining the step number learned by the low-level strategy network to complete the second task according to the low-level behavior.

S562, judging whether the step number is within a preset step number range. If yes, step S563 is performed. If not, each second task is used as an input of the higher-level policy network instead of the original task, and the step S510 is returned.

S563, storing the or node where the second task is located in the or tree and not decomposing any more.

S570, inputting each second task of the last layer in the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.

S571, determining the number of steps learned by the low-level strategy network to complete the second task according to the low-level behavior.

And S572, if the step number is within the preset step number range, storing the or node where the second task is located in the and or tree to form a division tree.

S573, if the step number is out of the preset step number range, determining the or node where the second task is located as an insoluble node.

When the original task is replaced by each second task as input of the high-level policy network, the second task can be the second task of the deepest layer or node in the tree, the second task of the non-terminal or node in the tree, and the second task of the terminal or node in the tree.

For example, if the number of layers of the and or tree is between the first preset number of layers and the second preset number of layers and each or node is a node that is no longer decomposed, the expansion may be performed again by the root node, and the decomposition may be performed on the node that is no longer decomposed.

Illustratively, the reinforcement learning network architecture is pre-built prior to acquiring the initial state and position targets of the aircraft as raw tasks in step S100. First, a reinforcement learning network architecture (based on an environment model) is built for each layer in an and or tree, and each layer adopts a network of depth deterministic policy gradient (Deep Deterministic Policy Gradient, DDPG) algorithm to realize policy training and node evaluation:

Wherein the higher level policy network is a network for training higher level policies and the lower level policy network is a network for training lower level policies. The current high-level policy network, the current low-level policy network, the current high-level estimation network and the current low-level estimation network are used in the processes of constructing the partition tree and traversing the partition tree to form the planning tree. The target high-level strategy network, the target low-level strategy network, the target high-level estimation network and the target low-level estimation network are respectively obtained by training the current high-level strategy network, the current low-level strategy network, the current high-level estimation network and the current low-level estimation network to update parameters, and are respectively used as the current high-level strategy network, the current low-level strategy network, the current high-level estimation network and the current low-level estimation network after updating.

According to the framework of the hierarchical model and the DDPG algorithm, the sub-objective is to pass through a high-level policy networkDetermined, namely:

；

wherein ,for higher layer behavior, ++>For the current initial state of the node, +.>As or the current location target of the node,is a higher layer policy network->Parameter of->Is random noise with an average value of 0.

By a function ofMeter-of-measure high-level policy->Is represented by:

；

wherein ,is->Is a combination of (a) and (b) of (b)>Is a desire for joint distribution. In the actual operation, < > and->Is to add vector-> and />And (5) splicing.

Optimal strategyI.e. maximize function +.>Is a strategy of (1): />And updating the function using a random gradient ramp-up method>Gradient of->：/>

。

wherein ,to measure higher layer policy network->Function of->Gradient of->To measure higher layer estimation network->Gradient of->For the joint distribution of the current initial state and the current position target, < >>For higher-level policy networksParameter of->Is->Current initial state->Is->The number of current position targets is determined,Nto update a higher-level policy network->Parameter of->The minimum number of samples of the experience pool taken,iis a positive integer.

Through the method, the sub-targets can be determined through the high-level strategy network.

Illustratively, through a low-level policy networkThe resulting low-level behavior is as follows:

；

wherein ,in order to be a low-level behavior,is a low-level policy network->Parameter of->Is random noise with an average value of 0.

Illustratively, in determining the sub-objective through the higher-level policy network, it is desirable that the sub-objective be equidistant from both ends of the or node to achieve the same complexity of the decomposed second task. Specifically, the training of the sub-targets is assisted by adding a constraint of distance to the higher-level behavior (i.e., sub-targets) output by the higher-level policy network. Wherein the behavior of the network output of the high-level strategy is in the current initial state And the current location target->Sub-object of "midpoint" in between->. Child object->And the current initial stateAnd the current location target->Having the same dimensions on each component.

Sub-targets for exporting high-level policy networksIs +_associated with the current initial state>Distance of (2) is recorded as +.>，. Sub-objective of outputting higher-level policy network +.>Is +.>Distance of (2) is recorded as +.>，/>。

In one embodiment, to make a sub-targetAt the current initial state->And the current location target->The "midpoint" between can be achieved by two constraints: minimize-> and />The difference between the sub-target and the node, i.e. the difference in distance between the sub-target and the node, is minimal. Minimize-> and />The sum of the two, i.e. the sum of the distances between the sub-object and the two ends of the or node, is minimal.

In this embodiment, by minimizing the distance difference between the sub-target and the two ends of the or node, the complexity difference between the or node lower layer and each first task in the node is small, so that the efficiency of parallel processing of the first tasks is improved. By minimizing the sum of the distances between the sub-targets and the two ends of the or node, the path of the aircraft trajectory is reduced, thereby optimizing the aircraft trajectory.

In one embodiment, to minimize the two constraints described above, this is accomplished by a loss function of the higher-level policy network. The loss function of the higher-level policy network is calculated from the primary loss function and the secondary loss function. The main loss function and the parameter updating method in the loss function are random gradient rising methods. The parameter updating method in the auxiliary loss function is a random gradient descent method.

In this embodiment, training of the high-level policy network is assisted by the loss function, so that the sum and difference of distances between the sub-target and the two ends of the node proposed by the high-level policy network are reduced as much as possible, and the sub-target is more towards the midpoint of the two ends of the node. The sub-targets proposed by the high-level strategy network are trained in an auxiliary mode through the loss function, time required by track planning is shortened, and therefore efficiency of aircraft track planning is improved.

In one embodiment, the primary loss functionThe formula of (2) is as follows:

；

auxiliary loss functionThe formula of (2) is as follows:

；

loss functionThe formula of (2) is as follows:

；

wherein ,to measure higher layer policy network->Function of->Gradient of->For the joint distribution of the current initial state and the current position target, < >>Is a higher layer policy network->Is used for the control of the temperature of the liquid crystal display device,Nto update a higher-level policy network->Parameter of->The minimum number of samples of the experience pool taken at the time,/->For the first weight parameter, +.>For the second weight parameter, +.>Is->Distance of sub-object from current initial state, +.>Is->Distance of sub-object from current position object, +.>In order to assist the weight of the loss function,iis a positive integer. Weight of auxiliary loss function->For example, it may be 0.3.

By way of example only, and in an illustrative, Representing the distance between the sub-target and the current initial state and the distance between the sub-target and the current position target and the occupied weight, +.>Representing the weight of the difference between the distance between the sub-target and the current initial state and the distance between the sub-target and the current position target,/->，/>。

In this embodiment, in the loss function, the high-level policy network is trained through the auxiliary loss function, so as to minimize the sum and difference of the distances between the sub-targets and the two ends of the node. By adjusting the first weight parameter, setting the first weight parameter to a smaller value can avoid aircraft falling into a suboptimal trajectory. And setting the second weight parameter to a larger value by adjusting the second weight parameter, so that the sub-target is more towards or at the midpoint of two ends of the node. The track of the sub-target and the aircraft is optimized by adjusting the first weight parameter and the second weight parameter in the auxiliary loss function, so that the efficiency of aircraft track planning is improved.

In one embodiment, as shown in fig. 8, the determination of the target trajectory of the aircraft in step S400 is determined according to the partition tree by:

s410, traversing the division tree to form a planning tree.

S420, forming a solution tree according to the planning tree.

S430, determining the target track of the aircraft according to the solution tree.

In this embodiment, by traversing the partition tree, the nodes that are insoluble in the partition tree are pruned to form a planning tree with multiple tracks. And reserving the nodes with the optimal track according to the planning tree to form a solution tree. From the solution tree, a target trajectory of the aircraft is determined. The solution tree is determined by gradually optimizing the division tree, the non-optimal track is removed, and the track of the aircraft is optimized. Meanwhile, by pruning the division tree, the complexity of knowing the tree structure is reduced, and therefore the expandability of knowing the tree is improved.

Illustratively, traversing the partition tree in step S410 to form a planning tree means that in one traversal, one planning tree is formed with all selected nodes and/or nodes. And after the one-time traversal is finished, updating the state values of each node and all the access times of the nodes through the high-level estimation network. And, starting from the end of the partition tree, back propagates to the root node. On the back propagation path, the state values of all nodes (i.e., the estimates solved by the higher-level estimation network) are updated. Wherein the state value of the node is as follows For input and output by higher-layer evaluation network +.>. The label of the high-level estimation network is the return value obtained by the agent.

The planning tree may also be extended and updated, for example, after the planning tree is formed. If the number of child nodes in the planning tree or the node is less than the maximum number of child nodes, step S200 is repeated to expand the node. If the number of the child nodes in the planning tree or the node is equal to the maximum number of the child nodes, selecting the node according to the confidence upper bound rule of the tree. After the number of samples in the experience pool is greater than the minimum number of samples of the experience pool taken when updating parameters of the higher-level policy network, each time the training segment reaches a preset segment periodAnd pruning the undesolvable nodes in the planning tree structure and updating parameters of each network. Every time the training segment reaches 2 times the preset segment period +.>And pruning the child node with the lowest estimated value in the root node, and repeating the step S200 to expand the node.

Illustratively, the step S420 of forming a solution tree according to the planning tree refers to pruning the child node with the lowest state value of the root node of the planning tree after the step S410 is traversed by the preset number of times to form the solution tree. As shown in fig. 9, the solution tree has the following three properties: 1. the root node is in the solution tree. 2. There is at most one child node per or node. 3. If each AND node is in the solution tree, the two child nodes of the AND node also have to be in the solution tree.

Illustratively, determining the target trajectory of the aircraft from the solution tree in step S430 means that the original mission has been decomposed into the solution tree after the solution tree is formed by step S420. Therefore, according to the first task and the second task of each layer in the solution tree, the optimal track of the aircraft can be determined.

As shown in fig. 10, a specific description is given of a method for planning an aircraft trajectory in the present embodiment:

s600, constructing an environment model. The environment model specifies the state space, the target space and the behavior space of the agent based on a markov decision process of the target control. The state space is the same as the target space, and represents information of the position state of the agent. The behavior space represents the direction of movement of the agent.

S610, constructing a layered architecture. The hierarchical architecture is a hierarchical structure using a nested strategy, a Markov decision process controlled by a high-level target and a Markov decision process controlled by a low-level target of an agent are set, and the Markov decision process controlled by the target is a Markov decision process of a limited-length round system. When a higher layer policy network outputs a sub-target, the sub-target passes down as a target of the next layer. The target satisfies the behavior space of the higher-level policy network and the target space of the lower-level policy network, namely, the state space of the higher-level policy network and the state space of the lower-level policy network are the same. Setting the behavior space of the higher-level policy network as a state space, i.e 。

S620, constructing a reward model. Wherein, the rewarding model is: if each layer of intelligent agent completes the given target at any time, the rewarding value is 1, and the round is ended; otherwise the prize value is 0.

S630, constructing a reinforcement learning network architecture. Wherein each layer in the AND or Tree employs a network of depth deterministic strategy gradient algorithms to achieve strategy training and node evaluation. Each layer contains 4 deep neural networks for training. Parameters of the depth deterministic strategy gradient algorithm are set, including a range of network output values, explored probability values, and the architecture of the neural network. I.e. the number of layers, network nodes and activation functions employed by each network.

S640, constructing a loss function. Wherein a first weight parameter and a second weight parameter in the loss function are set.

S650, acquiring initial states and position targets of the aircraft as original tasks.

S660, constructing a division tree according to the original task.

S670, traversing the division tree to form a planning tree.

S680, forming a solution tree according to the planning tree.

S690, determining the target track of the aircraft according to the solution tree.

As shown in fig. 11, steps S660 to S680 include the steps of:

s700, the first preset layer number and the second preset layer number of the AND or tree are specified, and the maximum number of child nodes of each OR node is specified.

S710, taking the original task as a root node of an AND or tree.

S720, taking the original task as input of a high-level strategy network, and outputting high-level behaviors.

S730, taking the high-level behavior as a sub-target.

S740, forming a first task of the lower layer and the node according to the node and the sub-target.

S750, decomposing each first task to form a second task of a lower layer or node.

And S760, when the number of layers of the AND or Tree does not reach the first preset number of layers, replacing the original task with each second task to serve as the input of the high-level strategy network, and returning to the step S720.

And S770, inputting each second task at the deepest layer in the AND or tree into the low-level strategy network when the number of layers of the AND or tree reaches the first preset number of layers and does not reach the second preset number of layers, and obtaining the low-level behavior output by the low-level strategy network.

S780, determining the number of steps learned by the low-level strategy network to complete the second task according to the low-level behaviors.

S790, judging whether the step number is within a preset step number range. If yes, go to step S800. If not, the second tasks replace the original tasks as input of the higher-level policy network, and then the step S720 is returned.

S800, storing the or node where the second task is located in an AND or tree and not decomposing any more.

And S810, when the number of layers of the AND or tree reaches a second preset number of layers, inputting each second task of the last layer of the AND or tree into the low-level strategy network to obtain the low-level behavior output by the low-level strategy network.

S820, determining the step number learned by the low-level strategy network to complete the second task according to the low-level behavior.

And S830, if the step number is within the preset step number range, storing the or node where the second task is located in the or tree to form a partition tree, and executing step S850.

And S840, if the step number is out of the preset step number range, determining the or node where the second task is located as an insoluble node and pruning.

S850, traversing the division tree to form a planning tree. Wherein the root node is the first node to be examined.

S860, judging whether the number of the child nodes in the planning tree or the node is smaller than the maximum number of the child nodes. If so, each second task is used as an input of the higher-level policy network instead of the original task, and the process returns to step S720. If not, step S870 is performed.

S870, selecting nodes through the confidence upper bound rule of the tree.

S880, pruning the insoluble nodes in the planning tree structure and updating parameters of each network when the training segments reach the preset segment period.

And S890, pruning the child node with the lowest estimated value in the root node when the training segment reaches 2 times of the preset segment period.

S900, judging whether the number of times of traversal reaches the preset number of times of traversal. If yes, go to step S910. If not, return to step S850.

S910, taking the planning tree as a solution tree.

And when the parameters of each network are updated, the higher-layer strategy network is assisted to update through the loss function. After the solution tree is obtained, the result of the solution tree is stored in a tree structure experience pool for auxiliary training.

As shown in fig. 12 and 13, a test procedure of the aircraft trajectory planning method in the present embodiment will be described as follows:

the test environment is an ant four road environment in MuJoCo, an analog simulation environment of air traffic and the like.

In constructing the environment model, an initial state space is provided that includes the location of the agents in the MuJoCo simulation environment, the range of angles and velocities of all joints. The range of dimensions of the state space in the x-axis and y-axis is set to (-8, 8), and the range of angles and velocities of the joints is generated by the ant four pivot environment in MuJoCo. The coordinates of the initial state of the agent and the absolute value of the position target are set to (3,6,5) and are not in the same quadrant. The position space of the sub-target is the same as the target space, the realization threshold of the position target is 0.4, and the realization threshold of the sub-target is 0.8.

In the process of constructing the hierarchical architecture, the behavior space of the higher-level policy network is set as a state space, namelyThe position vectors are x and y (-8, 8).

In the process of constructing the reward model, if each layer of intelligent agent completes a given target at any time, the reward value is 1, and the round is ended; otherwise the prize value is 0.

In the process of constructing the reinforcement learning network architecture, each layer in the AND or tree adopts a network of depth deterministic strategy gradient algorithm to realize strategy training and node evaluation. The range of network output values employs bounded V values: the range of the output V-value function is limited to 0 using a negative sigmoid function,]the lower limit value of 0 is because the algorithm is a non-negative prize value. The explored probability values are: the probability of uniform random sampling of actions from the action space of the layer is 20% and the probability of the sum of actions sampled from the policy and gaussian noise of the layer is 80%. The architecture of the neural network is as follows: both the policy network and the evaluation network are composed of 3 hidden layers, and each hidden layer has 64 neurons, and the activation function is a ReLU function.

In the case of the loss function,the value is +.>=0.5, followed by +.>After the traversal, the result is observed with 0.01n, n.epsilon.Z decrease. / >First of all->=0.5, followed by +.>After the traversal, the result is observed with 0.01n, n E Z increasing.

In the process of constructing the partition tree, a first preset layer number of the AND or tree is specifiedIs 3, the second preset layer number +.>At most child node number of each or node of 7 +.>5, the preset step number of each layer to complete the target +.>=5, preset fragment period->5. The network parameters are updated 40 times after each traversal of the partition tree, the sample number of the minimum experience pool is taken when the parameters of the higher-level policy network are updated +.>. Every time a preset segment period is passed +.>After the traversing, pruning the child node with the lowest state value of the root node. />

The above descriptions may be implemented alone or in various combinations, and these modifications are within the scope of the present disclosure.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In this disclosure, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of additional identical elements in an article or apparatus that comprises the element.

While the preferred embodiments of the present disclosure have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the disclosure.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present disclosure without departing from the spirit or scope of the disclosure. Thus, given that such modifications and variations of the present disclosure fall within the scope of the claims of the present disclosure and their equivalents, the intent of the present disclosure is to encompass such modifications and variations as well.

Claims

1. A method of planning an aircraft trajectory, the method comprising:

determining a target track of the aircraft according to the partition tree;

wherein the decomposing the original task in an AND or tree with a high-level policy network comprises:

determining the higher-level behavior as the child target;

decomposing the first tasks to form second tasks of the lower layer or nodes, wherein each second task comprises the sub-target;

When the number of layers of the and or tree reaches the first preset number of layers, decomposing the original task in the and or tree by using a high-layer strategy network, and further comprising:

the second preset layer number is the last layer of the AND or tree.

2. A method of planning a trajectory of an aircraft according to claim 1, wherein the sum of the distances of the sub-targets from both ends of the or node is minimal.

3. The method of claim 2, wherein a loss function of the higher-level strategic network is used to assist training of the higher-level strategic network, the loss function being calculated from a primary loss function and a secondary loss function, the primary loss function and a parameter update method in the loss function being a random gradient ascent method, and the parameter update method in the secondary loss function being a random gradient descent method.

4. A method of planning an aircraft trajectory according to claim 3, characterized in that the primary loss functionThe formula of (2) is as follows:

；

the auxiliary loss functionThe formula of (2) is as follows:

；

the loss functionThe formula of (2) is as follows:

；

wherein ,to measure the higher layer policy network +.>Function of->Gradient of->For the joint distribution of the current initial state and the current position target, < >>For the higher-layer policy network +.>Is used for the control of the temperature of the liquid crystal display device,Nfor updating the higher-level policy network +.>Parameter of->The minimum number of samples of the experience pool taken at the time,/->For the first weight parameter, +.>For the second weight parameter, +.>Is->Distance of the sub-object from the current initial state,/or->Is->Distance of the sub-object from the current position object,/or->As the weight of the auxiliary loss function,iis a positive integer.

5. The method of claim 1, wherein the high-level behavior of the high-level strategic network output is as follows:

；

6. The method of claim 1, wherein the low-level behavior of the low-level strategic network output is as follows:

；

7. The method of planning a trajectory of an aircraft according to any one of claims 1 to 6, wherein said determining a target trajectory of the aircraft from the partition tree comprises:

traversing the division tree to form a planning tree;

forming a solution tree according to the planning tree;