CN116718198B

CN116718198B - Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph

Info

Publication number: CN116718198B
Application number: CN202311003515.2A
Authority: CN
Inventors: 王必良; 李金滔; 廖甜; 汪礼辉; 汤俊
Original assignee: Hunan Jingde Technology Co ltd
Current assignee: Hunan Jingde Technology Co ltd
Priority date: 2023-08-10
Filing date: 2023-08-10
Publication date: 2023-11-03
Anticipated expiration: 2043-08-10
Also published as: CN116718198A

Abstract

The invention relates to a path planning method and a system of an unmanned aerial vehicle cluster based on a time sequence knowledge graph, wherein the method comprises the following steps: taking each state in the flight environment of the unmanned aerial vehicle cluster as a node in the knowledge graph, and taking the relation among the states as an edge to form the knowledge graph of the flight environment of the unmanned aerial vehicle cluster; predicting the possible change of the flight environment in real time and updating the knowledge graph; taking the unmanned aerial vehicle cluster as an intelligent agent, taking the current state of the unmanned aerial vehicle cluster and the action selected under the current state as input, taking environmental feedback as output, wherein the environmental feedback comprises the next action and rewards, taking the maximized total rewards as an optimization target, and carrying out iterative training through reinforcement learning so as to find the optimal flight path on the knowledge graph. The invention can fully utilize environmental information, real-time feedback and prediction results, and realize effective path planning of the unmanned aerial vehicle cluster.

Description

Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph

Technical Field

The invention relates to the technical field of unmanned aerial vehicle cluster path planning, in particular to a path planning method and system for unmanned aerial vehicle clusters based on a time sequence knowledge graph.

Background

Along with the development of science and technology, unmanned aerial vehicles are widely applied to a plurality of fields such as agriculture, electric power inspection, traffic monitoring and the like. In actual use, unmanned aerial vehicles often need to fly autonomously in diverse environments and predict future environments to make optimal decisions. Therefore, the path planning problem of the unmanned aerial vehicle becomes an important subject of the unmanned aerial vehicle related technology research.

The traditional unmanned aerial vehicle path planning method is mainly based on a heuristic search algorithm or a method based on single-target or multi-target optimization. However, these methods often fail to achieve satisfactory results in the face of environmental changes and real-time requirements. In addition, as the number of unmanned aerial vehicles increases, the path planning problem becomes more complex when forming unmanned aerial vehicle clusters, requiring more efficient and intelligent methods to solve.

Disclosure of Invention

First, the technical problem to be solved

In view of the above-mentioned drawbacks and shortcomings of the prior art, the present invention provides a method and a system for planning a path of an unmanned aerial vehicle cluster based on a time sequence knowledge graph, which solve the technical problems that the existing method for planning a path of an unmanned aerial vehicle cluster is too complex and cannot meet environmental change and real-time requirements.

(II) technical scheme

In order to achieve the above purpose, the main technical scheme adopted by the invention comprises the following steps:

in a first aspect, an embodiment of the present invention provides a path planning method for an unmanned aerial vehicle cluster based on a time sequence knowledge graph, including the following steps:

taking each state in the flight environment of the unmanned aerial vehicle cluster as a node in the knowledge graph, taking the relation among the states as edges, and connecting the nodes through the edges to form the knowledge graph of the flight environment of the unmanned aerial vehicle cluster;

predicting the possible change of the flight environment in real time and updating the knowledge graph;

taking the unmanned aerial vehicle cluster as an intelligent agent, taking the current state of the unmanned aerial vehicle cluster and the action selected under the current state as input, taking environmental feedback as output, wherein the environmental feedback comprises the next action and rewards, taking the maximized total rewards as an optimization target, and carrying out iterative training through reinforcement learning so as to find the optimal flight path on the knowledge graph.

The unmanned aerial vehicle cluster path planning method based on the time sequence knowledge graph provided by the embodiment of the invention can fully utilize environment information, real-time feedback and prediction results to realize effective path planning of the unmanned aerial vehicle cluster.

Optionally, the state includes: the position, weather conditions and flight restrictions of the unmanned aerial vehicle;

the actions include: a flight path of the unmanned aerial vehicle from one node to another node;

the relationship between states, including: distance between nodes and weather changes.

Optionally, the optimal flight path is the shortest flight time and/or the lowest energy consumption flight path.

Optionally, predicting the possible change of the flight environment in real time and updating the knowledge graph includes:

respectively expressing weather conditions and flight limits as time sequence data of the weather conditions and time sequence data of the flight limits, predicting the weather conditions at the future moment through a prediction model, and predicting the flight limits according to the weather conditions;

and updating each node and each edge in the knowledge graph according to the prediction result.

Optionally, when updating each node and edge in the knowledge graph according to the prediction result, the time of each update is correspondingly recorded.

Optionally, in the process of performing iterative training through reinforcement learning, setting a Q function for each unmanned aerial vehicle cluster; the Q function is Q (s, a), representing the expected rewards available to perform action a in state s;

at the end of each iteration, the Q function is updated according to the following equation:

wherein α is a learning rate for controlling the specific gravity of the new information and the old information; gamma is a discount factor for controlling the odds of the current and future rewards; r is the prize and s 'and a' are the new state and the new action; />The maximum Q value possible in the new state s'.

Optionally, the position of the unmanned aerial vehicle includes a current position of the unmanned aerial vehicle and a target position of the unmanned aerial vehicle, and the state further includes a current time;

setting a Q function for each unmanned aerial vehicle cluster in the process of iterative training through reinforcement learning; q function isRepresenting the expected rewards available to perform action a in state s and the network parameters of the target network;

wherein, the liquid crystal display device comprises a liquid crystal display device,is a network parameter of the target network, +.>Is Q value +.>Is a gradient of (a).

Optionally, the selection policy for selecting to execute action a in state s is a greedy policy or an epsilon-greedy policy; the reward is a function of time of flight and/or flight safety factors.

Optionally, during the iterative training process through reinforcement learning, a strategy network is trained through a worker respectivelyAnd a value network->Wherein->And->Network parameters of the target network; the optimization objective in the training process is to maximize the dominance function and minimize the loss function; each worker starts from the current state, selects and executes an action according to the strategy network, observes rewards and new states, iterates k steps or until encountering a termination state; the method comprises the following steps:

first, take advantage function a (s, a) to represent the advantage of taking action a in state s over acting according to policy pi, and calculate the advantage function:wherein, the liquid crystal display device comprises a liquid crystal display device,is the TD residual, defined as +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Is the prize earned at time t; />Is a cost function, which means that in state s, according to the network parameters +.>Cost function of->A calculated total prize expected to be achieved; gamma is a discount factor for adjusting the importance of future rewards; />The method is super-parameters and is used for adjusting the weight of each TD residual; t is the termination time;

second, each worker updates a policy network and a value network, including:

calculating a loss of the policy network using the output probability of the policy network multiplied by the dominance function of the corresponding action, whereby the loss function of the policy networkThe calculation is as follows: />Wherein (1)>A loss function for the policy network; />Is indicated in the state->Under, according to the network parameter +.>Policy of (2)Select action->Logarithm of probability of (2); />Is expressed in the state +.>Take action down->Is relatively advantageous;

the worker uses the back propagation and optimizer to update network parameters of the policy networkThe method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, calculating the value by using the square of the difference between the predicted total rewards and the actual total rewards of the value network as a loss function; whereby the loss function of the value network->The calculation is as follows: />Wherein (1)>Is from the state->A starting discount prize sum; />Is in state->Under, according to the network parameter +.>Cost function of->A calculated expected total prize;for the discount rewards sum calculated from the real rewards sequence starting from time t, i.e. the actual value of the future rewards; />Network parameters that are value networks;

the worker uses the back propagation and optimizer to update parameters of the value networkThe method comprises the steps of carrying out a first treatment on the surface of the Again, each worker synchronizes the network parameters to the global network after updating its own network parameters.

In a second aspect, embodiments of the present invention provide a computer system comprising a memory and a processor; a processor for implementing the unmanned aerial vehicle cluster path planning method based on time sequence knowledge graph as in any one of the above when executing the computer program.

(III) beneficial effects

The beneficial effects of the invention are as follows: according to the unmanned aerial vehicle cluster path planning method and system based on the time sequence knowledge graph, the effective path planning of the unmanned aerial vehicle cluster can be realized based on the knowledge graph, reinforcement learning and prediction technology, and the method and system comprise environment learning, dynamic planning and prediction of flight paths.

Drawings

Fig. 1 is a flowchart of a path planning method of an unmanned aerial vehicle cluster based on a time-series knowledge-graph according to a preferred embodiment of the present invention;

fig. 2 is a schematic diagram of a knowledge graph according to a preferred embodiment of the present invention.

Description of the embodiments

The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.

In order that the above-described aspects may be better understood, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment of the invention uses the knowledge graph as an environment learning and knowledge representation tool of the unmanned aerial vehicle cluster. Knowledge graph is a structured information representation that can represent entities in the environment and their relationships in the form of graphs.

In unmanned cluster path planning, embodiments of the present invention consider each possible state in the flight environment (e.g., location, weather conditions, flight restrictions, etc.) as a node in a knowledge-graph, while the transition relationships between these states (e.g., flying from one location to another, weather changes, etc.) are used as edges. The embodiment of the invention can show the information of the whole flight environment in the form of the knowledge graph, and provide the information for the unmanned aerial vehicle cluster to make decisions.

Reinforcement learning is a machine learning method that obtains a maximum prize by constantly trying to get wrong and learning from it. In unmanned cluster path planning, embodiments of the present invention treat each unmanned as a reinforcement learning agent with the goal of finding a flight path that maximizes a reward (e.g., shortest time of flight, lowest energy consumption, etc.).

Taking Q-learning as an example, a Q function Q (s, a) may be defined that represents the expected prize available to perform action a in state s. Each drone selects an action a (i.e. fly to a node) based on the current state s, then gets a new state s' and a prize r, and then updates the Q function according to the following update rules:where α is the learning rate and γ is the discount factor.

Predictive techniques are used to predict changes that may occur in the flight environment, such as weather changes, changes in flight restrictions, and the like. The prediction results can be used as attributes of nodes and edges in the knowledge graph, so that unmanned aerial vehicle clusters can refer to the attributes in decision making.

Based on the above inventive concept, referring to fig. 1, the path planning method of the unmanned aerial vehicle cluster based on the time sequence knowledge graph according to the embodiment of the present invention includes the following steps:

s1, taking each state in the flight environment of an unmanned aerial vehicle cluster as a node in a knowledge graph, and taking the relation among a plurality of states as edges, and connecting the nodes through the edges to form the knowledge graph of the flight environment of the unmanned aerial vehicle cluster;

s2, predicting the possible change of the flight environment in real time and updating the knowledge graph;

and S3, taking the unmanned aerial vehicle cluster as an intelligent agent, taking the current state of the unmanned aerial vehicle cluster and the action selected under the current state as input, taking environmental feedback as output, wherein the environmental feedback comprises the next action and rewards, taking the maximized total rewards as an optimization target, and carrying out iterative training through reinforcement learning so as to find the optimal flight path on the knowledge graph.

According to the unmanned aerial vehicle cluster path planning method based on the time sequence knowledge graph, which is provided by the embodiment of the invention, the environment information, the real-time feedback and the prediction result can be fully utilized based on the knowledge graph, the reinforcement learning and the prediction technology, and the effective path planning of the unmanned aerial vehicle cluster can be realized, including the environment learning, the dynamic planning and the prediction of the flight path.

In this embodiment S1, referring to fig. 2, in the knowledge graph, the states (nodes) include: the position, weather conditions and flight restrictions of the unmanned aerial vehicle; the position of the unmanned aerial vehicle comprises the current position of the unmanned aerial vehicle and the target position of the unmanned aerial vehicle; the states are represented by set E. The actions include: a flight path of the unmanned aerial vehicle from one node to another node; the relationships (edges) between states, including: the distance between nodes and weather changes, and the relationship between states is represented by the set R. Thus, the knowledge graph may be represented as g= (E, R), where E is a set of entities and R is a set of relationships. For unmanned aircraft flight environments, we can represent a relationship with a triplet (E1, R, E2), where E1, E2E, R E R. In this way, knowledge maps can be used to represent various information of the flight environment, such as possible positions of the unmanned aerial vehicle, flight conditions, etc. Meanwhile, by updating the knowledge graph, the change of the environment can be reflected in real time.

Using knowledge maps to represent flight environment information may bring about several advantages:

structured information expression: the knowledge graph organizes various entities and relationships in the environment together to form a structured information model, so that the unmanned aerial vehicle cluster can better understand and operate the environment information.

Dynamically updating: the knowledge graph can be dynamically updated along with the change of the environment, so that the unmanned aerial vehicle cluster can acquire real-time environment information, and a more accurate decision is made.

Efficient information retrieval: the knowledge graph supports efficient information retrieval, so that the unmanned aerial vehicle cluster can quickly find relevant environment information, and decision efficiency is improved.

In implementation, in the embodiment S3, the optimal flight path is the shortest flight path and/or the lowest energy consumption flight path.

In this embodiment S2, predicting the possible change of the flight environment in real time and updating the knowledge graph includes:

and respectively expressing the weather conditions and the flight restrictions as time sequence data of the weather conditions and time sequence data of the flight restrictions, predicting the weather conditions at the future moment through a prediction model, and predicting the flight restrictions according to the weather conditions. And updating each node and each edge in the knowledge graph according to the prediction result. In some preferred embodiments the time of each update is also recorded correspondingly.

In order to realize the real-time dynamic updating of the knowledge graph, a time sequence analysis method can be adopted to predict the possible environmental change during implementation. Time series analysis is a statistical technique for analyzing time series data that can be used to predict future trends and patterns. In an unmanned cluster flight environment, we can consider factors such as weather conditions, flight restrictions, etc. as time series data. S201, let x= { X1, X2,... The predictive model may be expressed as:where f is the prediction model and ε is the prediction error.

And S202, in implementation, after a predicted result is obtained, the predicted environment change needs to be updated into the knowledge graph. The updating step is as follows

And determining the entity and relation which need to be updated according to the prediction result. For example, if a predicted weather change affects the flight conditions at a location, then the entity at that location and its associated relationship may need to be updated.

For entities that need to be updated, their attributes are modified. For example, the flight conditions of the location are modified.

For relationships that need to be updated, their attributes are modified. For example, the flight time of a drone from one location to another is modified.

This process can be expressed as:where G is the original knowledge-graph, ΔE and ΔR are the predicted changes in the entities and relationships produced, and G' is the updated knowledge-graph.

The prediction and update process is dynamic, and changes of the environment can be reflected in real time, so that the unmanned aerial vehicle cluster is helped to make more accurate decisions.

S202A, in order to better predict the environmental state, a long-short-term memory network (LSTM) can be used for prediction, so that the prediction performance is better.

Assuming that there is a sequence of environmental states x= { X1, X2,..once, xt }, one wants to predict the environmental state xt+1 at the next time. Using LSTM, the following predictive model can be constructed:the LSTM (X) herein represents the result of predicting the sequence X by LSTM.

In some preferred embodiments, the knowledge-graph is updated for a time-series knowledge-graph taking into account time factors. Whenever the environmental state changes (e.g., the predicted outcome occurs), it is necessary to update not only the state of the relevant entities and relationships, but also the time of this update. In this way, each entity and relationship can be tracked over time.

Specifically, assuming that the environmental state xt+1 at the time t+1 is predicted, the following steps are required to update the timing knowledge map:

the entities and relationships that need to be updated are determined from xt+1. This may include some entities and relationships that are affected by xt+1.

For entities that need to be updated, their state is modified and the update time is recorded as t+1.

For relationships that need to be updated, their states are modified and the update time is recorded as t+1.

This process can be expressed as:wherein->Is the knowledge graph at time t, Δe and Δr are the changes in the entities and relationships produced by xt+1, and g_ { t+1} is the knowledge graph at time t+1 after updating.

In this embodiment S3, the unmanned aerial vehicle cluster (as Agent) learns the optimal action strategy by interacting with the environment (e.g., time-series knowledge graph). In the iterative training process through reinforcement learning, a Q function is set for each unmanned aerial vehicle cluster. In step S3, the Q function is updated according to each environmental feedback, and the next action is selected according to the latest Q function, so that iteration is continued until an optimal flight path is found.

The following specific steps can be adopted:

s301, initializing a Q table.

We can use a Q table Q (s, a) to represent the value of taking action a in state s. The state s is the current position and environmental information of the unmanned aerial vehicle obtained from the knowledge graph, and the action a is a possible flight path. This Q table may be random in the initial phase, with gradual updates as the learning process proceeds.

S302, selecting and executing actions.

In state s, the drone cluster selects an action a according to the Q table. The selection policy may be a greedy policy (selecting the currently optimal action) or an epsilon-greedy policy (selecting a random action with epsilon probability and selecting the currently optimal action with 1-epsilon probability) to ensure a certain degree of exploration.

After the action is selected, the drone cluster performs this action and observes the feedback of the environment, including the new state s' and the reward r. The reward r may be a function based on time of flight, flight safety, etc.

S303, updating the Q table.

Based on the observed feedback, we update the Q table. The update rule is typically Q-Learning, with the following formula:wherein alpha isThe rate of learning (controlling the specific gravity of new information versus old information), γ being the discount factor (controlling the specific gravity of current rewards and future rewards), r being rewards, s 'and a' being new status and new actions; />The maximum Q value possible in the new state s'.

And S304, repeating the steps S302 and S303.

And repeatedly executing the steps 2 and 3 until the Q table converges or reaches the preset learning round number. Finally, the optimal flight path of the unmanned aerial vehicle cluster is the action with the largest Q value under each state.

This process may be expressed as the following pseudocode: initializing Q (s, a);

for each round of learning:

acquiring an initial state s;

when the end condition is not reached:

selecting action a based on Q (s, a);

performing action a, observing the reward r and the new state s';

update Q (s, a):the state s is updated to the new state s'.

The unmanned aerial vehicle cluster path planning method based on reinforcement learning and time sequence knowledge patterns can reflect the change of the environment in real time, so that the unmanned aerial vehicle clusters can find the optimal flight path in the continuously changed environment.

In some preferred embodiments, since the problems faced in unmanned cluster path optimization applications are complex and variable, deep Q-Network (DQN) can be employed to improve our Q-functions, DQN can handle complex and high-dimensional state spaces, and useful features can be automatically extracted.

S301A in DQN, the Q function is approximated by a deep neural network, denoted asWhere s is the state, a is the action, < +.>Is a network parameter. For the unmanned aerial vehicle cluster path optimization problem, the state s can contain information such as the current position, the target position, the current time, the weather condition and the like of the unmanned aerial vehicle, and the action a is a possible flight path.

The updating rule of DQN is similar to Q-Learning, but a target network is used in calculating the target Q value, denoted asWherein->Is a parameter of the target network. Doing so may increase the stability of the learning process.

S302A, updating rules as follows:wherein r is a reward, γ is a discount factor, α is a learning rate, s 'and a' are new states and actions, +.>Is a network parameter of the target network,is Q value +.>Is a gradient of (a). Furthermore, in other preferred embodiments, in the reinforcement learning process, an Actor (policy network) and a Critic (value network) can be trained simultaneously, and multiple parallel workers are used for reinforcement learning, and the workers can explore different state spaces simultaneously, so that the diversity of samples is enhanced, and the learning efficiency is improved. The steps are as follows

S3031 initializing

We initialize a policy networkAnd value network->Wherein->And->Is a network parameter, s is a state, and a is an action. A (s, a): a dominance function representing the relative dominance of taking action a in state s;

s3032 parallel exploration

Each worker starts from its current state, selects and performs an action according to the policy network, and then observes rewards and new states, doing so k steps or until a termination state is encountered.

S3033 calculating a merit function

The dominance function is calculated using the following formula:wherein a (s, a): a dominance function representing the relative dominance of taking action a in state s; />: TD residual (Temporal-Difference Error), defined as +.>Wherein->Is a reward earned at time t, +.>Is a cost function representing the total prize expected to be achieved in state s. Gamma: and a discount factor for adjusting the importance of future rewards. />: and the super parameter is used for adjusting the weight of each TD residual. The closer λ is to 1, the more strongly we consider future TDsResidual errors; the closer λ is to 0, the more attention we pay to the current TD residual. T: and (5) ending time.

The dominance function represents the dominance of taking action a over acting according to policy pi. The goal we want to optimize is to make the dominance function as large as possible.

S3034, updating the policy network and the value network, including:

calculating a loss function for a policy network：/>

: a loss function of the policy network. The goal in the training process is to minimize this loss function.: is indicated in the state->In the following, the parameters are->Policy of->Select action->Is a logarithm of the probability of (a). />: dominance function, expressed in state->Take action down->Is relatively advantageous. />: parameters of the policy network. The meaning of the whole formula is: the loss of the policy network is calculated using the output probability of the policy network multiplied by the dominance function of the corresponding action. Such a loss function form encourages the policy network to choose a more dominant action in the future, and thus may help the drone cluster to better find the flight path.

The parameter θ of the policy network is updated using a back-propagation and optimizer (e.g., adam).

At the same time we also calculate the loss function of the value network：Wherein (1)>Is from the state->The starting discount rewards are summed. We also use the back propagation and optimizer to update the parameters of the value network +.>。/>: loss function of the value network. The goal in the training process is to minimize this loss function. />: in state->In the following, the parameters are->Cost function of->Calculated expected totalRewards. />: starting at time t, the sum of the discount rewards, i.e. the actual value of the future rewards, is calculated in accordance with the actual rewards sequence. />: parameters of the value network. The meaning of the whole formula is: the square of the difference between the predicted total prize and the actual total prize for the value network is calculated as a loss function. This form of loss function encourages the value network to more accurately predict future rewards, thereby helping the policy network to make better decisions.

S3035, synchronizing network parameters;

after updating own network parameters, each worker synchronizes the parameters to the global network.

The above process continues to iterate until the termination condition is met.

Through the steps, the unmanned aerial vehicle clusters can learn and optimize own flight paths in parallel, and learning efficiency is greatly improved. Meanwhile, by using the advantage function and the parallel exploration method, complex and variable flight environments can be better processed, so that the unmanned aerial vehicle cluster can find a better flight path.

Correspondingly, the computer system of the embodiment of the invention comprises a memory and a processor, wherein the memory is used for storing a computer program; and the processor is used for realizing the path planning method of the unmanned aerial vehicle cluster based on the time sequence knowledge graph according to any embodiment when executing the computer program.

In summary, the knowledge graph is a method for representing and using knowledge, and can be used to represent information of a flight environment, and the prediction technique can be used to predict possible environmental changes to update the knowledge graph, so that reinforcement learning is a machine learning method, and the optimum strategy can be learned by interacting with the environment.

Although knowledge graph, prediction technology and reinforcement learning are each widely used in the field, combining them for solving the path planning problem of unmanned aerial vehicle clusters is an entirely new challenge. It is necessary to study how to represent the flight environment information as a knowledge graph, how to update the knowledge graph according to predicted environmental changes, and how to find an optimal flight path on the knowledge graph using reinforcement learning.

Therefore, according to the path planning method and system for the unmanned aerial vehicle cluster based on the time sequence knowledge graph, the unmanned aerial vehicle cluster can learn and optimize own flight paths in parallel, and learning efficiency is greatly improved. The unmanned aerial vehicle cluster not only can fully utilize environmental information, real-time feedback and prediction results, but also can realize effective path planning of the unmanned aerial vehicle cluster. Meanwhile, by using the advantage function and the parallel exploration method, complex and variable flight environments can be better processed, so that the unmanned aerial vehicle cluster can find a better flight path.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.

It should be noted that the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. Several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third, etc. are for convenience of description only and do not denote any order. These terms may be understood as part of the component name.

Furthermore, it should be noted that in the description of the present specification, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to a specific feature, structure, material, or characteristic described in connection with the embodiment or example being included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art upon learning the basic inventive concepts.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, the present invention should also include such modifications and variations provided that they fall within the scope of the equivalent technology of the present invention.

Claims

1. The unmanned aerial vehicle cluster path planning method based on the time sequence knowledge graph is characterized by comprising the following steps of:

taking each state in the flight environment of the unmanned aerial vehicle cluster as a node in the knowledge graph, taking the relation among the states as edges, and connecting the nodes through the edges to form the knowledge graph of the flight environment of the unmanned aerial vehicle cluster; the states include: the position, weather conditions and flight restrictions of the unmanned aerial vehicle; the relationship between the states includes: distance between nodes and weather changes;

predicting the possible change of the flight environment in real time and updating the knowledge graph; comprising the following steps: respectively expressing the weather conditions and the flight restrictions as time sequence data of the weather conditions and time sequence data of the flight restrictions, predicting the weather conditions at the future moment through a prediction model, and predicting the flight restrictions according to the weather conditions; updating each node and each side in the knowledge graph according to the prediction result;

taking the unmanned aerial vehicle cluster as an intelligent agent, taking the current state of the unmanned aerial vehicle cluster and the action selected in the current state as input, taking environmental feedback as output, wherein the environmental feedback comprises the next action and rewards, taking the maximized total rewards as an optimization target, and carrying out iterative training through reinforcement learning so as to find the optimal flight path on a knowledge graph; the actions include: a flight path of the unmanned aerial vehicle from one node to another node;

in the iterative training process through reinforcement learning, a strategy network is trained respectively through a working deviceAnd a value network->Wherein->And->Network parameters of the target network; the optimization objective in the training process is to maximize the dominance function and minimize the loss function; each worker starts from the current state, selects and executes an action according to the strategy network, observes rewards and new states, iterates k steps or until encountering a termination state; the method comprises the following steps:

first, the dominance function A (s, a) is used to represent the action a taken in state s versus the policyActing onDominance and calculating a dominance function:

wherein (1)>Is the TD residual, defined as +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Is the prize earned at time t; />Is a cost function, which means that in state s, according to the network parameters +.>Cost function of->A calculated total prize expected to be achieved; gamma is a discount factor for adjusting the importance of future rewards; />The method is super-parameters and is used for adjusting the weight of each TD residual; t is the termination time;

second, each worker updates a policy network and a value network, including:

calculating a loss of the policy network using the output probability of the policy network multiplied by the dominance function of the corresponding action, whereby the loss function of the policy networkThe calculation is as follows:

wherein (1)>A loss function for the policy network;is indicated in the state->Under, according to the network parameter +.>Policy of->Select action->Logarithm of probability of (2); />Is expressed in the state +.>Take action down->Is relatively advantageous;

the worker uses the back propagation and optimizer to update network parameters of the policy networkThe method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, calculating the value by using the square of the difference between the predicted total rewards and the actual total rewards of the value network as a loss function; whereby the loss function of the value networkThe calculation is as follows: />Wherein (1)>Is from the state->A starting discount prize sum; />Is in state->Under, according to the network parameter +.>Cost function of->A calculated expected total prize;for the discount rewards sum calculated from the real rewards sequence starting from time t, i.e. the actual value of the future rewards; />Network parameters that are value networks;

the worker uses the back propagation and optimizer to update parameters of the value network；

Again, each worker synchronizes the network parameters to the global network after updating its own network parameters.

2. The method for planning paths of unmanned aerial vehicle clusters based on time series knowledge patterns according to claim 1, wherein the optimal flight path is the shortest flight path with the shortest flight time and/or the lowest energy consumption.

3. The path planning method of unmanned aerial vehicle clusters based on time sequence knowledge graph according to claim 1, wherein when updating each node and edge in the knowledge graph according to the prediction result, the time of each update is also recorded correspondingly.

4. A method for planning a path of a cluster of unmanned aerial vehicles based on a time-series knowledge graph according to any one of claims 1 or 3, wherein a Q function is set for each cluster of unmanned aerial vehicles during the iterative training by reinforcement learning; the Q function is Q (s, a) representing the expected rewards available to perform action a in state s;

wherein α is a learning rate for controlling the specific gravity of the new information and the old information; gamma is a discount factor for controlling the odds of the current and future rewards; r is the prize and s 'and a' are the new state and the new action; />Is the maximum Q value possible in the new state s'.

5. The path planning method of unmanned aerial vehicle clusters based on time series knowledge patterns according to claim 4, wherein the unmanned aerial vehicle positions comprise the current position of the unmanned aerial vehicle and the target position of the unmanned aerial vehicle, and the state further comprises the current time;

in the iterative training process through reinforcement learning, setting a Q function for each unmanned aerial vehicle cluster; the Q function isRepresenting the expected rewards available to perform action a in state s and the network parameters of the target network;

wherein (1)>Is a network parameter of the target network, +.>Is Q value +.>Is a gradient of (a).

6. The path planning method of unmanned aerial vehicle clusters based on time-series knowledge patterns according to claim 5, wherein the selection strategy for selecting execution of action a in state s is a greedy strategy or an epsilon-greedy strategy; the reward is a function of time of flight and/or flight safety factors.

7. A computer system, characterized in that: the system comprises a memory and a processor, wherein the memory is used for storing a computer program; the processor is configured to implement the path planning method of the unmanned aerial vehicle cluster based on the time-series knowledge-graph according to any one of claims 1 to 6 when executing the computer program.