CN114123178B

CN114123178B - Multi-agent reinforcement learning-based intelligent power grid partition network reconstruction method

Info

Publication number: CN114123178B
Application number: CN202111364422.3A
Authority: CN
Inventors: 卢芳; 陈理先; 王琴; 姚绪梁; 兰海; 刘宏达; 黄曼磊; 刘瑜超
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2023-12-19
Anticipated expiration: 2041-11-17
Also published as: CN114123178A

Abstract

The invention provides a multi-agent reinforcement learning-based intelligent power grid partition network reconstruction method, which comprises the following steps: dividing a power grid into N areas according to the operation requirement of the power grid, and constructing basic elements of multi-agent reinforcement learning, including environment, agents, states, observation, actions and rewarding functions; step 2, operating a simulation environment of the power system, and creating an initial operation state data set of the power system; step 3, constructing a deep neural network model, and training decision-making agents by applying enhanced inter-agent learning; and 4, providing a strategy for power grid reconstruction by using the trained intelligent agent. According to the invention, through interaction of multiple intelligent agents and the electric power simulation environment, the strategy of optimal network reconstruction is learned offline, and the strategy is applied to an actual power grid online.

Description

Multi-agent reinforcement learning-based intelligent power grid partition network reconstruction method

Technical Field

The invention relates to the field of multi-agent reinforcement learning, in particular to a smart grid partition network reconstruction method based on multi-agent reinforcement learning

Background

The network reconstruction refers to changing the network topology structure of the power grid, namely changing the running states of the tie switches and the sectionalizing switches of the power grid, so that loads between feeder lines or distribution stations are transferred, and the running states of the power grid are changed. When the power grid fails, the network reconstruction can enable the power grid to recover safe and stable operation. The traditional network reconstruction relies on an optimization algorithm or expert experience, and the optimization algorithm is often huge in calculation amount and low in processing speed, so that the real-time application is not facilitated. Expert experience lacks means for coping with the possible risks which do not occur, and the problem of running safety of the increasingly complex power system is difficult to solve. In addition, the uncertainty of wind power, photovoltaic power generation and load is difficult to consider simultaneously in the traditional network reconstruction. Before the network reconstruction is executed, the running state of the power grid after the network reconstruction needs to be estimated, and the accuracy of the estimation directly determines the advantages and disadvantages of the network reconstruction action, so that the difficulty of the network reconstruction is increased. The reinforcement learning fully considers the change rule of the environment, has the capability of predicting the new environment after the action, and provides a new idea for network reconstruction. In addition, the method based on reinforcement learning has the characteristics of high calculation speed and high efficiency, and is suitable for online application of a power system.

Disclosure of Invention

The invention aims to provide a multi-agent reinforcement learning-based intelligent power grid partition network reconstruction method for realizing automatic decision-making and safe operation of a power grid,

the purpose of the invention is realized in the following way:

a smart grid partition network reconstruction method based on multi-agent reinforcement learning comprises the following steps:

dividing a power grid into N areas according to the operation requirement of the power grid, and constructing basic elements of multi-agent reinforcement learning, including environment, agents, states, observation, actions and rewarding functions;

step 2, operating a simulation environment of the power system, and creating an initial operation state data set of the power system;

step 3, constructing a deep neural network model, and training decision-making agents by applying enhanced Inter-Agent Learning (RIAL);

and 4, providing a strategy for power grid reconstruction by using the trained intelligent agent.

Further, the basic element construction process of the multi-agent reinforcement learning method in the step 1 comprises the following steps:

step 1.1: and constructing an interaction environment taking the simulation environment of the power system as an intelligent agent, and providing decision reference for the intelligent agent for various attributes and state values of the power grid. And when the power system is safely operated, i.e. no overload line exists, the intelligent agent is not operated. If and only if the line overload exists in the power system, the intelligent agent performs a series of continuous decision actions, so that the power system is restored to safe operation. Each time a step length is operated, the environment modifies relevant parameters in the power grid according to actions of all intelligent agents, and then the power flow calculation is carried out to update the power grid state according to time-varying rules of power plants and load power;

step 1.2: n zone control agents are constructed. The agent acts as both a decision maker and a learner, interacts with the environment to obtain experience, and learns from time to obtain an optimal strategy. Each intelligent agent is responsible for supervising an area, and the intelligent agents continuously learn an optimal global strategy through cooperation;

step 1.3: a global state space is constructed. The state reflects the operating state of the power system at a certain moment. Active power of a power grid topological structure, a power plant, a load and a transmission line is used as a current system characteristic;

step 1.4: an observation space is built for each agent. And observing and reflecting the operation state of the regional power grid which can be observed by a certain agent at a certain moment. Taking the power grid topological structure, a power plant, a load and the active power of a transmission line as observables;

step 1.5: an environmental action space is built for each agent. The environmental actions of each agent affect the environment and team rewards. The environmental action is selected from one of the following two actions to be performed: switching a line; the bus is switched for a device of a substation. When the power grid runs safely, the environment action is selected to be kept as it is; once the line out of limit is found, the grid topology is changed to restore grid security. According to the operation limit of the actual power grid, the operation of the same line or power distribution station needs to be separated by at least 3 step sizes, and one step size corresponds to 5 minutes in the actual power grid;

step 1.6: a communication action space is built for each agent. The communication actions of each agent are received by other agents at the next moment and are used as decision basis, but the environment or rewards are not directly influenced. The communication action is a multidimensional vector, and the dimension of the multidimensional vector is determined by the communication capacity and the communication requirement between the intelligent agents in the actual application scene;

step 1.7: the bonus function includes two cases. The first is a reward function based on line overload in the reconstruction process;

and secondly, a reward function obtained based on whether the system is restored to safe operation or not at the end of the reconstruction of the round.

Bonus function based on line overload: and the sum of the per-unit values of the line overload amounts of all overload lines at the current moment.

Wherein is P _i ^actual The actual active power per unit value, P, of the ith line _i ^threshold And the per unit value of the active power threshold value of the ith line is represented by O, and the per unit value of the active power threshold value of the ith line is represented by a sequence number set of the overload line.

Further, the method for constructing the running state data set of the electric power system in the step 2 includes the following steps:

step 2.1: establishing a topological structure model and a tide calculation model of the power grid according to the power grid structure of the intelligent body;

step 2.2: establishing a time-varying rule model of active power of each power plant and load in the power grid by using the historical data and the forecast data of the real power grid;

step 2.3: random network attacks are designed. After the power grid runs safely and stably, a line is randomly disconnected, so that the creation event is handed over to an intelligent agent for solving.

Further, the training method by using the RIAL algorithm in the step 3 is as follows:

all agents were trained simultaneously using Deep Q Network (DQN), but there were two modifications to DQN: first, no experience reuse pool is used; second, the environmental actions and communication actions taken by the agent are taken as input to the next time step.

The deep Q learning of multiple agents includes the steps of:

step 3.1: establishing a simulation environment of the power system;

step 3.2: determining a state space, an observation space, an environment action space and a communication action space;

step 3.3: determining a neural network structure of the intelligent agent according to the RIAL architecture and initializing neural network parameters;

step 3.4: initializing an environment, and inputting a fault state of a power system as an initial state;

step 3.5: each step length, all the agents select respective actions, the environment is converted into a new environment after receiving the combined actions, rewards are generated, and the neural network parameters of the agents are updated according to the transfer process;

step 3.6: and judging whether the environment reaches a convergence or divergence condition, if not, returning to the step 3.5, otherwise, returning to the step 3.4.

Compared with the prior art, the invention has the beneficial effects that:

the method solves the problem of reconstruction after the complex power grid faults by adopting a multi-agent method, does not need to model a complex power system, learns an optimal reconstruction strategy through interaction between the multi-agent and the environment and information interaction among the multi-agent, realizes automatic reconstruction of the network, does not depend on an expert system and a traditional model algorithm, has self-adaptability to wind power, photovoltaic and load uncertainty, and has better countermeasure to unknown risks. The multi-agent in the partition has high training efficiency and high decision-making speed.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a diagram of the RIAL architecture of the present invention;

FIG. 3 is a DQN training flow diagram for multiple agents of the present invention;

FIG. 4 is a schematic illustration of multi-agent communication according to the present invention;

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

An intelligent power grid partition automatic decision-making method based on multi-agent reinforcement learning, the general flow chart of which is shown in fig. 1, comprises the following steps:

step 1: the power grid is divided into N areas according to the operation requirement of the power grid, and basic elements of multi-agent reinforcement learning (MARL) are constructed, including environment, agents, states, observation, actions and rewarding functions.

Step 2: the power system simulation environment is run, and an initial running state data set of the power system is created.

Step 3: a deep neural network model is constructed, and reinforcement inter-agent learning (RIAL) is applied to train decision agents.

Step 4: and providing a strategy for power grid control by using the trained agent.

The invention also includes:

1. the basic element construction process of the multi-agent reinforcement learning method in the step 1 is as follows:

(1) And constructing an interaction environment taking the simulation environment of the power system as an intelligent agent, and providing decision reference for the intelligent agent for various attributes and state values of the power grid. And when the power system is safely operated, i.e. no overload line exists, the intelligent agent is not operated. If and only if the line overload exists in the power system, the intelligent agent performs a series of continuous decision actions, so that the power system is restored to safe operation. And when one step length is operated, the environment modifies relevant parameters in the power grid according to the actions of all the agents, and then the power flow calculation is carried out to update the power grid state according to the time-varying rule of the power plant and the load power.

(2) N zone control agents are constructed. The agent acts as both a decision maker and a learner, interacts with the environment to obtain experience, and learns from time to obtain an optimal strategy. Each agent is responsible for supervising an area, and the agents continuously learn an optimal global strategy through cooperation.

(3) A global state space is constructed. The state reflects the operating state of the power system at a certain moment. Active power of a power grid topological structure, a power plant, a load and a transmission line is taken as the current system characteristic.

(4) An observation space is built for each agent. And observing and reflecting the operation state of the regional power grid which can be observed by a certain agent at a certain moment. The power grid topology structure, the power plant, the load and the active power of the transmission line are taken as observables.

(5) An environmental action space is built for each agent. The environmental actions of each agent affect the environment and team rewards. The environmental action is selected from one of the following two actions to be performed: switching a line; the bus is switched for a device of a substation. When the power grid runs safely (no out-of-limit line exists in the power grid), the environmental action is selected to be kept as it is; once the line out of limit is found, the grid topology is changed to restore grid security. According to the operation limit of the actual power grid, the operation of the same line or power distribution station needs to be separated by at least 3 step sizes, and one step size corresponds to 5 minutes in the actual power grid.

(6) A communication action space is built for each agent. The communication actions of each agent are received by other agents at the next moment and are used as decision basis, but the environment or rewards are not directly influenced. The communication action is a multidimensional vector, and the dimension of the multidimensional vector is determined by the communication capability and the communication requirement between the intelligent agents in the actual application scene.

(7) The bonus function includes two cases. The first is a reward function based on line overload during reconstruction. The second is a bonus function based on whether the system resumes safe operation at the end of the current round of reconstruction.

An end condition for a round of reconstruction is determined. When the power system is restored to safety, i.e. no overload line exists, the present round of reconstruction is successful, ends and a larger prize, e.g. 100, is obtained. If the power system has not yet reached safety over multiple actions (exceeding the set maximum number of steps), the current round of reconstruction fails, ending and giving a larger penalty, e.g. -100.

2. The method for constructing the running state data set of the electric power system in the step 2 comprises the following steps:

(1) And establishing a topological structure model and a tide calculation model of the power grid according to the power grid structure of the intelligent body.

(2) And establishing a time-varying rule model of the active power of each power plant and load in the power grid by using the real power grid historical data and the prediction data.

(3) Random network attacks are designed. After the power grid runs safely and stably, a line is randomly disconnected (accidents possibly happening in the power grid are simulated, such as cable burnout, artificial damage and the like), so that the events are created and are addressed by an intelligent agent.

3. The method for constructing the deep neural network model in the step 3 is as follows:

each agent comprises two cyclic neural network RNNs, which correspond to the environmental actions and the communication actions, respectively. The RNN corresponding to the environmental action is input as own observation at the current moment, information from other intelligent agents at the previous moment, own environmental action at the previous moment, own individual number, and output as Q function of own environmental action at the current moment and the environmental action. The RNN corresponding to the communication action is inputted with the current observation of the RNN, the information from other intelligent agents at the previous time, the communication action of the RNN at the previous time, and the individual number of the RNN, and the RNN is outputted as the Q function of the communication action of the RNN and the communication action of the RNN. The RNN is composed of a GRU layer, a BN layer, a Relu activation layer and a full connection layer.

The RIAL architecture is shown in figure 2. Wherein i is an agentI' represents other agents than i,indicating the observation of the ith agent at time t,/-)>The communication action from other agents at time t-1, a is the environmental action, and Q is the value function.

4. The training method by using RIAL algorithm in step 3 is:

all agents were trained simultaneously using deep Q learning (DQN), but there are two modifications to DQN: first, no experience reuse pool is used; second, the environmental actions and communication actions taken by the agent are taken as input to the next time step.

The deep Q learning of multiple agents includes the steps of:

step 1: establishing a simulation environment of the power system;

step 2: determining a state space, an observation space, an environment action space and a communication action space;

step 3: determining a neural network structure of the intelligent agent according to the RIAL architecture and initializing neural network parameters;

step 4: initializing an environment, and inputting a fault state of a power system as an initial state;

step 5: each step length, all the agents select respective actions, the environment is converted into a new environment after receiving the combined actions, rewards are generated, and the neural network parameters of the agents are updated according to the transfer process;

step 6: and judging whether the environment reaches a convergence or divergence condition, if not, returning to the step 5, otherwise, returning to the step 4.

The DQN training flow is as shown in figure 3. The communication process of the multi-agent is shown in fig. 4.

Claims

1. A smart grid partition network reconstruction method based on multi-agent reinforcement learning is characterized by comprising the following steps:

step 1: dividing the power grid into N areas according to the operation requirement of the power grid, and constructing basic elements of multi-agent reinforcement learning, including environment, agents, states, observation, actions and rewarding functions;

step 1.1: constructing an interaction environment taking a power system simulation environment as an intelligent agent, and providing decision reference for the intelligent agent for various attributes and state values of a power grid; when the power system runs safely, i.e. no overload line exists, the intelligent agent is not operated; if and only if the line overload exists in the power system, the intelligent agent performs a series of continuous decision behaviors, so that the power system is restored to safe operation; each time a step length is operated, the environment modifies relevant parameters in the power grid according to actions of all intelligent agents, and then the power flow calculation is carried out to update the power grid state according to time-varying rules of power plants and load power;

step 1.2: constructing N regional control intelligent agents; the intelligent agent is used as a decision maker and a learner, interacts with the environment to obtain experience, and continuously learns to obtain an optimal strategy; each intelligent agent is responsible for supervising an area, and the intelligent agents continuously learn an optimal global strategy through cooperation;

step 1.3: constructing a global state space; the state reflects the running state of the power system at a certain moment; active power of a power grid topological structure, a power plant, a load and a transmission line is used as a current system characteristic;

step 1.4: constructing an observation space for each agent; observing and reflecting the operation state of the regional power grid which can be observed by a certain agent at a certain moment; taking the power grid topological structure, a power plant, a load and the active power of a transmission line as observables;

step 1.5: building an environmental action space for each agent; the environmental actions of each agent can affect the environment and team rewards; the environmental action is selected from one of the following two actions to be performed: switching a line; switching bus bars for a device of a substation; when the power grid runs safely, the environment action is selected to be kept as it is; once the line out-of-limit is found, changing the topology of the power grid to restore the power grid security; according to the operation limit of the actual power grid, the operation of the same line or power distribution station needs to be separated by at least 3 step sizes, and one step size corresponds to 5 minutes in the actual power grid;

step 1.6: constructing a communication action space for each agent; the communication action of each intelligent agent can be received by other intelligent agents at the next moment and used as the basis of decision, but the environment or rewards are not directly influenced; the communication action is a multidimensional vector, and the dimension of the multidimensional vector is determined by the communication capacity and the communication requirement between the intelligent agents in the actual application scene;

step 1.7: the rewarding function comprises two cases, namely, a rewarding function based on the line overload in the reconstruction process and a rewarding function obtained based on whether the system recovers safe operation or not when the reconstruction of the round is finished;

in the reconstruction process, a reward function based on the line overload amount is the sum of the line overload amount per unit values of all overload lines at the current moment;

wherein P is _i ^actual The actual active power per unit value of the ith line; p (P) _i ^threshold The active power threshold per unit value of the ith line; o is the sequence number set of the overload line;

step 2: operating a power system simulation environment, and creating an initial operating state data set of the power system;

step 2.3: designing random network attack; randomly disconnecting a line after the power grid runs safely and stably, so that the creation event is handed over to an intelligent agent for solving;

step 3: constructing a deep neural network model, and training decision-making agents by applying enhanced inter-agent learning;

all agents were trained simultaneously using deep Q network learning, and there were two modifications to the deep Q network: first, no experience reuse pool is used; secondly, taking the environmental action and the communication action taken by the intelligent agent as the input of the next time step;

the deep Q network learning of multiple agents includes the steps of:

step 3.1: establishing a simulation environment of the power system;

step 3.5: each step length, all the agents select respective actions, the environment is converted into a new environment after receiving the combined actions, rewards are generated, and the neural network parameters of the agents are updated according to the conversion process;

step 3.6: judging whether the environment reaches a convergence or divergence condition, if not, returning to the step 3.5, otherwise, returning to the step 3.4;

step 4: and providing a strategy for power grid reconstruction by using the trained agent.