CN111200285B

CN111200285B - Micro-grid hybrid coordination control method based on reinforcement learning and multi-agent theory

Info

Publication number: CN111200285B
Application number: CN202010089205.7A
Authority: CN
Inventors: 窦春霞; 张立国
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2023-12-19
Anticipated expiration: 2040-02-12
Also published as: CN111200285A

Abstract

The invention discloses a micro-grid hybrid coordination control method based on reinforcement learning and multi-agent theory, which comprises the following steps: designing a transition voltage layer control strategy based on a voltage layering mode, and designing a double-energy-storage role-division control strategy, wherein when an energy storage unit works in a voltage stabilizing mode, two energy storages work separately; when the auxiliary energy storage is needed to continuously absorb power or supplement power, the two energy storage working modes are converted into cooperative charge/discharge; constructing an action space and a state space based on Q-Learning: designing a reinforcement learning control framework based on multiple agents: basic updating rules comprising design state-action pairs and selecting corresponding cost functions; designing a basic action selection mechanism and a return value strategy: the method comprises the steps of designing a selection strategy adopted by a system in an initial state and reporting values in various states; designing a reinforcement learning algorithm flow: and designing a proper algorithm flow based on the strategy to realize the control strategy.

Description

Micro-grid hybrid coordination control method based on reinforcement learning and multi-agent theory

Technical Field

The invention relates to the field of intelligent power grid control, in particular to a micro power grid hybrid coordination control method based on reinforcement learning and multi-agent theory.

Background

With the rapid development of economy, the energy consumption of China is gradually increased year by year, wherein the total consumption of non-renewable energy sources such as fossil energy sources is rapidly increased. The power supply of China is mainly from thermal power generation, but with the transitional exploitation of non-renewable energy sources such as fossil energy sources and the like and the increasing of negative effects on the environment in the traditional power generation process, the research of renewable energy sources such as wind, light, water and the like in China and even worldwide gradually brings up a schedule. The development and utilization of green clean energy not only can make a certain contribution to environmental protection, but also can provide a new energy supply form for economic development. Therefore, the development and utilization of clean energy become important tasks for the development of energy in China, and wind power generation and photovoltaic power generation in China are developed at a relatively rapid speed.

In recent years, compared with the traditional large-scale centralized power generation and distribution mode, the micro-grid based on the distributed power generation technology has gained wide attention and application at home and abroad due to the outstanding advantages of short construction period, less investment, flexible installation place, reliable power supply, easy maintenance, high energy utilization rate, small environmental pollution and the like. The micro-grid combines a distributed power supply, a load, an energy storage device, a control device and the like to form a single controllable unit, and simultaneously supplies electric energy and heat energy to a user. Advanced information technology, control technology and power technology are integrated in the micro-grid, stable power supply can be provided, diversified load demands are met, and maximization of energy benefit, economic benefit and environmental benefit can be guaranteed. At the same time, the micro-grid can also provide electrical support into the regular grid at the necessary moment. Micro-grids will be an integral part of the future grid construction process. In China, the distributed power generation technology is greatly promoted, the method is a specific embodiment of a sustainable development road, and the method is a powerful support for adjusting energy structures, solving electricity utilization problems in remote areas and protecting the environment.

With the rapid development of distributed power generation technology, the problem to be solved is gradually revealed. The micro-grid after being connected with the large power grid can easily meet the load demand, but when the micro-grid is in island operation, an effective control mechanism is needed in order to ensure the safe and stable operation of the system under a plurality of connection conditions. Firstly, the energy supply of distributed power generation mainly depends on renewable energy sources such as wind, light and the like, but the inherent characteristics of intermittence, uncontrollability and the like which depend on natural factors can cause certain instability to the energy supply of a power grid. Therefore, in order to ensure the stable operation of the load, the distributed energy source must be reasonably and effectively controlled, so that the distributed energy source can operate in different modes according to real-time natural conditions and load requirements. Moreover, if the system is frequently switched between different operation modes, the control difficulty is also increased, the operation stability is reduced, and how to reduce the mode switching times under the condition of ensuring the stable operation of the system is also a considerable problem. In addition to the above-mentioned different operation mode conversion problems, the control problems of important power indexes such as voltage, power and the like are not neglected.

The problem of controlling the multi-mode conversion of a micro grid based on distributed power generation technology, multi-agent system (MAS) technology is certainly one of the most effective and widely used means. So far, many control methods for the multi-mode conversion of the micro-grid based on the distributed power generation technology have also been studied. However, most people focus on MAS-based logic switching control or continuous dynamic regulation issues and do not adequately consider the switching conditions and switching behavior of the microgrid system. Therefore, how to reasonably plan the switching conditions and switching behaviors and reduce the number of mode switching times is a considerable problem under the condition of ensuring the stable operation of the system. In the detection schemes of bus voltage, most of the adopted ideas are to directly divide the voltage into three or five grades, when the bus voltage rises or falls to a certain grade due to a certain factor, the system adopts a control measure to maintain the bus voltage and the stability of the system, but the condition that the bus voltage just fluctuates between certain two grades is not considered, and the system does not stop converting the control measure. For energy storage units, there are generally two uses: the first is that when the renewable energy generating capacity is not supplied sufficiently, the energy storage unit is responsible for providing electric energy output for the load; and the other is to call the energy storage unit to 'peak clipping and valley filling' when the bus voltage fluctuates to a certain extent, so as to stabilize the bus voltage. However, most articles adopt a single energy storage scheme, the single energy storage scheme needs to continuously switch charging and discharging when stabilizing voltage fluctuation, the performance is general, the energy storage life can be greatly lost, and the other part adopts improved double energy storage control on the basis, but the double energy storage control strategy is single.

Disclosure of Invention

According to the problems existing in the prior art, the invention discloses a micro-grid hybrid coordination control method based on reinforcement learning and multi-agent theory, which is characterized in that a transition voltage layer control strategy based on a voltage layering control mode is designed, and a transition layer is added between adjacent voltage layers;

designing a double-energy-storage role-division control strategy, and separating two energy storages to work when the energy storage unit works in a voltage stabilizing mode; when the auxiliary energy storage is needed to continuously absorb power or supplement power, the two energy storage working modes are converted into cooperative charge/discharge; .

Constructing an action space and a state space based on Q-Learning: analyzing the respective necessary states and actions of the busbar voltage detection unit, the energy storage unit, the photovoltaic power generation unit, the wind power generation unit, the diesel power generation unit and the load control unit, and acquiring respective action spaces and state spaces;

designing a reinforcement learning control framework based on multiple agents: basic updating rules comprising design state-action pairs and selecting corresponding cost functions;

designing a basic action selection mechanism and a return value strategy: the method comprises the steps of designing a selection strategy adopted by a system in an initial state and reporting values in various states;

designing a reinforcement learning algorithm flow: and designing a proper algorithm flow based on the strategy to realize the control strategy.

The reinforcement learning control framework based on the multiple agents is designed in the following way: comparing the state information acquired currently by the system with the state information acquired at the previous moment, if the state is the same, not generating an action instruction, and continuing to acquire the state information at the next moment;

after all states and actions are determined, the system generates a multidimensional Q matrix, the Q value of the state of the reinforcement learning model system is obtained by learning iteration to approach the optimal action value function,

the basic update rule for the state-action pair is as follows:

where s is the current state of the agent, a is the action taken based on the current state, Q (s _t ,a _t ) Representing that the agent is in state s _t Next action instruction a selected by the set learning strategy _t The method comprises the steps of carrying out a first treatment on the surface of the Beta is a decay factor whose size determines whether the selected strategy is prone to current rewards or future rewards, R is a state and behavior based rewards;

determining the conditions of executing actions, changing the state of the intelligent agent after instructions and the previous state through the reward function, determining the rewarding or punishment degree which is applied to the actions based on the previous state, and maximizing the sum of expected values of the reward function by trying to combine all actions allowed based on the current state, wherein the reward function is as follows:

wherein:representing the reward of the system in the next j steps in the time t, wherein pi is the basic action selection mechanism and the return value strategy of the strategy selected by the system are designed in the following way:

setting the enabling priority of the unit: the load control unit cuts off loads, wherein wind power/photovoltaic power generation, energy storage power supply, diesel engine set power supply and load control unit;

and the three conditions of the renewable energy power generation supply quantity just meeting the requirement, the renewable energy power generation supply quantity being larger than the requirement and the renewable energy power generation supply quantity being smaller than the requirement are designed according to the conditions.

By adopting the technical scheme, the micro-grid hybrid coordination control method based on reinforcement learning and multi-agent theory provided by the invention takes the maximum value of total return expectations acquired by a plurality of micro-grid agent units as a target through introducing reinforcement learning algorithm, and autonomously learns iterative action cost functions according to historical data and current states. The optimal action strategy of the multi-agent mixed coordination control is the finally obtained converged state-action table, and the mixed coordination control rule in the controller is further optimized by the action cost function. And a transition voltage layer control strategy based on a voltage layering technology and a double-energy-storage coordination control strategy are designed to reduce unstable factors caused by the control strategy and stabilize bus voltage.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a voltage detection class diagram incorporating a transition layer;

FIG. 2 is a model diagram of a microgrid system;

FIG. 3 is a diagram of the environment, agent and control system relationship;

fig. 4 is a multi-agent coordinated control framework based on reinforcement learning.

Detailed Description

In order to make the technical scheme and advantages of the present invention more clear, the technical scheme in the embodiment of the present invention is clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:

the micro-grid hybrid coordination control method based on reinforcement learning and multi-agent theory as shown in fig. 4 comprises the following steps:

step 1: in order to optimally control the bus voltage, the voltage layering control strategy divides the bus voltage into 6 detections and the likeStage: (-, 0.95U) _ref ]，(0.95U _ref ，0.96U _ref ]，(0.96U _ref ，0.98U _ref ]，(0.98U _ref ，1.02U _ref ]，(1.02U _ref ，1.05U _ref ]，(1.05U _ref ，—]Wherein U is _ref Is the reference voltage.

Due to the renewable energy generation and the randomness of the load demands, the bus voltage can fluctuate to a certain extent. When the bus voltage jumps from one range to another, the detection unit transmits the state change signal to the hybrid control unit. To prevent frequent transmission of state change signals by the detection unit due to bus voltage jumps back and forth between two voltage detection levels caused by uncertainty in power generation and demand, a transition voltage layer is innovatively added between the voltage detection levels. As shown in fig. 1, when the bus voltage fluctuates in a certain voltage layer area and does not trigger a state change signal, the lower boundary of the upper layer area and the upper boundary of the lower layer area contained in the voltage layer area are in an inactive state, and when the bus voltage continues to rise (fall) to exceed the upper boundary (lower boundary) of the current layer area, the voltage layer area fails, the upper voltage layer area (the lower voltage layer area) is activated, the detection unit detects the bus voltage state signal fluctuation, and specific state change information is transmitted to the hybrid control unit. Event-triggered functions (event-triggered functions, ETFs) of the voltage detection units are formed by ETF (U _s ) The expression is:

wherein Sgn ()' is a sign function; 1 (t) step function; u is the current bus voltage;and->The upper and lower limits of the voltage layer in which the current voltage is located are respectively represented, and the voltage range of each layer is defined above; when ETF (U) _s ) When=1, the bus voltage is shifted up by one layer region, when ETF (U _s ) When= -1, the bus voltage is allowed to move down one layer region, ETF (U _s ) When=0, the voltage layer area is not changed; t is t _s Is the trigger time.

Step 2: in order to exert the capacity advantage of the two groups of energy storage to the maximum extent and maintain the service life of the energy storage, the energy storage control unit performs role-division control on the two groups of energy storage. When the bus voltage fluctuates due to unstable power generation of renewable energy sources or other factors, the energy storage works in a voltage stabilizing working mode (comprising two working modes of power absorption and power compensation), the double energy storage works separately, one group of energy storage is specially used for power absorption, absorbs redundant electric energy of the bus voltage and carries out peak clipping, the other group of energy storage is specially used for power compensation, and the bus is supplied with energy to inhibit the reduction of the bus voltage and carries out valley filling; when the energy generation amount of the renewable energy source continuously rises (falls) and the energy storage auxiliary continuously absorbs power (supplementing power), the two energy storage working modes are converted into cooperative charge/discharge. The event triggering function of the control strategy is defined by ETF (E _s ) The expression is:

ETF(E _s )＝Sgn[I _s ]×[1(t)-1(t-t _s )](5)

I _s for storing the current between the bus and the energy, I _s > 0 represents energy storage charging, whereas I _s < 0 denotes energy storage discharge, t _s Is the trigger time. When ETF (E) _s ) When=1, the two stored energy are in a cooperative charging mode, and when ETF (E _s ) When the energy storage device is in the state of the = -1, the two energy storage devices are in a cooperative discharge working mode, and when the energy storage device is in the state of the 0, the energy storage devices are in a angular color working mode (voltage stabilization). In addition, in order to maintain the service life of the energy storage and better enable the energy storage to provide service for the system, the two groups of energy storage SOCs are set to have no excessively large difference value, and when the energy storage is controlledThe unit detects the SOC ₁ -SOC ₂ When the energy storage capacity is more than 0.3, namely the two energy storage capacities are different by 0.3, the roles of the two energy storage are switched. In summary, the control mode can eliminate time loss caused by frequent switching of charge and discharge modes in single energy storage, further improve the reaction speed of the energy storage unit, stabilize busbar voltage fluctuation more quickly and effectively prolong the service life of the energy storage.

Step 3: the control scheme of each unit is designed as follows based on the micro grid system model diagram shown in fig. 2.

1) Bus voltage detection unit: in order to facilitate optimal control of the bus voltage, the bus voltage has been divided into 6 detection levels (-, 0.95U) according to the conventional voltage hierarchical control strategy in embodiment (1) _ref ]，(0.95U _ref ，0.96U _ref ]，(0.96U _ref ，0.98U _ref ]，(0.98U _ref ，1.02U _ref ]，(1.02U _ref ，1.05U _ref ]，(1.05U _ref ，—]Corresponding to the above, six states are provided, and the state space comprises: ultralow, too low, normal, too high; but its state change factor depends on other units or natural factors so it has no action space.

2) An energy storage unit: to maintain the stored energy life, the working space limiting its capacity is 0.1 SOC-0.9 SOC, i.e. if the stored energy is used as a discharging or charging unit, it is disconnected from operation when its capacity is less than 0.1SOC or greater than 0.9SOC, thus it has three capacity states of full (chargeable), full (dischargeable) and chargeable and dischargeable. And because the energy storage is stabilized, the charge/discharge is cooperated and waits for four working states, the state space comprises: the system is characterized by comprising the following steps of electricity deficiency (chargeable), electricity full (dischargeable), chargeable and dischargeable, voltage stabilizing, cooperative charging, cooperative discharging and waiting; each working state has the condition that the working state can be converted to other states, so twelve working states are converted, namely twelve actions are converted, and the action space comprises: voltage stabilization, cooperative charge, voltage stabilization, cooperative discharge, voltage stabilization, wait, cooperative charge, voltage stabilization, cooperative charge, cooperative discharge, cooperative charge, wait, cooperative discharge, voltage stabilization, cooperative discharge, cooperative charge, cooperative discharge, wait, voltage stabilization, wait, cooperative charge, wait, and cooperative discharge.

3) Photovoltaic power generation unit: because the maximum power of the photovoltaic power generation completely depends on the quality of natural conditions, when the power generation power is smaller than the load power, the working mode of the power generation unit adopts a maximum power point tracking (MTTP) mode, so that the power generation unit provides more power as much as possible, and the power generation cost is reduced; when the generated power is larger than the load power, the working mode adopts a constant power mode, so that the normal operation of the load is ensured; when the generated power is too low, the power generation unit is out of operation. Thus, there are three working states in total, the state space comprising: constant power mode, MTTP mode, stop running; different from the energy storage unit, the photovoltaic power generation unit has four kinds of operating conditions of conversion altogether, and the action space includes: constant power, MTTP, stop operation, MTTP, constant power.

4) A wind power generation unit: the photovoltaic power generation unit has three working states and four actions, namely, a state space comprises: constant power mode, MTTP mode, stop running; the action space includes: constant power, MTTP, stop operation, MTTP, constant power.

5) Diesel power generation unit: when renewable energy sources are insufficient in power generation and the stored energy electric quantity is consumed, the diesel generator set is started to supply power, and normal operation of the load is maintained. The cell thus has two states, two actions. State space: the operation and stop are carried out; action space: starting and stopping.

6) Load control unit: the load control unit is responsible for controlling the supply of the unimportant loads. When all the generated power is smaller than the load power, the load control unit cuts off the unimportant loads one by one so as to maintain the stable bus voltage and ensure the normal operation of the important loads. The unit has three states of all online, partial cutting and all cutting, and corresponds to two actions of cutting load and online load. State space: all on-line, partially resected, and all resected; action space: cutting off load and loading on line.

Step 4: reinforcement learning is an unsupervised learning method, in which an agent repeatedly interacts with the environment to learn continuously, and selects an optimal or near optimal action to achieve a system objective or maintain a system optimal state, and the basic model generally includes two parts of the environment and the system as shown in fig. 3.

The Q value of the system state of the reinforcement learning model of the control module approaches the optimal action value function through learning iteration, and has little correlation with the strategy being followed. After all states and actions are determined, the system generates a multidimensional Q matrix. It should be noted that, the collection and processing of the state information and the issuing of the next action command by the system are an uninterrupted process, and the action command has coverage, that is, the agent must be ready to accept the action change command transmitted by the system at any time. In order to reduce the calculation amount of system data processing, the state information acquired currently by the system is compared with the state information acquired at the previous moment, if the states are the same, no action instruction is generated, and the acquisition of the state information at the next moment is continued. The basic update rule for the state-action pair is as follows:

where s is the current state of the agent, a is the action taken based on the current state, Q (s _t ,a _t ) Then it indicates that the agent is in state s _t Next action instruction a selected by the set learning strategy _t The method comprises the steps of carrying out a first treatment on the surface of the Beta is a decay factor whose magnitude determines whether the selected strategy is more prone to current rewards or future rewards, and R is a state and behavior based reward, the result of which is given by equation (8). During the learning iteration of the agents, the rewards earned by each agent depend on both the own actions and the actions of the other agents caused by the own actions. A multi-agent coordinated control framework based on reinforcement learning is shown in fig. 4.

The system decides the rewarding or punishment degree which is applied to the action based on the previous state by judging the state of the agent after executing the action changing instruction and the state before through the rewarding function, so the system maximizes the sum of expected values of the rewarding function by trying to combine all actions allowed based on the current state, and the rewarding function is as follows:

wherein:representing the reward of the system for the next j steps in time t, pi being the strategy selected by the system.

Step 5: on the premise of meeting the voltage safety evaluation index, the control system regulates and controls each power generation unit and the load control unit based on the state information of the bus voltage, and takes the problems of cost increase caused by energy storage and a diesel engine set and environmental pollution caused by the diesel engine set into consideration, the starting priority of each unit is ordered as follows: wind power/photovoltaic power generation (energy storage and voltage regulation) > energy storage and power supply > diesel unit power supply > load control unit cuts load.

When the renewable energy power generation supply quantity just meets the requirement:

the bus voltage is (0.98U) _ref ，1.02U _ref ]The range fluctuates, the power generation source mainly depends on renewable energy sources such as wind and light to supply power, the working mode is a maximum power point tracking mode, and the double energy storage units provide power compensation and real-time voltage stabilization;

when the renewable energy power generation supply amount is larger than the demand:

when wind power or photovoltaic power generation is sufficient, the bus voltage rises to (1.02U) _ref ，1.05U _ref ]Range (both default across the voltage buffer layer below). Energy storage action: the energy storage no longer provides power compensation, the working mode is switched to the charging mode, the self-power is supplemented, and the bus voltage is stabilized within the range (namely, the return value is r=r when the system selects the action ₊ The return value when selecting other actions is r=r _- The action return values expressed below are positive, and other action return values not expressed are negative, and are not described in detail;

if the bus voltage continues to riseHigh reach (1.05U _ref ，—]Range, renewable energy action: the working mode of the wind power or photovoltaic power generation unit is converted from a maximum power point tracking mode to a constant power mode operation, and meanwhile, the bus voltage is stabilized in the range;

when the renewable energy power generation supply quantity is smaller than the demand:

when the renewable energy source power generation power is reduced, the bus voltage is reduced to (0.98U) _ref ，1.02U _ref ]During the range, the energy storage action: the energy storage unit enables power compensation and stabilizes the bus voltage in the range;

if renewable energy generation power continues to drop, bus voltage drops to (0.96U) _ref ，0.98U _ref ]And when the energy storage is in the range, the energy storage action is as follows: the energy storage starts to supply energy to maintain the voltage stability of the bus;

when the energy storage electric quantity is insufficient, the energy storage action is as follows: and the stored energy exits the operation. The bus voltage continues to drop to (0.95U) _ref ，0.96U _ref ]The diesel generating set acts: starting a diesel generating set;

if the load is large, and the power generated by the diesel generator set is difficult to meet, the bus voltage enters (-) 0.95U _ref ]Range, load controller action: the load controller cuts off the loads one by one to maintain the voltage at 0.95U _ref In the vicinity (note that the load controller adopts partition management for the cleavable load, each partition is composed of a plurality of loads, the load capacity is approximately the same, and the minimum unit of load shedding by the load controller is "zone"); when the renewable energy source is recovered, the bus voltage rises to (0.96U) _ref ，0.98U _ref ]When the load controller acts: the load controller attempts to ablate the load line by line (the process is reversed from ablation).

Step 6: based on the above strategies, a suitable algorithm flow is designed to implement the control strategy:

1) Initializing the setting: reading a system structure, loading a state-action matrix, setting an objective function and a reward function, and initializing a Q table;

2) Parameter setting: setting an action strategy and an attenuation factor beta;

3) Detecting a state s in a current environment _t ；

4) Judging the current state s _t And the last state s _t-1 If so, carrying out the step 5); otherwise, returning to the step 3

5) Selecting a state s according to an action policy _t Action a corresponding to _t ；

6) Execution a _t Returning the prize value R and the next state s _t+1 ；

7) And (3) updating the Q value according to the formula (5), storing the data into a knowledge base, and returning to the step (3).

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A micro-grid hybrid coordination control method based on reinforcement learning and multi-agent theory is characterized by comprising the following steps:

designing a transition voltage layer control strategy based on a voltage layering control mode, and adding a transition layer between adjacent voltage layers;

designing a double-energy-storage role-division control strategy, and separating two energy storages to work when the energy storage unit works in a voltage stabilizing mode; when the auxiliary energy storage is needed to continuously absorb power or supplement power, the two energy storage working modes are converted into cooperative charge/discharge; the two energy storage separation works comprise that one group of energy storage is specially used for power absorption, absorbing redundant electric energy of bus voltage, carrying out peak clipping, and the other group of energy storage is specially used for power compensation, supplying energy to the bus to inhibit the reduction of the bus voltage, and carrying out valley filling;

2. The reinforcement learning and multi-agent theory-based micro-grid hybrid coordination control method of claim 1, further characterized by: the reinforcement learning control framework based on the multiple agents is designed in the following way: comparing the state information acquired currently by the system with the state information acquired at the previous moment, if the state is the same, not generating an action instruction, and continuing to acquire the state information at the next moment;

the basic update rule for the state-action pair is as follows:

wherein s is _t A is the current state of the intelligent agent _t For actions taken in accordance with the current state, Q (s _t ,a _t ) Representing that the agent is in state s _t Next action instruction a selected by the set learning strategy _t The method comprises the steps of carrying out a first treatment on the surface of the Beta is a decay factor whose size determines whether the selected strategy is prone to current rewards or future rewards, R is a state and behavior based rewards;