CN111242443B

CN111242443B - Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet

Info

Publication number: CN111242443B
Application number: CN202010010410.XA
Authority: CN
Inventors: 孙迪; 王宁; 关心; 林霖
Original assignee: State Grid Heilongjiang Electric Power Co Ltd; Heilongjiang University
Current assignee: State Grid Heilongjiang Electric Power Co Ltd; Heilongjiang University
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2023-04-18
Anticipated expiration: 2040-01-06
Also published as: CN111242443A

Abstract

A virtual power plant economic dispatching method in an energy internet based on deep reinforcement learning belongs to the technical field of energy distribution of virtual power plants. The invention solves the problems of large communication load and delay, high calculation complexity and poor reliability of data transmission in the existing method. The invention provides a distributed power generation economic dispatching structure utilizing a three-layer system structure based on edge calculation, wherein: the first and second layers are edge computing layers, and the third layer is a cloud computing layer. The proposed three-layer edge computing architecture reduces the computational complexity of processing training tasks at the central node, and further reduces the communication load between the VPP operator and the DG, thereby also reducing the response time of industrial users, and simultaneously also keeping the privacy of the industrial users and improving the reliability of data transmission. The invention can be applied to the energy distribution of the virtual power plant.

Description

Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet

Technical Field

The invention belongs to the technical field of energy distribution of virtual power plants, and particularly relates to a virtual power plant economic dispatching method in an energy internet based on deep reinforcement learning.

Background

With the access of large-scale distributed power generation in the energy Internet, due to the limitation of geographical conditions, a traditional microgrid has certain limitation, the effective utilization of multi-region large-scale distributed power generation is hindered, and the electric power is cut down very frequently. Due to the mismatch between the scale of construction of renewable energy stations and the demand of local loads, the capacity of renewable energy sources is limited, resulting in a certain amount of power reduction in wind power stations and photovoltaic power station concentration areas. Compared with a micro-grid, the VPP has a larger energy load channel, can better match the construction scale of renewable energy with the demand scale of local load, and reduces power reduction.

Due to the complexity of economic dispatch scenarios, such as intelligent devices that manage distributed renewable energy and industrial users, large amounts of different types of data need to be transmitted in real time. Due to the close relationship between industrial users and VPP operators, reasonable economic scheduling should take full account of user participation. Industrial users can participate in economic dispatch by contracting with VPP operators. The VPP operator needs to receive data from the demand side industrial users and the DG units (distributed generation units). Since data transmission between a VPP operator and a device requires a certain degree of performance guarantees to achieve optimal economic scheduling, VPPs employ advanced control, sensing and communication techniques to sense and collect data and transmit it to the VPP's economic scheduling control center. VPPs achieve optimal economic scheduling in complex situations, requiring consideration of the wireless link between most devices and the VPP operator, and large data transfers can easily exceed transmission capacity limits. Thus, resource-limited bulk devices cannot directly send demand to the VPP operator, which poses a significant challenge to efficient economic scheduling.

Traditionally, VPP operators distribute geographically dispersed distributed power supplies in a centralized fashion. The information of the users and the real-time status data of the DGs from the plurality of areas are sent to the cloud for storage and processing, which results in a large network communication load and consumption of computing resources. However, this results in higher network delay and computational complexity. In practical situations, long distance data transmission from various DG and industrial users to a cloud computing center can consume a large amount of energy. Moreover, transmitted data raises privacy concerns for industrial users in different regions. In a traditional cloud computing mode, local sensitive data needs to be uploaded to a cloud computing center, and the risk of privacy disclosure of a user is increased. In addition, the generation and transmission of a large amount of data makes it difficult to accurately ensure the reliability of data transmission in a complex environment.

Disclosure of Invention

The invention aims to solve the problems of high computational complexity, large communication load and delay and poor reliability of data transmission in the conventional method, and provides a deep reinforcement learning-based economic dispatching method for a virtual power plant in an energy Internet.

The technical scheme adopted by the invention for solving the technical problems is as follows: the method for economically scheduling the virtual power plant in the energy internet based on deep reinforcement learning comprises the following steps:

step one, collecting power generation side and user side information from an area I by using an industrial side server and a power supply side server of the area I for any area I, wherein I =1,2, \ 8230, and I are the total number of the areas;

respectively training the operator-critic network by using the information collected in each area so as to respectively obtain the operator-critic network trained by using the information in each area;

step two, deploying the trained operator-critic networks at edge nodes of corresponding areas respectively;

and step three, the industrial side server and the power supply side server in each area collect information from the power generation side and the user side in real time, input the collected information into an operator-critical network on a corresponding edge node, and obtain decision information of each area in real time.

The invention has the beneficial effects that: the invention provides a deep reinforcement learning-based economic dispatching method for a virtual power plant in an energy internet. Since we consider real-time economic dispatch scenarios, demand response and energy delivery are real-time. And on the second layer, the agent manages the distributed power supplies and the industrial users of the local area to perform online scheduling, and compared with the mode that scheduling of all areas is put into a cloud center, communication delay and response time to the industrial users can be reduced. The calculation and the storage are completed in the edge node, the application program is started on the edge server, and the new energy is used for supplying power to the server nearby, so that the energy consumption can be obviously reduced. In the framework proposed by the invention, the first and second layers are edge computing layers, while the third layer is a cloud computing layer. The proposed three-layer edge computing architecture reduces the computational complexity of processing the training task at the central node, and further reduces the communication load between the VPP operator and the DG department, thereby also reducing the response time of the industrial user, and simultaneously also keeping the privacy of the industrial user, and improving the reliability of data transmission.

Drawings

FIG. 1 is a diagram of an economic dispatch architecture as set forth in the present invention;

FIG. 2 is a block diagram of the distributed power generation economic dispatch architecture utilizing a three-tier architecture based on edge computing as proposed by the present invention;

FIG. 3 is a diagram of an information delivery model for DRL-based VPP economic scheduling of the present invention;

in the figure: s _i For the real-time status of zone i, a _i Is a state s _i Corresponding action, r _i Is a return value, pi is a strategy, V is a state value function, theta is a parameter of an actor network in a thread, and theta is _v Is a parameter of the critic network in the thread, and theta 'is a parameter of the global actor network, theta' _v Parameters of a global critic network;

FIG. 4 is a graph of power from photovoltaic power generation, wind power generation, and controlled load, uncontrolled load power for a random day;

in the figure: PV represents photovoltaic, WT represents wind power, controllable load represents Controllable load, and uncontrollable load represents uncontrollable load;

FIG. 5 is a graph of the return value as a function of iteration number;

FIG. 6 is a graph comparing the generated power of wind power with the actual power;

FIG. 7 is a graph of generated power versus actual power for a photovoltaic cell;

FIG. 8 is a graph of power generated by a gas turbine versus actual power;

FIG. 9 is a graph of the optimization results for a controllable load;

FIG. 10 is a graph comparing the cost of the inventive process and the DPG process.

Detailed Description

The first embodiment is as follows: the method for economically scheduling the virtual power plant in the energy internet based on the deep reinforcement learning comprises the following steps:

respectively training the operator-critic network of the VPP operator cloud server by using the information collected by each region to respectively obtain the operator-critic network trained by using the information of each region;

The second embodiment is as follows: the first difference between the present embodiment and the specific embodiment is: in the first step, the operator-critical network of the VPP operator cloud server is trained by using the information collected in each area, an asynchronous method is adopted, and 8 threads are run in parallel.

The third concrete implementation mode: the first difference between the present embodiment and the specific embodiment is: the objective function of the operator-critical network is as follows:

wherein: c is the total operating cost of the area i,

photovoltaic investment at time slot k for region iInitial depreciation cost, K =0,1, \ 8230;, K (24 hours considered in VPP, K equals 23), -or->

For the photovoltaic operation and maintenance costs of zone i in time slot k, <' >>

For the initial depreciation cost of the wind turbine for zone i in time slot k, based on the wind turbine status of the wind turbine>

For the wind turbine operating and maintenance costs of zone i in time slot k, ->

Based on the initial depreciation cost of the micro gas turbine in the time slot k for the area i>

For the micro gas turbine operating and maintenance costs of zone i in time slot k, ->

Environmental protection costs of micro gas turbines in time slot k for zone i->

The costs of the micro gas turbine itself for the section i in time slot k, λ being the compensation factor, are greater or less>

Controllable load for zone i in time slot k, x _i (k) Selection of interruptible load percentage vector, x, for zone i in time slot k _i (k) Has a value range of [0,1 ]]。

The fourth concrete implementation mode: the first difference between the present embodiment and the specific embodiment is: the specific training process of the operator network in the operator-critical network comprises the following steps:

the actor network consists of a mu network and a sigma network, and the mu network and the sigma network consist of 2 full connection layers;

the 1 st full-connection layer of the mu network and the sigma network has an activation function of tanh, input dimensionality of 5 and output dimensionality of h;

activating functions of the 2 nd full connection layer of the mu network and the sigma network are softplus, input dimensionality is h, and output dimensionality is m;

inputting the information of the power generation side and the user side into the mu network and the sigma network to obtain the output of the mu network and the sigma network; and then carrying out normal random sampling on the output of the mu network and the sigma network to obtain 4-dimensional action of the operator network output.

The fifth concrete implementation mode is as follows: the fourth difference between the present embodiment and the specific embodiment is that: the specific training process of the critic network in the operator-critic network comprises the following steps:

the critic network is composed of full connection layers;

inputting the information of the power generation side and the user side and the 4-dimensional action output by the operator network into the full connection layer of the critic network, splicing the output of the full connection layer to obtain a splicing result, and performing linear transformation on the splicing result to obtain a one-dimensional return value output by the critic network.

The sixth specific implementation mode: the fifth embodiment is different from the fifth embodiment in that: the expression of the return function of the operator-critical network is as follows:

wherein: k ₁ 、K ₂ 、K ₃ And K ₄ Are all weighted values.

And guiding the training of the operator network according to the return function value output by the critic network.

Edge computing is used to provide computing services on batch processing equipment near the network edge of a VPP. First, edge computation can greatly reduce data transfer from the device to the VPP operator through pre-processing. Second, the edge computing architecture can shift the computational burden to the edge. Fig. 1 shows the economic dispatch architecture proposed by the method of the present invention, which consists of four main components: a power source side server (PSS), an industrial user side server, a proxy edge server and a VPP operator cloud server. The power source side server connects the power devices through different communication technologies (e.g., 5g, wifi). It collects and processes power generation data from distributed power equipment and transmits the data to the proxy edge server in real time. The PSS also receives scheduling information for the proxy edge server and provides power to the industrial users. The industrial customer premises server connects the power devices through different communication technologies (e.g. 5g, wifi). It collects and processes power consumption information of industrial users and transmits data to the proxy edge server in real time. And making a local economic dispatching decision according to the analysis results of the industrial user side server and the power supply side server, and interacting the proxy edge server with the servers on the two sides. The VPP operator cloud server meets the computing requirements of the proxy edge server and manages each proxy. It can not only help the proxy server to provide real-time analysis and computation, but also collect the scheduling information of the managed proxy.

FIG. 2 illustrates a distributed power generation economic dispatch architecture utilizing a three-tier architecture based on edge computation as proposed by the present invention. First, the VPP operator sets up agents to manage distributed power generation and industrial users in different regions. In terms of demand, the user's controllable load participates in demand response, which may reduce load demand during peak hours. In contrast to the VPP operator, each proxy is an edge compute server. The industrial customer side server and the power side server collect data from each distributed power generation unit and extract and aggregate the data in real time mode. These distributed power generation may be photovoltaic power generation, wind power generation and micro gas turbines. The proxy server provides the optimal economic dispatching strategy for the area and finally sends the decision information to the VPP operator. The proposed architecture is suitable for offline training and real-time online scheduling. First, in the offline training phase, the industrial-side server and the power supply server must process and collect information from the power generation side and the user side in a specific area, and transmit the collected information to the VPP operator cloud server. The VPP operator cloud server performs model training according to large-scale off-line data and transmits the trained model to the proxy edge server in a specific area. During real-time economic dispatching, data of industrial users and distributed power supplies are collected by the servers of the two parties and transmitted to the proxy edge server, and the proxy edge server is put into a model trained before as input to obtain a real-time dispatching strategy. The three-layer economic dispatching model is adaptive to the distributed characteristic of the power supply, and the problem of large-scale data transmission in VPP economic dispatching is solved. More flexible and adaptable to the expansion of dynamic networks, making it a more scalable solution.

The goal of economic dispatch by the VPP operators is to minimize the compensation to the industrial users and the operating costs of the DG (including photovoltaic, wind turbines and micro gas turbines). On the basis of minimizing the cost of a VPP operator, the optimal economic scheduling algorithm fully considers C ^pom ，C ^wom And C ^dom . In particular, we also consider the environmental cost C of micro gas turbines ^de And fuel cost C ^d . In general, the initial depreciation cost of DG units is taken into consideration and defined as C, respectively ^pdp ，C ^wdp And C ^ddp . We consider the needs of the industrial users, and also include the compensation cost for the industrial users participating in the demand response, denoted C ^dr . We consider industrial users as schedulable resources, participating in the economic scheduling of VPP. The proposed algorithm reduces the economic loss of VPP during peak power consumption by cutting down controllable loads, which may result in load peak-to-valley shifts due to increased user flexibility. In this case, the industrial user corresponds to a virtual power generation resource. Therefore, in the objective function of the proposed model, the compensation cost for the demand side is added to be C ^dr The compensation selects the user who shed the controllable load. The objective function consists of two parts, the first part is the running cost of the DG, and the second part is the compensation cost of the demand party and the controllable load when the system runs.

Where C is the total operating cost of managing DG in the VPP and industrial users. C _i Is the operating cost for managing DG and industrial users in the management area i.

Is the operating cost of the DG in the i region, device for selecting or keeping>

Is the compensatory cost of the i region for the industrial user to participate in demand response.

In a real-time scheme, the edge proxy for the VPP is denoted by i. In our proposed optimal economic dispatch model, three types of DG, photovoltaic, wind turbine and micro gas turbine, are considered. The operating costs of a DG device include the initial depreciation of the VPP, the operating and maintenance costs. Specifically, environmental protection and fuel costs of the micro gas turbine are also considered. Where k denotes a slot interval,

representing the actual consumption of the photovoltaic, wind turbine and micro gas turbine respectively in time slot k;

(1) Photovoltaic: the initial depreciation cost of the photovoltaic investment may be expressed as

Wherein r is the annual rate of interest,

is the installation cost per unit volume of the photovoltaic cell, K _p Is the photovoltaic capacity coefficient, n ^p Is the service life of the photovoltaic.

The operating and maintenance costs of the photovoltaic will be

Wherein

Is the maintenance and operation cost of the photovoltaic, and K _pom Is a photovoltaic coefficient of maintenance and operation cost.

(2) A wind power generator: the initial investment cost of the wind driven generator is converted into the output power per unit time. As a depreciation cost of a wind turbine, it has been included in the operating costs of a wind turbine

/>

Wherein

Is the initial depreciation cost of the wind turbine, device for selecting or keeping>

Is the unit installation cost of the wind turbine, K _w Is the capacity factor of the wind turbine, r is the annual rate, n ^w Is the service life of the wind turbine.

The operating and maintenance costs of a wind turbine during operation may be expressed as

Wherein, K _wom Is the operating cost factor of the wind turbine.

(3) A micro gas turbine: the initial depreciation cost of a micro gas turbine is modeled as:

wherein

Installation cost per unit volume, K, of a micro gas turbine _d Is the capacity coefficient, n, of the micro gas turbine ^d The service life of the micro gas turbine is prolonged.

Operating and maintenance costs of micro gas turbines:

wherein, K _dom Is a factor in the operating and maintenance costs of the micro gas turbine.

The environmental protection cost of the micro gas turbine is as follows:

where M is the pollutant emitted, M is the total number of pollutants, β _m Is the treatment cost of m discharge amount of unit pollutant, alpha _dm Is the pollutant discharge amount of the micro gas turbine generating unit electricity.

The relationship function between the power generation efficiency and the output power of the micro gas turbine is as follows:

wherein eta _d Is the power generation efficiency of the micro gas turbine,

is the output power of the micro gas turbine.

The consumption characteristic of the micro gas turbine can be expressed as (10)

Wherein

Is the cost of the fuel, c ^d Is the natural gas price and L is the lowest energy released by the natural gas.

According to the above description, the operating costs of a DG are as follows:

the demand response can effectively integrate the potential of the user side response, thereby enhancing the safety, stability and economy of the power grid operation. In this context, we consider the demand response of an industrial user during the model building process. In order to achieve the best economic dispatch strategy, each agent selects the controllable load size to be reduced. This is inconvenient for industrial users as the controllable load is reduced, for which purpose it needs to be compensated. The VPP operator should provide power compensation to the user who chooses to curtail the controllable load. Controlling a variable of controllable load to be X _i (k) And a compensation coefficient lambda. X _i (k) Is a variable derived from the power information of all industrial users in the area, defined as the percentage of the maximum interruptible controllable load in each time slot of the industrial area considering agent i, with a compensation cost at the load end of

This approach may reduce or reduce part of the power consumption, thereby avoiding peak loads for industrial users. The load of the industrial user is obtained from the acquisition and is classified as controllable load->

And a non-controllable load->

Since controllable loads can directly respond to economic scheduling of VPPs, consideration is given herein primarily to the reduction of controllable loads involved in the VPPs scheduling process. The compensation cost for the managed controllable load of agent i can be expressed as:

where λ is the compensation factor, x _i (k) Expressed as a vector of percentage of selected interruptible load, the range of values is 0,1]. The objective function of economic dispatch for each agent i can be expressed as:

for the entire VPP system, the power balance constraint is a fundamental problem and should be fully considered in the model building process. In each management area of agent i, the total power consumption of the individual DG units should be equal to the total power consumption of the industrial users. For the total power demand of an industrial user, the curtailment of the controllable load of the industrial user by the agent i, i.e. the

The actual power consumption of the DG in each agent management area is limited by the actual power generation in that area. Photovoltaic, wind energy of DG, actual power of micro gas turbine is

Respectively as follows:

the percentage of interruptible load in the industrial domain managed by agent i should not exceed the percentage of maximum interrupt controllable load per timeslot, i.e. the percentage of maximum interrupt controllable load per timeslot

0≤x _i (k)≤X _i (k) (18)

The VPP operator manages all the regions and summarizes scheduling information of each region. Based on the above description, we define the objective function of the optimal economic scheduling policy as follows:

in the invention, the optimal economic dispatching strategy provided by the invention minimizes the power generation cost of the distributed power supply and simultaneously meets the limitations of power balance and power generation capacity of the VPP.

To make the solution more practical, we incorporate various cost components into the objective function. The objective function established by the invention is a non-linear cost function, although the invention does not add the constraint of non-convexity, in a real scene, the power generation unit is generally influenced by the valve point effect, and the cost function is generally non-convex. To address these difficulties, previous work has often employed heuristic methods. The deep reinforcement learning method adopted by the user can adapt to the nonlinear non-convex condition, and the nonlinear and non-convex constraints are relaxed. In a practical economic scheduling scheme, the scheduling process should generally be completed in a short time. Due to the stochastic nature of photovoltaic and wind power generation and the flexibility of the load, the state transition from the previous time slot to the next constitutes a large state space and the state information needs to be updated quickly. The DRL, as an effective artificial intelligence algorithm, has achieved great success in many areas of problem resolution, such as the internet of things, where it can find different optimization strategies within a reasonable time frame. In the invention, the provided DRL-based algorithm relaxes the constraint of nonlinear characteristics, and improves the solving precision by fitting a value function through a deep learning algorithm. The economic scheduling problem in the invention is nonlinear, the transition probability is unknown, the state space is large and continuous, and the DRL can calculate the probability distribution of state transition without environment information. The off-line training model can be directly applied to on-line economic dispatching, and the optimal economic dispatching algorithm based on the DRL provided by the invention obviously improves the calculation efficiency.

An information delivery model for the DRL based VPP economic scheduling is shown in fig. 3. The algorithm adopts an off-line data training mode, and a power supply side server and a user side server collect historical temporary data and transmit the information to a VPP cloud server. The VPP cloud server uses the DRL to train the network independently according to the data transmitted from different areas, so that economic scheduling strategies of different areas are obtained. In an online economic dispatching stage, each proxy edge server obtains a corresponding network weight value from a VPP cloud server. The power side server and the industrial customer side server gather real-time transmission information and power requirements and then transmit all the gathered information to the corresponding proxy edge server. And the proxy edge server obtains a real-time optimal economic dispatching strategy based on the historical weight and according to the real-time state information, and feeds back the result to the servers of the two parties.

The off-line training and the on-line scheduling are respectively realized at different nodes. Firstly, completely training a model based on offline data in a cloud center. Then, the proposed DRL-based method is combined with edge calculation, and the trained model is placed at the edge node, so that the model can be applied online in a real environment. If the online and offline training environments change slightly at this time, the model trained offline can learn these changes by default and dynamically adjust the actions to achieve optimal scheduling. During online scheduling, the distributed power generation data and the demand data of the industrial users can be directly transmitted to the edge nodes without being transmitted to the cloud center, and the method is more suitable for real-time economic scheduling scenes.

For VPP we consider 24 hours, denoted by k ∈ (0, 1, ..., 23). The goal of economic scheduling is to find an optimal economic scheduling solution to minimize the operating cost of the VPP. For region i, the state is set to S _i ，s _i ∈S _i ,

The power supply side server and the industrial user side server are aggregated to respectively represent the photovoltaic power, the wind power, the actual power generation capacity of the micro gas turbine, the load controlled by an industrial user and the uncontrollable load demand in a time slot k. Action set A _i ,a _i ∈A _i ,/>

Respectively representing the actual power consumption of photovoltaic power generation, wind power generation and a micro gas turbine in a time slot k, and the control coefficient of a controllable load. A is a continuous motion space satisfying a power balance constraint, a _i Is a selected action that satisfies the action constraint.

In any slot, we introduce a policy π in order to find the mapping from state to action. The policy represents a conditional probability distribution for each action given the current state. The next state is represented as s' _i The initial state is represented as s ⁰ _i . Namely, it is

In practical cases, the state transition probability is unknown, and the state space and the behavior space are continuous. When s is known _i ，a _i Then, a return value r related to the objective function can be obtained _i (s _i ,a _i ). The reported value is a key component for evaluating the quality of the action and guiding the effect of the learning process. For better setting of the reward value, the reward value is set as a function related to the cost through repeated experiments, and the specific setting of the reward value is explained in detail below:

wherein K ₁ ,K ₂ ,K ₃ ,K ₄ Is the set weight value. The return value is negative because the cost of the virtual power plant is to be minimized. The total reported value of K hours can be obtained as follows:

to maximize the return value, we use the gradient ascent method to update the strategy in the proposed algorithm, i.e. we use

From (23), the state value function V can be obtained ^π (s _i ) Sum state contribution function Q ^π (s _i ,a _i ) And gamma is a discount factor representing the discount rate of the return value.

The goal is to select the best strategy and maximize the state effort function, which is expressed as follows:

in order to find the optimal economic dispatch strategy, it is usually considered to represent the function by using a data table. However, this approach limits the scale of the reinforcement learning algorithm. When the size of the problem is too large, the storage space for storing the table may be large, and it takes a long time to accurately calculate each value in the table. If learning experience is obtained from a small training data set, the generalization capability of the training pattern is insufficient. In order to solve the above problem, a state value function and a state action value function are parameterized using a deep neural network in consideration of a large-scale state action space. In the algorithm provided by the invention, the deep neural network is used for extracting the characteristics of large-scale input state data to train economic dispatchModels, which make the trained models more generalized. Starting from the first layer of neurons, the next layer of neurons is entered by a non-linear activation function, and continues to pass down to the output layer. Since the nonlinear function is essential for the deep neural network, the deep neural network has sufficient capability to extract data features. Theta.theta. _v For approximating a function V(s) of state values _i ) And the state merit function Q(s) _i ,a _i )。

Q(s _i ,a _i )≈Q(s _i ,a _i ,θ _v ) (26)

V(s _i )≈V(s _i ,θ _v ) (27)

The deep neural network is used as a function approximator, and the parameter theta of the deep neural network is a strategy parameter. Pi obeys a Gaussian distribution and can be used to solve the problem of continuous motion space, i.e.

/>

The value of each slot return for each zone i is given at (20), so

In our scenario, to increase the probability of a policy with a higher reward value, we perform an update of the policy gradient, the gradient update calculated as:

wherein R is _i Is the total reported value in region i and is represented by Q(s) _i ,a _i ) Estimation, i.e. R _i ≈Q(s _i ,a _i )。b(s _i ) Is a baseline for reducing estimation errors. V(s) _i ) For estimating the baseline, i.e. b(s) _i )≈V(s _i )。

A ^π (s _i ,a _i ；θ,θ _v )＝Q ^π (s _i ,a _i ,θ _v )-V ^π (s _i ,θ _v ) (31)

Equation (31) is an advantage function, representing the advantage of the action value function over the cost function. The merit function is positive if the action value function is greater than the value function, and negative if the action value function is smaller. The parameters are updated in a direction that increases the strategic probability when the dominance function is positive, and in a direction that decreases the strategic probability when the dominance function is negative. Therefore, the convergence speed of the algorithm is faster when the merit function is employed.

The policy gradient is updated as:

parameter theta _v The updates of θ are:

in order to make the training strategy more adaptive and prevent premature convergence to a suboptimal deterministic strategy, entropy regularization is added to the strategy gradient, i.e.

When the neural network training is carried out, required data are independently and simultaneously distributed, in order to break the correlation between the data, an asynchronous method is adopted, a plurality of threads can be operated in parallel, and each thread has an own environment copy. During the training process, multiple threads maintain a global operator-critical network, and each thread maintains a copy of the local network weight values of the global network. The local network accumulates gradient updates and passes the gradients to the global network for parameter updates. The local network will then synchronize the parameters in the global network. The local network can not only update its own independent network by learning the environment status, but also interact with the global network. We define the global shared parameter vector as θ 'and θ' _v ：

In this sense, each zone achieves the best economic dispatch. In the numerical part of the offline training process, we implement 8 threads, the VPP operator communicates with each region and computes C. Based on the algorithm, an economic dispatch model for region i can be obtained. In the online scheduling stage, each proxy edge server first obtains a corresponding network weight value from a VPP cloud server, i.e., proxy i. The DRL-based economic dispatch model is shown in FIG. 3.

Experimental part

To train the DRL-based economic dispatch model, we train load data from photovoltaic, wind, micro gas turbines and industrial users with an offline dataset. Fig. 4 shows the power of photovoltaic power generation and wind power generation, and the power of controllable load and uncontrollable load in a random day. Wherein the maximum power of the micro gas turbine is set to 200kw. Since the industrial load is mainly a variety of industrial processes, the power demand generally does not vary much, without particularly significant peak-to-valley differences. The periods of higher load demand are 9.00-10.00, 12.00-14.00 and 19.00-21.00, and the periods of lower load demand are 1.00-5.00. It can be seen that the photovoltaic power generation and the wind power generation have larger peak-valley difference, the peak time of the photovoltaic power generation is 10.00-16.00, and the peak time of the wind power generation is 10.00-18.00. The power of photovoltaic power generation and wind power generation in one day is random, and the power consumption of controllable load and uncontrollable load is random.

The emission costs of pollution and the operating and maintenance costs of photovoltaic, wind power generation and micro gas turbines are listed in tables 1 and 2.

TABLE 1

TABLE 2

The structure of the neural network in the DRL-based algorithm used in the present invention is described in detail below. The state is expressed as 5-dimensional vector expression, the finally obtained action has 4 dimensions, the action is obtained by normal distribution random sampling according to the state, and the neural network is adopted to calculate mu and sigma parameters required by normal distribution. The states are input into the mu network and the sigma network, respectively, resulting in 4-dimensional mu and sigma parameters. The mu network is composed of 2 MLP layers, the input dimension of the first layer is 5, the output dimension is h, and tan h is used for activation; the second layer inputs dimension h and outputs dimension m is activated using softplus. The sigma network also comprises 2 MLP layers, the input dimension of the first layer is 5, tan h is used for activation, the input dimension is input into a two-layer neural network, the output dimension is 4, softplus is used for activation, and in order to ensure that the sigma network does not

output

0,1 multiplied by 10 is added to an output sigma vector ^-6 . After that time, the user can use the device,the 4-dimensional motion is sampled randomly by a positive-too distribution. Based on the state and action, the Q value is calculated by using a critic network. In the criticc network, the state is encoded by using one MLP, 5 dimensions are input, and activation is performed by using tanh. The actions are encoded using another MLP, input dimension 5, and activated using tanh. And then splicing the two coded outputs to use a linear change output fraction, wherein the final output dimension is 1. For the operator-critical network, it implements two neural networks, with a discount coefficient of 0.90 and an entropy weight of 0.01. Typically, an actor update is generated in return for critic, which is faster than the actor. The convergence speed is faster as the learning rate increases. However, a higher learning rate may result in a local optimum rather than a global optimum. Therefore, we set the learning rate to be moderate.

In the invention, numerical experiments are carried out on an 8-core CPU and 16GB memory computer. The number of threads is 8, i.e. each local operator and critical network corresponds to one sub-thread, for a total of 8 threads. The environment is asynchronously learned through the child threads, and the learning result is regularly updated to the global network. There are many random choices at the beginning of learning, but through multiple iterations, the economic dispatch model converges and selects the action that optimizes the objective. We train the optimal economic scheduling strategy using the offline data set. The main advantage of DRL is that the model can be applied online in a real environment after such offline data is fully trained. This online environment changes slightly, and the DRL model can learn about these changes and dynamically adjust the actions to achieve optimal scheduling.

In order to verify the convergence of the algorithm, 100-day data is sampled as training data, each epsilon runs for any one of 100 days, and after 4.5 ten thousand eposides are run, the model can generate the optimal action. There are 24 steps per epidemode, where each step is one hour and the iterative process is shown in fig. 5. The actions are obtained by random sampling in a normal distribution according to the state. We can see that the algorithm has a large fluctuation in the first 3 ten thousand episodies, mainly due to the randomness of policy selection, and is therefore constantly being explored. But the fluctuation interval is approximately between-300 and-400 due to the constraints of the action interval and the equality constraints. After 32000 episodies are trained, the training has a good breakthrough, as the model learns how to select the optimal action. From 35000 epistates, the model began to converge gradually. The training results show that the proposed model can minimize the cost of a fully trained VPP operator. Although there are many random choices, many iterations, at the beginning of learning, the deep reinforcement learning model can converge and learn to choose an action that is close to the optimal target value.

In a virtual power plant, compared with a micro gas turbine, photovoltaic power generation and wind power generation are lower in cost and more environment-friendly, and the training strategy mainly takes wind power photovoltaic power generation as a main strategy. The load is therefore mainly powered by wind photovoltaic and the remainder is supplemented by gas turbines or curtailed to controllable loads by demand response. Wherein, fig. 6, fig. 7 and fig. 8 are the comparison of the generated power of the wind power, the photovoltaic power and the gas turbine with the actual power consumption, the dark gray is the generated power, the light gray is the actual power consumption, the horizontal axis is time, the unit hour and the vertical axis is power. As can be seen from fig. 6 and 7, the difference between the actual power generation amount and the final power consumption of the wind power generation and the photovoltaic power generation is approximately 0, and the actual power output of the photovoltaic power generation and the wind power generation is small at 1.00 to 7.00 and 23.00 to 24.00 per hour. The load at this time needs to be powered by a micro gas turbine. As can be seen from fig. 7, 1.00-7.00 and 23.00-24.00, micro gas turbines are the main power supply units. As can be seen from fig. 9, at 20.00-24.00, this time period has a high weight to controllable load shedding, almost all shedding, due to the large electricity demand of the industrial user and the high cost of the gas turbine. Therefore, it can be concluded that, by using the algorithm proposed by the inventor to minimize the cost of the virtual power plant, the early learning stage is relatively random under the preset return value, and in the training process, the model learns the correct strategy selection along with the time, so as to minimize the cost of the virtual power plant by stably controlling the distributed power generation and the demand response.

To verify the effectiveness of the proposed method, we compared the proposed algorithm with other reinforcement learning algorithms. The method of the invention is compared with a deterministic tactical gradient algorithm (DPG) which can solve this continuous action space problem. The results are shown in fig. 10, with the light gray curve being DPG and the dark gray curve being our proposed DRL-based algorithm. Comparing the costs of DPG and our proposed DRL-based algorithm over 30 days, it can be seen from the figure that by comparing the costs of the two methods, it can be seen that the cost of our proposed method is significantly lower from day 22 onwards. Compared with the method proposed by us, the DPG uses the return value at the current moment as the unbiased estimation of the action state function under the current strategy, so that the obtained strategy has higher variance, small generalization and instability in some cases. Our proposed method uses a neural network to fit the action value state function, resulting in smaller variance by subtracting baseline. To break the correlation between data, an asynchronous update mechanism is used to create multiple parallel contexts because the parallelism will not interfere with each other, allowing the child threads to simultaneously update the parameters of the primary network in the parallel contexts.

TABLE 3

We set the epsilon to 4.5 ten thousand compared to DDPG and DPG, comparing the run times of the different methods. As can be seen from Table 3, compared with different deep reinforcement learning methods adapted to solve the economic dispatch of the virtual power plant, the time complexity of the method proposed by the inventor is the lowest. Because each epsilon time is several milliseconds, in a virtual power plant real-time economic dispatching scene, a decision can be made within several milliseconds according to state input. The traditional heuristic method needs to re-run the optimization process for each state, and the time complexity is higher.

The invention is suitable for the random characteristic of distributed renewable energy power generation and provides a VPP optimal economic scheduling algorithm based on deep reinforcement learning. We further utilize a framework based on edge computation so that the optimal scheduling solution can be achieved with lower computational complexity. The performance of the algorithm proposed by us is evaluated by using real-world meteorological and load data, and experimental results show that the DRL-based model proposed by us can successfully learn the characteristics of distributed power generation and industrial user requirements in the economic scheduling problem of the virtual power plant and learn to select actions to minimize the cost of the virtual power plant. By comparison with DPG, the method we propose has better performance. By comparison with DPG and DDPG, we propose a method with lower time complexity.

The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims

1. The method for economically scheduling the virtual power plant in the energy internet based on deep reinforcement learning is characterized by comprising the following steps:

respectively training the operator-critical network by using the information collected by each region to respectively obtain the operator-critical network trained by using the information of each region;

the objective function of the operator-critical network is as follows:

wherein: c is the total operating cost of the area i,

initial depreciation cost for the photovoltaic investment in time slot K for zone i, K =0,1, ..., K, \ h>

For the wind turbine operating and maintenance costs of zone i at time slot k>

For the micro gas turbine operating and maintenance costs in zone i at time slot k, based on>

The cost of the micro gas turbine itself consumed in the time slot k for the area i, λ is the compensation factor,

controllable load for zone i in time slot k, x _i (k) Selection of interruptible load percentage vector, x, for region i in time slot k _i (k) Get (1)The value range is [0,1 ]]；

The specific training process of the operator network in the operator-critical network comprises the following steps:

the activation functions of the 1 st full connection layer of the mu network and the sigma network are both tanh, the input dimensionality is 5, and the output dimensionality is h;

inputting the information of the power generation side and the user side into the mu network and the sigma network to obtain the output of the mu network and the sigma network; then, carrying out normal random sampling on the output of the mu network and the sigma network to obtain 4-dimensional action output by the actor network;

the specific training process of the critic network in the operator-critic network comprises the following steps:

the critic network is composed of full connection layers;

inputting the information of the power generation side and the user side and the 4-dimensional action output by the operator network into a full connection layer of the critic network, splicing the output of the full connection layer to obtain a splicing result, and performing linear transformation on the splicing result to obtain a one-dimensional return value output by the critic network;

2. The deep reinforcement learning-based economic scheduling method for the virtual power plant in the energy internet according to claim 1, wherein in the first step, the operator-critic network of the VPP operator cloud server is trained by using the information collected in each region, an asynchronous method is adopted, and 8 threads are run in parallel.

3. The deep reinforcement learning-based economic dispatching method for virtual power plants in energy Internet according to claim 2, characterized in that the return function of the operator-critic network has the expression:

wherein: k ₁ 、K ₂ 、K ₃ And K ₄ Are weighted values.