CN116755409A - Coal-fired power generation system coordination control method based on value distribution DDPG algorithm - Google Patents

Coal-fired power generation system coordination control method based on value distribution DDPG algorithm Download PDF

Info

Publication number
CN116755409A
CN116755409A CN202310812903.9A CN202310812903A CN116755409A CN 116755409 A CN116755409 A CN 116755409A CN 202310812903 A CN202310812903 A CN 202310812903A CN 116755409 A CN116755409 A CN 116755409A
Authority
CN
China
Prior art keywords
distribution
coal
power generation
network
fired power
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310812903.9A
Other languages
Chinese (zh)
Other versions
CN116755409B (en
Inventor
刘晓敏
余梦君
祝琛
王浩宇
杨春雨
周林娜
赵峻
彭献勇
范赫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN202310812903.9A priority Critical patent/CN116755409B/en
Publication of CN116755409A publication Critical patent/CN116755409A/en
Application granted granted Critical
Publication of CN116755409B publication Critical patent/CN116755409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B19/00Programme-control systems
    • G05B19/02Programme-control systems electric
    • G05B19/418Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM]
    • G05B19/41865Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM] characterised by job scheduling, process planning, material flow
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/32Operator till task planning
    • G05B2219/32252Scheduling production, machining, job shop

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Manufacturing & Machinery (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention discloses a coal-fired power generation system coordination control method based on a value distribution DDPG algorithm, which comprises the following steps: modeling a coal-fired power generation system coordination control problem as a Markov decision process; evaluating a long-term expected control cost function through a Q function, and performing distributed modeling on a scalar Q value by using quantiles; introducing a distributed expression of a Bellman operator to construct target distribution, measuring a minimum distance of 1-Wassertein between the distributions, and designing a distance between quantile loss and the minimum distribution; and constructing two actor-criticizer network structures of an estimation network and a target network, standardizing data samples of the coal-fired power generation system, and adopting a DDPG algorithm to realize estimation of value distribution and updating of strategies so as to obtain a coordinated control strategy. The invention solves the problems that the environment is complex and uncertain, the coal-fired power generation system is difficult to accurately model, the self-adaptive capacity is lacked and the like based on the system operation data and the value distribution DDPG algorithm.

Description

Coal-fired power generation system coordination control method based on value distribution DDPG algorithm
Technical Field
The invention belongs to the field of data driving control of coal-fired power generation systems, and mainly relates to a coal-fired power generation system coordination control method based on a value distribution DDPG algorithm.
Background
In recent years, global energy consumption is continuously improved, and the addition of novel renewable energy sources threatens the stability of a power grid. The thermal power generation is used as a main energy source, and the safety and stability of the coal-fired power plant are important for maintaining the reliability of the power grid. The boiler-turbine system is a core component of a thermal power plant, and an intelligent lifting operation strategy is required to meet the fluctuation of the power grid demand. The coordination control system plays an important role in maintaining energy supply and demand balance of the coal-fired generator set and optimizing operation efficiency. In recent years, students use an input convex neural network to carry out dynamic modeling on a boiler-steam turbine system, and a nonlinear model is adopted to predict and control so as to realize rapid and stable tracking performance under wide load change. In addition, a learner proposes a coordination control scheme based on H-infinity-LQR to reasonably schedule various devices of the system so as to improve the load adaptability of the generator set under various working conditions. However, the methods depend on modeling precision, and have poor self-adaptive capacity under the conditions of environmental change, upgrading and reconstruction of the thermal power generating unit and the like. Therefore, a learner puts forward a strategy iterative integral reinforcement learning method to design an optimal tracking control law of a boiler-steam turbine system, avoid the construction of an accurate model and improve the self-adaptive capacity of an algorithm, but does not fully consider the uncertainty of the environment of a coal-fired power generation system.
DDPG is a reinforcement learning method combining Q learning and deterministic strategy gradient algorithm under a deep network framework, performs self-learning through rewarding feedback guidance and environment interaction, is suitable for processing complex continuous control problems, and has been widely applied in the fields of wind driven generators, automatic driving, robots, internet of things and the like. The environment of the coal-fired power generation system is complex, and a plurality of influencing factors such as temperature, pressure, humidity, coal quality and the like exist, and the factors have important influence on the power generation efficiency and the combustion effect. Most of the existing DDPG algorithms evaluate expected returns brought by an execution strategy under the current state-action based on a fixed Q value function, and cannot reflect the influence of environmental uncertainty on the strategy evaluation.
Therefore, there is an urgent need to develop a DDPG algorithm with adaptive, self-learning, and uncertainty environment handling capabilities to solve the problem of coordinated control of a coal-fired power generation system under a given objective.
Disclosure of Invention
The invention aims to: the method solves the problem of coordination control of the coal-fired power generation system with unmodeled dynamic and uncertain environmental factors, so that the control strategy of the coal-fired power generation system has self-adaption and self-learning capabilities, and the problem that the scalar Q value is difficult to represent the value estimation of the uncertain coal-fired power generation system is solved by introducing a value distribution form.
The invention comprises the following steps: the invention provides a coal-fired power generation system coordination control method based on a value distribution DDPG algorithm, which comprises the following steps:
step 1, constructing a coordination control cost function of a coal-fired power generation system;
step 2, evaluating a long-term expected control cost function through a Q function, introducing a value distribution form Z corresponding to a scalar Q value, and carrying out distribution modeling on Z distribution by using a fractional number method;
step 3, introducing a distributed expression of a Bellman operator to construct target distribution, measuring a minimum distance between Z distribution and the target distribution, and designing quantile loss to minimize the distance between the distributions;
and 4, constructing two actor-criticizer network structures of an estimation network and a target network, standardizing a data sample of the coal-fired power generation system, taking quantile Huber loss as a loss function of the estimation network and the target network, and training and updating the estimation network and the target network by adopting a DDPG algorithm to realize estimation of value distribution and updating of strategies and obtain a coordinated control strategy.
Further, the step 1 specifically includes the following:
step 101, analyzing the operation characteristics of the coal-fired power generation system, wherein the operation characteristics are specifically as follows: steam output is critical to coal-fired power generation systems because it is directly related to the power output effects of the plant. The fuel flow directly affects combustion efficiency, while the water feed and steam flow valves directly affect steam output. The coal-fired power plant needs to achieve stabilization of drum water level to improve boiler performance. In addition, the height of the water level depends on the control input, and the direct control of the water level is quite cumbersome, and the water level change is indirectly reflected through the steam pressure of the steam drum and the density of the fluid in the steam drum. In the running process of the coal-fired power generation unit, the coordination control system aims at enabling the unit to meet the electric load requirement, and simultaneously stabilizing the steam pressure, the electric power and the drum water level at required levels, so that the energy consumption is reduced as much as possible;
step 102, selecting steam flow, fuel flow and feedwater flow as control variables and steam pressure, drum fluid density and power output as state variables according to the operation characteristics;
step 103, formulating a coal-fired power generation system coordination control problem under the reinforcement learning framework, and defining a four-element groupThe method comprises the following steps:
(1)as a motion space, a collection of operations that the reinforcement learning agent can perform is represented. For coordinated control problems, the control input may act as an agent, i.e., a= [ a ] 1 ,a 2 ,u 3 ] T Wherein u is 1 ,u 2 ,u 3 Representing fuel flow, steam flow and drum feedwater flow control valve openings, respectively. The range in which each valve can rotate is (0, 1), each control variable is a continuous motion space, 0 representing fully closed and 1 representing fully open;
(2)is a state space representing a collection of information observed by an agent. Selecting system state variable boiler drum steam pressure x 1 (kg/cm 2), electric power x 2 (MW) and fluid Density in boiler drum x 3 (kg/cm 3), current state and target state T h Deviation e between 1 、e 2 And e 3
Wherein h=1, 2 or 3; e, e 1 Representing the boiler drum steam pressure x of the previous system state variable 1 Boiler drum steam pressure h with target system state quantity 1 Deviation between;
e 2 representing the pre-system state variable power x 2 Power h as a target system state variable 2 Deviation between;
e 3 representing the fluid density x in the boiler drum of the previous system state variable 3 Fluid density h in boiler drum corresponding to state variable of target system 3 Deviation between;
at the same time, the current control input a= [ u ] 1 ,u 2 ,u 3 ] T As part of the observation state information, the complete observation information can thus be described as
s=[x 1 ,x 2 ,x 3 ,e 1 ,e 2 ,e 3 ,u 1 ,u 2 ,u 3 ] T #(1)
(3) r is a reward function for evaluating the objectives and effects of the agent in performing the action. When the system state is far from the target state T s A certain punishment item is set when the system state approaches to the target state, and a bonus item is set, so that the bonus r is set as
Wherein omega 1 、ω 2 And omega 3 The tracking error penalty weight, the control consumption penalty weight, and the bonus weight approaching the target state value, respectivelyThe weight of the steel plate is increased, the i e i represents the current system state and the target state T s Tracking error norms between;
(4) Gamma is a discount factor used to measure the importance of future rewards. Specifically, the discount factor determines how much the agent reduces the future rewards;
step 104, modeling the coal-fired power generation system coordination control problem as a Markov decision process, specifically, considering an agent whose current observed information state at time t is expressed asIn taking action->After that, it transits to the next state s t+1 And obtaining scalar rewards r from the environment t To this end, the cumulative return obtained is maximized
Wherein t is l Representing state start time, gamma e (0, 1) is a discount factor;
step 105, the coordination control cost function of the coal-fired power generation system is as follows
Further, the step 2 specifically includes the following contents;
step 201, defineFor describing that at a specific time t, the coal-fired power generation system is in state s t When act a is performed t And then performs a long-term desired control cost function of the coordinated control strategy μ(s), i.e
Step 202, introducing a value distribution form corresponding to the Q function, namely Z μ (s, a) satisfy
Step 203, define [0,1 ]]N quantile points on τ 1 ,…,τ N Representing the cumulative probability function associated with such a Z distribution:
wherein the method comprises the steps ofRepresenting an inverse cumulative probability function with respect to the Z distribution;
in step 204, the distribution modeling is performed on the value distribution form by using quantiles, specifically:
wherein the method comprises the steps ofRepresenting Z μ τ of cumulative probability distribution of (s, a) i Quantiles.
Further, the step 3 specifically includes the following contents;
step 301, introducing a distributed expression of a Belman operator to construct a target distribution, wherein the target distribution is that
Wherein T is μ Is a bellman operator, which is a binary code,the two sides of the expression equation follow the same probability law, namely the random variables Y and U follow the same rule distribution;
step 302, measure 1-Wassentin minimum distance quantile between two distributions
Wherein the method comprises the steps ofRepresenting the inverse cumulative distribution function of the distribution z;
step 303, the optimal value positions of the N quantiles are
Step 304, introducing a distance between quantile loss metric distributions
Wherein the method comprises the steps ofRepresenting if sigma<0 is true, then-> Calculate for a loss with respect to sigma;
step 305, minimizing the distance between distributions based on the minimum quantile loss
Wherein d is 1 (Y, U) represents a 1-Wassentin distance between the Y and U distributions;
further, the step 4 specifically includes the following contents;
step 401, constructing an estimation network and a target network
The estimation network and the target network comprise an actor network, a criticism network, a target actor network and a target criticism network;
step 402, collecting a raw data sample of a coal-fired power generation system as an estimation network and a target network training set;
step 403, training and updating the estimation network and the target network by adopting a DDPG algorithm to realize the estimation of the value distribution and the updating of the strategy;
step 404, continuously cycling step 403 until the highest control cost function is obtained stably, and obtaining the coordinated control strategy.
Further, in step 401, the input of the actor network is the observed information s of the coal-fired power generation system, and the output is μ (s|θ μ );
The input of the criticizing home network is observation information s and control input a of the coal-fired power generation system, and the output is that
The input of the target actor network is the next observed information s ' obtained by the coal-fired power generation system by executing the control input a on the observed information s for the coal-fired power generation system, and the output is mu ' (s ' |theta) μ′ );
The input of the target criticism network is the next observation information s 'obtained by the coal-fired power generation system by executing the control input a on the observation information s and the corresponding target criticism network output mu' (s|theta) μ′ ) The output is
Further, in step 403, the estimated network and the target network are trained and updated by using a DDPG algorithm to implement the estimation of the value distribution and the update of the policy, which specifically includes the following steps:
step 4031, calculating a time difference error according to the distribution output of the critics network and the target critics network value:
wherein s' represents the next observation information obtained by the coal-fired power generation system by executing the control input a on the observation information s;
step 4032, calculating a fractional Huber loss function between value distributions
Wherein the method comprises the steps ofIndicating if delta ij <0 is true, then->Kappa is the hyper-parameter of the Huber loss function;
step 4033, minimizing quantile Huber loss function to update reviewer network parameters
Wherein B is a standardized coal-fired power generation training set;
step 4034, the actor network follows Z (s, μ (s|θ μ ) Updating of directional gradients
Step 4035, updating the target network parameters based on the estimated network parameters
Wherein xi represents θ μ′ And theta Z′ Is used for updating the amplitude of the update.
The technical scheme provided by the invention has the beneficial effects that:
1) The self-learning capability of the control strategy is obviously improved by organically combining the reinforcement learning optimization framework with the coordination control of the complex uncertain coal-fired power generation system, so that the actual requirement of the coordination control is flexibly met.
2) The complete distribution information of the cost function is described by introducing a value distribution concept into the DDPG, so that the cost function is accurately predicted, the uncertainty of the coal-fired power generation operation environment is better processed, and the learning efficiency is ensured.
Drawings
FIG. 1 is a flow chart of coordinated control of a coal-fired power generation system based on a value distribution DDPG algorithm;
FIG. 2 is a diagram of a value distribution DDPG algorithm training;
FIG. 3 is state x 1 Error e 1 A trace map;
FIG. 4 is state x 2 Error e 2 A trace diagram.
FIG. 5 is state x 3 Error e 3 A trace diagram.
Fig. 6 is a control input trace diagram.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
The invention relates to a coal-fired power generation system coordination control method based on value distribution DDPG, which is shown in figure 1 and comprises the following steps:
step 1, modeling a coal-fired power generation system coordination control problem as a Markov decision process under a reinforcement learning optimization framework based on operation characteristic analysis of the coal-fired power generation system, and constructing a coal-fired power generation system coordination control cost function; the method specifically comprises the following steps:
step 101, analyzing the operation characteristics of a coal-fired power generation system;
the concrete steps are as follows: steam output is critical to coal-fired power generation systems because it is directly related to the power output effects of the plant. The fuel flow directly affects combustion efficiency, while the water feed and steam flow valves directly affect steam output. The coal-fired power plant needs to achieve stabilization of drum water level to improve boiler performance. In addition, the height of the water level depends on the control input, and the direct control of the water level is quite cumbersome, and the water level change is indirectly reflected through the steam pressure of the steam drum and the density of the fluid in the steam drum. In the operation process of the coal-fired power generation unit, the coordination control system aims at enabling the unit to meet the electric load requirement and stabilizing the steam pressure, the electric power and the drum water level at required levels, so that energy consumption is reduced as much as possible.
Step 102, selecting steam flow, fuel flow and feedwater flow as control variables and steam pressure, drum fluid density and electric power output as state variables according to the operating characteristics.
Step 103, formulating a coal-fired power generation system coordination control problem under the reinforcement learning framework, and defining a four-element groupThe method comprises the following steps:
(1)as a motion space, a collection of operations that the reinforcement learning agent can perform is represented. For coordinated control problems, the control input may act as an agent, i.e., a= [ u ] 1 ,u 2 ,u 3 ] T Wherein u is 1 ,u 2 ,u 3 Respectively represent the fuel flow u 1 Steam flow u 2 And drum feed water flow u 3 And controlling the opening degree of the valve. The range in which each valve can be rotated is (0, 1), and each control variable is a continuous motion space, with 0 indicating full closure and 1 indicating full opening.
(2)Is a state space representing a collection of information observed by an agent. Selecting system state variable boiler drum steam pressure x 1 (kg/cm 2), electric power x 2 (MW) and fluid Density in boiler drum x 3 (kg/cm 3), current system state and target system state T h Deviation e between 1 、e 2 And e 3
Wherein h=1, 2 or 3; e, e 1 Representing the boiler drum steam pressure x of the previous system state variable 1 Boiler drum steam pressure h with target system state quantity 1 Deviation between;
e 2 representing the pre-system state variable power x 2 Power h as a target system state variable 2 Deviation between;
e 3 representing the fluid density x in the boiler drum of the previous system state variable 3 Fluid density h in boiler drum corresponding to state variable of target system 3 Deviation between;
at the same time, the current control input a= [ u ] 1 ,u 2 ,u 3 ] T As part of the observation state information, the complete observation information can thus be described as
s=[x 1 ,x 2 ,x 3 ,e 1 ,e 2 ,e 3 ,u 1 ,u 2 ,u 3 ] T #(1)
(3) r is a reward function for evaluating the objectives and effects of the agent in performing the action. When the system state is far from the target state T h A certain punishment item is set when the system state approaches to the target state, and a bonus item is set, so that the bonus r is set as
Wherein omega 1 、ω 2 And omega 3 Tracking error punishment weight, control consumption punishment weight and reward weight close to target state value are respectively adopted, wherein the I e I represents the current system state and the target state T h Tracking error norms between.
(4) Gamma is a discount factor used to measure the importance of future rewards. Specifically, the discount factor determines how much an agent reduces future rewards.
Step 104, modeling the coal-fired power generation system coordination control problem as a Markov decision process, specifically, considering an agent whose current observed information state at time t is expressed asIn taking action->After that, it transits to the next state s t+1 And obtaining scalar rewards r from the environment t In this way, the cumulative return obtained is maximized>Is that
Wherein t is l Representing state start time, gamma e (0, 1) is a discount factor;
step 105, the coordination control cost function of the coal-fired power generation system is as follows
Step 2, evaluating a long-term expected control cost function through a Q function, introducing a value distribution form Z corresponding to a scalar Q value, and carrying out distribution modeling on Z distribution by using a fractional number method; the method specifically comprises the following steps:
step 201, defineFor describing that at a specific time t, the coal-fired power generation system is in state s t When act a is performed t And then performs a long-term desired control cost function of the coordinated control strategy μ(s), i.e
Step 202, introducing a value distribution form corresponding to the Q function, namely Z μ (s, a) satisfy
Step 203, define [0,1 ]]N quantile points on τ 1 ,…,τ N Representing the cumulative probability function associated with such a Z distribution:
wherein the method comprises the steps ofRepresenting an inverse cumulative probability function with respect to the Z distribution.
In step 204, the distribution modeling is performed on the value distribution form by using quantiles, specifically:
wherein the method comprises the steps ofRepresenting Z μ Accumulation of (s, a)τ of the product probability distribution i Quantiles.
Step 3, introducing a distributed expression of a Bellman operator to construct target distribution, measuring a minimum distance between the Z distribution and the target distribution, and designing quantile loss to minimize the distance between the distributions; the method specifically comprises the following steps:
step 301, introducing a distributed expression of a Belman operator to construct a target distribution, wherein the target distribution is that
Wherein T is μ Is a bellman operator, which is a binary code,the two sides of the expression equation follow the same probability law, namely the random variables Y and U follow the same rule distribution;
step 302, measure 1-Wassentin minimum distance quantile between two distributions
Wherein the method comprises the steps ofRepresented is the inverse cumulative distribution function of the distribution z.
Step 303, the optimal value positions of the N quantiles are
Step 304, introducing a distance between quantile loss metric distributions
Wherein the method comprises the steps ofRepresenting if sigma<0 is true, then-> Calculated for a loss with respect to sigma.
Step 305, minimizing the distance between distributions based on the minimum quantile loss
Wherein d is 1 (Y, U) represents the 1-Wassertein distance between the random variable Y and U distributions.
And 4, constructing two actor-criticizer network structures of an estimation network and a target network, standardizing data samples of the coal-fired power generation system, and training and updating the estimation network and the target network by adopting a DDPG algorithm to realize estimation of value distribution and updating of strategies and obtain coordinated control strategies. The method specifically comprises the following steps:
in step 401, an estimation network and a target network are constructed as two actor-criticizer network structures. The estimated network and the target network comprise four neural networks, specifically an actor network, a reviewer network, a target actor network and a target reviewer network, and the network parameters of the actor network are represented by theta μ Representing network parameters of a commentator network in θ Z A representation; network parameters θ for target actor networks μ′ Representing network parameters of target criticism network by theta z′ A representation;
wherein, the input of the actor network is observed information s of the coal-fired power generation system, and the output is mu (s|theta) μ );
The input of the criticizing home network is observation information s and control input a of the coal-fired power generation system, and the output is that
The input of the target actor network is the next observed information s ' obtained by the coal-fired power generation system by executing the control input a on the observed information s for the coal-fired power generation system, and the output is mu ' (s ' |theta) μ′ );
The input of the target criticism network is the next observed information s' obtained by the coal-fired power generation system executing the control input a on the observed information s and the corresponding target criticism network output mu (s∣θ μ′ ) The output is
Step 402, converting the data samples originally collected by the coal-fired power generation system into a standardized training set with a mean of 0 and a variance of 1 using z-score normalization.
Step 403, training and updating the estimation network and the target network by adopting a DDPG algorithm to realize the estimation of the value distribution and the updating of the strategy, specifically:
(1) Calculating time difference errors according to distribution output of critic network and target critic network values
Wherein s' represents the next observation information obtained by the coal-fired power generation system by executing the control input a on the observation information s;
(2) Fractional Huber loss function between calculated value distributions
Wherein the method comprises the steps ofIndicating if delta ij <0 is true, then->Kappa is the hyper-parameter of the Huber loss function.
(3) Minimizing quantile Huber loss functionUpdating critic network parameters
Wherein B is a standardized coal-fired power generation training set.
(4) The actor network is defined along Z (s, μ (s|θ μ ) Updating of directional gradients
(5) Updating the target network parameters directly according to the estimated network parameters by a soft updating mode
Wherein xi represents θ μ′ And theta z′ Usually ζ < 1.
Step 404, continuously cycling step 403 until the highest control cost function is obtained steadily, and ending training to obtain the coordinated control strategy.
The method is suitable for the coal-fired power generation system under various operation conditions. In order to better understand the present invention, a detailed description of a coal-fired power generation system coordination control method of the value distribution DDPG algorithm will be given below with reference to specific embodiments.
When a coal-fired power generation system coordination controller based on a value distribution DDPG algorithm is designed, four multi-layer perceptron neural networks are respectively used as an actor network, a commentator network, a target actor network and a commentator network, the activation function of the network is RELU, the activation functions of the output layers of the actor network and the target actor network are Tanh, and the optimizer is Adam. The minimum batch data B of the network training is 128 in size, the soft update rate xi=5e-3, and the network learning rate is 1e-4. The Huber super parameter κ=0.3 was chosen.
Setting the initial state and the target state to x (0) = [102,438.93] T And T h =[h 1 ,h 2 ,h 3 ]=[121,389,92] T The initial control input is a 0 =[0.3102,0.6711,0.3967]. Controlling the cost function to be
Where discount coefficient γ=0.99.
It is applied to a 160MW coal-fired power generation system. And collecting operation data through interaction with the environment of the coal-fired power generation system, and performing network training. And performing multiple operation tests in a coal-fired power generation environment. The resulting average rewards curve is shown in fig. 2, wherein the horizontal axis represents the number of training steps, the vertical axis represents the corresponding rewards, the solid line represents the average rewards curve for multiple runs, and the shading represents the rewards variance. As can be seen from fig. 2, the prize return curve has converged after 2e5 training steps. After training is finished, a value distribution DDPG algorithm is applied to the coal-fired power generation system to test the tracking target state T of the coal-fired power generation system s Effects system, system variable x 1 、x 2 、x 3 And corresponding to the tracking target h 1 、h 2 、h 3 The state and error trace diagrams of the system are shown in fig. 3, fig. 4 and fig. 5, respectively, and the system input variable trace is shown in fig. 6, so that the system can track a given target value finally.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (7)

1. The coal-fired power generation system coordination control method based on the value distribution DDPG algorithm is characterized by comprising the following steps of:
step 1, constructing a coordination control cost function of a coal-fired power generation system;
step 2, evaluating a long-term expected control cost function through a Q function, introducing a value distribution form Z corresponding to a scalar Q value, and carrying out distribution modeling on Z distribution by using a fractional method;
step 3, a distribution type expression of a Bellman operator is introduced to construct target distribution, the minimum distance between Z distribution and the target distribution is measured, and quantile loss is designed to minimize the distance between the Z distribution and the target distribution;
and 4, constructing an estimation network and a target network, and training and updating the estimation network and the target network by adopting a DDPG algorithm to realize estimation of value distribution and updating of strategies so as to obtain a coordinated control strategy.
2. The method for coordinated control of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 1, wherein the constructing the coordinated control cost function of the coal-fired power generation system in step 1 comprises the following steps:
step 101, selecting steam flow, fuel flow and water supply flow as control variables of a coal-fired power generation system;
steam pressure, drum fluid density and power output are selected as state variables of the coal-fired power generation system;
102, formulating a coal-fired power generation system coordination control problem under a reinforcement learning framework, and defining a quadrupleThe method comprises the following steps:
(1)for the action space, a set representing the operation a performed by the reinforcement learning agent, i.e., a= [ u ] 1 ,u 2 ,u 3 ] T Wherein u is 1 ,u 2 ,u 3 Representing the fuel flow, steam flow and drum feedwater flow control valve opening respectively;
(2)is a state space representing a collection of information observed by an agent; selecting system state variable boiler drum steam pressure x 1 Power x of electric power 2 And fluid density x in boiler drum 3 Current system state and target system state T h Deviation e between 1 、e 2 And e 3
Thus, the complete observation information is described as
s=[x 1 ,x 2 ,x 3 ,e 1 ,e 2 ,e 3 ,u 1 ,u 2 ,u 3 ] T #(1)
(3) r is a reward function for evaluating the objective and effect of the agent when executing the action;
wherein omega 1 、ω 2 And omega 3 Tracking error punishment weight, control consumption punishment weight and reward weight close to target state value are respectively adopted, wherein the I e I represents the current system state and the target state T s Tracking error norms between;
(4) Gamma is a discount factor for measuring the importance of future rewards;
104, modeling the coordination control problem of the coal-fired power generation system as a Markov decision process, wherein the current observed information state of the intelligent agent at time t is expressed asIn taking action->After that, it transits to the next state s t+1 And obtaining scalar rewards r from the environment t To this end, the cumulative return obtained is maximized
Wherein t is l Representing state start time, gamma e (0, 1) is a discount factor;
step 105, the coordination control cost function of the coal-fired power generation system is as follows
3. The coordinated control method of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 1, wherein in the step 2, the cost function is expected to be controlled for a long time through Q function evaluation, a value distribution form Z corresponding to a scalar Q value is introduced, and a distribution modeling is carried out on the Z distribution by using a fractional method, specifically comprising the following steps:
step 201, defineFor describing that at a specific time t, the coal-fired power generation system is in a current state s t When act a is performed t And then performs a long-term desired control cost function of the coordinated control strategy μ(s), i.e
Step 202, introducing a value distribution form corresponding to the scalar Q value,i.e. Z μ (s, a) satisfy
Step 203, define [0,1 ]]N quantile points on τ 1 ,…,τ N Representing the cumulative probability function associated with such a Z distribution:
wherein the method comprises the steps ofRepresenting an inverse cumulative probability function with respect to the Z distribution;
in step 204, the Z distribution is modeled using quantiles, specifically:
wherein the method comprises the steps ofRepresenting Z μ τ of cumulative probability distribution of (s, a) i Quantiles.
4. The coordinated control method of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 1, wherein the method is characterized in that a target distribution is constructed by introducing a distribution type expression of a Belman operator in the step 3, measuring a minimum distance between Z distribution and the target distribution, and designing quantile loss to minimize the distance between Z distribution and the target distribution, and specifically comprises the following steps:
step 301, introducing a distributed expression of a Belman operator to construct a target distribution, wherein the target distribution is that
Wherein the method comprises the steps ofIs a Belman operator, < ->The two sides of the expression equation follow the same probability law, namely the random variables Y and U follow the same rule distribution;
step 302, measuring 1-Wassertein minimum distance quantiles between Z distribution and target distribution
Wherein the method comprises the steps ofRepresenting the inverse cumulative distribution function of the distribution z;
step 303, the optimal value positions of the N quantiles are
Step 304, introducing a distance between quantile loss metric distributions
Wherein the method comprises the steps ofRepresenting if sigma<0 is true, then-> Calculate for a loss with respect to sigma;
step 305, minimizing the distance between distributions based on the minimum quantile loss
Wherein d is 1 (Y, U) represents the 1-Wassentin distance between the Y and U distributions.
5. The coal-fired power generation system coordination control method based on the value distribution DDPG algorithm according to claim 1, wherein in step 4, an estimation network and a target network are constructed, data samples of the coal-fired power generation system are standardized, the estimation network and the target network are trained and updated by adopting the DDPG algorithm, so that the estimation of the value distribution and the updating of the strategy are realized, and the coordination control strategy is obtained
Step 401, constructing an estimation network and a target network
The estimation network and the target network comprise an actor network, a criticism network, a target actor network and a target criticism network;
step 402, collecting a raw data sample of a coal-fired power generation system as an estimation network and a target network training set;
step 403, training and updating the estimation network and the target network by adopting a DDPG algorithm to realize the estimation of the value distribution and the updating of the strategy;
step 404, continuously cycling step 403 until the highest control cost function is obtained stably, and obtaining the coordinated control strategy.
6. The coordinated control method of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 5, wherein in step 401, the input of the actor network is the observed information s of the coal-fired power generation system, and the output is μ (s|θ μ );
The input of the criticizing home network is observation information s and control input a of the coal-fired power generation system, and the output is that
The input of the target actor network is the next observed information s ' obtained by the coal-fired power generation system by executing the control input a on the observed information s for the coal-fired power generation system, and the output is mu ' (s ' |theta) μ′ );
The input of the target criticism network is the next observed information s 'obtained by the coal-fired power generation system by executing the control input a on the observed information s and the corresponding target criticism network output mu' (s|theta) μ′ ) The output is
7. The coordinated control method of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 5, wherein in step 403, the estimation network and the target network are trained and updated by the DDPG algorithm to realize the estimation of the value distribution and the update of the strategy, and specifically comprises the following steps:
step 4031, calculating a time difference error according to the distribution output of the critics network and the target critics network value:
wherein s' represents the next observation information obtained by the coal-fired power generation system by executing the control input a on the observation information s;
step 4032, calculating a fractional Huber loss function between value distributions
Wherein II { delta } ij < 0}, if delta ij < 0 is true, II { delta ] ij < 0} = 1, k is the hyper-parameter of the Huber loss function;
step 4033, minimizing quantile Huber loss function to update reviewer network parameters
Wherein B is a standardized coal-fired power generation training set;
step 4034, the actor network follows Z (s, μ (s|θ μ ) Updating of directional gradients
Step 4035, updating the target network parameters based on the estimated network parameters
Wherein xi represents θ μ′ And theta Z′ Is used for updating the amplitude of the update.
CN202310812903.9A 2023-07-04 2023-07-04 Coal-fired power generation system coordination control method based on value distribution DDPG algorithm Active CN116755409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310812903.9A CN116755409B (en) 2023-07-04 2023-07-04 Coal-fired power generation system coordination control method based on value distribution DDPG algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310812903.9A CN116755409B (en) 2023-07-04 2023-07-04 Coal-fired power generation system coordination control method based on value distribution DDPG algorithm

Publications (2)

Publication Number Publication Date
CN116755409A true CN116755409A (en) 2023-09-15
CN116755409B CN116755409B (en) 2024-01-12

Family

ID=87953192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310812903.9A Active CN116755409B (en) 2023-07-04 2023-07-04 Coal-fired power generation system coordination control method based on value distribution DDPG algorithm

Country Status (1)

Country Link
CN (1) CN116755409B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117474295A (en) * 2023-12-26 2024-01-30 长春工业大学 Multi-AGV load balancing and task scheduling method based on lasting DQN algorithm

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016157188A (en) * 2015-02-23 2016-09-01 株式会社デンソーアイティーラボラトリ Attitude estimation model generation device and attitude estimation device
CN110365056A (en) * 2019-08-14 2019-10-22 南方电网科学研究院有限责任公司 A kind of distributed energy participation power distribution network pressure regulation optimization method based on DDPG
US20200327411A1 (en) * 2019-04-14 2020-10-15 Di Shi Systems and Method on Deriving Real-time Coordinated Voltage Control Strategies Using Deep Reinforcement Learning
CN115123159A (en) * 2022-06-27 2022-09-30 重庆邮电大学 AEB control method and system based on DDPG deep reinforcement learning
US11571809B1 (en) * 2019-09-15 2023-02-07 X Development Llc Robotic control using value distributions
US20230071450A1 (en) * 2021-09-09 2023-03-09 Siemens Aktiengesellschaft System and method for controlling large scale power distribution systems using reinforcement learning
CN115933410A (en) * 2023-01-09 2023-04-07 中国矿业大学 Q learning-based optimal tracking control method for double-time-scale coal-fired power generation system
CN116131254A (en) * 2022-12-26 2023-05-16 上海交通大学 Economical dispatching method of cogeneration unit based on DDPG algorithm
CN116207750A (en) * 2023-02-28 2023-06-02 云南电网有限责任公司昆明供电局 Power distribution network reactive power optimization method based on depth deterministic strategy gradient algorithm
CN116316911A (en) * 2023-03-28 2023-06-23 国网江苏省电力有限公司电力科学研究院 Active-reactive coordination optimization method and device for power system
CN116341376A (en) * 2023-03-17 2023-06-27 西安交通大学 Space thermal ion nuclear power control method based on DDPG deep reinforcement learning algorithm

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016157188A (en) * 2015-02-23 2016-09-01 株式会社デンソーアイティーラボラトリ Attitude estimation model generation device and attitude estimation device
US20200327411A1 (en) * 2019-04-14 2020-10-15 Di Shi Systems and Method on Deriving Real-time Coordinated Voltage Control Strategies Using Deep Reinforcement Learning
CN110365056A (en) * 2019-08-14 2019-10-22 南方电网科学研究院有限责任公司 A kind of distributed energy participation power distribution network pressure regulation optimization method based on DDPG
US11571809B1 (en) * 2019-09-15 2023-02-07 X Development Llc Robotic control using value distributions
US20230071450A1 (en) * 2021-09-09 2023-03-09 Siemens Aktiengesellschaft System and method for controlling large scale power distribution systems using reinforcement learning
CN115123159A (en) * 2022-06-27 2022-09-30 重庆邮电大学 AEB control method and system based on DDPG deep reinforcement learning
CN116131254A (en) * 2022-12-26 2023-05-16 上海交通大学 Economical dispatching method of cogeneration unit based on DDPG algorithm
CN115933410A (en) * 2023-01-09 2023-04-07 中国矿业大学 Q learning-based optimal tracking control method for double-time-scale coal-fired power generation system
CN116207750A (en) * 2023-02-28 2023-06-02 云南电网有限责任公司昆明供电局 Power distribution network reactive power optimization method based on depth deterministic strategy gradient algorithm
CN116341376A (en) * 2023-03-17 2023-06-27 西安交通大学 Space thermal ion nuclear power control method based on DDPG deep reinforcement learning algorithm
CN116316911A (en) * 2023-03-28 2023-06-23 国网江苏省电力有限公司电力科学研究院 Active-reactive coordination optimization method and device for power system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
LAI, TUNG-YI等: "Combining Deep Deterministic Policy Gradient with Cross-Entropy Method", 《 2019 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI)》 *
LI YE LIU ZHONGXIN, LAN GE, SADER MALIKA, CHEN ZENGQIANG: "A DDPG-based solution for optimal consensus of continuous-time linear multi-agent systems", 《SCIENCE CHINA TECHNOLOGICAL SCIENCES》, vol. 66, no. 8 *
刘建伟;高峰;罗雄麟;: "基于值函数和策略梯度的深度强化学习综述", 计算机学报, no. 06 *
彭刘阳,孙元章,徐箭,廖思阳,杨丽: "基于深度强化学习的自适应不确定性经济调度", 《电力***自动化》, vol. 44, no. 9 *
彭刘阳;孙元章;徐箭;廖思阳;杨丽;: "基于深度强化学习的自适应不确定性经济调度", 电力***自动化, no. 09 *
苏诗慧;雷勇;李永凯;朱英伟;: "基于改进DDPG算法的中短期光伏发电功率预测", 半导体光电, no. 05 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117474295A (en) * 2023-12-26 2024-01-30 长春工业大学 Multi-AGV load balancing and task scheduling method based on lasting DQN algorithm
CN117474295B (en) * 2023-12-26 2024-04-26 长春工业大学 Dueling DQN algorithm-based multi-AGV load balancing and task scheduling method

Also Published As

Publication number Publication date
CN116755409B (en) 2024-01-12

Similar Documents

Publication Publication Date Title
CN109062053B (en) Denitration ammonia injection control method based on multivariate correction
US20060184477A1 (en) Method and apparatus for optimizing a system model with gain constraints using a non-linear programming optimizer
CN113433911B (en) Accurate control system and method for ammonia spraying of denitration device based on accurate concentration prediction
CN116755409B (en) Coal-fired power generation system coordination control method based on value distribution DDPG algorithm
CN108121215B (en) Process control loops method of evaluating performance and device based on full loop reconstruct emulation
CN106599586A (en) Neural network-based SCR intelligent ammonia-spraying optimization method and apparatus
Na et al. Design of a fuzzy model predictive power controller for pressurized water reactors
Feng et al. Controller optimization approach using LSTM-based identification model for pumped-storage units
CN112213945A (en) Improved robust prediction control method and system for electric vehicle participating in micro-grid group frequency modulation
CN114909706B (en) Two-level network balance regulation and control method based on reinforcement learning algorithm and differential pressure control
Kim et al. Design of a model predictive load-following controller by discrete optimization of control rod speed for PWRs
CN116822380A (en) Collaborative optimization method for tail gas recycling in copper smelting process based on digital twin
CN113885328A (en) Nuclear power tracking control method based on integral reinforcement learning
CN115933410B (en) Optimal tracking control method for double-time-scale coal-fired power generation system based on Q learning
CN115001002B (en) Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling
CN111612255B (en) Wind power plant power curve modeling method based on support vector regression
CN114326395A (en) Intelligent generator set control model online updating method based on working condition judgment
Chen et al. Modeling of goethite iron precipitation process based on time-delay fuzzy gray cognitive network
Lindberg A comparison between MPC and PID controllers for education and steam reformers
Ponkiya Improved Artificial Intelligence-based Optimization and Energy Dispatch Techniques for Integrated Energy System
Katikaneni Power Dispatch and Storage Configuration Optimization of an Integrated Energy System using Deep Reinforcement Learning and Hyperparameter Tuning
CN116488150A (en) New energy micro-grid optimizing operation method based on improved model predictive control
CN118157224A (en) Automatic parameter adjusting method for wind farm and related components
CN118192249A (en) Boiler turbine system load control method based on experience-oriented Q learning
Kiš et al. Nearly-optimal Explicit MPC-based Reference Governors with Long Prediction Horizons Generated with Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant