CN116755409A

CN116755409A - Coal-fired power generation system coordination control method based on value distribution DDPG algorithm

Info

Publication number: CN116755409A
Application number: CN202310812903.9A
Authority: CN
Inventors: 刘晓敏; 余梦君; 祝琛; 王浩宇; 杨春雨; 周林娜; 赵峻; 彭献勇; 范赫
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-09-15
Anticipated expiration: 2043-07-04
Also published as: CN116755409B

Abstract

The invention discloses a coal-fired power generation system coordination control method based on a value distribution DDPG algorithm, which comprises the following steps: modeling a coal-fired power generation system coordination control problem as a Markov decision process; evaluating a long-term expected control cost function through a Q function, and performing distributed modeling on a scalar Q value by using quantiles; introducing a distributed expression of a Bellman operator to construct target distribution, measuring a minimum distance of 1-Wassertein between the distributions, and designing a distance between quantile loss and the minimum distribution; and constructing two actor-criticizer network structures of an estimation network and a target network, standardizing data samples of the coal-fired power generation system, and adopting a DDPG algorithm to realize estimation of value distribution and updating of strategies so as to obtain a coordinated control strategy. The invention solves the problems that the environment is complex and uncertain, the coal-fired power generation system is difficult to accurately model, the self-adaptive capacity is lacked and the like based on the system operation data and the value distribution DDPG algorithm.

Description

Coal-fired power generation system coordination control method based on value distribution DDPG algorithm

Technical Field

The invention belongs to the field of data driving control of coal-fired power generation systems, and mainly relates to a coal-fired power generation system coordination control method based on a value distribution DDPG algorithm.

Background

In recent years, global energy consumption is continuously improved, and the addition of novel renewable energy sources threatens the stability of a power grid. The thermal power generation is used as a main energy source, and the safety and stability of the coal-fired power plant are important for maintaining the reliability of the power grid. The boiler-turbine system is a core component of a thermal power plant, and an intelligent lifting operation strategy is required to meet the fluctuation of the power grid demand. The coordination control system plays an important role in maintaining energy supply and demand balance of the coal-fired generator set and optimizing operation efficiency. In recent years, students use an input convex neural network to carry out dynamic modeling on a boiler-steam turbine system, and a nonlinear model is adopted to predict and control so as to realize rapid and stable tracking performance under wide load change. In addition, a learner proposes a coordination control scheme based on H-infinity-LQR to reasonably schedule various devices of the system so as to improve the load adaptability of the generator set under various working conditions. However, the methods depend on modeling precision, and have poor self-adaptive capacity under the conditions of environmental change, upgrading and reconstruction of the thermal power generating unit and the like. Therefore, a learner puts forward a strategy iterative integral reinforcement learning method to design an optimal tracking control law of a boiler-steam turbine system, avoid the construction of an accurate model and improve the self-adaptive capacity of an algorithm, but does not fully consider the uncertainty of the environment of a coal-fired power generation system.

DDPG is a reinforcement learning method combining Q learning and deterministic strategy gradient algorithm under a deep network framework, performs self-learning through rewarding feedback guidance and environment interaction, is suitable for processing complex continuous control problems, and has been widely applied in the fields of wind driven generators, automatic driving, robots, internet of things and the like. The environment of the coal-fired power generation system is complex, and a plurality of influencing factors such as temperature, pressure, humidity, coal quality and the like exist, and the factors have important influence on the power generation efficiency and the combustion effect. Most of the existing DDPG algorithms evaluate expected returns brought by an execution strategy under the current state-action based on a fixed Q value function, and cannot reflect the influence of environmental uncertainty on the strategy evaluation.

Therefore, there is an urgent need to develop a DDPG algorithm with adaptive, self-learning, and uncertainty environment handling capabilities to solve the problem of coordinated control of a coal-fired power generation system under a given objective.

Disclosure of Invention

The invention aims to: the method solves the problem of coordination control of the coal-fired power generation system with unmodeled dynamic and uncertain environmental factors, so that the control strategy of the coal-fired power generation system has self-adaption and self-learning capabilities, and the problem that the scalar Q value is difficult to represent the value estimation of the uncertain coal-fired power generation system is solved by introducing a value distribution form.

The invention comprises the following steps: the invention provides a coal-fired power generation system coordination control method based on a value distribution DDPG algorithm, which comprises the following steps:

step 1, constructing a coordination control cost function of a coal-fired power generation system;

step 2, evaluating a long-term expected control cost function through a Q function, introducing a value distribution form Z corresponding to a scalar Q value, and carrying out distribution modeling on Z distribution by using a fractional number method;

step 3, introducing a distributed expression of a Bellman operator to construct target distribution, measuring a minimum distance between Z distribution and the target distribution, and designing quantile loss to minimize the distance between the distributions;

and 4, constructing two actor-criticizer network structures of an estimation network and a target network, standardizing a data sample of the coal-fired power generation system, taking quantile Huber loss as a loss function of the estimation network and the target network, and training and updating the estimation network and the target network by adopting a DDPG algorithm to realize estimation of value distribution and updating of strategies and obtain a coordinated control strategy.

Further, the step 1 specifically includes the following:

step 101, analyzing the operation characteristics of the coal-fired power generation system, wherein the operation characteristics are specifically as follows: steam output is critical to coal-fired power generation systems because it is directly related to the power output effects of the plant. The fuel flow directly affects combustion efficiency, while the water feed and steam flow valves directly affect steam output. The coal-fired power plant needs to achieve stabilization of drum water level to improve boiler performance. In addition, the height of the water level depends on the control input, and the direct control of the water level is quite cumbersome, and the water level change is indirectly reflected through the steam pressure of the steam drum and the density of the fluid in the steam drum. In the running process of the coal-fired power generation unit, the coordination control system aims at enabling the unit to meet the electric load requirement, and simultaneously stabilizing the steam pressure, the electric power and the drum water level at required levels, so that the energy consumption is reduced as much as possible;

step 102, selecting steam flow, fuel flow and feedwater flow as control variables and steam pressure, drum fluid density and power output as state variables according to the operation characteristics;

step 103, formulating a coal-fired power generation system coordination control problem under the reinforcement learning framework, and defining a four-element groupThe method comprises the following steps:

(1)as a motion space, a collection of operations that the reinforcement learning agent can perform is represented. For coordinated control problems, the control input may act as an agent, i.e., a= [ a ] ₁ ,a ₂ ,u ₃ ] ^T Wherein u is ₁ ,u ₂ ,u ₃ Representing fuel flow, steam flow and drum feedwater flow control valve openings, respectively. The range in which each valve can rotate is (0, 1), each control variable is a continuous motion space, 0 representing fully closed and 1 representing fully open;

(2)is a state space representing a collection of information observed by an agent. Selecting system state variable boiler drum steam pressure x ₁ (kg/cm 2), electric power x ₂ (MW) and fluid Density in boiler drum x ₃ (kg/cm 3), current state and target state T _h Deviation e between ₁ 、e ₂ And e ₃ ；

Wherein h=1, 2 or 3; e, e ₁ Representing the boiler drum steam pressure x of the previous system state variable ₁ Boiler drum steam pressure h with target system state quantity ₁ Deviation between;

e ₂ representing the pre-system state variable power x ₂ Power h as a target system state variable ₂ Deviation between;

e ₃ representing the fluid density x in the boiler drum of the previous system state variable ₃ Fluid density h in boiler drum corresponding to state variable of target system ₃ Deviation between;

at the same time, the current control input a= [ u ] ₁ ,u ₂ ,u ₃ ] ^T As part of the observation state information, the complete observation information can thus be described as

s＝[x ₁ ,x ₂ ,x ₃ ,e ₁ ,e ₂ ,e ₃ ,u ₁ ,u ₂ ,u ₃ ] ^T #(1)

(3) r is a reward function for evaluating the objectives and effects of the agent in performing the action. When the system state is far from the target state T _s A certain punishment item is set when the system state approaches to the target state, and a bonus item is set, so that the bonus r is set as

Wherein omega ₁ 、ω ₂ And omega ₃ The tracking error penalty weight, the control consumption penalty weight, and the bonus weight approaching the target state value, respectivelyThe weight of the steel plate is increased, the i e i represents the current system state and the target state T _s Tracking error norms between;

(4) Gamma is a discount factor used to measure the importance of future rewards. Specifically, the discount factor determines how much the agent reduces the future rewards;

step 104, modeling the coal-fired power generation system coordination control problem as a Markov decision process, specifically, considering an agent whose current observed information state at time t is expressed asIn taking action->After that, it transits to the next state s _t+1 And obtaining scalar rewards r from the environment _t To this end, the cumulative return obtained is maximized

Wherein t is _l Representing state start time, gamma e (0, 1) is a discount factor;

step 105, the coordination control cost function of the coal-fired power generation system is as follows

Further, the step 2 specifically includes the following contents;

step 201, defineFor describing that at a specific time t, the coal-fired power generation system is in state s _t When act a is performed _t And then performs a long-term desired control cost function of the coordinated control strategy μ(s), i.e

Step 202, introducing a value distribution form corresponding to the Q function, namely Z ^μ (s, a) satisfy

Step 203, define [0,1 ]]N quantile points on τ ₁ ,…,τ _N Representing the cumulative probability function associated with such a Z distribution:

wherein the method comprises the steps ofRepresenting an inverse cumulative probability function with respect to the Z distribution;

in step 204, the distribution modeling is performed on the value distribution form by using quantiles, specifically:

wherein the method comprises the steps ofRepresenting Z ^μ τ of cumulative probability distribution of (s, a) _i Quantiles.

Further, the step 3 specifically includes the following contents;

step 301, introducing a distributed expression of a Belman operator to construct a target distribution, wherein the target distribution is that

Wherein T is ^μ Is a bellman operator, which is a binary code,the two sides of the expression equation follow the same probability law, namely the random variables Y and U follow the same rule distribution;

step 302, measure 1-Wassentin minimum distance quantile between two distributions

Wherein the method comprises the steps ofRepresenting the inverse cumulative distribution function of the distribution z;

step 303, the optimal value positions of the N quantiles are

Step 304, introducing a distance between quantile loss metric distributions

Wherein the method comprises the steps ofRepresenting if sigma<0 is true, then-> Calculate for a loss with respect to sigma;

step 305, minimizing the distance between distributions based on the minimum quantile loss

Wherein d is ₁ (Y, U) represents a 1-Wassentin distance between the Y and U distributions;

further, the step 4 specifically includes the following contents;

step 401, constructing an estimation network and a target network

The estimation network and the target network comprise an actor network, a criticism network, a target actor network and a target criticism network;

step 402, collecting a raw data sample of a coal-fired power generation system as an estimation network and a target network training set;

step 403, training and updating the estimation network and the target network by adopting a DDPG algorithm to realize the estimation of the value distribution and the updating of the strategy;

step 404, continuously cycling step 403 until the highest control cost function is obtained stably, and obtaining the coordinated control strategy.

Further, in step 401, the input of the actor network is the observed information s of the coal-fired power generation system, and the output is μ (s|θ ^μ )；

The input of the criticizing home network is observation information s and control input a of the coal-fired power generation system, and the output is that

The input of the target actor network is the next observed information s ' obtained by the coal-fired power generation system by executing the control input a on the observed information s for the coal-fired power generation system, and the output is mu ' (s ' |theta) ^μ′ )；

The input of the target criticism network is the next observation information s 'obtained by the coal-fired power generation system by executing the control input a on the observation information s and the corresponding target criticism network output mu' (s|theta) ^μ′ ) The output is

Further, in step 403, the estimated network and the target network are trained and updated by using a DDPG algorithm to implement the estimation of the value distribution and the update of the policy, which specifically includes the following steps:

step 4031, calculating a time difference error according to the distribution output of the critics network and the target critics network value:

wherein s' represents the next observation information obtained by the coal-fired power generation system by executing the control input a on the observation information s;

step 4032, calculating a fractional Huber loss function between value distributions

Wherein the method comprises the steps ofIndicating if delta _ij <0 is true, then->Kappa is the hyper-parameter of the Huber loss function;

step 4033, minimizing quantile Huber loss function to update reviewer network parameters

Wherein B is a standardized coal-fired power generation training set;

step 4034, the actor network follows Z (s, μ (s|θ ^μ ) Updating of directional gradients

Step 4035, updating the target network parameters based on the estimated network parameters

Wherein xi represents θ ^μ′ And theta ^Z′ Is used for updating the amplitude of the update.

The technical scheme provided by the invention has the beneficial effects that:

1) The self-learning capability of the control strategy is obviously improved by organically combining the reinforcement learning optimization framework with the coordination control of the complex uncertain coal-fired power generation system, so that the actual requirement of the coordination control is flexibly met.

2) The complete distribution information of the cost function is described by introducing a value distribution concept into the DDPG, so that the cost function is accurately predicted, the uncertainty of the coal-fired power generation operation environment is better processed, and the learning efficiency is ensured.

Drawings

FIG. 1 is a flow chart of coordinated control of a coal-fired power generation system based on a value distribution DDPG algorithm;

FIG. 2 is a diagram of a value distribution DDPG algorithm training;

FIG. 3 is state x ₁ Error e ₁ A trace map;

FIG. 4 is state x ₂ Error e ₂ A trace diagram.

FIG. 5 is state x ₃ Error e ₃ A trace diagram.

Fig. 6 is a control input trace diagram.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

The invention relates to a coal-fired power generation system coordination control method based on value distribution DDPG, which is shown in figure 1 and comprises the following steps:

step 1, modeling a coal-fired power generation system coordination control problem as a Markov decision process under a reinforcement learning optimization framework based on operation characteristic analysis of the coal-fired power generation system, and constructing a coal-fired power generation system coordination control cost function; the method specifically comprises the following steps:

step 101, analyzing the operation characteristics of a coal-fired power generation system;

the concrete steps are as follows: steam output is critical to coal-fired power generation systems because it is directly related to the power output effects of the plant. The fuel flow directly affects combustion efficiency, while the water feed and steam flow valves directly affect steam output. The coal-fired power plant needs to achieve stabilization of drum water level to improve boiler performance. In addition, the height of the water level depends on the control input, and the direct control of the water level is quite cumbersome, and the water level change is indirectly reflected through the steam pressure of the steam drum and the density of the fluid in the steam drum. In the operation process of the coal-fired power generation unit, the coordination control system aims at enabling the unit to meet the electric load requirement and stabilizing the steam pressure, the electric power and the drum water level at required levels, so that energy consumption is reduced as much as possible.

Step 102, selecting steam flow, fuel flow and feedwater flow as control variables and steam pressure, drum fluid density and electric power output as state variables according to the operating characteristics.

(1)as a motion space, a collection of operations that the reinforcement learning agent can perform is represented. For coordinated control problems, the control input may act as an agent, i.e., a= [ u ] ₁ ,u ₂ ,u ₃ ] ^T Wherein u is ₁ ,u ₂ ,u ₃ Respectively represent the fuel flow u ₁ Steam flow u ₂ And drum feed water flow u ₃ And controlling the opening degree of the valve. The range in which each valve can be rotated is (0, 1), and each control variable is a continuous motion space, with 0 indicating full closure and 1 indicating full opening.

(2)Is a state space representing a collection of information observed by an agent. Selecting system state variable boiler drum steam pressure x ₁ (kg/cm 2), electric power x ₂ (MW) and fluid Density in boiler drum x ₃ (kg/cm 3), current system state and target system state T _h Deviation e between ₁ 、e ₂ And e ₃ ；

s＝[x ₁ ,x ₂ ,x ₃ ,e ₁ ,e ₂ ,e ₃ ,u ₁ ,u ₂ ,u ₃ ] ^T #(1)

(3) r is a reward function for evaluating the objectives and effects of the agent in performing the action. When the system state is far from the target state T _h A certain punishment item is set when the system state approaches to the target state, and a bonus item is set, so that the bonus r is set as

Wherein omega ₁ 、ω ₂ And omega ₃ Tracking error punishment weight, control consumption punishment weight and reward weight close to target state value are respectively adopted, wherein the I e I represents the current system state and the target state T _h Tracking error norms between.

(4) Gamma is a discount factor used to measure the importance of future rewards. Specifically, the discount factor determines how much an agent reduces future rewards.

Step 104, modeling the coal-fired power generation system coordination control problem as a Markov decision process, specifically, considering an agent whose current observed information state at time t is expressed asIn taking action->After that, it transits to the next state s _t+1 And obtaining scalar rewards r from the environment _t In this way, the cumulative return obtained is maximized>Is that

Step 2, evaluating a long-term expected control cost function through a Q function, introducing a value distribution form Z corresponding to a scalar Q value, and carrying out distribution modeling on Z distribution by using a fractional number method; the method specifically comprises the following steps:

wherein the method comprises the steps ofRepresenting an inverse cumulative probability function with respect to the Z distribution.

wherein the method comprises the steps ofRepresenting Z ^μ Accumulation of (s, a)τ of the product probability distribution _i Quantiles.

Step 3, introducing a distributed expression of a Bellman operator to construct target distribution, measuring a minimum distance between the Z distribution and the target distribution, and designing quantile loss to minimize the distance between the distributions; the method specifically comprises the following steps:

Wherein the method comprises the steps ofRepresented is the inverse cumulative distribution function of the distribution z.

Step 303, the optimal value positions of the N quantiles are

Step 304, introducing a distance between quantile loss metric distributions

Wherein the method comprises the steps ofRepresenting if sigma<0 is true, then-> Calculated for a loss with respect to sigma.

Wherein d is ₁ (Y, U) represents the 1-Wassertein distance between the random variable Y and U distributions.

And 4, constructing two actor-criticizer network structures of an estimation network and a target network, standardizing data samples of the coal-fired power generation system, and training and updating the estimation network and the target network by adopting a DDPG algorithm to realize estimation of value distribution and updating of strategies and obtain coordinated control strategies. The method specifically comprises the following steps:

in step 401, an estimation network and a target network are constructed as two actor-criticizer network structures. The estimated network and the target network comprise four neural networks, specifically an actor network, a reviewer network, a target actor network and a target reviewer network, and the network parameters of the actor network are represented by theta ^μ Representing network parameters of a commentator network in θ ^Z A representation; network parameters θ for target actor networks ^μ′ Representing network parameters of target criticism network by theta ^z′ A representation;

wherein, the input of the actor network is observed information s of the coal-fired power generation system, and the output is mu (s|theta) ^μ )；

The input of the target criticism network is the next observed information s' obtained by the coal-fired power generation system executing the control input a on the observed information s and the corresponding target criticism network output mu ^′ (s∣θ ^μ′ ) The output is

Step 402, converting the data samples originally collected by the coal-fired power generation system into a standardized training set with a mean of 0 and a variance of 1 using z-score normalization.

Step 403, training and updating the estimation network and the target network by adopting a DDPG algorithm to realize the estimation of the value distribution and the updating of the strategy, specifically:

(1) Calculating time difference errors according to distribution output of critic network and target critic network values

(2) Fractional Huber loss function between calculated value distributions

Wherein the method comprises the steps ofIndicating if delta _ij <0 is true, then->Kappa is the hyper-parameter of the Huber loss function.

(3) Minimizing quantile Huber loss functionUpdating critic network parameters

Wherein B is a standardized coal-fired power generation training set.

(4) The actor network is defined along Z (s, μ (s|θ ^μ ) Updating of directional gradients

(5) Updating the target network parameters directly according to the estimated network parameters by a soft updating mode

Wherein xi represents θ ^μ′ And theta ^z′ Usually ζ < 1.

Step 404, continuously cycling step 403 until the highest control cost function is obtained steadily, and ending training to obtain the coordinated control strategy.

The method is suitable for the coal-fired power generation system under various operation conditions. In order to better understand the present invention, a detailed description of a coal-fired power generation system coordination control method of the value distribution DDPG algorithm will be given below with reference to specific embodiments.

When a coal-fired power generation system coordination controller based on a value distribution DDPG algorithm is designed, four multi-layer perceptron neural networks are respectively used as an actor network, a commentator network, a target actor network and a commentator network, the activation function of the network is RELU, the activation functions of the output layers of the actor network and the target actor network are Tanh, and the optimizer is Adam. The minimum batch data B of the network training is 128 in size, the soft update rate xi=5e-3, and the network learning rate is 1e-4. The Huber super parameter κ=0.3 was chosen.

Setting the initial state and the target state to x (0) = [102,438.93] ^T And T _h ＝[h ₁ ,h ₂ ,h ₃ ]＝[121,389,92] ^T The initial control input is a ₀ ＝[0.3102,0.6711,0.3967]. Controlling the cost function to be

Where discount coefficient γ=0.99.

It is applied to a 160MW coal-fired power generation system. And collecting operation data through interaction with the environment of the coal-fired power generation system, and performing network training. And performing multiple operation tests in a coal-fired power generation environment. The resulting average rewards curve is shown in fig. 2, wherein the horizontal axis represents the number of training steps, the vertical axis represents the corresponding rewards, the solid line represents the average rewards curve for multiple runs, and the shading represents the rewards variance. As can be seen from fig. 2, the prize return curve has converged after 2e5 training steps. After training is finished, a value distribution DDPG algorithm is applied to the coal-fired power generation system to test the tracking target state T of the coal-fired power generation system _s Effects system, system variable x ₁ 、x ₂ 、x ₃ And corresponding to the tracking target h ₁ 、h ₂ 、h ₃ The state and error trace diagrams of the system are shown in fig. 3, fig. 4 and fig. 5, respectively, and the system input variable trace is shown in fig. 6, so that the system can track a given target value finally.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The coal-fired power generation system coordination control method based on the value distribution DDPG algorithm is characterized by comprising the following steps of:

step 2, evaluating a long-term expected control cost function through a Q function, introducing a value distribution form Z corresponding to a scalar Q value, and carrying out distribution modeling on Z distribution by using a fractional method;

step 3, a distribution type expression of a Bellman operator is introduced to construct target distribution, the minimum distance between Z distribution and the target distribution is measured, and quantile loss is designed to minimize the distance between the Z distribution and the target distribution;

and 4, constructing an estimation network and a target network, and training and updating the estimation network and the target network by adopting a DDPG algorithm to realize estimation of value distribution and updating of strategies so as to obtain a coordinated control strategy.

2. The method for coordinated control of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 1, wherein the constructing the coordinated control cost function of the coal-fired power generation system in step 1 comprises the following steps:

step 101, selecting steam flow, fuel flow and water supply flow as control variables of a coal-fired power generation system;

steam pressure, drum fluid density and power output are selected as state variables of the coal-fired power generation system;

102, formulating a coal-fired power generation system coordination control problem under a reinforcement learning framework, and defining a quadrupleThe method comprises the following steps:

(1)for the action space, a set representing the operation a performed by the reinforcement learning agent, i.e., a= [ u ] ₁ ,u ₂ ,u ₃ ] ^T Wherein u is ₁ ,u ₂ ,u ₃ Representing the fuel flow, steam flow and drum feedwater flow control valve opening respectively;

(2)is a state space representing a collection of information observed by an agent; selecting system state variable boiler drum steam pressure x ₁ Power x of electric power ₂ And fluid density x in boiler drum ₃ Current system state and target system state T _h Deviation e between ₁ 、e ₂ And e ₃ ；

Thus, the complete observation information is described as

s＝[x ₁ ,x ₂ ,x ₃ ,e ₁ ,e ₂ ,e ₃ ,u ₁ ,u ₂ ,u ₃ ] ^T #(1)

(3) r is a reward function for evaluating the objective and effect of the agent when executing the action;

wherein omega ₁ 、ω ₂ And omega ₃ Tracking error punishment weight, control consumption punishment weight and reward weight close to target state value are respectively adopted, wherein the I e I represents the current system state and the target state T _s Tracking error norms between;

(4) Gamma is a discount factor for measuring the importance of future rewards;

104, modeling the coordination control problem of the coal-fired power generation system as a Markov decision process, wherein the current observed information state of the intelligent agent at time t is expressed asIn taking action->After that, it transits to the next state s _t+1 And obtaining scalar rewards r from the environment _t To this end, the cumulative return obtained is maximized

3. The coordinated control method of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 1, wherein in the step 2, the cost function is expected to be controlled for a long time through Q function evaluation, a value distribution form Z corresponding to a scalar Q value is introduced, and a distribution modeling is carried out on the Z distribution by using a fractional method, specifically comprising the following steps:

step 201, defineFor describing that at a specific time t, the coal-fired power generation system is in a current state s _t When act a is performed _t And then performs a long-term desired control cost function of the coordinated control strategy μ(s), i.e

Step 202, introducing a value distribution form corresponding to the scalar Q value,i.e. Z ^μ (s, a) satisfy

in step 204, the Z distribution is modeled using quantiles, specifically:

4. The coordinated control method of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 1, wherein the method is characterized in that a target distribution is constructed by introducing a distribution type expression of a Belman operator in the step 3, measuring a minimum distance between Z distribution and the target distribution, and designing quantile loss to minimize the distance between Z distribution and the target distribution, and specifically comprises the following steps:

Wherein the method comprises the steps ofIs a Belman operator, < ->The two sides of the expression equation follow the same probability law, namely the random variables Y and U follow the same rule distribution;

step 302, measuring 1-Wassertein minimum distance quantiles between Z distribution and target distribution

step 303, the optimal value positions of the N quantiles are

Step 304, introducing a distance between quantile loss metric distributions

Wherein d is ₁ (Y, U) represents the 1-Wassentin distance between the Y and U distributions.

5. The coal-fired power generation system coordination control method based on the value distribution DDPG algorithm according to claim 1, wherein in step 4, an estimation network and a target network are constructed, data samples of the coal-fired power generation system are standardized, the estimation network and the target network are trained and updated by adopting the DDPG algorithm, so that the estimation of the value distribution and the updating of the strategy are realized, and the coordination control strategy is obtained

Step 401, constructing an estimation network and a target network

6. The coordinated control method of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 5, wherein in step 401, the input of the actor network is the observed information s of the coal-fired power generation system, and the output is μ (s|θ ^μ )；

The input of the target criticism network is the next observed information s 'obtained by the coal-fired power generation system by executing the control input a on the observed information s and the corresponding target criticism network output mu' (s|theta) ^μ′ ) The output is

7. The coordinated control method of a coal-fired power generation system based on a value distribution DDPG algorithm according to claim 5, wherein in step 403, the estimation network and the target network are trained and updated by the DDPG algorithm to realize the estimation of the value distribution and the update of the strategy, and specifically comprises the following steps:

Wherein II { delta } _ij < 0}, if delta _ij < 0 is true, II { delta ] _ij < 0} = 1, k is the hyper-parameter of the Huber loss function;

Wherein B is a standardized coal-fired power generation training set;