CN115310760A

CN115310760A - Gas system dynamic scheduling method based on improved near-end strategy optimization

Info

Publication number: CN115310760A
Application number: CN202210781220.7A
Authority: CN
Inventors: 谢磊; 常海颖; 陈启明; 苏宏业
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2022-11-08

Abstract

The invention discloses a gas system dynamic scheduling method based on improved near-end strategy optimization, which comprises the following steps: (1) Determining a production plan interval and a production and consumption gas device according to the scheduling optimization process of a gas system, and establishing a gas pipe network model; (2) Determining the initial state of a gas pipe network model according to the gas output and the initial value of the pressure of the gas pipe network, and iteratively updating on the basis; (3) Constructing a reward function for evaluating the action quality under the current state according to the optimization objective function; (4) The reinforcement learning intelligent agent model obtains an action a through the state, updates the model until one-time scheduling is completed, records the state, the action and the reward of the scheduling process, and updates the network parameters of the model; (5) After the iterative training is finished, applying the model to a test set for testing; (6) And carrying out scheduling optimization on the gas system by using the trained model. By utilizing the method and the device, the load capacity of the gas pipe network can be better predicted, and the pressure balance level of the gas pipe network is effectively improved.

Description

Gas system dynamic scheduling method based on improved near-end strategy optimization

Technical Field

The invention relates to the field of balance and artificial intelligence application of a gas system, in particular to a gas system dynamic scheduling method based on improved near-end strategy optimization.

Background

The gas system is an important component of the energy system of the oil refinery and is one of the most important fuel sources of the oil refinery. The gas pipe network is main equipment for bearing gas transportation, but the pressure of the pipe network is limited by upper and lower limits, and if the gas yield is far greater than the consumption, the gas breaks through the upper limit of the pressure of the pipe network, so that potential safety hazards are easily caused; if the gas output is far less than the consumption, the gas is lower than the lower pressure limit of the pipe network, and mechanical failure is easily caused.

Currently, the scheduling optimization algorithms actually applied to the gas system are few, and most of the algorithms still depend on manual experience or traditional methods. Whether the scheduling method adopts a manual method or a traditional method mainly based on a heuristic algorithm, the performance of the solution is mostly common, and the actual effect of the scheduling rule determined by manual experience has a large difference with the difference of decision makers. The scheduling strategy which can be obtained by the heuristic algorithm in a limited time is only slightly better than that of manual operation, and a certain randomness often exists, so that a stable result cannot be obtained.

Chinese patent publication No. CN101794119A discloses a method for balancing and optimizing scheduling of a gas system based on prediction data, which includes: acquiring data required by triggering a gas system from a scheduling system; predicting the gas generation amount of each production device in a future preset time period and the energy demand of the heating furnace boiler according to the required data to obtain predicted data; judging whether the production and demand of the gas system in the future preset time period of each production device are balanced or not according to the prediction data, and optimizing the scheduling strategy and the scheduling scheme of the gas system in the future preset time period of each production device according to the prediction data when the production and demand of the gas system in the future preset time period of each production device are unbalanced; and displaying the optimized optimal scheduling strategy and scheduling scheme of the gas system in the future preset time period of each production device through the client so as to facilitate scheduling personnel to perform optimal scheduling. However, the method depends on more accurate historical data for the production and demand prediction of the gas system, and if the historical data is insufficient or the current production and demand data is changed greatly compared with the historical situation, the prediction accuracy is difficult to guarantee. In addition, the prediction model proposed by the method also needs to give production plan scheduling data in advance, which is difficult to realize accurate prediction under the scene of dynamic change of the production plan. In addition, the scheduling system mixed integer linear programming algorithm provided by the method needs repeated iteration for scheduling, has the problem of long operation time, and is difficult to achieve scheduling instantaneity.

Compared with the traditional method, the deep reinforcement learning method is used for scheduling optimization, and is a brand-new data-driven solving method. It has the following advantages:

(1) Generalization ability: the traditional method mostly needs to start from the beginning for a new problem, relatively better solutions are obtained through iteration, the algorithm has learning capacity through a deep reinforcement learning method, and the solutions can be effectively obtained when a new problem is given through analysis and solution of some problems.

(2) The flexibility is as follows: the deep reinforcement learning method can reduce the time complexity to be linear, and can be applied to large-scale problems by combining with mature parallel acceleration capability.

(3) Universality: the trained model can be suitable for the problems with different scales and different parameters, and a new parameter training study does not need to be designed for each problem.

However, currently, the academic and engineering circles do not adopt deep reinforcement learning algorithm to study and apply the scheduling optimization of the gas system of the oil refining enterprise.

Disclosure of Invention

The invention provides a gas system dynamic scheduling method based on improved near-end strategy optimization, which is used for dynamically scheduling a gas system, can better predict the load capacity of a gas pipe network and effectively improve the pressure balance level of the gas pipe network.

A gas system dynamic scheduling method based on improved near-end strategy optimization comprises the following steps:

(1) Determining a production plan interval, a device for producing gas and a device for consuming gas according to the scheduling optimization process of the gas system, and establishing a gas pipe network model;

(2) Determining the initial state of a gas pipe network model according to the initial values of the gas output and the gas pipe network pressure, and performing iterative updating on the basis;

(3) And constructing a reward function for evaluating the performance of the action under the current state according to the optimization objective function, wherein the reward function is represented by the income generated by the gas consumption device and the pressure balance degree of the gas pipe network, and the formula is as follows:

in the formula, x _ik Representing the state of the ith device during the kth time period, p _ik Represents the highest gain that the ith device can obtain in the kth time period, c _ik Represents the maximum consumption, W, that the ith device can achieve during the kth time period _k Indicates the pressure in the gas network, W, of the kth time period _normal The pressure of the gas pipe network in a completely balanced state is represented; alpha is alpha _k A penalty factor representing the pressure unbalance of the pipe network in the kth time period; n represents the number of devices;

(4) Building a reinforcement learning intelligent agent model, obtaining an action a through the state of the model, updating the reinforcement learning intelligent agent model until one-time scheduling is completed, recording the state, the action and the reward of the scheduling process, updating the network parameters of the reinforcement learning intelligent agent model, and improving the reward through iterative training;

(5) After the iterative training is finished, the reinforcement learning intelligent model is applied to a test set, and the change process of the pipe network pressure is visualized so as to ensure the safety and reliability of the reinforcement learning intelligent model;

(6) And storing the reinforcement learning intelligent agent model, and directly carrying out scheduling optimization on the gas system by using the trained reinforcement learning intelligent agent model.

Further, the air conditioner is provided with a fan,in the step (1), the devices consuming gas in the gas system are divided into two types, one type is a device for selecting the switching value of gas consumption, namely, all the provided gas is consumed, or no gas is consumed; another type is a device with a valve for regulating the gas consumption, the gas consumption being between 0 and c _ik Continuously changing between;

assuming that there are m devices of the first type, the range of motion x _ik E {0,1}, i =1,2, · m; k =1,2,. N; the second type of device has n-m devices, the range of motion of which is x _jk ∈[0，1]，j＝m+1，m+2，...，n；k＝1，2，...，N。

In the step (2), the state of the gas pipe network model is represented by the state of each device and the actual pipe network pressure at the current moment, and the current pipe network pressure is provided for the reinforcement learning intelligent model, so that the reinforcement learning intelligent model has the capabilities of predicting and controlling the pipe network pressure to maintain balance and increasing benefits.

In the step (4), the implementation of one-time scheduling by the reinforcement learning neural network specifically comprises the following steps:

(4-1) first, a network parameter θ of a policy is initialized ⁰ ，θ ^k For the parameters obtained from the previous training, theta for each iteration ^k Updating and interacting with the environment to obtain a group of state-action pairs, dynamically adjusting beta according to KL divergence, and estimating a merit function by using a near-end strategy optimization formula

(4-2) critic Web learning to estimate the value of the current strategy

And parameterized according to current strategy

To calculate a future discount reward

(4-3) Actor network learning from theta _π Parameterizing the resulting random strategy pi in order to take the action with the maximum probability of maximizing the future return sum; thus, the strategy is represented by θ _π Parameterize and generate a probability distribution of the set of available actions at time t, formulated as:

where R represents the reward function evaluated by taking action a at state s and time t, and E represents the mathematical expectation;

(4-4) updating the parameter by calculating the timing difference ERROR TD-ERROR, and the formula is:

(4-5) activating by using a Tanh function, wherein the Tanh function is expressed as follows:

and (4-6) performing optimization calculation on the accumulated loss by adopting an Adam optimization algorithm, and performing iterative update on the weight of the neural network based on training data, thereby designing independent adaptive learning rates for different parameters.

In the step (4-1), the near-end strategy optimization formula is expressed as:

in the formula (I), the compound is shown in the specification,

represents the optimized objective function, beta represents the penalty factor, KL (theta ) _k ) Used for measuring theta and theta _k To a similar degree.

Step (4-2) In calculating the future discount reward

Is expressed as:

wherein s is _t Is the state at time t, R _t Is from s _t Conversion to s _t+1 T represents the total number of scheduled time instants, Y is a discount factor, where 0 < Y ≦ 1, e represents the mathematical expectation of a future discount reward.

In step (4-3), during training, from a set of available action sets A, based on the probabilistic output of the policy network _t Middle pair action a _t Sampling is carried out, so that the selected action has certain randomness to encourage exploration; during the test, the action with the highest probability is selected instead.

Preferably, in the step (5), the trained reinforcement learning intelligent agent model is verified in a pre-generated test set, the total profit is calculated, a change curve of the pipe network pressure in the test process is drawn, and the control effect of the model on the pipe network pressure balance is verified.

Compared with the prior art, the invention has the following beneficial effects:

1. the gas system dynamic scheduling method based on improved near-end strategy optimization provided by the invention is improved aiming at different actual gas consumption devices, so that the algorithm can solve the mixing problem of 0/1 consumption devices and non-0/1 consumption devices.

2. Aiming at the problem of unbalanced pressure of an actual gas pipe network, the method aims at maximizing the consumption benefit and minimizing the pipe network fluctuation, trains the model by improving the near-end strategy optimization, and can efficiently obtain a scheduling optimization strategy by utilizing the trained model so as to guide the actual gas scheduling to a certain extent.

3. According to the gas system dynamic scheduling method based on the improved near-end strategy optimization, provided by the invention, the change range of the pipe network pressure does not exceed the upper limit and the lower limit according to the pressure change curve, so that the balance of the pipe network pressure can be effectively realized, and the safety of the scheduling process is improved.

4. The gas system dynamic scheduling method based on the improved near-end strategy optimization has the advantages of short solving time, good solving effect and the like, and a scheduling scheme can be obtained by adopting a trained network model for a production scene with higher scheduling strategy real-time performance.

5. The gas system dynamic scheduling method based on improved near-end strategy optimization, which is provided by the invention, is based on deep reinforcement learning, can train a network on a small-scale problem, directly migrate the trained network to a large-scale scheduling problem, effectively solve the large-scale optimization scheduling problem, obtain good solving performance and improve the adaptability of a scheduling optimization strategy.

Drawings

FIG. 1 is a view of a topology structure of a gas system pipe network according to an embodiment of the present invention;

FIG. 2 is a structural diagram of a reinforcement learning agent model constructed in an embodiment of the present invention;

FIG. 3 is a graph of a training curve of an improved near-end strategy optimization algorithm in an embodiment of the present invention;

FIG. 4 is a graph of pressure change for an improved near-end strategy optimization algorithm in an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

Taking a gas system in a certain oil refining enterprise as an example, a detailed description is given below to establish a dynamic scheduling model for the gas system and to perform simulation research.

As shown in fig. 1, a method for dynamically scheduling a gas system based on improved near-end policy optimization mainly includes:

step 1, determining a production plan interval and devices for producing and consuming gas, and establishing a gas pipe network model.

In this embodiment, the gas pipe network pressure is initially 100Kpa, and the upper and lower limits are 110Kpa and 90Kpa, respectively; the unit plan interval is [0, T ]]T =30, and the gas production amount in each time zone is y _k K =1, 2.. N.gas consumption c _k K =1, 2., N, for a total of N =5 gas consuming devices, wherein the first class of devices has m =2 and the second class of devices has N-m = 3.

And 2, determining the initial state of the gas pipe network model according to the initial values of the gas output and the gas pipe network pressure, and iteratively updating on the basis.

In this embodiment, the initial state is formed by connecting the states of the devices at the current time and the actual pipe network pressure, in the initial state, the production devices are about to produce gas according to the production plan of the 1 st time period, the consumption device is not operated, s ₁ ＝[tank ₁ ，tank ₂ ，...，tank _n ，W ₁ ]Wherein tan ₁ ＝tank ₂ ＝…＝tank _n ＝0，W ₁ ＝W _normal +input ₁ ，input ₁ Indicating the gas input by the first time period production device.

Step 3, constructing a reward function for evaluating the action quality under the current state according to the optimization objective function, wherein the reward function is represented by the income generated by the gas consumption device and the pressure balance degree of the gas pipe network, and the formula is as follows:

(4) The reinforcement learning intelligent agent model obtains the action a through the state, updates the model until one-time scheduling is completed, records the state, the action and the reward of the scheduling process, updates the network parameters, and improves the reward through a certain number of iterations.

In this embodiment, the model structure of the reinforcement learning intelligent agent model (actor-critic network) is shown in fig. 2, and the relevant parameter information is: hiding the layer: 3, number of hidden layer neurons: 128, actor web learning rate: 5e-5, critic web learning rate: 1e-3, number of iterations: 2000 times, through three layers of full connection neural networks and activated by a Tanh function, action and state updating are obtained, and total reward is improved.

(5) After a certain number of iterations are completed, the model is applied to a test set, and the change process of the pipe network pressure is visualized.

As shown in fig. 3, in this embodiment, 2000 iterative training studies are performed altogether, and it can be seen from the training curve that the algorithm rapidly rises and converges in a short time, and it can be seen that the scheduling method provided by the present invention can effectively implement dynamic scheduling of a gas pipe network system, the training efficiency is high, the performance of the reinforcement learning intelligent agent model after training is stable, and a high benefit can be obtained under the condition of ensuring that the pressure of the gas pipe network is substantially balanced, and the method has good reliability and practicability.

As shown in fig. 4, in this embodiment, it can be seen from the pressure variation curve that the variation range of the pipe network pressure does not exceed the upper and lower limits when the algorithm runs in the test set, and it can be seen that the improvement of the near-end policy optimization algorithm can effectively achieve the balance of the pipe network pressure, and improve the safety of the scheduling process.

In addition, the average income of the algorithm running on 30 groups of test sets reaches 751, the gas generated by a production device can be fully utilized while the pressure balance of a pipe network is ensured, positive income is obtained, and the scheduling effectiveness is fully verified.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A gas system dynamic scheduling method based on improved near-end strategy optimization is characterized by comprising the following steps:

(2) Determining the initial state of a gas pipe network model according to the initial values of the gas output and the gas pipe network pressure, and iteratively updating on the basis;

in the formula, x _ik Representing the state of the ith device during the kth time period, p _ik Represents the highest gain that the ith device can obtain in the kth time period, c _ik Represents the maximum consumption, W, that the ith device can achieve during the kth time period _k Indicating the magnitude of the pressure, W, in the gas network in the k-th time segment _normal The pressure of the gas pipe network in a completely balanced state is represented; alpha (alpha) ("alpha") _k A penalty factor representing the pressure unbalance of the pipe network in the kth time period; n represents the number of devices;

(5) After the iterative training is finished, the reinforcement learning intelligent model is applied to a test set, and the change process of the pressure of the pipe network is visualized, so that the safety and the reliability of the reinforcement learning intelligent model are ensured;

2. The method for dynamically scheduling gas system based on improved near-end strategy optimization as claimed in claim 1, wherein in step (1), the gas consumed in the gas system isThe devices of (1) are divided into two types, one type is a device for selecting the switching value of gas consumption, namely, all the supplied gas is consumed, or no gas is consumed; another type is a device with a valve for regulating the gas consumption, the gas consumption being between 0 and c _ik Continuously changing between;

suppose that the first class of devices has m devices, the range of motion of which is x _ik E {0,1}, i =1,2, ·, m; k =1,2,. Ang, N; the second type of devices has n-m, ranges of motion x _jk ∈[0,1],j＝m+1,m+2,…,n；k＝1,2,…,N。

3. The method for dynamically scheduling the gas system based on the improved near-end strategy optimization of claim 1, wherein in the step (2), the state of the gas pipe network model is represented by the state of each device at the current time and the actual pipe network pressure, and the reinforcement learning intelligent agent model is provided with the current pipe network pressure so as to have the capability of predicting and controlling the pipe network pressure to maintain balance and increase the profit.

4. The gas system dynamic scheduling method based on improved near-end strategy optimization of claim 1, wherein in the step (4), the implementation of one-time scheduling by the reinforcement learning neural network specifically comprises the following steps:

(4-1) first, a network parameter θ of a policy is initialized ⁰ ，θ ^k For the parameters obtained from the previous training, theta for each iteration ^k Updating and interacting with the environment to obtain a group of state-action pairs, dynamically adjusting beta according to KL divergence, and estimating an advantage function by using a near-end strategy optimization formula

(4-2) critic Web learning to estimate the value of the current strategy

And parameterized according to current strategy

To calculate a future discount reward

(4-3) actor network learning by theta _π Parameterizing the resulting random strategy pi in order to take the action with the maximum probability of maximizing the future return sum; thus, the strategy is represented by θ _π Parameterize and generate a probability distribution of the set of available actions at time t, formulated as:

(4-5) activating by adopting a Tanh function, wherein the Tanh function is expressed as follows:

5. The dynamic gas system scheduling method based on improved near-end strategy optimization of claim 4, wherein in the step (4-1), the near-end strategy optimization formula is expressed as:

in the formula (I), the compound is shown in the specification,

denotes the optimized objective function, beta denotes the penalty factor, KL (theta ) _k ) Used for measuring theta and theta _k To a similar degree.

6. The method for dynamically scheduling gas system based on improved near-end strategy optimization according to claim 4, wherein in the step (4-2), the discount reward in the future is calculated

Is expressed as:

wherein s is _t Is the state at time t, R _t Is from s _t Conversion to s _t+1 T represents the total number of scheduled time instants, gamma is a discount coefficient, where 0<γ ≦ 1, E represents the mathematical expectation of a future discount reward.

7. The method for dynamically scheduling gas system based on improved near-end strategy optimization of claim 4, wherein in the step (4-3), during the training, the probability output of the strategy network is used to select the available action set A _t Middle pair action a _t Sampling is carried out, so that the selected action has certain randomness to encourage exploration; during the test, the action with the highest probability is selected instead.

8. The gas system dynamic scheduling method based on improved near-end strategy optimization according to claim 1, wherein in the step (5), the trained reinforcement learning intelligent agent model is verified in a pre-generated test set, the total profit is calculated, a change curve of the pipe network pressure in the test process is drawn, and the control effect of the model on the pipe network pressure balance is verified.