CN113112077A

CN113112077A - HVAC control system based on multi-step prediction deep reinforcement learning algorithm

Info

Publication number: CN113112077A
Application number: CN202110403130.XA
Authority: CN
Inventors: 任密蜂; 刘祥飞; 杨之乐; 张建华
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-07-13
Anticipated expiration: 2041-04-14
Also published as: CN113112077B

Abstract

The invention relates to an intelligent control method of a control system of temperature, humidity, Air cleanliness and Air-conditioning Ventilation (HVAC), in particular to an HVAC control system based on a Long Short-term Memory neural network (LSTM) and a Deep Reinforcement Learning (DRL) algorithm of a generalized mutual entropy (GC) loss function. The method comprises the following steps: the method comprises the steps of collecting outdoor environment temperature, indoor environment temperature and power grid electricity price information, preprocessing the collected data, predicting future multistep outdoor environment temperature by using outdoor environment temperature historical data, and controlling power output of the HVAC system by utilizing a Deep Deterministic strategy (DDPG) algorithm of a DRL (digital refrigerant phase) based on future outdoor temperature value, indoor environment temperature and power grid electricity price information. The invention can intelligently control the HVAC system in real time to reduce the cost of users and ensure the satisfaction degree of the users, and has higher practical engineering application value.

Description

HVAC control system based on multi-step prediction deep reinforcement learning algorithm

Technical Field

The invention relates to a method for intelligently and optimally controlling an HVAC system, in particular to a research method for intelligently controlling the HVAC system based on a GC-LSTM neural network and a DRL algorithm.

Background

The household users are used as terminal users of the power grid, and the electricity utilization habits of the users and the addition of the distributed renewable energy sources directly cause the appearance of wave crests and wave troughs of the power grid; which can cause severe impact and serious threat to the power grid. With the development of the smart power grid and the implementation of a demand response strategy in recent years, the passive mode of a resident user is changed into the active mode to be added into the power grid; under the environment of the smart power grid, the electricity price information and the generating capacity information of the power grid are communicated with the demand information of the user in a two-way mode. In the family user, the power consumption of the air conditioning system accounts for about 35% of the power consumption of the whole user, so that on the premise of meeting certain comfort of the user, the output power of the HVAC system is intelligently controlled according to the power price of a power grid and the temperature information of the environment, and the method has important significance for reducing the use of the power, reducing the user cost and reducing the greenhouse effect.

At present, the HVAC system mainly adopts a traditional control mode closed-loop control and model predictive control algorithm, a temperature sensor is arranged in the closed-loop control system, when the indoor temperature is detected to reach a set value, the HVAC system stops working, the HVAC system based on the closed-loop control mode is simple to operate and easy to realize, but under the environment of an intelligent power grid and a corresponding strategy of demand, the power is difficult to be converted according to dynamic electricity prices so as to reach the standards of energy conservation and emission reduction; the model predictive control algorithm controls the HVAC system by establishing an accurate model of the indoor temperature variation, however, the complexity of the indoor ambient temperature variation affects the accuracy of the modeling. With the development of an intelligent algorithm, researchers also propose to optimize and control the HVAC system by using a particle swarm optimization algorithm and a genetic algorithm, the algorithm optimally controls the power output of the HVAC system under a real-time electricity price mechanism to reduce the cost of a user, the algorithm has the characteristic of difficult tuning, the problem that the power output of the HVAC system has time delay on the change of the indoor temperature is not considered, and the comfort level of the user is not really guaranteed. It is therefore necessary to predict the future outdoor ambient temperature values first.

Disclosure of Invention

The invention provides a method for controlling an HVAC control system based on a multi-step prediction deep reinforcement learning algorithm, aiming at the nonlinearity and randomness of outdoor environment temperature and intelligent power grid electricity price and the time delay of the HVAC system output power to the indoor environment temperature change.

The HVAC control system based on the multi-step prediction deep reinforcement learning algorithm is realized by adopting the following technical scheme, the model structure of the HVAC control system is shown in figure 1, and the HVAC control system comprises two stages of multi-step prediction of outdoor environment temperature and real-time control of indoor temperature, wherein the prediction stage of the outdoor environment temperature comprises the following steps:

the method comprises the following steps: according to actual data points of the outdoor environment, the outdoor environment temperature X ═ T at continuous i moments is selected₁,…,T_i]As input to the multi-step temperature prediction model, with h ═ h_i+1,…,h_i+n]As the real output of the model, n is the step number of the multi-step prediction;

step two: preprocessing the acquired data, correcting abnormal data, and converting the time series data into the data of a supervision sequence;

step three: inputting the input quantity into a long-short term memory neural network based on a generalized mutual entropy loss function, and forgetting, memorizing and learning the input quantity by utilizing a forgetting gate, an input gate and an output gate of the long-short term memory neural network; the nonlinear regression model of the long-short term memory neural network based on the generalized mutual entropy loss function is described as follows:

1) converting the input quantity X to [ T ]₁,…,T_i]Inputting the data into the first block of the long-short term memory neural network, and determining the input information X at the current moment and the output information h at the last moment by a sigmoid (sigma) function by a forgetting gate_t-1How much can be reserved by the current block, i.e. the output of the forgetting gate is f_t＝σ(w_f[h_t-1,X_t]+b_f) Wherein w is_fAnd b_fWeights and bias values for the neural network;

2) the input gate determines the information to be updated, and first determines the updated information i by the sigma function_t＝σ(w_i[h_t-1,X_t]+b_f) Second generating new candidate values by the tanh function

Candidate value c of the last current block_tThe output of the forgetting gate, the output of the input gate, the new candidate value and the candidate value of the last block are determined together, that is:

3) the output gate obtains the output of the model, and an initial output o is obtained by the sigma function_t＝σ(w_o[h_t-1,X_t]+b_o) Secondly, obtaining the candidate value information c from 2)_tScaling to a value between-1 and 1 through an activation function tanh to finally obtain an output h of the model_t＝o_t*tanh(c_t)；

4) Calculating true value Y based on GC loss function_tAnd the predicted value h_tThe error between, as in the following equation:

n is the number of samples, gamma is a gamma function, alpha is more than 0 is a shape parameter, beta is more than 0 is a bandwidth parameter, a plurality of times of iterative training is carried out, the weight w and the offset value b of the neural network are updated through a minimum batch gradient descent method, and the error between a real value and a predicted value is minimized;

step four: finally, obtaining a nonlinear mapping model from the outdoor environment temperature of the first i moments to the outdoor environment temperature of the next n moments based on the long-short term memory neural network of the generalized mutual entropy loss function;

the real-time control of the indoor temperature comprises the following steps:

the method comprises the following steps: collecting the outdoor environment temperature X ═ T at i continuous moments₁,…,T_i]Based on a long-short term memory neural network based on a generalized mutual entropy loss function, obtaining the outdoor environment temperature h ═ h at n continuous moments in the future_i+1,…,h_i+n](ii) a Obtaining the power grid electricity price rho at the current moment_tAnd indoor temperature T_i ⁱⁿH, rho are calculated according to the related information_tAnd T_i ⁱⁿAs the environment information, that is: s_t＝{h，ρ_t，T_i ⁱⁿ}；

Step two: the current state information S_tInputting to the Actor current neural network of the deep reinforcement learning DDPG algorithm, and based on the current strategy and Gaussian noise

To select an action

a_t∈[P_min,P_max]Gaussian noise (Gaussian)

Is to increase the exploration rate of the action, and is reduced along with the increase of the number of iterative cycles, theta^μIs the Actor's current neural network parameter, P_minAnd P_maxMinimum and maximum output power of the HVAC system, respectively;

step three: performing action a_tControlling the output power of the air conditioner, the output of power from the HVAC system can change the ambient temperature in the room, such as:

then obtain a timely reward r_tAnd reaches the next state S_t+1；

Step four: will (S)_t,a_t,r_t,S_t+1) Storing the data into an experience pool buff-C;

step five: if the data volume of the experience pool buff-C is larger than C_MRandomly taking M samples (S) from the experience pool buff-N_i,a_i,r_i,S_i+1) I 1,2, …, M, the following steps are performed; otherwise, directly performing the step eleven;

step six: calculating the expected value y of the target_i＝r_i+γQ'(S_i+1,μ'(S_i+1|θ^μ')|θ^Q') Where μ' (S)_i+1|θ^μ') Is to obtain the optimal action from the target neural network of the Actor, Q' (S)_i+1,μ'(S_i+1|θ^μ')|θ^Q') A target network Q' that is Ctric is a future target value that is output based on the state information and the optimal action information at the next time, γ is a discount factor, θ^μ'And theta^Q'Parameters of a target neural network of Actor and parameters of a target network of Ctric are respectively;

step seven: critic current neural network Q pair action a taken based on DDPG algorithm_tPerforming evaluation to calculate an evaluation value of θ^QParameters of the Critic current neural network;

step eight: calculating an error value between a target desired value and an evaluation value of a sample using a root mean square error

And updating the parameter theta of the Critic current neural network by using a minimum batch gradient descent method^Q；

Step nine: updating Actor current neural network parameter theta by using sample strategy gradient^μThe following equation:

step ten: soft copying the parameters of the current neural networks of Ctric and Actor to the target neural network parameters of Ctric and Actor respectively, namely:

θ^Q'←τθ^Q+(1-τ)θ^Q'

θ^μ'←τθ^μ+(1-τ)θ^μ'

step eleven: regarding the state at the next time as the state at the current time, that is: s_t←S_t+1Iteration loops from the first step to the eleventh step to finally obtain a converged Actor current neural network, and output the parameter theta of the neural network^μObtaining a final HVAC control system model, and then performing the step twelve;

step twelve: the current state information S_tInputting the current neural network of the Actor of the deep reinforcement learning DDPG algorithm, and selecting an a based on the optimal strategy_tPerforming action a_tAnd controlling the power output of the HVAC system.

Drawings

FIG. 1 is a schematic diagram of the establishment of an HVAC intelligent control system.

Fig. 2 is a graph of loss functions of the outdoor environment temperature training set and the test set in the debugging stage, where 1 represents a loss function curve of the outdoor environment temperature training set, and 2 represents a loss function curve of the outdoor environment temperature test set.

Fig. 3 is a graph showing a real value and a predicted value of the outdoor environment temperature test set at the debugging stage, where 3 represents the predicted value of the outdoor environment temperature test set, and 4 represents the real value of the outdoor environment temperature test set.

Detailed Description

The method takes the collected real environment temperature data as an experimental object to train and test the HVAC control system based on the multi-step prediction deep reinforcement learning algorithm

The HVAC control system based on the multi-step prediction deep reinforcement learning algorithm comprises two stages of multi-step prediction of outdoor environment temperature and real-time control of indoor temperature, wherein the prediction stage of the outdoor environment temperature comprises the following steps:

the method comprises the following steps: according to actual data points of the outdoor environment, selecting the outdoor environment temperature X ═ T at 6 continuous moments₁,…,T_i]As input to the model, h ═ h_i+1,…,h_i+n]The sampling interval, as a real output of the model, is every 30 minutes.

Step two: preprocessing the acquired data, correcting abnormal data, converting the data of the time sequence into data of a supervision sequence, and dividing the data into 2500 training sets and 1000 testing sets.

Step three: setting the number of cells of the long-short term memory neural network as 100, the training times as 500, the learning rate as 0.001 and the batch of the minimum batch gradient descent method as 32;

step four: inputting the input quantity of the training set into a long-short term memory neural network based on a generalized mutual entropy loss function, and forgetting, memorizing and learning the input quantity by utilizing a forgetting gate, an input gate and an output gate of the long-short term memory neural network; the nonlinear regression process of the long-short term memory neural network based on the generalized mutual entropy loss function is described as follows:

step five: finally, based on a long-short term memory neural network of the generalized mutual entropy loss function, obtaining a nonlinear mapping model from the outdoor environment temperature at the first moment i to the outdoor environment temperature at the future moment n, and using the accuracy of the test set test model;

step six: testing the accuracy of the model using the test set, using the root mean square error between the true and predicted values, the probability density distribution of the error, and R²As evaluation indexes of the model, they are defined as:

in the formula y_i，h_iFor the corresponding real and predicted values of each step,

the mean value of real samples in each step, m is the number of samples in the test set, k (-) is a Gaussian kernel function,

the probability density function of the error will be implemented in a sliding window approach.

The real-time control of the indoor temperature comprises the following steps:

the method comprises the following steps: acquiring the outdoor environment temperature X ═ T at 6 continuous moments₁,…,T_i]Based on GC-LSTM neural network model, obtaining the outdoor environment temperature h ═ 3 moments in future continuously_i+1,…,h_i+n](ii) a Obtaining the power grid electricity price rho at the current moment_tAnd indoor temperature T_i ⁱⁿAnd dividing the data into a training set 2500 and a test set 1000 according to the relevant information. H, rho_tAnd T_i ⁱⁿAs the environment information, that is: s_t＝{h，ρ_t，T_i ⁱⁿ}；

Step two: setting a DDPG algorithm of deep reinforcement learning as four neural networks, wherein a current neural network of an Actor and a target neural network of the Actor have three layers of neural networks with the same structure, a hidden layer activation function is tanh, a current neural network of Critic and the target neural network of Critic have the same neural network structure, and the hidden layer activation function is relu;

step three: current state information S in training set_tThe current neural network input to the Actor based on the current strategy and Gaussian noise

To select an action

a_t∈[P_min,P_max]，P_minAnd P_maxMinimum and maximum output power of the HVAC system, respectively;

step four: performing action a_tControlling the output power of the air conditioner and then obtaining a timely reward r_tTo the next state S_t+1R is a prize_tWill be related to the comfort of the user, as follows:

T_minand T_maxMinimum and maximum comfort temperature, λ, respectively₁And λ₂A weighting factor for balancing the reward;

step five: will (S)_t,a_t,r_t,S_t+1) Storing the data into an experience pool buff-C;

step six: then randomly take M samples (S) from the experience pool buff-N_i,a_i,r_i,S_i+1)，i＝1,2,…,M；

Step seven: calculating the expectation y of the target based on the status of the next time and the action obtained by the target network of the Actor_i＝r_i+γQ'(S_i+1,μ'(S_i+1|θ^μ')|θ^Q')；

Step eight: action a taken by Critic current neural network Q pair of DDPG algorithm_tEvaluation is performed to calculate an evaluation value Q (S)_t,a_i|θ^Q)；

Step nine: calculating an error value between a target desired value and an evaluation value of a sample using a root mean square error

Updating the parameters of the Critic current neural network by using a minimum batch gradient descent method;

step ten: updating Actor current neural network parameter theta by using sample strategy gradient^μ；

Step eleven: soft copying the parameters of the current neural networks of the Ctric and the Actor to the target neural network parameters of the Ctric and the Actor respectively;

θ^Q'←τθ^Q+(1-τ)θ^Q'

θ^μ'←τθ^μ+(1-τ)θ^μ'

step twelve: obtaining a convergent Actor current neural network through training of a training set, and outputting a parameter theta of the neural network^μUsing the reward value obtained by each iterative training and the error value L of each step as the judgment index of network convergence;

step thirteen: current state information S of test set_tInputting to the Actor current neural network of the DDPG algorithm, and selecting an a based on the optimal strategy_tPerforming action a_tAnd controlling the power output of the HVAC system, and using the power consumption cost of the HVAC system and the comfort of the user as the performance indexes of the system.

The invention has the advantages that: the long-short term memory neural network is used for predicting the temperature of the future outdoor environment, the comfort level of a user is improved, and the generalized mutual entropy loss function is used as the loss function of the long-short term memory neural network to improve the accuracy of prediction; then based on the DDPG algorithm, according to the electricity price change of the power grid, the change of the indoor temperature and the change of the future outdoor temperature, the power output of the HVAC system is intelligently adjusted, and the power consumption cost of a user is saved under the condition of ensuring the comfort degree of the user.

The above description is only an embodiment of the present invention, but the structural features of the present invention are not limited thereto, and any changes or modifications within the scope of the present invention by those skilled in the art are covered by the present invention.

Claims

1. The HVAC control system based on the multi-step prediction deep reinforcement learning algorithm is characterized in that: the method comprises two stages of multi-step prediction of outdoor environment temperature and real-time control of indoor temperature, wherein the prediction stage of the outdoor environment temperature comprises the following steps:

1) converting the input quantity X to [ T ]₁,…,T_t]Inputting the data into the first block of the long-short term memory neural network, and determining the input information X at the current moment and the output information h at the last moment by a sigmoid (sigma) function by a forgetting gate_t-1How much can be reserved by the current block, i.e. the output of the forgetting gate is f_t＝σ(w_f[h_t-1,X_t]+b_f) Wherein w is_fAnd b_fWeights and bias values for the neural network;

3) the output gate obtains the output of the model, and an initial output o is obtained by the sigma function_t＝σ(w_o[h_t-1,X_t]+b_o)，Secondly, obtaining the candidate value information c in the step 2)_tScaling to a value between-1 and 1 through an activation function tanh to finally obtain an output h of the model_t＝o_t*tanh(c_t)；

4) Calculating true value Y based on generalized mutual entropy loss function_tAnd the predicted value h_tThe error between, as in the following equation:

the real-time control of the indoor temperature comprises the following steps:

the method comprises the following steps: collecting the outdoor environment temperature X ═ T at i continuous moments₁,…,T_i]Obtaining the outdoor environment temperature h ═ h [ h ] at n continuous moments in the future based on the long-short term memory neural network of the generalized mutual entropy loss function_i+1,…,h_i+n](ii) a Obtaining the power grid electricity price rho at the current moment_tAnd indoor temperature T_t ⁱⁿH, rho are calculated according to the related information_tAnd T_t ⁱⁿAs the environment information, that is: s_t＝{h,ρ_t,T_t ⁱⁿ}；

To select an action

a_t∈[P_min,P_max]Gaussian noise (Gaussian)

then obtain a timely reward r_tAnd reaches the next state S_t+1；

step seven: critic current neural network Q pair adopted action based on DDPG algorithmAs a_tEvaluation is performed to calculate an evaluation value Q (S)_t,a_t|θ^Q) Wherein theta^QParameters of the Critic current neural network;

θ^Q'←τθ^Q+(1-τ)θ^Q'

θ^μ'←τθ^μ+(1-τ)θ^μ'