CN114739229A

CN114739229A - Heat exchange process important parameter control method based on reinforcement learning

Info

Publication number: CN114739229A
Application number: CN202210221479.6A
Authority: CN
Inventors: 王轩; 王瑞; 田华; 蔡金文
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-07-12

Abstract

The invention discloses a heat exchange process important parameter control method based on reinforcement learning, which comprises the following steps: firstly, using an algorithm framework of reinforcement learning, taking a controller of heat exchange process parameters as an intelligent agent of reinforcement learning, and controlling actions output by the intelligent agent, namely control variables; and secondly, taking the heat exchanger boundary condition control signal or an actuator control signal which directly influences the heat exchanger boundary condition as the output action of the intelligent body, taking the important state parameter of the heat exchange process as the observed quantity of the intelligent body, constructing a reward function by the instant reward which is closer to the control target, and converging the output action of the intelligent body towards the maximum direction of the reward function through the continuous training of the reinforcement learning algorithm to finally obtain the intelligent body which can accurately control the change of the important state parameter of the heat exchange process. The invention can accurately and reliably control the important parameters of the heat exchanger in the heat exchange process, and improve the control precision of the parameters of the heat exchange process under the boundary condition of severe and frequent fluctuation.

Description

Heat exchange process important parameter control method based on reinforcement learning

Technical Field

The invention relates to the technical field of energy utilization, in particular to a heat exchange process important parameter control method based on reinforcement learning.

Background

At present, the heat exchange process widely exists in the industries of energy, chemical industry, power and the like, and is a very important link in industrial production.

The heat exchange process has a crucial influence on the steady-state and dynamic performance of the whole system, and therefore, in order to ensure the safe and efficient operation of the system, some important parameters in the heat exchanger, such as the temperature and pressure of the fluid, need to be effectively controlled. However, since the boundary conditions in the actual heat exchange process are changed frequently, especially the boundary conditions are changed frequently, for example, the heat exchanger for recovering the residual heat of the flue gas of the internal combustion engine for a vehicle has the situation that the fluctuation of the flue gas is frequent and severe, which brings great challenges to the control of the parameters.

When the conventional PID control (i.e., PID control) faces such a strongly fluctuating boundary condition, it often fails to exhibit a satisfactory control effect, and there is a problem of poor control accuracy.

Therefore, there is an urgent need to develop a technology that can accurately and reliably control important parameters of a heat exchanger in the heat exchange process and improve the control accuracy of the parameters of the heat exchange process under the boundary condition of severe and frequent fluctuation.

Disclosure of Invention

The invention aims to provide a heat exchange process important parameter control method based on reinforcement learning aiming at the technical defects in the prior art.

Therefore, the invention provides a heat exchange process important parameter control method based on reinforcement learning, which comprises the following steps:

firstly, using an algorithm framework of reinforcement learning, taking a controller of heat exchange process parameters as an intelligent agent of reinforcement learning, and controlling actions output by the intelligent agent, namely control variables;

and secondly, taking a heat exchanger boundary condition control signal or an actuator control signal which directly influences the heat exchanger boundary condition as the output action of the intelligent body, taking an important state parameter of the heat exchange process as the observed quantity of the intelligent body, constructing a reward function by rewarding the more close to a control target and the larger the real-time reward, and enabling the output action of the intelligent body to converge towards the maximum direction of the reward function through continuous training of an enhanced learning algorithm, thereby finally obtaining the intelligent body capable of accurately controlling the change of the important state parameter of the heat exchange process.

Preferably, in the second step, when the heat exchanger is an evaporator, the action output by the intelligent agent is the inlet flow of the cold fluid of the evaporator if the boundary condition control signal of the evaporator is adopted; if an actuator control signal is used which directly influences the boundary conditions, this is the rotational speed signal of the pump.

Preferably, in the second step, the important state parameters of the heat exchange process, which are observed quantities of the intelligent agent, specifically include the temperature and pressure of the cold fluid and the hot fluid, the change rate of the temperature and pressure of the cold fluid and the hot fluid, and the flow rate of the cold fluid and the hot fluid and the control error of the important state parameters of the controlled heat exchange process.

Preferably, in the second step, the control objectives are, in particular: target values of important state parameters in the heat exchange process;

when the heat exchanger is an evaporator, the target value of the superheat degree of the cold fluid outlet of the evaporator is obtained.

Preferably, in the second step, the pre-constructed reward function is a function formed by an error obtained by directly subtracting the actual value of the controlled heat exchanger parameter from the control reference value;

the smaller the error, the larger the value of the reward function;

the controlled heat exchanger parameters are important state parameters of the heat exchange process.

Preferably, when the important state parameter of the controlled heat exchange process is the superheat degree of the working medium at the outlet of the evaporator, the pre-constructed reward function is as follows:

in formula (1), e represents an error between the degree of superheat and the target value, a represents motion, sup represents the degree of superheat, and subscript t represents time.

Compared with the prior art, the heat exchange process important parameter control method based on reinforcement learning is scientific in design, solves the problem that the parameter control precision of the heat exchange process is poor under the condition of transient fluctuation, can accurately and reliably control important parameters of a heat exchanger in the heat exchange process, improves the parameter control precision of the heat exchange process under the boundary condition of severe and frequent fluctuation, and has great practical significance.

Drawings

FIG. 1 is a flow chart of a method for controlling important parameters of a heat exchange process based on reinforcement learning according to the present invention;

fig. 2a is a schematic diagram of a principle of a first control structure adopted in an embodiment of a method for controlling important parameters of a heat exchange process based on reinforcement learning, that is, a control principle diagram of an action of an agent is a boundary condition control signal (a heat exchanger boundary condition control signal);

FIG. 2b is a schematic diagram of a second control structure adopted in the embodiment of the method for controlling important parameters of a heat exchange process based on reinforcement learning, which is provided by the present invention, that is, a control schematic diagram of an agent action using an actuator control signal (an actuator signal directly influencing the boundary condition of a heat exchanger) as an agent action

FIG. 3 is a schematic diagram of the control effect of the trained agent under a fluctuating heat source using the control structure of FIG. 2a according to the present invention;

fig. 4 is a schematic diagram of a DDPG algorithm framework.

Detailed Description

In order that those skilled in the art will better understand the technical solution of the present invention, the present invention will be further described in detail with reference to the accompanying drawings and embodiments.

Referring to fig. 1 to 4, the invention provides a heat exchange process important parameter control method based on reinforcement learning, which comprises the following steps:

firstly, using an algorithm framework of reinforcement learning, taking a controller of heat exchange process parameters (when a heat exchanger is an evaporator in a Rankine cycle, a cold fluid temperature controller of the evaporator) as an intelligent agent of reinforcement learning, wherein the intelligent agent outputs actions (namely the actions of the intelligent agent) which are control variables;

it should be noted that, when the controller of the heat exchange process parameter is an evaporator in a rankine cycle (that is, when the control of the superheat degree of a cold fluid outlet of the evaporator in the rankine cycle is taken as an example), the method for controlling the important parameter of the heat exchange process based on reinforcement learning, that is, the method for controlling the superheat degree based on reinforcement learning, provided by the present invention.

And secondly, taking a boundary condition control signal (such as a control signal of flow of cold and hot fluids) of a heat exchanger (such as an evaporator) or an actuator control signal (such as a rotating speed control signal of a pump) which directly influences the boundary condition of the heat exchanger as an output action of the intelligent body, taking important state parameters of a heat exchange process as an observed quantity of the intelligent body, constructing a reward function by instantly rewarding the intelligent body to be larger as the heat exchange process is closer to a control target, and enabling the output action of the intelligent body to converge towards the direction of the maximum reward function through continuous training of an enhanced learning algorithm to finally obtain the intelligent body which can accurately control the change of the important state parameters of the heat exchange process, or the intelligent body can be called as an intelligent controller.

In the second step, in terms of implementation, the control targets may specifically be: target values of important state parameters in the heat exchange process; when the heat exchanger is an evaporator in a Rankine cycle, the target superheat value of the cold fluid outlet of the evaporator is obtained.

It should be noted that, for the present invention, the control method of the present invention uses an algorithm framework of reinforcement learning, that is, an intelligence body of reinforcement learning is used as a controller of a heat exchange process parameter (for example, superheat degree of cold fluid of an evaporator), and an output action of the intelligence body is a control variable, the intelligence body obtains an instant reward by observing an environmental state (for example, important state parameter of heat exchange process is used as an observed quantity, and for example, important state parameter of cold fluid is used as an observed quantity), and then the intelligence body interacts with the environment (for example, heat exchange process) continuously according to a constructed reward function, and converges towards a direction of large reward continuously, and finally trains to obtain an intelligence body capable of accurately controlling the parameter, or can be called as an intelligent controller.

For the invention, the optimal control strategy is continuously learned through a reinforcement learning algorithm. In view of the strong optimization decision-making capability of the reinforcement learning algorithm, particularly the deep reinforcement learning algorithm, which has been proved in other fields, the control method can achieve the effect exceeding that of the traditional control method on the parameter control of the heat exchange process.

In the second step, specifically, when the heat exchanger is an evaporator in a rankine cycle, the action (i.e., the control variable) output by the intelligent agent is the inlet flow rate of the cold fluid of the evaporator if the boundary condition control signal of the evaporator is adopted, and the control structure is as shown in fig. 2 a; if an actuator (e.g. a working fluid pump) control signal which directly influences the boundary conditions is used, the control structure is shown in fig. 2b, and the control signal is a pump rotating speed signal.

In the present invention, in terms of specific implementation, important parameters of the heat exchange process to be controlled specifically include parameters such as temperature and pressure of the hot fluid or the cold fluid, for example, the superheat degree of the outlet of the cold fluid of the evaporator.

In particular, the boundary conditions of the heat exchanger are the inlet and outlet flow rates of the cold and hot fluids and the inlet temperature. For example, the boundary conditions may be the inlet temperature and flow rate of the cold fluid (working fluid) and hot fluid (heat source) in the evaporator, and the outlet flow rate, which are parameters necessary for the heat exchange process.

The cold fluid inlet flow is the flow at the evaporator cold fluid inlet, and the cold fluid outlet flow is the flow at the evaporator cold fluid outlet. The flow rate of the hot fluid (i.e., the heat source) is the flow rate at the hot fluid inlet of the evaporator, and the hot fluid outlet flow rate is the flow rate at the hot fluid outlet of the evaporator.

In the present invention, as shown in fig. 2b, in the embodiment, when the heat exchanger is an evaporator, the actuator directly influencing the boundary condition is only a working medium pump, and the working medium pump is used for changing the boundary condition of the flow rate of the cold fluid.

In the second step, the output action of the agent is a control signal (heat exchanger boundary condition control signal) for changing the boundary condition of the heat exchanger of the controlled parameter, or an actuator signal directly influencing the boundary condition of the heat exchanger. If the flow of cold and hot fluid is adjusted (the rotating speed of the corresponding fluid pump is changed when the flow is reflected on the actuator), the temperature of the fluid after heat exchange can be changed.

It should be noted that, in the present invention, the heat exchanger boundary condition control signal is a change signal thereof, such as a change signal of the boundary condition flow. The control of these boundary conditions requires corresponding actuators, for example, the control of flow requires an actuator pump, so that a pump control signal is required to adjust the rotation speed of the pump, thereby achieving the purpose of controlling the flow rate of the boundary conditions.

In the invention, the heat exchanger boundary condition control signal and the actuator control signal directly influencing the boundary condition of the heat exchanger are used for influencing the parameters of the heat exchange process. For example, referring to fig. 2a and 2b, when the heat exchanger is an evaporator, the boundary condition of control is the cold fluid flow, and the boundary condition of control needs to be achieved by controlling the pump rotation speed, so that the control signal of the working medium pump rotation speed directly influences the boundary condition cold fluid flow.

In the invention, the heat exchanger boundary condition control signal is provided by an intelligent agent, namely a controller of heat exchange process parameters (when the heat exchanger is an evaporator in a Rankine cycle, a cold fluid temperature controller of the evaporator). The first type of control structure can transmit the heat exchanger boundary condition control signals to the PID controller through digital signal transmission, and the second type of control structure can transmit the actuator control signals which directly influence the heat exchanger boundary condition to the pump (such as a working medium pump).

In the second step, in concrete implementation, if the output action of the intelligent agent is a heat exchanger boundary condition control signal, but not an actuator control signal directly influencing the boundary condition, the intelligent agent further comprises a control actuator for tracking the output action of the intelligent agent;

for example, a PID controller (proportional-integral-derivative controller) may be used to control the actuator.

In the present invention, when the heat exchanger is an evaporator, the flow rate control signal, which is a boundary condition of the output of the agent (i.e., the output operation), needs to be changed by changing the rotational speed of the pump, and therefore, the flow rate is tracked by adjusting the rotational speed of the pump by using this signal as a tracking signal of the pump. The flow rate of the pump is the working medium flow rate at the inlet of the evaporator. The PID controller (proportional-integral-derivative controller) has the function of tracking a reference signal, namely, a flow signal is sent to the PID controller of the pump to be used as the reference signal, and after receiving the reference flow signal, the PID controller inputs a control signal to the pump to adjust the rotating speed of the pump so that the flow of the pump changes according to the flow control signal.

It should be noted that, in the second step, the control signal of the boundary condition is used as the action output by the agent, which is suitable for the situation that many other processes (i.e. other heat exchange processes for changing the temperature of the fluid) exist between the actuator and the controlled heat exchanger, so that the control signal of the boundary condition is directly used as the action, which is not affected by other processes, and other processes are not required to be taken as the environment during training, thereby saving the training cost. If the actuator control signal (namely the actuator control signal directly influencing the boundary condition of the heat exchanger) is directly adopted as the action output by the intelligent agent, because the action effect of the actuator is influenced by the processes, other processes must be taken as the environment during training. If no other process (namely no other heat exchange process for changing the temperature of the fluid) exists between the actuator and the controlled heat exchanger, the method is suitable for directly adopting the actuator control signal (namely the actuator control signal directly influencing the boundary condition of the heat exchanger) as the action output by the intelligent agent.

In the second step, in the concrete implementation, a preset high-precision heat exchanger dynamic simulation model or an actual heat exchange process is used as an interactive environment of the reinforcement learning intelligent agent.

It should be noted that the environment interacting with the agent is a heat exchange process, which includes both an actual heat exchange process and a high-precision dynamic simulation model of the heat exchange process. The observed quantities of the intelligent agent to the environment are important state quantities of the heat exchange process, including the temperature and pressure of the cold fluid and the hot fluid, and the change rate of the cold fluid and the hot fluid (namely the temperature and the pressure of the cold fluid and the hot fluid), and the flow rate of the cold fluid and the hot fluid and the control error of the parameters of the controlled heat exchanger. Wherein, for example, the pressure is the pressure of the cold and hot fluid in the evaporator.

In the present invention, the environment in which the agent interacts, i.e. the controlled process, is used to interact with the agent.

In the invention, the preset high-precision heat exchanger dynamic simulation model is a mathematical model, can be obtained by the existing mathematical mechanism and the existing conventional modeling mode, and is used for simulating the actual heat exchange process.

For the present invention, the agent evaluates the performance of the action by the reward earned, which is calculated by a reward function, characterized in that the reward earned by the action is greater if the action brings the controlled parameter closer to the target value. In the training process of the intelligent agent, the intelligent agent inputs an action to the environment each time, the environment immediately generates corresponding output, and an instant reward is returned to the intelligent agent to evaluate the action. In the process of exploring a large amount of action outputs of the intelligent agent, the training algorithm of reinforcement learning enables the action of the intelligent agent to be gradually developed towards the direction that the reward is increased, namely the direction that the control precision is increased. After many times of training, an intelligent agent or an intelligent controller which can accurately control the change of parameters (namely important state parameters in the heat exchange process) is finally obtained.

In the second step, in particular implementation, important state parameters of the heat exchange process, as observed quantities of the intelligent agent, specifically include the temperature and pressure of the cold fluid and the hot fluid, and the change rate of the cold fluid and the hot fluid (i.e., the temperature and pressure of the cold fluid and the hot fluid), and control errors of the flow rate of the cold fluid and the hot fluid and the controlled heat exchanger parameters (i.e., important state parameters of the heat exchange process).

In the second step, in particular, the pre-constructed reward function is a function formed by an error (namely, a control error of the controlled heat exchanger parameter) obtained by directly subtracting an actual value of the controlled heat exchanger parameter (namely, an important state parameter in the heat exchange process) from a control reference value. The reward function is characterized in that the smaller the error, the larger the value of the reward function.

It should be noted that the actual value of the controlled heat exchanger parameter (i.e. the important state parameter in the heat exchange process) is the measured value in the heat exchange process, and can be obtained by model calculation or actual measurement;

the control reference value of the controlled heat exchanger parameter (i.e. the important state parameter of the heat exchange process) is a generic term, i.e. it refers to a target value, which is an artificially given target value.

It should be noted that the error between the actual value of the controlled heat exchanger parameter (i.e. the important state parameter of the heat exchange process) and the control reference value is obtained by directly subtracting the actual value of the controlled heat exchanger parameter from the control reference value.

It should be noted that the reward function has no general formula, but has general characteristics that the larger the error between the actual value of the controlled heat exchanger parameter and the control reference value is, the smaller the reward is.

In the second step, in terms of specific implementation, when the controlled heat exchanger parameter (i.e., the important state parameter in the heat exchange process) is the superheat degree of the working medium at the outlet of the evaporator, the pre-constructed reward function is as follows:

In the formula (1), the first five items are used for judging the performance of the reference tracking, the tracking error is small, and the reward is large. The sixth term is to avoid too frequent fluctuations in pump speed. The seventh term indicates that if the superheat is below the lower limit or above the upper limit, the training is stopped to save training time and a large penalty is returned.

In the invention, the control method is trained by a reinforcement learning algorithm, and can comprise various reinforcement learning algorithms, such as a deep reinforcement learning algorithm.

In order to more clearly understand the technical solution of the present invention, the technical solution of the present invention is described below by specific examples.

Examples are given.

The control of the degree of superheat at the cold fluid outlet of the evaporator in the rankine cycle is taken as an example. In this embodiment, the environment is a dynamic simulation model of the evaporator (the model is a mathematical model, and is obtained by modeling in an existing conventional manner through an existing mathematical mechanism and is used for simulating an actual heat exchange process).

The boundary conditions of the model are the inlet temperature and flow rate of the cold fluid (i.e., working fluid) and the hot fluid (i.e., heat source), and the outlet flow rate. Both the inlet temperature (i.e., inlet temperature) and the flow rate can generally be actively regulated, and thus the data can be given directly, or as control variables.

In this embodiment, the inlet temperatures of the cold and hot fluids and the temperature of the hot fluid are artificially assigned, and the inlet flow rate of the cold fluid is used as a control variable and is directly controlled by the rotation speed of the pump, so the environment model in this embodiment includes a model of the pump (the model is a mathematical model and is obtained by modeling through a pump performance curve in the conventional manner to calculate the flow rate through the pump).

The working medium outlet of the evaporator is connected with an expander, so that the change of the flow of the working medium outlet can be calculated by using an expansion valve model (the model is a mathematical model, is obtained by modeling through a valve performance curve in the conventional mode and is used for calculating the flow passing through the valve). The hot fluid (i.e., the heat source) is an open system, with a pressure approximately equal to atmospheric pressure, so that the flow at its outlet can be calculated from the inlet flow.

It should be noted that the inlet and outlet flows of the heat source are artificially given, the inlet flow of the cold source is calculated by the pump model, and the outlet flow is calculated by the valve model.

In the invention, the model of the heat exchanger (such as an evaporator) can be established by adopting a traditional finite volume method, so that the establishment of the interactive environment model of the intelligent agent is completed.

As mentioned above, the flow rate of the cold fluid (i.e. working fluid) is a control variable, whereas the flow rate of the cold fluid (i.e. working fluid) is changed by changing the rotational speed of the working fluid pump in this embodiment. If the control signal of the working medium flow (namely the boundary condition control signal of the heat exchanger) is adopted as the action output by the intelligent agent, the control structure shown in the attached figure 2a is adopted; if the rotating speed control signal of the working medium pump (namely the actuator control signal which directly influences the boundary condition of the heat exchanger) is directly adopted as the action output by the intelligent body, the control structure shown in the attached figure 2b is adopted.

The observed quantities of the intelligent agent are the outlet pressure and temperature of the cold fluid (i.e. the working fluid) and the rate of change of the outlet pressure and temperature of the cold fluid, and also the flow rate of the intelligent agent and the error between the actual value and the target value of the controlled heat exchanger parameter (i.e. the control error of the controlled heat exchanger parameter).

According to the principle that the closer the controlled heat exchanger parameter (i.e. the superheat degree of the working medium at the outlet of the evaporator) is to the target value, the larger the reward is, the reward function shown in the following formula (1) is constructed in the example.

In this embodiment, the control of the superheat degree of the evaporator outlet working medium in the organic rankine cycle is regarded as a continuous problem, and a DDPG (Deep Deterministic Policy Gradient) algorithm in an reinforcement learning algorithm is adopted. DDPG is a strategy-based deep reinforcement learning algorithm, and the aim of the method is to directly optimize strategies (actors) and train reviewers to evaluate action values. The actor selects an action and the reviewer tells the actor whether the action is appropriate. In the process, the actor performs continuous iteration to obtain an optimal action strategy, and the reviewer performs continuous iteration to improve the accuracy of value function approximation. Both actors and reviewers use deep neural networks for representation.

DDPG Algorithm framework As shown in FIG. 4, actor network μ (s | θ)^μ) The observation s is used as input, and the corresponding action for maximizing the long-term return is output. Criticizing network Q (s, a | omega [ ]^Q) And taking the observation s and the action a as input, and outputting the action value. To improve the stability of the training optimization process, the DDPG agent creates two additional networks mu' (s | theta)^μ’) And Q' (s, a | ω)^Q’) And are referred to as target actors and target critics. Theta^μandω^QIs a network parameter, the training process includes optimizing θ^μAnd omega^Q. According to theta respectively^μAnd ω^QUpdate of latest value θ^μ’And omega^Q’. The specific training process is as follows:

with the same random parameter ω₀Initializing criticizing network Q (s, a | omega [ ]^Q) And Q' (s, a | ω)^Q’). With the same random parameter theta₀Initializing actor network μ (s | θ)^μ) And μ' (s | θ)^μ’)。

The following steps are repeated for each step of training:

1. and taking the pump rotating speed change action a as mu(s) + NM for the current observed value s, namely the outlet pressure and the temperature of the working medium, and the change rate of the outlet pressure and the temperature of the working medium, the error of the actual value and the target value of the controlled parameter and the accumulated amount of the error, wherein NM is a random noise model. The evaporator simulation model performs action a, returns the next set of observations s' and calculates the return instant prize by the prize function equation (1). The experience (s, a, r, s') is stored in a playback memory buffer. Each sample of training data is represented as(s)_i，a_i，r_i，s_i+1)。

2. Randomly taking N samples from a playback memory buffer, and updating the criticizing parameter omega by minimizing a loss function L^Q. In formula (2), γ is a discount factor for calculating a long-term prize.

y_i＝r_i+γQ′(S_i+1，μ′(S_i+1|θ^μ′)|ω^Q′) Equation (3)

3. Updating actor network parameter θ with policy gradients in equations (4) through (6)^μThe expected discount reward is maximized.

4. Updating theta of the target actor network and the target criticizer network according to formulas (7) and (8)^μ’And ω^Q’The value is obtained.

θ_Q′＝τθ_Q+(1-τ)θ_Q′Equation (7);

θ_μ′＝τθ_μ+(1-τ)θ_μ′equation (8);

this process is repeated for each training segment until the actor network and critic network converge, at which point training of the intelligent controller is complete. With the control structure of fig. 2a, the control effect of the trained agent under the fluctuating heat source is shown in fig. 3. In fig. 3, comparing the control effect of the control method with that of the conventional PID control method, it is obvious that the control method of the present invention has a much higher following effect on the target superheat degree than the conventional PID control method, and the control performance is excellent.

In conclusion, compared with the prior art, the method for controlling the important parameters of the heat exchange process based on reinforcement learning provided by the invention has the advantages that the design is scientific, the problem of poor parameter control precision of the heat exchange process under the condition of transient fluctuation is solved, the method can accurately and reliably control the important parameters of the heat exchanger in the heat exchange process, the control precision of the parameters of the heat exchange process under the boundary condition of severe frequent fluctuation is improved, and the method has great practical significance.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A heat exchange process important parameter control method based on reinforcement learning is characterized by comprising the following steps:

step one, using an algorithm framework of reinforcement learning, taking a controller of heat exchange process parameters as an intelligent agent of reinforcement learning, and controlling the action output by the intelligent agent, namely a control variable;

and secondly, taking the heat exchanger boundary condition control signal or an actuator control signal which directly influences the heat exchanger boundary condition as the output action of the intelligent body, taking the important state parameter of the heat exchange process as the observed quantity of the intelligent body, constructing a reward function by the instant reward which is closer to the control target, and converging the output action of the intelligent body towards the maximum direction of the reward function through the continuous training of the reinforcement learning algorithm to finally obtain the intelligent body which can accurately control the change of the important state parameter of the heat exchange process.

2. The reinforcement learning-based heat exchange process important parameter control method according to claim 1, wherein in the second step, when the heat exchanger is an evaporator, the action output by the intelligent agent is the inlet flow of the cold fluid of the evaporator if the boundary condition control signal of the evaporator is adopted; if an actuator control signal is used which directly influences the boundary conditions, this is the rotational speed signal of the pump.

3. The reinforcement learning-based control method for important parameters of heat exchange process according to claim 1, wherein in the second step, the important state parameters of the heat exchange process, as observed quantities of the intelligent agent, specifically include the temperatures and pressures of the cold fluid and the hot fluid, and the rates of change of the temperatures and pressures of the cold fluid and the hot fluid, and the control errors of the flow rates of the cold fluid and the hot fluid and the important state parameters of the controlled heat exchange process.

4. The method for controlling important parameters of the heat exchange process based on reinforcement learning as claimed in claim 1, wherein in the second step, the control objective is specifically: target values of important state parameters in the heat exchange process;

5. The method for controlling important parameters of the heat exchange process based on reinforcement learning as claimed in claim 1, wherein in the second step, the pre-constructed reward function is a function formed by an error obtained by directly subtracting the actual value of the parameter of the controlled heat exchanger from the control reference value;

the smaller the error, the larger the value of the reward function;

6. The reinforcement learning-based heat exchange process important parameter control method as claimed in any one of claims 1 to 5, wherein when the important state parameter of the controlled heat exchange process is the superheat degree of the working medium at the outlet of the evaporator, the pre-constructed reward function is as follows: