CN113885328A

CN113885328A - Nuclear power tracking control method based on integral reinforcement learning

Info

Publication number: CN113885328A
Application number: CN202111212559.7A
Authority: CN
Inventors: 仲伟峰; 王蒙轩; 赵晶
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-04

Abstract

The invention discloses a nuclear power tracking control method based on integral reinforcement learning, which comprises the following steps: selecting an initial strategy, initializing relevant parameters, and selecting an initial power point and an expected power point; starting global iteration, starting local iteration, training an evaluation network by utilizing a strategy iteration integral reinforcement learning algorithm, correcting a network weight, wherein the evaluation network is used for approximating a tracking error performance index function, evaluating the performance of a current tracking error control system by utilizing the evaluation network weight, selecting an optimal control strategy through an execution process, and minimizing the total cost of one-time global iteration; judging whether the current local iteration is finished, if not, returning to the local iteration, otherwise, updating an iteration performance index function and a tracking control law to obtain an optimal tracking control strategy; and (4) completing the iteration of the global strategy to obtain an optimal tracking control strategy, tracking to an expected power point, and calculating the total cost. Therefore, the invention can continuously learn and adjust the current strategy to track to the expected power point.

Description

Nuclear power tracking control method based on integral reinforcement learning

Technical Field

The embodiment of the invention relates to the technical field of power control of nuclear power units, in particular to a nuclear power tracking control method based on integral reinforcement learning.

Background

In recent years, due to coal combustion power generation, the greenhouse effect and air pollution caused by the coal combustion power generation are increasingly serious, and the resource reserve amount of the coal combustion power generation is also reduced year by year. The nuclear energy is used as a clean energy source, has the advantages of no pollution and low transportation cost, is widely concerned by various countries, and is applied and popularized to the power generation industry. The safety of nuclear power systems is also always concerned by various fields, so the problem of power regulation becomes a focus. A stable, safe and efficient power control method of a nuclear power unit is particularly important for the whole nuclear power industry.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

In view of the above, the present invention is proposed to provide a nuclear power tracking control method based on reinforcement integral learning, which at least partially solves the above problems.

In order to achieve the above object, according to one aspect of the present invention, the following technical solutions are provided:

a method of nuclear power tracking control based on reinforcement integral learning, the method comprising:

s1: selecting an initial strategy, initializing relevant parameters, and selecting an initial power point and an expected power point;

s2: performing global iteration, and updating an iterative tracking error performance index function according to an iterative control sequence to obtain an optimal tracking error performance index function;

s3: performing local iteration, training an evaluation network by using an integral reinforcement learning algorithm, correcting the weight of the evaluation network, and obtaining an optimal error control strategy by using the optimal tracking error performance index function;

s4: judging whether the current local iteration is finished, if not, returning to the local iteration step, otherwise, updating the iterative tracking error performance index function and the control law to obtain the optimal tracking error performance index function;

s5: and (4) completing the iteration of the global strategy to obtain an optimal tracking control strategy, tracking to an expected power point, and calculating the total cost.

Compared with the prior art, the technical scheme at least has the following beneficial effects:

the embodiment of the invention constructs the self-learning power tracking controller based on the self-adaptive dynamic programming algorithm through the neural network, can continuously learn, adjust and adapt to different nuclear power states through real-time operation, and can track the working points of different nuclear power units.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention, are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention without limiting the invention to the right. It is obvious that the drawings in the following description are only some embodiments, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a schematic illustration of a nuclear power system model shown in accordance with an exemplary embodiment;

fig. 2 is a flowchart illustrating a nuclear power generating unit power tracking control method based on integral intensification according to an exemplary embodiment.

Detailed Description

In order to more clearly illustrate the objects, technical solutions and advantages of the present invention, the present invention is further described in detail below with reference to the accompanying drawings in combination with specific examples.

Adaptive dynamic programming has evolved rapidly since the 80's of the 20 th century, as proposed by Paul j. The method is mainly used for solving the problem of 'dimension disaster' in dynamic planning, and the specific solution method is to solve through multiple iterative optimization. In recent years, adaptive dynamic programming algorithms have shown great advantages in solving optimal control. The adaptive dynamic programming method generally uses a controller-evaluator (operator-critical) structure and a neural network to approximate the tracking error performance index function and the control strategy, gradually approximates the equation analytic solution by adopting an iterative method, and finally converges to the optimal tracking error performance index function and the optimal tracking control strategy.

The self-adaptive dynamic programming method utilizes a function approximation structure (such as a neural network) to approximate a tracking error performance index function and a control strategy in a dynamic programming equation so as to meet the optimization principle, thereby obtaining the optimal error control and the optimal tracking error performance index function of the system. The self-adaptive dynamic planning structure mainly comprises a dynamic system, a control network and an evaluation network. The evaluation network is used for approximating the optimal cost function and giving an evaluation guide to execute the network to generate optimal control. After the output of the execution network acts on the dynamic system, the evaluation network is influenced through rewards/punishments generated at different stages of the dynamic system, and the update control strategy of the execution network is known, so that the total cost (namely the sum of the rewards/punishments) reaches the optimal value.

The method for the integral reinforcement learning self-adaptive dynamic planning does not depend on a system model, and the weights of the controller and the evaluator neural network are adjusted based on the system state generated in real time and corresponding control actions. Finally, the integration reinforcement learning self-adaptive dynamic programming method can be operated on line, and the controller and the evaluator neural network can be finally converged to the optimal control strategy and the optimal tracking error performance index function in an iterative mode. The method is particularly suitable for solving the optimal control problem on line of a linear or nonlinear continuous system.

FIG. 1 is a schematic diagram of a nuclear power system in which an embodiment of the present invention is applied, and schematically illustrates a reaction heat transfer model diagram of the nuclear power system. The nuclear power system consists of one reactor and two cooling stacks. Wherein Q only represents heat transfer and has no practical meaning for a nuclear power system model. The nuclear Power system comprises five system states, wherein the Power percentage represents the generated Power percentage of the system (the full-load generated Power is 2500 MW); the Delayed nuclear concentration represents the relative concentration of Delayed neutrons in a reaction kettle of the nuclear power system; the Reactor core Temperature is the average Temperature of the Reactor core of the nuclear power system (meanwhile, T can be used_fRepresents); coolant output Temperature represents the average Temperature of the Coolant inside the nuclear power system; the Reactor coefficient represents the reactivity change of the nuclear power system caused by the up-and-down movement of the control rod. The system only uses the reaction speed of the control rod as a control signal, and when the control rod moves up and down at a certain speed, the internal reaction of the reactor core of the system changes along with the control rod. The faster the control rod moves upward and the more violent the reaction. The control rod moves downwards, and vice versa.

As shown in fig. 2, an embodiment of the present invention provides a method for tracking and controlling power of a nuclear power system based on integral reinforcement learning, where the method may include step S1 and step S5.

S1: the initialization parameters include: nuclear power system parameters, evaluation network parameters, global iteration duration, integral time constant, local iteration duration, convergence accuracy and target parameters; the nuclear power system parameters are nuclear power model system parameters, and the model comprises five system input and output states.

The nuclear power system model mainly comprises a reactor core internal neutron reaction equation, two temperature feedback models of a reactor and a reactivity equation of a control rod. In the study of reactor characteristics, a control rod control method is often used. Because the control rod has very strong neutron absorbing capacity, and the translation rate is easily controlled moreover, convenient operation, the influence of the high control rod of accuracy to reactivity control to the reactivity can embody through two kinds of modes: a change in position and a change in velocity.

In addition, selection of an initial power operating point and a desired power operating point is required, and an initial stability control strategy is determined. The following parameters are also initialized: global training time-step, local iteration time-step, neural network structure (such as number of input nodes, number of hidden nodes, and number of output layer nodes), neural network weights.

Illustratively, the structure of the evaluation network is set to be 5-15-1, wherein 5 is the number of input nodes of the evaluation network, 15 is the number of hidden nodes of the evaluation network, 1 is the number of output nodes of the evaluation network, the number of hidden nodes can be adjusted according to experience to obtain the best approximation effect, and the convergence precision is defined to be 1.0 multiplied by 10^-2。

In the execution stage, the embodiment of the invention uses the simplified finite dimension control variable, namely, the finite and determined nuclear power working condition point is set for tracking.

In practical application, the selection of the initial working condition point and the expected working condition point can be set according to actual requirements, wherein the power model and parameter setting of the nuclear power unit also need to have practical significance.

S2: when global training is carried out, updating an iterative tracking error performance index function according to an iterative control sequence so as to obtain an optimality tracking error performance index function;

specifically, according to the requirement of the integral reinforcement learning method of the controller, weight initialization training work needs to be performed on the evaluation network.

Training an evaluation network by using an integral reinforcement learning algorithm: evaluating the input values of the network includes: five states x (t) of nuclear power unit working point and five states x of nuclear power unit expected working point_d(t) nuclear power unit tracking error control strategy u_e(t) the output value is a tracking error performance indicator function J_e(t) of (d). Wherein, J_e(t) the tracking error performance indicator function is referred to as the J function for short. Optimal tracking error control strategy u_e(t) is approximated by a tracking error performance indicator function obtained from the evaluation network.

The weight initialization of the evaluation network is performed within the global iteration. Preferably, the weight value can be initialized again when global iteration starts each time, so that the convergence of the evaluation network is better ensured on the basis of ensuring the stability and the convergence speed of the evaluation network, and an optimal tracking control strategy of the power of the nuclear power system can be found as soon as possible.

In the execution stage, input data of the evaluation network are five state outputs x (t) and an expected power point x of the nuclear power unit_d(t) difference x_e(t) and an optimal tracking error control strategy u obtained from the trained evaluation network_e(t) of (d). Evaluating the output data of the network as a tracking error performance index function J_e(t)。

According to the Bellman equation, utilizing the output J of the evaluation network at the next moment_e(T + T) and utility function U (T) are calculated to obtain output data J at the current moment_e(t), the calculation formula is as follows:

using global iterative error control law

To update the global iteration J_eA function.

The following example describes the process of obtaining the optimal tracking error performance indicator function in detail.

Let t time, x (t) be five input and output states of the nuclear power unit at the time, x_d(t) As the desired power point, we have a systematic tracking error x_e(t)，u_e(t) a tracking error control strategy; the error control system can be defined as:

x_e(t+1)＝f(x(t)-x_d(t),u_e(t),t)

wherein f can be derived from a nuclear power unit power model. The utility function is defined as follows:

U(t)＝α[x_e(t)]²+β[u_e(t)]²

wherein α and β are constants; u. of_eAnd (t) is the difference value of the control law of the nuclear power unit at the current time and the expected working control law. And the utility function U (t) represents the sum of the difference value of the current working point and the expected working point of the nuclear power unit at the time t and the utility of the control rod control law.

We give a new form of utility function:

wherein, Q and R are positive definite matrices, and our global tracking error performance index function can be defined as:

the Hamiltonian equation can be derived as follows:

then we have one

Such that the following equation is satisfied:

the optimal tracking error control law can be expressed as:

defining initial error control law

For the

And

we have

Where i is 0,1,2, …, the error tracking control law can be obtained by the following equation:

when the temperature is within a range of T → ∞,

it converges to an optimum value.

the goal of the local training iteration is to obtain the optimum

Under the condition of given initial stable control strategy, let us make the control law u_e ⁰. Let the integration duration T equal to 1, and select the local training iteration duration as 30 steps.

The tracking error performance index function updating rule is as follows:

the optimal error control law update rule is as follows:

when the temperature is T → ∞ times,

will converge to the optimum value

Then, the weight of the evaluation network is updated to approximate the optimal tracking error performance index function.

Wherein, the updating rule is as follows:

W_CL＝-(X^TX)^-1(X^TY)

wherein,

for evaluating weight vector deviation of network, X is weight vector inner product difference of network, Y is utility function value of network approximation, W_CLTo evaluate the weight of the network.

Since the error control strategy and the tracking error performance indicator function change with the weights of the controller and the evaluator neural network, adjusting the weights of the controller and the evaluator neural network means updating of the error control strategy and the tracking error performance indicator function. In the execution stage, limited control variables are substituted into the optimal tracking error performance index function approximated by the evaluation network

The optimal error control strategy is obtained approximately according to a tracking error performance index function obtained by an evaluation network, and a control variable which enables the optimal tracking error performance index function to be minimum is selected as the optimal tracking error control strategy:

the evaluation network is used for approximating an optimal tracking error performance index function, evaluating the performance of the nuclear power control rod system by using the evaluation network weight, and selecting an optimal tracking control strategy through an execution flow to minimize the total tracking error cost of global training.

S4: judging whether the current local iteration is finished, if not, returning to the local iteration step, otherwise, updating the iterative tracking error performance index function and the error control law to obtain the optimal tracking error performance index function;

specifically, after local iteration is completed, whether the current iteration number reaches an iteration threshold value is determined, and if yes, an iterative tracking error performance index function and an error control law are updated to obtain an optimal tracking error performance index function and an optimal error control strategy.

If not, go to step S3; otherwise, step S5 is executed.

S5: and (4) finishing the iteration of the global strategy to obtain an optimal tracking error control strategy, tracking to a desired power point, and calculating the total cost (tracking error and control rod control cost).

Calculation of the total cost requires an optimal tracking error control strategy

Substitution into the actual model, here due to the utility function U (x)_e,u_e) Is dependent on the actual model, so the total cost can be approximated to the resulting optimality tracking error performance indicator function

Although the steps in this embodiment are described in the foregoing sequence, those skilled in the art will understand that, in order to achieve the effect of this embodiment, the different steps need not be executed in such a sequence, and may be executed simultaneously (in parallel) or in an inverted sequence, and these simple changes are all within the protection scope of the present invention. The technical solutions provided by the embodiments of the present invention are described in detail above. Although specific examples have been employed herein to illustrate the principles and practice of the invention, the foregoing descriptions of embodiments are merely provided to assist in understanding the principles of embodiments of the invention; also, it will be apparent to those skilled in the art that variations may be made in the embodiments and applications of the invention without departing from the spirit and scope of the invention.

It should be noted that the flowcharts mentioned herein are not limited to the forms shown herein, and may be divided and/or combined.

It should be noted that: the numerals and text in the figures are only used to illustrate the invention more clearly and are not to be considered as an undue limitation of the scope of the invention.

The present invention is not limited to the above-described embodiments, and any variations, modifications, or alterations that may occur to one skilled in the art without departing from the spirit of the invention fall within the scope of the invention.

Claims

1. A nuclear power system power tracking control method based on integral reinforcement learning is characterized by comprising the following steps:

s3: performing local iteration, training an evaluation network by using an integral reinforcement learning algorithm, correcting the weight of the evaluation network, and obtaining an optimal tracking control strategy by using the optimal tracking performance index function;

s4: judging whether the current local iteration is finished, if not, returning to the local iteration step, otherwise, updating the iterative tracking error performance index function and the tracking control law to obtain the optimal tracking error performance index function;

2. The method according to claim 1, wherein in the step S1, the initialization parameters comprise: nuclear power system parameters, evaluation network parameters, global iteration duration, integral time constant, local iteration duration, convergence accuracy and target parameters; the nuclear power system parameters are nuclear power model system parameters, and the model comprises five system input and output states.

3. The method of claim 2Method, characterized in that the structure of the evaluation network is set to 5-15-1 and the convergence accuracy is defined to be 1.0 x 10^-2Wherein, 5 is the number of input nodes of the evaluation network, 15 is the number of hidden nodes of the evaluation network, and 1 is the number of output nodes of the evaluation network.

4. The method of claim 1, wherein the step S1 further comprises selecting an initial control strategy, wherein the error control strategy is obtained by a conventional PID or MPC strategy, so as to obtain an initial stable control rate.

5. The method of claim 1, wherein in step S3, the input data of the evaluation network includes 5 operating states x (t) of the nuclear power plant and an operating state point x of the desired power_d(t) tracking error value x_e(t), and tracking control strategy u for nuclear power control rods_e(t); the output data of the evaluation network comprises: tracking error performance indicator function J_e(t)；

According to the Bellman equation, utilizing the output J of the next integration moment of the evaluation network_e(T + T) and a utility function U (T), and calculating output data J at the current moment by the following formula_e(t)：

Wherein x is_e(t) is the working state point x of 5 working states x (t) and expected power of the nuclear power unit_d(t) tracking error value x_e(t); utility function U (t) represents the tracking error value x at time t_e(t) and tracking control strategy u of nuclear power control rod_e(t) sum of the utilities.

6. The method of claim 5, wherein the utility function U (t) is calculated by:

U(t)＝α[x_e(t)]²+β[u_e(t)]²

wherein α and β are constants; u. of_eAnd (t) is the difference value of the control law of the nuclear power unit at the current time and the expected working control law.

7. The method of claim 1, wherein in the step S3, the input data of the execution phase of the evaluation network includes relative power coefficient of the nuclear power plant to be controlled, relative concentration of delayed neutrons, average temperature of the reactor core, average temperature of coolant, and reactivity of control rods; the output data of the execution stage of the evaluation network comprises an optimal tracking control strategy; and the optimal tracking control strategy is obtained approximately according to a tracking error performance index function obtained by the evaluation network.

8. The method according to claim 1, wherein in the step S3, the update rule of the evaluation network is as follows:

W_CL＝-(X^TX)^-1(X^TY)

wherein,