CN113807029B

CN113807029B - Deep reinforcement learning-based double-time-scale new energy power grid voltage optimization method

Info

Publication number: CN113807029B
Application number: CN202111217697.4A
Authority: CN
Inventors: 李鹏; 姜磊; 王加浩; 夏辉; 高一航; 李建宜
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2022-07-29
Anticipated expiration: 2041-10-19
Also published as: CN113807029A

Abstract

A double-time-scale power grid voltage optimization method based on deep reinforcement learning comprises the following steps: dividing a long time scale interval and a short time scale interval in a double time scale method respectively; performing long-time-scale power grid voltage optimization based on a DQN algorithm to obtain a long-time-scale parallel capacitor bank switching plan; and (4) performing short-time-scale reactive voltage optimization based on a DDPG algorithm to obtain a short-time-scale continuous reactive power compensation device output plan. The invention realizes the advantage complementation of various reactive compensation devices, has stronger reactive voltage optimization capability, can carry out overall arrangement on the capacitor switching plan at each optimized time point in one day, and effectively realizes quick optimization.

Description

Deep reinforcement learning-based double-time-scale new energy power grid voltage optimization method

Technical Field

The invention relates to a double-time-scale power grid voltage optimization method. In particular to a double-time-scale new energy power grid voltage optimization method based on deep reinforcement learning.

Background

In order to construct a novel power system taking new energy as a main body, the permeability of various renewable energy sources is further improved, and the randomness and the dynamics of load demand response are further enhanced, so that great challenges are brought to the operation and the control of a modern power grid.

The reactive voltage optimization of the power grid can effectively and economically solve the problem that the voltage of a power system fluctuates in a large range due to disturbance to a certain extent, and the reactive voltage optimization problem of the power grid can be regarded as a nonlinear complex optimization problem with numerous targets, variables and constraints.

At present, methods for processing dynamic reactive voltage optimization mainly comprise a traditional operation research optimization method, a heuristic search method and the like. However, these methods often have the problems of slow convergence speed, large calculation amount, easy falling into local optimum, and the like. Furthermore, most existing methods are based on model solution, highly dependent on model accuracy, which is impractical for power systems with large access to new energy sources. In order to reduce the influence of model precision on control performance, the artificial intelligence algorithm is applied to the field of reactive voltage optimization, so that the power system can timely and accurately make countermeasures under various conditions, and a new thought is undoubtedly provided for the operation control of the power system.

Disclosure of Invention

The invention aims to solve the technical problem of providing a deep reinforcement learning-based double-time-scale new energy power grid voltage optimization method capable of simultaneously considering discrete and continuous reactive compensation devices.

The technical scheme adopted by the invention is as follows: the double-time-scale new energy power grid voltage optimization method based on deep reinforcement learning comprises the following steps:

1) respectively dividing a long time scale interval and a short time scale interval in the double time scale method:

divide a day into K _l The initial time of each long time interval in the day is tau equal to 0, …, K _l -1, subdividing each long time interval into K _s The initial time of each short time interval in a long time interval is t 0, …, K _s -1；

2) Carrying out long-time scale power grid voltage optimization based on a DQN algorithm, comprising the following steps: establishing a long-time scale power grid voltage optimization model, integrating multiple targets by using a membership function, designing a reward function aiming at the long-time scale power grid voltage optimization model, and solving the long-time scale power grid voltage optimization model by using a DQN algorithm to obtain a long-time scale parallel capacitor bank switching plan;

3) short-time scale reactive voltage optimization is carried out based on a DDPG algorithm, and the method comprises the following steps: establishing a short-time-scale power grid voltage optimization model, designing a reward function aiming at the short-time-scale power grid voltage optimization model, and solving the short-time-scale power grid voltage optimization model by using a DDPG algorithm to obtain a short-time-scale continuous reactive power compensation device output plan.

The double-time-scale new energy power grid voltage optimization method based on deep reinforcement learning has the following advantages:

1. according to the invention, through the mutual matching of the two intelligent agents on the long and short time scales, the complementary advantages of various reactive compensation devices are realized, and the reactive voltage optimization capability is stronger.

2. The design method provided by the invention aims at inhibiting large-range voltage fluctuation caused by conventional load demand change and minimizing the whole system network loss on a long-time scale, and the DQN algorithm is used as an optimization kernel, so that the capacitor switching plan can be comprehensively arranged at each optimization time point in one day.

3. The design method provided by the invention aims at solving the problem of rapid and frequent fluctuation of the grid voltage caused by large-scale grid connection of new energy on a short time scale, and the DDPG algorithm is used as an optimization kernel to effectively realize rapid optimization.

4. The power grid voltage optimization method provided by the invention can effectively solve the problem of frequent fluctuation of the power grid voltage under the condition of high-proportion new energy access, and has practical significance for engineering application.

Drawings

FIG. 1 is a flow chart of a double-time scale new energy power grid voltage optimization method based on deep reinforcement learning according to the invention;

FIG. 2 is a schematic diagram of an improved IEEE39 node test system in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a typical long-time-scale capacitor bank switching plan within a day in an example of the present invention;

FIG. 4 is a graph of long time scale daily average network loss in an example of the invention;

FIG. 5 is a schematic diagram of reactive power output of a typical short-time-scale continuous reactive power compensation device in an example of the present invention;

FIG. 6a is a schematic diagram of the voltage optimization effect of a typical in-day node 6 in an example of the present invention;

FIG. 6b is a schematic diagram illustrating the voltage optimization effect of the exemplary in-day node 23 in the example of the present invention;

fig. 6c is a schematic diagram of the voltage optimization effect of the typical in-day node 26 in the example of the present invention.

Detailed Description

The invention provides a deep reinforcement learning-based dual-time scale new energy grid voltage optimization method, which is described in detail below with reference to embodiments and drawings.

As shown in fig. 1, the method for optimizing the voltage of the dual-time-scale new energy power grid based on deep reinforcement learning of the present invention includes the following steps:

2) Carrying out long-time scale power grid voltage optimization based on a DQN algorithm; the method comprises the following steps:

(2.1) establishing a long-time scale power grid voltage optimization model:

long-time scale power grid voltage optimization model objective function F _l (T) is:

minF _l (T)＝|f ₁ (T),f ₂ (T)|

wherein the content of the first and second substances,t is the switching state vector of all parallel capacitor banks, f ₁ (T) and f ₂ (T) are respectively an objective function F _l (T) a first sub-goal and a second sub-goal; t (tau) represents all parallel capacitor bank switching state vectors at the time of tau, and the switching state of each parallel capacitor bank is represented by a switching gear; n is the number of nodes in the power grid; v. of _p Representing the voltage amplitude corresponding to the pivot node; v. of _ref Corresponding the voltage reference value for the hub node; p is a radical of _ij Representing the active power flowing from the node i to the node j;

considering the power grid operation flow constraint and the voltage constraint:

wherein p is _i Representing the active power injected into node i; q. q.s _i Is representative of the reactive power injected into node i; v. of _i Represents the voltage amplitude of the node i;

and

respectively representing the upper limit and the lower limit of the voltage amplitude of the node i; g _ij Represents the conductance between node i and node j; b is _ij Representing susceptance between node i and node j; omega _ij Represents the voltage phase angle difference between node i and node j;

switching of the mth parallel capacitor bank is switched by a switching gear T _m Upper and lower limits

And

the limitation of (2):

the upper limit of the switching times of the parallel capacitor bank in one day is set as

Wherein, C _m And the switching times of the mth parallel capacitor bank in one day are shown.

(2.2) integrating multiple targets by using a membership function;

wherein the content of the first and second substances,

representing the possible optimal value on a single dimension corresponding to the beta sub-target, wherein beta is 1, 2; delta _β Is sub-target f _β Tolerance of value is used to define the boundary that the target function can reach, and for any sub-target, when the corresponding target value is in the tolerance range, the membership function mu (f) _β ) Will then sub-target f _β Mapping of values to [0,1 ]]In sub-goal f _β When the tolerance of the value is out, the value of the membership function is set to 0, and when a new sub-target optimal value is found, the value of the membership function is set to 1;

the new objective function after the membership function mapping is adopted is as follows:

min[-μ(F _l )]＝k ₁ μ(f ₁ )+k ₂ μ(f ₂ )

wherein, mu (F) _l ) Representation and objective function F _l A corresponding membership function; f. of ₁ And f ₂ Respectively representing a first sub-target and a second sub-target of the objective function; k is a radical of ₁ And k ₂ The weight coefficients of the two targets are respectively.

(2.3) designing a reward function r for a long-time scale grid voltage optimization model _l (τ)：

Wherein, mu (F) _l ) Representation and objective function F _l A corresponding membership function; sigma _l Penalty factors for long-time scale voltage out-of-limit and capacitor switching times out-of-limit; v. of _i (τ) represents the voltage magnitude at node i at time t; c _m And (tau) represents the switching times of the mth parallel capacitor bank at the time t in one day.

(2.4) solving the long-time-scale power grid voltage optimization model by using a DQN algorithm to obtain a long-time-scale parallel capacitor bank switching plan; the method comprises the following steps:

(2.4.1) calculating the DQN network loss function L (θ):

wherein r is _l (τ) represents a reward function of the long-time scale grid voltage optimization model; s (tau) and s (tau-1) respectively represent the states of the intelligent agent at the time tau and the time tau-1 and are composed of an information matrix set { v, p, Q, T, C, Q }, wherein v, p and Q are voltage amplitude vectors of all nodes, active power vectors and reactive power vectors injected into all nodes respectively, and T, C, Q are switching state vectors of all parallel capacitor banks, switched frequency vectors of all parallel capacitor banks in one day and reactive power vectors of all continuous reactive power compensation devices respectively; a is the action space of the agent; a is _l Representing the action selected by the agent; a is _l (t) is the state s (τ) of the agent at time τ -1 -1) derived from an implementation strategy; q. q.s _π Output by the estimated value network;

the target network parameters are input by the estimated value network at fixed step length, so that the target network has a certain lag relative to the estimated value network, theta and theta ^target Respectively estimating a value network parameter and a target network parameter; gamma represents an attenuation factor;

(2.4.2) updating the estimated value network parameters by adopting a random gradient descent method, wherein the updating method comprises the following steps:

wherein, theta _τ+1 And theta _τ Respectively representing estimated value network parameters at the tau +1 moment and the tau moment; alpha represents the learning rate of the estimated value network in the updating process; ^ represents gradient determination.

3) Optimizing the short-time-scale power grid voltage based on a DDPG algorithm; the method comprises the following steps:

(3.1) establishing a short-time-scale power grid voltage optimization model:

short-time-scale grid voltage optimization model objective function F _s Comprises the following steps:

q is a vector formed by reactive power output of each continuous reactive power compensation device; q (t) represents a vector formed by the reactive power output of each continuous reactive power compensation device at the time t; v. of _p Representing the voltage amplitude corresponding to the pivot node; v. of _ref Corresponding the voltage reference value for the hub node; k _s Representing the number of short time intervals within a long time scale;

considering the power grid operation flow constraint and the voltage constraint, the constraint conditions are as follows:

and

in order to deal with the emergency in the power system, a certain reserve amount needs to be reserved in the adjusting process of the continuous reactive power compensation device, and the constraint conditions of the continuous reactive power compensation device in the adjusting process are as follows:

wherein q is _con,n The reactive power output value of the nth continuous reactive power compensation device is obtained;

and

are each q _con,n Upper and lower reactive power output limits of;

(3.2) designing a reward function r for a short-time-scale grid voltage optimization model _s (t)：

Wherein, mu (F) _s ) Representation and objective function F _s A corresponding membership function; sigma _s A penalty factor for short timescale voltage violations; v. of _i (t) represents the voltage magnitude at node i at time t.

(3.3) solving the short-time-scale power grid voltage optimization model by using a DDPG algorithm to obtain a short-time-scale continuous reactive power compensation device output plan; comprises that

(3.3.1) calculating the Critic network loss function L (θ) ^q )：

Wherein, the first and the second end of the pipe are connected with each other,

representing the expectation of corresponding target values over all values of t; s (T) is the state of the intelligent agent at the moment T, and consists of an information matrix set { v, p, Q, T, C, Q }, wherein v, p and Q are voltage amplitude vectors of all nodes and active power vectors and reactive power vectors injected into all nodes respectively, and T, C, Q are switching state vectors of all parallel capacitor banks, switched frequency vectors of all parallel capacitor banks in one day and reactive power output vectors of all continuous reactive power compensation devices respectively; a is _s (t) ═ t (t) represents the action taken by the agent at time t, derived from the policy enforcement based on state s (t-1) at time t-1; theta ^q Estimating parameters of a value network for the Critic network; q is output by an estimated value network in the Critic network; because the DDPG algorithm belongs to a gradient solving method based on a deterministic strategy, the probability distribution selected by each action under different states cannot be determined, the expected value solving is converted into M times of sampling to obtain an average value, and the M times of sampling dataRandom non-repetitive extraction from memory banks, s ^m (t) and a _s ^m (t) agent status and agent actions taken as the mth sample, respectively; y is ^m (t) a label to be considered as the mth sample; the label y (t) expression is:

y(t)＝r _s (t)+γq ^target {s(t+1),ψ ^target [s(t)∣θ ^ψ' ]∣θ ^q' }

wherein r is _s (t) is a reward function of a short-time scale grid voltage optimization model; s (t) and s (t +1) are states of the agent at the time t and the time t +1 respectively; psi ^target Output by a target network in the Actor network; theta ^ψ' Parameters of a target network in an Actor network; q. q.s ^target Output by a target network in the Critic network; theta ^q' Parameters of a target network in the Critic network; gamma is an attenuation factor;

(3.3.2) the Actor network is judged based on the Critic network, and a gradient updating method is adopted to update the Actor network, wherein the gradient updating calculation method comprises the following steps:

wherein the content of the first and second substances,

representing the expectation of corresponding target values over all values of t; q is output by an estimated value network in the Critic network; psi is output by the estimated value network in the Actor network; s and a represent agent status and action taken by the agent, respectively; theta ^q Estimating parameters of a value network for the Critic network; theta ^ψ Estimating parameters of a value network for the Actor network;

representing the gradient of the corresponding target value to the action a;

representing the corresponding target value pair parameter theta ^ψ LadderCalculating the degree; m is the sampling times; s ^m (t) and

agent status and agent actions taken as the m-th sample, respectively;

(3.3.3) updating parameters of the Actor network and the criticic network respectively, wherein the updating method comprises the following steps:

and

respectively representing parameters of an estimated value network in an Actor network at the time t +1 and the time t;

and

respectively representing parameters of an estimated value network in a Critic network at the moment t +1 and the moment t; alpha is alpha ₁ And alpha ₂ Respectively representing the learning rates of the estimated value network in the Actor network and the estimated value network in the Critic network in the updating process; ^ represents gradient determination.

Examples are given below:

according to the flow chart of the deep reinforcement learning-based dual-time scale power grid voltage optimization method disclosed by the invention and shown in fig. 1, the improved IEEE39 node test system shown in fig. 2 is subjected to voltage optimization under the conditions of uncertain new energy output and uncertain load. Node 6, node 23, and node 26 are set as the hub nodes for the region. The node 33 and the node 37 are wind power plants, and the rated capacity is 500 MW. The parallel capacitor bank 1 and the parallel capacitor bank 2 are respectively installed on the No. 4 node and the No. 8 node of the original system, the parameters are the same, the maximum gear is 6, each gear is 50Mvar, and the maximum adjusting times are 6 times every day. The node 6, the node 23 and the node 26 are respectively connected with a continuous device, and the adjustable range is-120 Mvar. Considering the influence of emergency on the voltage of the power grid, the SVG sets a reactive power reserve area for reactive power support in emergency, and in the reactive power voltage method provided by the embodiment, the adjustable range is adjusted to-80 Mvar. In order to adapt to the action time of the reactive power compensation device, the long-time scale decision time selected in the embodiment is 1h, and the short-time scale decision time is 5 min. And constructing 420-day power system operation data according to the typical daily load curve and the typical wind power output curve, wherein the 420-day power system operation data is used as training data for driving the double intelligent agents. The comprehensive operation performance of the comparison algorithm is analyzed by comparing the optimization effect and the operation time obtained by the deep reinforcement learning algorithm and the genetic algorithm.

The capacitor bank is limited by the switching times in the switching process, and as can be seen from fig. 3, whether the switching plan at the next moment is executed can be determined according to the current switching times of the capacitor and the operation condition of the power grid at the moment by adopting the method provided by the invention. It can be seen that, at about 4:00, when the wind power output is sufficient and the load demand is low, the parallel capacitor bank is placed at a lower gear, less reactive support is provided for the power grid, and at about 12:00, when the wind power output is low and the load demand is high, the parallel capacitor bank is adjusted to a higher gear, and the voltage fluctuation in a large range is inhibited. Under the condition that the switching times of the capacitor bank are limited, if the long-time scale is optimized only according to the current power grid operation condition, the optimization effect cannot be fully exerted after the times of a single day reach the upper limit.

As can be seen from fig. 4, compared with the genetic algorithm, the average loss reduction rate of the method provided by the invention is 5.24% in 200-400 days, and the average loss reduction rate under the optimization of the genetic algorithm is 4.66%, which fully shows the superiority of the method provided by the invention.

The reactive power output of each successive device on a typical short time scale of the day can be seen from fig. 5. The continuous reactive power compensation device is mainly used for inhibiting the rapid and frequent fluctuation of voltage caused by the uncertainty of new energy, so that the output of the continuous reactive power compensation device changes frequently in one day.

According to fig. 6a, fig. 6b and fig. 6c, it can be seen that the method provided by the invention has a good voltage optimization effect compared with a genetic algorithm in a typical day. The cumulative time for optimization of the method provided by the invention in a typical day is only 137.58s, while the cumulative time for optimization by a genetic algorithm is 685.44s, which proves the rapidity of the method in solving a decision problem.

In conclusion, the advantages of various reactive compensation devices are complemented through the mutual matching of the two intelligent agents on the long and short time scales, and the reactive compensation device has stronger reactive voltage optimization capability and good feasibility.

Claims

1. A double-time-scale new energy power grid voltage optimization method based on deep reinforcement learning is characterized by comprising the following steps:

2) Carrying out long-time scale power grid voltage optimization based on a DQN algorithm, comprising the following steps: establishing a long-time scale power grid voltage optimization model, integrating multiple targets by using a membership function, designing a reward function aiming at the long-time scale power grid voltage optimization model, and solving the long-time scale power grid voltage optimization model by using a DQN algorithm to obtain a long-time scale parallel capacitor bank switching plan; the long-time scale power grid voltage optimization model comprises the following steps:

(1) long-time scale power grid voltage optimization model objective function F _l (T) is:

min F _l (T)＝|f ₁ (T),f ₂ (T)|

wherein T is the switching state vector of all parallel capacitor banks, f ₁ (T) and f ₂ (T) are respectively an objective function F _l (T) a first sub-goal and a second sub-goal; t (tau) represents all parallel capacitor bank switching state vectors at the time of tau, and the switching state of each parallel capacitor bank is represented by a switching gear; n is the number of nodes in the power grid; v. of _p Representing the voltage amplitude corresponding to the pivot node; v. of _ref Corresponding the voltage reference value for the hub node; p is a radical of _ij Representing the active power of the node i flowing to the node j; k is _l Representing the number of long time intervals in a day;

(2) considering the power grid operation flow constraint and the voltage constraint:

and

(3) switching of the mth parallel capacitor bank is switched by a switching gear T _m Upper and lower limits

And

the limitation of (2):

(4) the upper limit of the switching times of the parallel capacitor bank in one day is set as

Wherein, C _m Representing the switching times of the mth parallel capacitor bank in one day;

2. The deep reinforcement learning-based double-time-scale new energy grid voltage optimization method according to claim 1, wherein the step 2) of integrating multiple targets by using membership functions adopts the following formula:

representing the possible optimal value on a single dimension corresponding to the beta sub-target, wherein beta is 1, 2; delta. for the preparation of a coating _β As sub-objective f _β Tolerance of value is used to define the boundary that the target function can reach, and for any sub-target, when the corresponding target value is in the tolerance range, the membership function mu (f) _β ) Will then sub-target f _β Mapping of values to [0,1 ]]In sub-goal f _β When the tolerance of the value is out, the value of the membership function is set to 0, and when a new sub-target optimal value is found, the value of the membership function is set to 1;

min[-μ(F _l )]＝k ₁ μ(f ₁ )+k ₂ μ(f ₂ )

3. The deep reinforcement learning-based dual-time scale new energy grid voltage optimization method according to claim 1, wherein the method is characterized in thatIn that, the reward function r is designed aiming at the long-time scale grid voltage optimization model in the step 2) _l (τ)：

4. The deep reinforcement learning-based double-time-scale new energy power grid voltage optimization method according to claim 1, wherein the step 2) of solving the long-time-scale power grid voltage optimization model by using the DQN algorithm to obtain the long-time-scale parallel capacitor bank switching plan comprises the following steps:

(1) calculating a DQN network loss function L (theta):

wherein r is _l (τ) represents a reward function of the long-time scale grid voltage optimization model; s (tau) and s (tau-1) respectively represent the states of the intelligent agent at the time tau and the time tau-1 and are composed of an information matrix set { v, p, Q, T, C, Q }, wherein v, p and Q are voltage amplitude vectors of all nodes, active power vectors and reactive power vectors injected into all nodes respectively, and T, C, Q are switching state vectors of all parallel capacitor banks, switched frequency vectors of all parallel capacitor banks in one day and reactive power vectors of all continuous reactive power compensation devices respectively; a is the action space of the agent; a is _l Representing the action selected by the agent; a is _l (t) ═ t (t) is obtained by the agent implementing the policy based on the state s (τ -1) at time τ -1; q. q.s _π Output from estimated value network；

The target network parameters are input by the estimated value network at fixed step length, so that the target network has a certain lag relative to the estimated value network, theta and theta ^target Respectively estimating a value network parameter and a target network parameter; γ represents an attenuation factor;

(2) updating the estimated value network parameters by adopting a random gradient descent method, wherein the updating method comprises the following steps:

wherein, theta _τ+1 And theta _τ Respectively representing estimated value network parameters at the tau +1 moment and the tau moment; alpha represents the learning rate of the estimated value network in the updating process;

representing gradient finding.

5. The deep reinforcement learning-based dual-time scale new energy grid voltage optimization method according to claim 1, wherein the establishing of the short-time scale grid voltage optimization model in step 3) comprises:

and

and

are each q _con,n Upper and lower reactive power output limits.

6. The deep reinforcement learning-based dual-time scale new energy grid voltage optimization method according to claim 1, wherein the reward function r is designed for the short-time scale grid voltage optimization model in the step 3) _s (t) the following:

7. The deep reinforcement learning-based dual-time-scale new energy grid voltage optimization method according to claim 1, wherein the step 3) of solving the short-time-scale grid voltage optimization model by using a DDPG algorithm to obtain a short-time-scale continuous reactive power compensation device output plan comprises the following steps:

(1) calculating a Critic network loss function L (theta) ^q )：

representing the expectation of corresponding target values over all values of t; s (T) is the state of the agent at time T, and is composed of a set of information matrices { v, p, Q, T, C, Q }V, p and q are voltage amplitude vectors of each node and active power vectors and reactive power vectors injected into each node respectively, and T, C, Q is a switching state vector of all parallel capacitor banks, a switched frequency vector of all parallel capacitor banks in one day and a reactive power output vector of all continuous reactive power compensation devices respectively; a is _s (t) ═ t (t) represents the action taken by the agent at time t, derived from the policy enforcement based on state s (t-1) at time t-1; theta ^q Estimating parameters of a value network for the Critic network; q is output by an estimated value network in the Critic network; because the DDPG algorithm belongs to a gradient solving method based on a deterministic strategy, the probability distribution selected by each action under different states cannot be determined, the expected value solving is converted into M times of sampling to obtain an average value, the M times of sampling data are randomly and repeatedly extracted from a memory base, and s is ^m (t) and

agent status and agent actions taken as the m-th sample, respectively; y is ^m (t) a label to be considered as the mth sample; the label y (t) expression is:

y(t)＝r _s (t)+γq ^target {s(t+1),ψ ^target [s(t)∣θ ^ψ' ]∣θ ^q' }

(2) the Actor network is judged based on the Critic network, and is updated by adopting a gradient updating method, wherein the gradient updating calculation method comprises the following steps:

wherein the content of the first and second substances,

_a representing the gradient of the corresponding target value to the action a;

representing the corresponding target value pair parameter theta ^ψ The gradient of (2) is calculated; m is the sampling times; s ^m (t) and

agent status and agent actions taken as the m-th sample, respectively;

(3) respectively updating parameters of the Actor network and the Critic network, wherein the updating method comprises the following steps:

wherein the content of the first and second substances,

and

respectively representEstimating parameters of the value network in the Actor network at the t +1 moment and the t moment;

and

respectively representing parameters of an estimated value network in a Critic network at the moment t +1 and the moment t; alpha is alpha ₁ And alpha ₂ Respectively representing the learning rates of the estimated value network in the Actor network and the estimated value network in the Critic network in the updating process;

representing gradient finding.