CN118017523A

CN118017523A - Voltage control method, device, equipment and medium for electric power system

Info

Publication number: CN118017523A
Application number: CN202410420112.6A
Authority: CN
Inventors: 沈增祥; 陈晓峰; 陈飘; 钟灵军; 赵银龙
Original assignee: Hangzhou Hongsheng Electric Power Design Consulting Co ltd
Current assignee: Hangzhou Hongsheng Electric Power Design Consulting Co ltd
Priority date: 2024-04-09
Filing date: 2024-04-09
Publication date: 2024-05-10

Abstract

The invention discloses a voltage control method, a device, equipment and a medium of a power system, and relates to the field of power system dispatching, wherein the method comprises the following steps: acquiring power operation data of a distributed power supply in a power system; inputting the electric power operation data into a trained voltage control model to obtain a voltage control scheme output by the voltage control model; the voltage control scheme is used for adjusting reactive power correspondingly output by the distributed power supply to control voltage deviation minimization; the voltage control model is obtained based on historical power operation data, markov decision and reinforcement learning processes, and an offline agent explores the environment based on a behavior strategy and trains network parameters by using a strategy gradient algorithm. The invention can effectively improve the power flow distribution, relieve the voltage out-of-limit, stabilize the voltage fluctuation, and further realize the access of high-proportion new energy to the voltage of the lower distribution network so as to perform real-time optimal control.

Description

Voltage control method, device, equipment and medium for electric power system

Technical Field

The invention relates to the technical field of power system dispatching, in particular to a voltage control method, device, equipment and medium of a power system.

Background

The capacity of new energy sources such as distributed power sources (Distributed Generator, DG) connected to a power distribution network is increased year by year at present, but the randomness of the output of the new energy sources brings serious challenges to the safe and stable operation of the power distribution network, and how to realize the high-proportion new energy source connected to the lower power distribution network voltage so as to perform real-time optimal control is an important subject to be solved in the industry at present.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a voltage control method, apparatus, device and medium for an electric power system, so as to solve the problem in the prior art that the accuracy of calibration for the falling edge of an ac signal is not satisfied.

According to a first aspect, an embodiment of the present invention provides a voltage control method of an electric power system, the method including:

acquiring power operation data of a distributed power supply in a power system; the power operation data includes at least: load active power, load reactive power, distributed power source active power, voltage amplitude and voltage phase angle;

Inputting the electric power operation data into a trained voltage control model to obtain a voltage control scheme output by the voltage control model; the voltage control scheme is used for adjusting reactive power correspondingly output by the distributed power supply to control voltage deviation minimization;

The voltage control model is obtained based on historical power operation data, markov decision and reinforcement learning processes, and offline agents explore the environment based on behavior strategies and train network parameters by using a strategy gradient algorithm.

With reference to the first aspect, in a first implementation manner of the first aspect, the voltage control model is trained by:

acquiring historical power operation data of a distributed power supply in a power system, and constructing a voltage control model for solving the voltage control problem based on the historical power operation data;

determining agents, states, observations, actions and rewards in the optimal power flow model based on Markov decision and reinforcement learning processes; the intelligent agents are distributed power supplies in the power system, and the intelligent agents are in one-to-one correspondence with the distributed power supplies;

Constructing a neural network structure of the voltage control model;

offline training is carried out on the intelligent agent, and interaction records of the intelligent agent are determined through continuous interaction between the offline intelligent agent and the environment;

Samples are extracted from the interaction records through experience playback, and neural network parameters are updated based on the samples and a strategy gradient algorithm.

With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the voltage control model includes: a policy network layer and a global value network layer, wherein the policy network layer is equal to the intelligent agents in number and corresponds to the intelligent agents one by one;

The input end of the global value network layer is connected with the output end of the strategy network layer, and the strategy network layer and the global value network layer are both composed of fully-connected neural networks;

the strategy network layer is used for generating individual actions according to local observation of the area where the intelligent agent is located in the environment; the global value network layer is used for scoring actions of all the agents according to the states to determine global value.

With reference to the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the voltage control model further includes: the target strategy network layers are equal to the intelligent agents in number and correspond to the intelligent agents one by one, and a target value network layer;

the target strategy network layers are in one-to-one correspondence with the strategy network layers, and the target strategy network layers are connected with the corresponding strategy network layers; the target value network layer corresponds to and is connected with the global value network layer;

the target strategy network layer and the target value network layer are both composed of fully-connected neural networks.

With reference to the first implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the determining, based on a markov decision and reinforcement learning process, an agent, an environment, a state, an action, and a reward in the optimal power flow model specifically includes:

Determining an agent and an environment; the environment is a power distribution network environment of the power system, and the intelligent body is a distributed power supply in the power system;

Determining a state; the state is a node characteristic quantity, and the node characteristic quantity at least comprises: load active power, load reactive power, distributed power source active power, voltage amplitude and voltage phase angle;

Determining an observation; the observation consists of local observation of each intelligent agent, wherein the local observation is node characteristic quantity in an area where the intelligent agent observes;

Determining an action; the action is composed of individual actions of each intelligent agent, wherein the individual actions are distributed power supply reactive power corresponding to the intelligent agent at the current moment;

determining a reward; the intelligent agents are in complete cooperative relationship and share rewards.

With reference to the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the performing offline training on the agent, and determining an interaction record of the agent by continuously interacting the offline agent with the environment specifically includes:

determining the individual action output by the strategy network layer corresponding to the intelligent agent, and adding noise for the individual action; the noise is randomly extracted from Gaussian distribution with a mean value set to 0 and a standard deviation set to a preset value;

offline training is carried out on the intelligent agent, and an interaction record generated by interaction between the offline intelligent agent and the environment is recorded; the interaction record comprises the state of the agent at the current moment, the observation of the current moment, the action of the current moment, the rewards of the current moment, the state of the next moment adjacent to the current moment and the observation of the next moment adjacent to the current moment.

With reference to the first implementation manner of the first aspect, in a sixth implementation manner of the first aspect, samples are extracted from the interaction record through experience playback, and parameters of the neural network are updated based on the samples and a policy gradient algorithm, which specifically includes:

storing the interaction records into an experience playback array, and extracting at least one interaction record from the experience playback array as a sample;

The method comprises the steps of sequentially determining target individual actions and target values at the next moment adjacent to the current moment according to the observation of the next moment adjacent to the current moment of an intelligent agent in a sample and according to a corresponding target strategy layer and a target value network layer, and determining a time difference target at the current moment according to the target individual actions and the target values at the next moment adjacent to the current moment;

Determining the value of the intelligent agent at the current moment through a global value network layer according to the state and the action of the intelligent agent in the sample, and determining a time difference error according to the value at the current moment and a time difference target;

According to the time difference errors respectively corresponding to the intelligent agents in the sample, updating network parameters of the global value network layer by using a gradient descent algorithm;

The method comprises the steps of observing the current moment of an agent in a sample, determining the action of the agent at the current moment by utilizing a strategy network layer, and determining the value of the current moment by combining the historical actions of agents except the agent in the sample;

determining policy gradients of the agents in the sample according to a chained rule, and updating network parameters of a policy network layer by combining a gradient rising algorithm according to the average value of the policy gradients corresponding to each agent in the sample;

based on a preset soft update factor, updating network parameters of the target strategy network layer and the target value network layer by using a soft update algorithm.

According to a second aspect, an embodiment of the present invention further provides a voltage control apparatus of an electric power system, the apparatus including:

the acquisition module is used for acquiring power operation data of the distributed power supply in the power system; the power operation data includes at least: load active power, load reactive power, distributed power source active power, distributed power source reactive power, voltage amplitude and voltage phase angle;

The control module is used for inputting the electric power operation data into a trained voltage control model to obtain a voltage control scheme output by the voltage control model; the voltage control scheme is used for adjusting reactive power correspondingly output by the distributed power supply to control voltage deviation minimization;

The voltage control model is obtained based on historical power operation data, markov decision and reinforcement learning processes, exploration by utilizing a behavior strategy and training of network parameters by utilizing a strategy gradient algorithm.

According to a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the voltage control method of the power system as described in any one of the above when the program is executed.

According to a fourth aspect, an embodiment of the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method of controlling voltage of an electrical power system as described in any of the above.

According to the voltage control method, the device, the equipment and the medium of the power system, the voltage control model is obtained through training, the voltage control model is based on historical power operation data, and is obtained through searching the environment based on a Markov decision and reinforcement learning process and training network parameters by an offline intelligent agent based on a behavior strategy, so that the voltage control scheme obtained based on the voltage control module is a more excellent voltage control strategy, can effectively improve tide distribution, relieve voltage out-of-limit and stabilize voltage fluctuation, and further can realize real-time optimal control of the power distribution network voltage under high-proportion new energy access.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:

fig. 1 shows a schematic flow chart of a voltage control method of an electric power system provided by the invention;

fig. 2 is a schematic structural diagram of a voltage control model in the voltage control method of the power system according to the present invention;

fig. 3 shows a schematic structural diagram of a voltage control device of the power system provided by the invention;

Fig. 4 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The capacity of new energy sources such as DG which is connected into a power distribution network is increased year by year at present, but the randomness of the output of the new energy sources brings serious challenges to the safe and stable operation of the power distribution network, and meanwhile, inverters equipped with the new energy sources can output continuously adjustable reactive power, so that flexible scheduling means can be provided for the power distribution network.

The traditional centralized control method has larger calculated amount when facing the situation of more optimized variables and nodes, and is difficult to cope with the rapid voltage fluctuation caused by the intermittent power generation of new energy. In addition, the centralized control method relies on perfect communication facilities, and the investment cost is high.

Therefore, how to realize the real-time optimal control of the voltage of the power distribution network under the high-proportion new energy access is an important topic to be solved in the industry.

In order to solve the above problems, in this embodiment, a voltage control method of an electric power system is provided, which aims to realize that a high proportion of new energy is connected to the voltage of the lower power distribution network to perform voltage optimization control. The voltage control method of the power system according to the embodiment of the invention can be used in electronic equipment, including but not limited to computers, mobile terminals, etc., and fig. 1 is a schematic flow chart of the voltage control method of the power system according to the embodiment of the invention, as shown in fig. 1, the method includes the following steps:

S10, acquiring power operation data of a distributed power supply in the power system.

The power operation data of various DG in the power system (power distribution network) are various data generated by DG under various operation conditions, and the power operation data may be stored in the electronic device in advance or may be acquired by the electronic device from the outside. For example, the electronic device is obtained from an external detection device.

The specific acquisition form of the power operation data is not limited, and the electronic equipment can acquire the power operation data only by ensuring that the electronic equipment can acquire the power operation data, and a corresponding database can be built for the power operation data including various DGs in the power distribution network under various operation conditions.

Preferably, the power operation data includes: load active power, load reactive power, distributed power source active power, voltage amplitude and voltage phase angle.

S20, inputting the power operation data into a trained voltage control model to obtain a voltage control scheme output by the voltage control model, wherein the voltage control scheme is to adjust reactive power correspondingly output by the distributed power supply to control voltage deviation to be minimized.

In the embodiment of the invention, the voltage control model is obtained based on historical power operation data, markov decision and reinforcement learning processes, and offline agents explore the environment based on a behavior strategy and train network parameters by using a strategy gradient algorithm.

In the embodiment of the invention, the off-line data related to the total collected historical power operation data is used as sample data, and mass data support is provided for subsequent training. It will be appreciated that the historical power operation data is consistent with the form of the power operation data collected in real time.

The embodiment of the invention is applied to a deep reinforcement learning (Deep reinforcement learning, DRL) method, the DRL is used as a data driving method, the online real-time scheduling control can be realized without depending on predicted data while the DRL is adapted to a complex environment state, the decision scheduling method can be learned from historical experience, the self-adaptive learning decision capability is high for nonlinear and non-convex problems, modeling is not required to be carried out on an uncertain real-time changing physical model, and the method is suitable for complex and changeable electric power system scenes containing various new energy sources.

The trained voltage control model in the embodiment of the invention realizes the minimization of the system voltage deviation by adjusting the reactive power output by the DG inverters such as photovoltaic inverters and fans in the power distribution network.

Assume that the total number of time steps isCommon one/>, in distribution networksNode/>Station DG, time index/>Node index is/>Photovoltaic index is/>In order to effectively distinguish the severity of the node voltage deviation from the reference value in the safety range, the embodiment of the invention adopts a U-shaped function to describe the voltage deviation, adopts a Gaussian distribution probability density function in the safety range, and adopts a Laplace distribution probability density function outside the safety range, and is specific:

Wherein, Representing a voltage deviation function; /(I)Representing nodes/>At/>The voltage amplitude corresponding to the moment in time, and,，/>Representing a safe lower limit of voltage,/>Representing a safe upper limit of voltage,/>Representing a voltage reference value; /(I)Then is/>Is mean value,/>A gaussian distribution function of standard deviation; /(I)Representing the first shape parameter,/>Representing a second shape parameter,/>Representing a third shape parameter,/>A fourth shape parameter is represented, which is used to adjust the smoothness of the voltage deviation function.

In the voltage control model, each DG in the distribution network is considered to be an individual agent (the total number of agents is) Then, through establishing a Markov model of the voltage control problem, the core parameters of the Markov model mainly comprise: status, observations, actions, and rewards, specific:

Wherein, Representing states, which are feature quantities/>, of all nodesIs set of (1), node/>At/>Characteristic quantity corresponding to time,/>Representing nodes/>At/>Load active power corresponding to time (current time)/>Representing nodes/>At/>Load reactive power corresponding to moment,/>Representing nodes/>At/>DG reactive power corresponding to time,/>Representing nodes/>At/>DG reactive power corresponding to time (immediately preceding time adjacent to the current time)/>Representing nodes/>At/>Voltage amplitude corresponding to moment,/>Representing nodes/>At/>The voltage phase angle corresponding to the moment.

If there is no load or DG on a certain node, the corresponding power is 0.

Wherein,Representing (joint) observations is a local observation/>, by each agentIs composed of the components.

It should be noted that, due to limitations in communication and some privacy, each DG can only observe the node feature quantity in the area where it is located, namely，/>Representing agent/>The area is the agent/>The node set of the area.

Wherein,Representing (combining) actions is performed by individual actions of the respective agents/>Composed of,/>Representing agent/>At/>DG reactive power corresponding to the moment.

Wherein,The rewards are represented, and can be shared because of the complete cooperative relationship among the various agents. The markov model for solving the voltage control problem in the embodiment of the invention is essentially a multi-agent deep reinforcement learning model, and aims to solve the optimal joint control strategy so as to maximize the accumulated discount return of rewards.

According to the voltage control method of the power system, the voltage control model is obtained through training, and is obtained based on historical power operation data, on the basis of Markov decision and reinforcement learning processes and on the basis of behavior strategies, offline agents search the environment and train network parameters by using strategy gradient algorithms, so that the voltage control scheme obtained based on the voltage control module is a more excellent voltage control strategy, power flow distribution can be effectively improved, voltage out-of-limit is relieved, voltage fluctuation is stabilized, and further high-proportion new energy is connected into the voltage of the lower distribution network to conduct real-time optimal control.

As shown in fig. 2, in the embodiment of the present invention, the neural network of the voltage control model (markov model), i.e., the multi-agent deep reinforcement learning model, is an actor-reviewer structure, and the neural network structure includesPersonal policy network layer/>And a global value network layer/>Policy network layer/>The global value network layer/>, which is equal to the quantity of the agents and corresponds to the agents one by oneInput to and policy network layer/>Output connection of (3) policy network layer/>And global value network layer/>Are all composed of fully connected neural networks. Policy network layer/>Network parameter of/>Global value network layer/>Network parameter of/>。

Wherein the policy network layerFor adapting to agents/>, in a distribution network environmentThe area/>Local observations/>Generating individual actions/>Global value network layer/>Then according to the state/>Action on all agents/>Scoring to obtain global value/>Specific:

in order to alleviate the overestimation problem of value in the embodiment of the invention, a target strategy network layer is also introduced And target value network layer/>Target policy network layer/>Network structure and policy network layer/>The network structure of (a) is basically the same except that its network parameter is set to/>Target value network layer/>Network structure and global value network layer/>The network structure of (a) is basically the same except that its network parameter is set to/>. Likewise, target policy network layer/>The target strategy network layer/>, which is equal to the quantity of the agents and corresponds to the agents one by oneAnd corresponding policy network layer/>Connection, target value network layer/>And global value network layer/>Corresponding and connected.

In the embodiment of the invention, the voltage control model is obtained through training the following steps:

a10, acquiring historical power operation data of a distributed power supply in the power system, and constructing a voltage control model for solving the voltage control problem based on the historical power operation data;

A20, determining an agent, a state, observation, actions and rewards in the optimal power flow model based on a Markov decision and reinforcement learning process, wherein the agent is a distributed power supply in the power system and corresponds to the distributed power supply one by one;

a30, constructing a neural network structure of a voltage control model;

A40, performing offline training on the intelligent agent, and determining an interaction record of the intelligent agent through continuous interaction between the offline intelligent agent and the environment;

And A50, extracting a sample from the interaction record through experience playback, and updating the neural network parameters based on the sample and a strategy gradient algorithm.

In the exploration process of the voltage control model, namely in the offline training process, the offline intelligent agentExperience will be collected through behavior strategies, which means adding random noise/>, based on the output of the current strategy networkSpecific:

Wherein the noise From a mean value of 0 and a standard deviation of/>Is randomly extracted from the Gaussian distribution of the model (C), and is further determined.

In an embodiment of the present invention, in the present invention,For the preset value which can be configured by the user by oneself,/>Either a fixed value or an interval.

After one exploration, the offline agent generates an interaction record in the power distribution network environment, and the interaction record can use six-tupleRepresenting and storing it in an experience playback array, and when the experience playback array is full, replacing the oldest stored interaction record with the latest obtained interaction record. It can be seen that the interaction record comprises the state/>, of the agent at the current moment in timeObservation at the present moment/>Action at the present moment/>Rewards at the present moment/>State/>, next time adjacent to current timeAnd observation/>, of the next moment adjacent to the current moment。

More specifically, step a50 specifically includes:

A51, storing the interaction records into an experience playback array, and extracting at least one interaction record from the experience playback array as a sample.

After a certain number of interaction records, namely interaction experiences, are collected in the experience playback array, small batches (preset number) of samples are randomly extracted from the experience playback array, and expected samples are replaced by sample average values according to Monte Carlo simulation, so that network parameters of the Markov model neural network are updated. Assume that a small batch of samples has a size ofWherein sample b is。

A52, observing the next moment adjacent to the current moment of the intelligent agent in the sample, sequentially determining the target individual action and the target value of the next moment adjacent to the current moment according to the corresponding target strategy layer and the target value network layer, and determining the time difference target of the current moment according to the target individual action and the target value of the next moment adjacent to the current moment.

Updating global value network layer using time difference algorithm when updating network parameters of global value network layerNetwork parameter of/>. Specifically, for the smarter/>, in sample bObservation/>, of next moment adjacent to current momentNetwork layer/> according to the corresponding target policyAnd target value network layer/>Sequentially calculating target individual actions/>, adjacent to the current moment, of the next momentAnd target value/>Thereby obtaining a time difference target/>The method comprises the following steps:

Wherein, Representing the discount rate.

A53, determining the value of the intelligent agent at the current moment through the global value network layer according to the state and the action of the intelligent agent in the sample, and determining the time difference error according to the value at the current moment and the time difference target.

From the agent in sample bState/>And (joint) action/>Through the global value network layer/>Computing agentValue/>And get the time difference error/>The method comprises the following steps:

a54, according to the time difference errors respectively corresponding to the intelligent agents in the sample, updating the network parameters of the global value network layer by using a gradient descent algorithm.

According toIndividual agents/>, in bar samplesTime difference error/>, respectively correspondingUpdating global value network layer/>, by using gradient descent algorithmNetwork parameters/>Specific:

Wherein, Representing a global value network layer/>Is a learning rate of (a).

A55, observing the current moment of the intelligent agent in the sample, determining the action of the intelligent agent at the current moment by utilizing a strategy network layer, and determining the value of the current moment by combining the historical actions of the intelligent agents except the intelligent agent in the sample.

Updating a policy network layer using a policy gradient algorithm when updating network parameters of the policy network layerNetwork parameter of/>. Specifically, for the smarter/>, in sample bObservation at the present moment/>Utilizing a policy network layer/>Calculate its temporary action/>Combine the historical actions of other agents in sample b (except/>) Obtain temporary value/>Specific:

Wherein, And/>Respectively the intelligent agent/>, in the b-th sample generated after the neural network processingIs a temporary variable and is used to solve the strategy gradient.

A56, determining the strategy gradients of the agents in the sample according to a chain rule, and updating network parameters of a strategy network layer by combining a gradient ascending algorithm according to the average value of the strategy gradients corresponding to each agent in the sample.

Then calculating strategy gradient according to the chain ruleThen by/>Individual agents/>, in bar samplesCorresponding policy gradient/>Is combined with a gradient rising algorithm to update the strategy network layer/>Network parameters/>Specific:

Wherein, For policy network layer/>Is a learning rate of (a).

And A57, updating network parameters of the target strategy network layer and the target value network layer by using a soft update algorithm based on a preset soft update factor.

I.e. based on a preset soft update factorUpdating a target policy network layer/>, using a soft update algorithmNetwork parameters/>And target value network layer/>Network parameters/>Specific:

the voltage control device of the power system provided by the embodiment of the invention is described below, and the voltage control device of the power system described below and the voltage control method of the power system described above can be referred to correspondingly.

In order to solve the above-mentioned problems, in this embodiment, a voltage control device of an electric power system is provided, which aims to realize that a high proportion of new energy is connected to the voltage of the lower distribution network to perform voltage optimization control. The voltage control apparatus of the power system according to the embodiment of the present invention may be used in an electronic device, including but not limited to a computer, a mobile terminal, etc., and fig. 3 is a flowchart of a voltage control method of the power system according to the embodiment of the present invention, and as shown in fig. 3, the apparatus includes:

An acquisition module 10 for acquiring power operation data of a distributed power supply in a power system,

The control module 20 is configured to input the power operation data into the trained voltage control model, and obtain a voltage control scheme output by the voltage control model, where the voltage control scheme is to adjust reactive power correspondingly output by the distributed power supply to control voltage deviation to be minimized.

According to the voltage control device of the power system, the voltage control model is obtained through training, and is obtained based on historical power operation data, on the basis of Markov decision and reinforcement learning processes and on the basis of behavior strategies, offline agents search the environment and train network parameters by using strategy gradient algorithms, so that the voltage control scheme obtained based on the voltage control module is a more excellent voltage control strategy, the power flow distribution can be effectively improved, the voltage out-of-limit is relieved, the voltage fluctuation is stabilized, and further the high-proportion new energy is connected into the voltage of the lower distribution network to perform real-time optimal control.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic commands in memory 530 to perform a voltage control method of a power system, the method comprising:

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in some part contributing to the prior art in the form of a software medium, which may be stored in a computer readable storage medium such as ROM/RAM, a magnetic disk, an optical disk, etc., including several commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the various embodiments or methods of some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of voltage control of an electrical power system, the method comprising:

The voltage control model is obtained by exploring the environment based on historical electric power operation data, markov decision and reinforcement learning processes and offline agents based on behavior strategies and training network parameters by using a strategy gradient algorithm;

the voltage control model is obtained through training the following steps:

Constructing a neural network structure of the voltage control model;

2. The method of voltage control of an electrical power system of claim 1, wherein the voltage control model comprises: a policy network layer and a global value network layer, wherein the policy network layer is equal to the intelligent agents in number and corresponds to the intelligent agents one by one;

3. The method of voltage control of an electrical power system of claim 2, wherein the voltage control model further comprises: the target strategy network layers are equal to the intelligent agents in number and correspond to the intelligent agents one by one, and a target value network layer;

4. The method according to claim 2, wherein determining agents, environments, states, actions and rewards in the optimal power flow model based on markov decision and reinforcement learning process, comprises:

5. The method for controlling voltage of power system according to claim 4, wherein the step of performing offline training on the agent and determining the interaction record of the agent by continuously interacting the offline agent with the environment comprises:

6. The method according to claim 1, wherein samples are extracted from the interaction records by empirical playback and neural network parameters are updated based on the samples and a policy gradient algorithm, comprising:

7. A voltage control device for an electrical power system, the device comprising:

the voltage control model is obtained based on historical power operation data, on the basis of Markov decision and reinforcement learning processes, by utilizing a behavior strategy to explore and utilizing a strategy gradient algorithm to train network parameters;

the voltage control model is obtained through training the following steps:

Constructing a neural network structure of the voltage control model;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the voltage control method of the power system according to any one of claims 1 to 6 when the program is executed.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the voltage control method of the power system according to any one of claims 1 to 6.