CN117833353A

CN117833353A - Simulation training method, device and equipment for power grid active control intelligent agent

Info

Publication number: CN117833353A
Application number: CN202311630737.7A
Authority: CN
Inventors: 周毅; 沈维健; 周良才; 陈清; 高佳宁; 范栋琦; 闪鑫; 王波; 王天禄; 骆玮; 徐峰; 徐希; 李雷; 郑义明; 孙小磊; 刘理达; 孙志豪; 余飞翔; 陆廷骧; 吴自博
Original assignee: East China Branch Of State Grid Corp ltd; NARI Nanjing Control System Co Ltd
Current assignee: East China Branch Of State Grid Corp ltd; NARI Nanjing Control System Co Ltd
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-04-05

Abstract

The invention discloses a simulation training method, device and equipment for an intelligent agent for active control of a power grid, relates to the technical field of power system simulation, and can solve the problem that the intelligent agent for deep reinforcement learning training cannot be applied to an actual power grid model topological structure. The method comprises the following steps: acquiring an agent constructed based on deep reinforcement learning, constructing a power grid simulation environment based on an actual power grid model topological structure and historical operation data, acquiring at least one group of initial state data of power grid equipment in the power grid simulation environment, training the agent, and generating an initial active control action strategy; if the initial active control action strategy meets the preset check condition, updating initial state data according to the initial active control action strategy to obtain next state data; and calculating to obtain an initial action rewarding value according to the initial state data and the next state data, and continuously training the intelligent agent by utilizing the next state data and the initial action rewarding value until a preset stopping condition is reached, so as to obtain the intelligent agent after training is completed.

Description

Simulation training method, device and equipment for power grid active control intelligent agent

Technical Field

The invention relates to the technical field of power system simulation, in particular to a simulation training method, device and equipment for an intelligent power grid active control agent.

Background

In recent years, with the rapid development of big data analysis and new generation artificial intelligence technology, the new problems and new challenges faced by the current power grid are solved by utilizing the artificial intelligence technology based on data driving, and the method becomes a new means and a new method in the process of building a novel power system. The deep reinforcement learning technology plays an important role in power grid scheduling auxiliary decision-making.

In order to research the application of deep reinforcement learning in power grid dispatching, a laboratory-level simulation environment is built based on an IEEE typical topology to verify the application effect of the deep reinforcement learning in power grid regulation and control, but the actual power grid characteristics are remarkably different from the IEEE typical topology, so that an intelligent body of the current deep reinforcement learning training cannot adapt to a power grid model topological structure inconsistent with the training environment, and therefore, the deep reinforcement learning technology is difficult to land in an actual system. How to train an agent algorithm adapting to actual power grid dispatching requirements is a problem to be solved at present.

Disclosure of Invention

In view of the above, the invention provides a simulation training method, device and equipment for an intelligent agent for active control of a power grid, which can solve the technical problem that the intelligent agent for deep reinforcement learning training cannot be applied to an actual power grid model topological structure.

According to a first aspect of the present invention, there is provided a simulation training method of an active control agent for a power grid, the method comprising:

acquiring an agent constructed based on deep reinforcement learning, constructing a power grid simulation environment based on an actual power grid model topological structure and historical operation data, acquiring at least one set of initial state data of power grid equipment in the power grid simulation environment, training the agent by using one set of initial state data, and generating an initial active control action strategy at the current moment;

judging whether the initial active control action strategy meets a preset check condition, if so, updating the initial state data according to the initial active control action strategy to obtain next state data at the next moment;

and calculating to obtain an initial action rewarding value according to the initial state data and the next state data, and continuously training the intelligent agent by utilizing the next state data and the initial action rewarding value until a preset stopping condition is reached, so as to obtain the trained intelligent agent.

Preferably, the method further comprises:

and if the initial active control action strategy does not meet the preset check condition, training the intelligent agent again by using the initial state data until the initial active control action strategy meets the preset check condition.

Preferably, the determining whether the initial active control action policy meets a preset check condition includes:

judging whether the length of an action list in the initial active control action strategy is equal to the number of the current environmental units or not; the method comprises the steps of,

and judging whether the current active output and the active adjustment quantity of the unit in the initial active control action strategy are within the unit output limit after being overlapped.

Preferably, the method further comprises:

and setting the preset stopping condition, wherein the preset stopping condition comprises the group number of the initial state data and the training times of each group.

Preferably, after the trained agent is obtained, the method further comprises:

acquiring state data to be adjusted of the power grid equipment;

inputting the state data to be adjusted into the trained intelligent agent to generate an optimal active control action strategy;

and updating the state data to be adjusted according to the optimal active control action strategy.

Preferably, the acquiring at least one set of initial state data of the power grid device includes:

and acquiring different sets of initial state data of the power grid equipment corresponding to different preset moments according to the actual power grid model topological structure and the historical operation data.

According to a second aspect of the present invention, there is provided a simulation training apparatus for a power grid active control agent, the apparatus comprising:

the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring an agent constructed based on deep reinforcement learning, constructing a power grid simulation environment based on an actual power grid model topological structure and historical operation data, acquiring at least one set of initial state data of power grid equipment in the power grid simulation environment, training the agent by using one set of initial state data, and generating an initial active control action strategy at the current moment;

the judging module is used for judging whether the initial active control action strategy meets a preset check condition or not, if yes, updating the initial state data according to the initial active control action strategy to obtain the next state data at the next moment;

and the first training module is used for calculating and obtaining an initial action rewarding value according to the initial state data and the next state data, and continuously training the intelligent agent by utilizing the next state data and the initial action rewarding value until reaching a preset stopping condition, so as to obtain the trained intelligent agent.

Preferably, the apparatus further comprises:

and the second training module is used for training the intelligent agent again by using the initial state data if the initial active control action strategy does not meet the preset check condition until the initial active control action strategy meets the preset check condition.

According to a third aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method for simulation training of an active control agent of a power grid.

According to a fourth aspect of the present application, there is provided a computer device, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, where the processor implements the simulation training method of the power grid active control agent when executing the program.

By means of the technical scheme, the simulation training method, the simulation training device and the simulation training equipment for the intelligent agent for the power grid active control can firstly acquire the intelligent agent constructed based on deep reinforcement learning, construct a power grid simulation environment based on an actual power grid model topological structure and historical operation data, acquire at least one set of initial state data of power grid equipment in the power grid simulation environment, train the intelligent agent by utilizing one set of initial state data, and generate an initial active control action strategy at the current moment; then judging whether the initial active control action strategy meets a preset check condition, if so, updating the initial state data according to the initial active control action strategy to obtain next state data at the next moment; and finally, calculating to obtain an initial action rewarding value according to the initial state data and the next state data, and continuously training the intelligent agent by utilizing the next state data and the initial action rewarding value until reaching a preset stopping condition to obtain the trained intelligent agent. According to the technical scheme, on one hand, the power grid simulation training environment is built based on the actual power grid model topological structure and the historical operation data, so that the actual power grid data (namely at least one group of initial state data of power grid equipment) is obtained from the actual power grid model topological structure and the historical operation data, on the other hand, the state data of the power grid are updated through executing the active control action strategy output by the intelligent body, and the interaction between the intelligent body and the power grid simulation environment is realized, therefore, the calculated state data and action rewarding value in the invention accord with the actual power grid characteristics, and the intelligent body is trained by utilizing the state data and the action rewarding value in an iterative manner, so that the trained intelligent body can adapt to the actual power grid scheduling requirement, and has strong adaptability in the actual power grid application. In the prior art, a laboratory level simulation environment is built based on an IEEE typical topology, and because the IEEE typical topology is different from an actual power grid, an intelligent body trained based on the laboratory level simulation environment cannot adapt to a topological structure inconsistent with a training environment, further cannot adapt to an actual power grid dispatching requirement, and has poor adaptability in actual power grid application.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute an undue limitation to the present application. In the drawings:

fig. 1 shows a flow diagram of a simulation training method of an active control intelligent agent of a power grid, which is provided by an embodiment of the invention;

fig. 2 is a schematic flow chart of another simulation training method of an active control intelligent agent of a power grid according to an embodiment of the present invention;

fig. 3 shows a schematic structural diagram of a simulation training device for an active control intelligent agent of a power grid according to an embodiment of the present invention;

fig. 4 shows a schematic structural diagram of another simulation training apparatus for active control of an intelligent agent in a power grid according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The embodiment provides a simulation training method of an active control intelligent agent of a power grid, as shown in fig. 1, the method comprises the following steps:

101. acquiring an agent constructed based on deep reinforcement learning, constructing a power grid simulation environment based on an actual power grid model topological structure and historical operation data, acquiring at least one set of initial state data of power grid equipment in the power grid simulation environment, training the agent by using the initial state data, and generating an initial active control action strategy at the current moment.

It should be noted that deep reinforcement learning is used to describe and solve the problem that an Agent (Agent) reaches a maximum return or a specific objective through learning strategies during interaction with an environment. Deep reinforcement learning is learning by an agent in a "trial and error" manner, and obtains rewards through interaction with the environment, so that the agent obtains the maximum rewards, and in the deep reinforcement learning, the environment evaluates the quality of actions generated by the agent, rather than telling the agent how to generate correct actions. In this way, the agent gains knowledge in the context of the action-assessment, improving the action plan to suit the context. Deep reinforcement learning regards learning as a heuristic evaluation process, wherein an agent selects an action for an environment, the state of the environment changes after receiving the action, and a prize or punishment signal is generated and fed back to the agent, and the agent selects the next action according to the prize or punishment signal and the current state of the environment, wherein the selection principle is that the probability of being subjected to positive reinforcement (prize) is increased. The selected action affects not only the immediate prize value, but also the state at the next moment in the environment and the final prize value. Namely, the training process of the intelligent agent is to learn what action strategy is output by the intelligent agent, so that the initial state can be adjusted by using the action strategy to achieve a specific target state.

Wherein the agent built based on deep reinforcement learning has not been trained yet.

The method comprises the steps of constructing a power grid simulation environment based on an actual power grid model topological structure and historical operation data, namely, storing the actual power grid model topological structure and the historical operation data in an environment, taking the environment as an initial power grid simulation environment, setting an initial value of power grid simulation environment parameters, and setting a power grid simulation environment state space, so that the power grid simulation environment is obtained, in the power grid simulation environment, at least one group of initial state data of power grid equipment can be obtained according to the actual power grid model topological structure and the historical operation data, and an initial active control action strategy generated by an agent can be simulated and executed on the actual power grid model topological structure, so that the initial state data of the power grid equipment in the actual power grid model topological structure is updated into next state data of the power grid equipment through the initial active control action strategy.

The actual power grid model topological structure comprises the following steps: at least one of line voltage class, thermal stability limit and power limit, at least one of bus voltage class, upper voltage limit, lower voltage limit and station to which the transformer belongs, at least one of transformer voltage class, thermal stability limit, power limit and station to which the transformer belongs, line bus connection relationship, transformer bus connection relationship, section composition and power limit. The historical operating data includes: at least one of line number, name, active value, reactive value and current value, at least one of main transformer number, name, active value, reactive value and current value and at least one of bus number, name, active value and reactive value. The power grid simulation environment parameters include: the current moment reserves decimal digits and the allowed precision of action strategies. The power grid simulation environment state space comprises: the current moment, the current step number, the active and reactive power of the unit/line/bus, the convergence condition of the line tide, the on-off state of the unit, the bus voltage amplitude and the bus voltage phase angle.

For obtaining at least one set of initial state data of the power grid equipment according to the actual power grid model topological structure and the historical operation data, specifically, the actual power grid model topological structure comprises nodes and connection relations among the nodes, the connection relations among the nodes are all the power grid equipment, the historical operation data are the historical operation data of the power grid equipment, therefore, a plurality of preset moments can be selected, and the actual power grid model topological structure and the historical operation data at each preset moment correspond to one set of initial state data of the power grid equipment, so that at least one set of initial state data of the power grid equipment is obtained.

102. And judging whether the initial active control action strategy meets a preset check condition, if so, updating the initial state data according to the initial active control action strategy to obtain the next state data at the next moment.

103. And calculating to obtain an initial action rewarding value according to the initial state data and the next state data, and continuously training the intelligent agent by utilizing the next state data and the initial action rewarding value until a preset stopping condition is reached, so as to obtain the trained intelligent agent.

For the embodiment steps 101-103, taking one preset time as the current time, inputting a corresponding set of initial state data into the intelligent agent, after the training of the set of initial state data is completed, training other sets of initial state data until the training of all sets of initial state data is completed, wherein the intelligent agent needs to achieve a specific target state through the output of an action strategy, when the intelligent agent inputs a set of initial state data, in order to achieve the specific target state, the intelligent agent randomly outputs an initial active control action strategy, after executing the initial active control action strategy, obtains next state data at the next time, hopefully, the next state data is the specific target state, but the next state data is not the specific target state, and at this time, the initial active control action strategy output by the intelligent agent is evaluated, wherein the evaluation is based on the fact that whether the adjustment direction of the next state data is in a direction close to the specific target state compared with the initial state data or not, for example. The evaluation is measured by the initial action reward value, which is fed back to the agent, which receives the feedback and the next state data, on the basis of which the training is continued.

In summary, the training process of the agent according to the initial state data is described in detail:

acquiring multiple sets of initial state data (one preset time corresponds to one set of initial state data) at different preset times, taking one preset time as a current time t, inputting the corresponding set of initial state data into an intelligent agent, randomly outputting an initial active control action strategy by the intelligent agent, executing the initial active control action strategy if the initial active control action strategy meets preset verification conditions, updating the next state data of the next time t+1 from the initial state data of the current time t by power grid equipment, calculating an initial action rewarding value according to the initial state data of the current time t and the next state data of the next time t+1, inputting the next state data and the initial action rewarding value into the intelligent agent for continuous training, randomly outputting a second active control action strategy by the intelligent agent, executing the second active control action strategy if the second active control action strategy meets preset verification conditions, and calculating the rewarding number of times by power grid equipment from the next state data of the next time t+1 to the next state data of the next time t+2 according to the next state data of the next time t+1. Similarly, another preset time is taken as the current time, the corresponding another set of initial state data is continuously input into the intelligent agent, the training mode is the same, and the detailed description is omitted until all sets of initial state data are trained, and the intelligent agent obtains the trained intelligent agent in the iterative training process.

According to the simulation training method, device and equipment for the power grid active control intelligent agent, the intelligent agent constructed based on deep reinforcement learning can be firstly obtained, a power grid simulation environment is constructed based on an actual power grid model topological structure and historical operation data, at least one set of initial state data of power grid equipment is obtained in the power grid simulation environment, the intelligent agent is trained by using the initial state data, and an initial active control action strategy at the current moment is generated; then judging whether the initial active control action strategy meets a preset check condition, if so, updating the initial state data according to the initial active control action strategy to obtain next state data at the next moment; and finally, calculating to obtain an initial action rewarding value according to the initial state data and the next state data, and continuously training the intelligent agent by utilizing the next state data and the initial action rewarding value until reaching a preset stopping condition to obtain the trained intelligent agent. According to the technical scheme, on one hand, the power grid simulation training environment is built based on the actual power grid model topological structure and the historical operation data, so that the actual power grid data (namely at least one group of initial state data of power grid equipment) is obtained from the actual power grid model topological structure and the historical operation data, on the other hand, the state data of the power grid are updated through executing the active control action strategy output by the intelligent body, and the interaction between the intelligent body and the power grid simulation environment is realized, therefore, the calculated state data and action rewarding value in the invention accord with the actual power grid characteristics, and the intelligent body is trained by utilizing the state data and the action rewarding value in an iterative manner, so that the trained intelligent body can adapt to the actual power grid scheduling requirement, and has strong adaptability in the actual power grid application. In the prior art, a laboratory level simulation environment is built based on an IEEE typical topology, and because the IEEE typical topology is different from an actual power grid, an intelligent body trained based on the laboratory level simulation environment cannot adapt to a topological structure inconsistent with a training environment, further cannot adapt to an actual power grid dispatching requirement, and has poor adaptability in actual power grid application.

Further, as a refinement and extension of the specific implementation manner of the foregoing embodiment, in order to fully describe the specific implementation process in this embodiment, another simulation training method of the active control agent of the power grid is provided, as shown in fig. 2, where the method includes:

201. and constructing a power grid simulation environment based on an actual power grid model topological structure and historical operation data, and acquiring different sets of initial state data of power grid equipment corresponding to different preset moments according to the actual power grid model topological structure and the historical operation data in the power grid simulation environment.

202. And acquiring an agent constructed based on deep reinforcement learning, training the agent by using a group of initial state data, and generating an initial active control action strategy at the current moment.

For example steps 201 and 202, the specific implementation is the same as example step 101, and will not be repeated here.

203. And judging whether the initial active control action strategy meets a preset check condition, if so, updating the initial state data according to the initial active control action strategy to obtain the next state data at the next moment.

Wherein, the action strategy is a unit active power adjustment list delta P _g And an adjustable load list ΔP _l Specifically, it isWherein n is the number of units, and m is the number of adjustable loads.

In the application scenario of the present embodiment, the active control action policy means that the generated action policy is active controlled.

For this embodiment, as an implementation manner, the determining whether the initial active control action policy meets a preset check condition includes: checking the dimension of the intelligent agent action list, namely judging whether the action list length in the initial active control action strategy is equal to the number of the current environmental units or not; and checking the adjustment value limit, namely judging whether the current active output of the unit in the initial active control action strategy and the active adjustment amount are within the unit output limit after being overlapped.

204. And if the initial active control action strategy does not meet the preset check condition, training the intelligent agent again by using the initial state data until the initial active control action strategy meets the preset check condition.

For the embodiment, the retraining agent policy after the preset verification condition is not satisfied may be preset, and may include 1, without replacing other sets of initial state data; or 2, replacing other sets of initial state data. Retraining the agent using the initial state data, comprising: and retraining the agent according to the retraining agent strategy and the initial state data.

Specifically, there are multiple groups of initial state data, if a certain group of initial state data is used for training an agent, the obtained initial active control action strategy does not meet the preset check condition, as an implementation manner, since the action strategy output by the agent is random after the agent acquires the data input, one action strategy does not meet the preset check condition, and the action strategy obtained by retraining by using the same group of data may meet the preset check condition, so that other groups of initial state data do not need to be replaced, but the agent is still retrained by using the same group of initial state data. As another embodiment, to improve efficiency of action policy generation, when the initial active control action policy does not satisfy a preset check condition, other sets of initial state data are replaced.

It should be noted that, only before updating the state data with the active control action policy, it is required to determine whether the active control action policy satisfies the preset check condition.

205. And setting the preset stopping condition, wherein the preset stopping condition comprises the group number of the initial state data and the training times of each group.

For example, the number of the initial state data sets is 5, and each training time is 100, so when the number of the initial state data sets is 100, the preset stopping condition is reached, the training is stopped, and the trained intelligent body is obtained.

206. And calculating to obtain an initial action rewarding value according to the initial state data and the next state data, and continuously training the intelligent agent by utilizing the next state data and the initial action rewarding value until a preset stopping condition is reached, so as to obtain the trained intelligent agent.

It should be noted that, after a certain set of initial state data is trained for the first time, the corresponding initial action rewarding value is calculated by the initial state data and the next state data, and after the second training is performed, the corresponding second action rewarding value is calculated by the next state data and the next state data, and the calculation modes are as follows:

wherein C is _sys Index for measuring unbalanced degree of system tide, P _i ^or And P _i ^ex The active power measured by the transmitting end and the receiving end of the ith line, K is the number of system lines, L is the number of stable sections of the system, and P _{inter_i} Is the power flow of the ith section,is the stability limit of the ith section, D _of Is the sum of squares of the cross section of the system, D _p Is the unbalance of system power, n is the number of generators, m is the number of adjustable loads, P _loss Is the system power shortage, E ₁ Is a positive offset coefficient, and is used for rewarding positive E when the corresponding condition is satisfied ₂ -E ₄ Is a positive constant, and functions to adjust the ratio between the respective components and limit the range of values of the prize.

207. And acquiring the state data to be regulated of the power grid equipment, inputting the state data to be regulated into the trained intelligent body, generating an optimal active control action strategy, and updating the state data to be regulated according to the optimal active control action strategy.

It should be noted that, in the embodiment steps 201 to 206, training is automatically completed in the power grid simulation environment, the trained agent is more accurate in the actual power grid application, and in this embodiment, after the trained agent is obtained, the state data to be adjusted is input, the trained agent can output the optimal active control action strategy according to the input state data to be adjusted, and execute the optimal active control action strategy, so that the state data to be adjusted can be updated, and the adjusted state data reaches the target state.

Further, as a specific implementation of the method shown in fig. 1 and fig. 2, an embodiment of the present invention provides a simulation training apparatus for an active control agent of a power grid, as shown in fig. 3, where the apparatus includes: the device comprises an acquisition module 31, a judgment module 32 and a first training module 33;

the obtaining module 31 is configured to obtain an agent constructed based on deep reinforcement learning, construct a power grid simulation environment based on an actual power grid model topology structure and historical operation data, obtain at least one set of initial state data of power grid equipment in the power grid simulation environment, train the agent by using one set of initial state data, and generate an initial active control action strategy at the current moment;

the judging module 32 is configured to judge whether the initial active control action policy meets a preset verification condition, if yes, update the initial state data according to the initial active control action policy, and obtain next state data at a next moment;

the first training module 33 is configured to calculate an initial action rewarding value according to the initial state data and the next state data, and continue training the agent by using the next state data and the initial action rewarding value until reaching a preset stopping condition, thereby obtaining the trained agent.

In a specific application scenario, as shown in fig. 4, a simulation training apparatus for an active control agent of a power grid, the apparatus further includes: the second training module 34 is specifically configured to, if the initial active control action policy does not meet the preset check condition, re-train the agent by using the initial state data until the initial active control action policy meets the preset check condition.

Correspondingly, in order to determine whether the initial active control action policy meets a preset check condition, the determining module 32 may be specifically configured to determine whether the action list length in the initial active control action policy is equal to the number of current environmental units; and judging whether the current active output and the active adjustment quantity of the unit in the initial active control action strategy are within the unit output limit after being overlapped.

In a specific application scenario, as shown in fig. 4, a simulation training apparatus for an active control agent of a power grid, the apparatus further includes: the setting module 35 is specifically configured to set the preset stopping condition, where the preset stopping condition includes the number of sets of the initial state data and the training number of times of each set.

In a specific application scenario, as shown in fig. 4, a simulation training apparatus for an active control agent of a power grid, the apparatus further includes: the application module 36 is specifically configured to obtain status data to be adjusted of the power grid device; inputting the state data to be adjusted into the trained intelligent agent to generate an optimal active control action strategy; and updating the state data to be adjusted according to the optimal active control action strategy.

Correspondingly, in order to obtain at least one set of initial state data of the power grid device, the obtaining module 31 may be specifically further configured to obtain different sets of initial state data of the power grid device corresponding to different preset moments according to the actual power grid model topology structure and the historical operation data.

It should be noted that, other corresponding descriptions of each functional unit related to the simulation training apparatus of the power grid active control intelligent agent provided in this embodiment may refer to corresponding descriptions of fig. 1 to 2, and are not repeated herein.

Based on the above-mentioned method shown in fig. 1 to 2, correspondingly, the present embodiment further provides a storage medium, which may be specifically volatile or nonvolatile, and has a computer program stored thereon, where the program, when executed by a processor, implements the above-mentioned simulation training method for the active control agent of the power grid shown in fig. 1 to 2.

Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which may be stored in a storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method of each implementation scenario of the present invention.

Based on the method shown in fig. 1 to 2 and the virtual device embodiments shown in fig. 3 and 4, in order to achieve the above object, the present embodiment further provides a computer device, where the computer device includes a storage medium and a processor; a storage medium storing a computer program; and the processor is used for executing the computer program to realize the simulation training method of the power grid active control intelligent agent shown in the figures 1-2.

Optionally, the computer device may also include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be appreciated by those skilled in the art that the architecture of a computer device provided in this embodiment is not limited to this physical device, but may include more or fewer components, or may be combined with certain components, or may be arranged in a different arrangement of components.

The storage medium may also include an operating system, a network communication module. An operating system is a program that manages the computer device hardware and software resources described above, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for communication among components in the actual storage medium and communication with other hardware and software in the information processing entity device.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present invention may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware.

Those skilled in the art will appreciate that the drawing is merely a schematic illustration of a preferred implementation scenario and that the modules or flows in the drawing are not necessarily required to practice the invention. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above-mentioned inventive sequence numbers are merely for description and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely illustrative of some embodiments of the invention, and the invention is not limited thereto, as modifications may be made by those skilled in the art without departing from the scope of the invention.

Claims

1. The simulation training method of the power grid active control intelligent agent is characterized by comprising the following steps of:

2. The method according to claim 1, wherein the method further comprises:

3. The method according to claim 1 or 2, wherein said determining whether the initial active control action policy meets a preset check condition comprises:

4. The method according to claim 1, wherein the method further comprises:

5. The method of claim 1, wherein after the trained agent is obtained, the method further comprises:

acquiring state data to be adjusted of the power grid equipment;

6. The method of claim 1, wherein the obtaining at least one set of initial state data for the power grid device comprises:

7. A simulation training apparatus for an active control agent of a power grid, the apparatus comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises:

9. A storage medium having stored thereon a computer program, which when executed by a processor, implements a method of simulated training of an active control agent of a power grid as claimed in any one of claims 1 to 6.

10. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements a simulated training method of the power grid active control agent according to any of claims 1 to 6 when executing the computer program.