CN113627533B

CN113627533B - Power equipment overhaul decision generation method based on reinforcement learning

Info

Publication number: CN113627533B
Application number: CN202110920543.5A
Authority: CN
Inventors: 李睿凡; 王泽元; 杜一帆; 熊永平; 刘子全
Original assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2023-11-10
Anticipated expiration: 2041-08-11
Also published as: CN113627533A

Abstract

A power equipment overhaul decision generation method based on reinforcement learning relates to the technical field of power equipment overhaul, solves the problems that a large amount of data is needed and the data utilization rate is not high in the existing modeling strategy mode based on reinforcement learning, and comprises the following steps: calculating a first cut set and accordingly calculating a first weight of the power equipment causing power grid outage loss; modeling a power equipment overhaul decision generation problem as a Markov decision process, and defining the running state of the power equipment; solving a Markov decision process by using a reinforcement learning method to obtain an optimal strategy and a value matrix of the optimal strategy, wherein the first weight is weighted into the overall operation loss of the reinforcement learning power grid, and the reinforcement learning aims at minimizing the overall operation loss of the power grid; and calculating a second cut set, calculating a second weight according to the second cut set, weighting the second cut set into the whole operation loss of the power grid, and improving the optimal strategy. The invention can indirectly realize the communication among a plurality of electric power equipment, has high data utilization rate and lower application threshold in the professional field.

Description

Power equipment overhaul decision generation method based on reinforcement learning

Technical Field

The invention relates to the technical field of power equipment overhaul, in particular to a power equipment overhaul decision generation method based on reinforcement learning.

Background

The maintenance of the power equipment means that in the operation process of a circuit, the equipment is maintained so as to ensure the good operation state of the equipment, thereby avoiding the damage of the equipment and greatly influencing the operation of a power grid. At present, a strategy for overhauling the power equipment is often through manual decision making, technicians often have abundant experience on power equipment management, and have a lot of power expertise, and the technician can subjectively judge whether the equipment needs overhauling by scoring the equipment state. Most of the work mainly utilizes on-line detection, off-line detection and periodic disassembly detection methods to make manual decisions, however, the above strategies are manual strategies, often require a technician to have a lot of experience, and many strategies are similar to strategies for people such as 'making maintenance strategies', 'perfecting skill training', and the like, so that the method is low in efficiency and poor in migration capability. Besides the maintenance strategy obtained through manual decision, the current power equipment maintenance method also scores according to the running state of the equipment and orders the equipment for maintenance according to the maintenance scores. The method based on manual power transformation overhaul decision requires much expertise and has more knowledge on the running state of the circuit, and the method based on equipment scoring also often requires manual score making and has poor migration capability.

Power equipment service may be understood as a sequential decision problem, modeling the relationship of service strategies and rewards using reinforcement learning. Reinforcement learning is inspired by trial and error in animal learning, and takes a reward value obtained by interaction of an intelligent body (namely power equipment) and the environment as a feedback signal to train the intelligent body, so that the reinforcement learning has strong exploration capacity and autonomous learning capacity. Multi-agent augmentation refers to the fact that the entire system contains more than two agents, and that there is a relationship between agents that is used in a variety of fields and that often results in strategies that exceed human experience. Multi-agent reinforcement learning has been applied in multiple fields, for example, a deep Q network is applied in a multi-agent reinforcement learning method to guide service cloud multi-workflow scheduling, so as to achieve the effect of optimizing multi-workflow completion time and user cost; optimizing a sewage reinforcement learning process by utilizing multi-agent deep reinforcement learning, and training the agents by taking a reward value obtained by interaction of the agents and the environment as a feedback signal; an hedging strategy is automatically built using coordinated multi-agents to reduce the risk of losing a given set of stocks in a portfolio. Some researches combine electric equipment overhaul with deep reinforcement learning, and a deep recursive Q network multi-agent reinforcement learning model is designed, which has better optimization and decision capability and low maintenance cost, but the current reinforcement learning based on deep learning represents states and models strategies through a neural network, requires a large amount of data, has low data utilization rate and has higher application threshold in the professional field.

Disclosure of Invention

The invention provides a power equipment overhaul decision generation method based on reinforcement learning, which aims to solve the problems that a large amount of data is needed and the data utilization rate is not high in the existing modeling strategy mode based on reinforcement learning.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a power equipment overhaul decision generation method based on reinforcement learning comprises the following steps:

step one, calculating all first cutsets capable of enabling a current power grid to have power failure according to connection and conduction conditions of all power equipment in the power grid; calculating static weight of the power equipment, which causes power outage loss of the power grid, according to the first cut set;

modeling the power equipment overhaul decision generation problem as a Markov decision process, defining a plurality of possible running states of the power equipment, which represent the damage degree of the power equipment, according to the information of all the power equipment in the power grid, wherein the running states represent the damage degree of the power equipment form a state set of the Markov decision process; solving a Markov decision process by using a reinforcement learning method to obtain an optimal strategy and a value matrix of the optimal strategy, weighting static weights into the overall operation loss of the reinforcement-learned power grid, and taking the reinforcement learning as a target to minimize the overall operation loss of the power grid weighted by the static weights;

step three, calculating all second cut sets which can enable the current power grid to be powered off according to connection and conduction conditions of all power equipment in the power grid to be overhauled, and calculating dynamic weights of the power equipment causing power grid power failure loss according to the second cut sets; and (3) weighing the dynamic weight into the overall operation loss of the power grid, taking the overall operation loss of the power grid weighted by the dynamic weight as a target, improving an optimal strategy by utilizing the overall operation loss of the power grid weighted by the dynamic weight and the value matrix obtained in the step two, and selecting the action with the minimum overall operation loss of the power grid as a final maintenance strategy of the power grid to be maintained.

The beneficial effects of the invention are as follows:

the invention provides a power equipment overhaul decision generation method based on reinforcement learning, which is used for solving by a reinforcement learning dynamic programming method, calculating the importance of equipment by utilizing a cut set in a power grid, weighting the importance in the dynamic programming solving process, and realizing the connection between equipment decisions by utilizing the characteristic that overhaul actions can cause the change of the cut set. The communication among a plurality of devices can be indirectly realized through the change of the cutset and the integration into the decision process. The invention has the advantages of less data, high data utilization rate, repeated utilization of the connection and conduction conditions of all power equipment in the power grid, repeated utilization of the whole operation loss of the power grid and lower application threshold in the professional field.

Drawings

Fig. 1 is a graph of six grid power profiles obtained by simulation.

Fig. 2 is a schematic diagram of a cut-set of a method for generating a power equipment overhaul decision based on reinforcement learning according to the present invention.

Fig. 3 is a schematic diagram of states and state transitions of a power equipment overhaul decision generation method based on reinforcement learning according to the present invention.

Fig. 4 is a schematic flow diagram of a strategy optimization method for generating a power equipment overhaul decision based on reinforcement learning.

Fig. 5 is a flowchart of a method for generating a power equipment overhaul decision based on reinforcement learning according to the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, including a description of the general terms, but the present invention may be practiced otherwise than as described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

step one, calculating all first cut sets which can enable a current power grid to have power failure according to connection and conduction conditions of all power equipment in the power grid during reinforcement learning; and calculating the static weight of the power equipment causing the power grid outage loss according to the first cut set.

Modeling a power equipment overhaul decision generation problem as a Markov decision process, defining a plurality of possible running states of the power equipment, which represent the damage degree of the power equipment, according to the information of all the power equipment in the power grid, wherein the obtained running states represent the damage degree of the power equipment to form a state set of the Markov decision process; and solving a Markov decision process by using a reinforcement learning method to obtain an optimal strategy and a value matrix of the optimal strategy, wherein the static weight is weighted into reinforcement learning rewards, namely the static weight is weighted into the overall operation loss of the power grid, and the reinforcement learning aims at minimizing the overall operation loss of the power grid weighted with the static weight.

Step three, calculating all second cut sets which can enable the current power grid to be powered off according to connection and conduction conditions of all power equipment in the power grid to be overhauled, and calculating dynamic weights of the power equipment causing power grid power failure loss according to the second cut sets; and (3) in the overall operation loss of the power grid weighted by the dynamic weight, taking the overall operation loss of the power grid weighted by the dynamic weight as a target, utilizing the overall operation loss of the power grid weighted by the dynamic weight and the value matrix obtained in the step two to improve the optimal strategy, and selecting the action with the minimum overall operation loss of the power grid weighted by the dynamic weight as the final maintenance strategy of the power grid to be maintained.

Markov processes are typically used to model reinforcement learning, and a Markov process (Markov Decision Process, MDP) may use a five-tuple<S，A，P，R，γ>Representation, wherein: s is a finite set of states; a is a finite set of actions; p is a probability transition matrix, satisfying:s and s 'both represent states, state s represents the current state, state s' represents the transferred state,/->A probability transition matrix representing the current state S to the post-transition state S' and taking action a, S _t Representing a set of states at time t, i.e. the running time t, the states S e S, hereinafter _t ，S _t+1 State S' S represents a set of states at time t+1 _t+1 Action a e A _t ，A _t The action set is the action set at the moment t; r is a reward (recall) function, satisfying: /> Representing rewards starting in the current state s and taking action a, i.e. +.>Indicating no rewards under cut set weighting, E indicating desire, R _t+1 A reward representing time t+1; gamma is a discount factor, gamma e 0,1]The method comprises the steps of carrying out a first treatment on the surface of the Return (G) _t The evaluation of the segment is the sum of all discount rewards starting at time t:

wherein R is _t+k+1 The rewards at the time t+k+1 are represented, k is an integer greater than or equal to 0;

the policy (policy) pi of an agent is the distribution of actions in a given state:

π(a|s)＝P[A _t ＝a|S _t ＝s]。

state cost function v _π (s) describes the expected return value that an agent gets based on the policy pi starting at state s: v _π (s)＝E _π [G _t |S _t ＝s]，E _π []Representing the expectations corresponding to the policy pi;

action value function q _π (s, a) describes the expected return value that an agent would get based on the policy pi to start at state s and take action a: q _π (s，a)＝E _π [G _t |S _t ＝s，A _t ＝a]。

The dynamic programming method is divided into two parts of strategy evaluation and strategy iteration.

In the policy evaluation, the approximation of each iteration round may use the state cost function v _π Updated by the bellman equation, i.e. for any S e S:

reams the

v()＝v _new () (2)

Where v () represents the value at the current time stepValue matrix V, the updated value matrix is used for policy update and value matrix update, V (s ') represents the value matrix V, V at state s' and current time step _new () Representing the updated value matrix. Obviously v _π Refers to a state cost function under the optimal strategy, and v is guaranteed in the second step _π Where present, the sequence v () will converge to v _π 。

In the policy pi improvement, we use a greedy algorithm to select the action a with the greatest prize to update the policy based on the cost function of the current time step, i.e. the latest v (), whereRepresenting the probability that, in the current state s, the execution action a transitions from state s to state s'. The action that rewards the largest is selected for each state s to implement policy iteration. Pi (a|s) represents the policy of the current time step:

in the dynamic programming method, two parts of strategy evaluation and strategy iteration are respectively and alternately executed in a circulating way, and the two parts affect each other, but the whole part of the strategy evaluation and the strategy iteration tend to the optimal solution.

The invention regards electric equipment as an intelligent agent in reinforcement learning, models an electric equipment overhaul decision generation problem as a Markov decision process, and a specific process of an electric equipment overhaul decision generation method based on reinforcement learning is as follows:

and acquiring information such as connection and conduction conditions of each power equipment in the power grid, power grid operation power and the like. As fig. 1 simulates the state of 6 different power curves of the power grid, the abscissa of the power graph represents time in "days", and the ordinate represents power in "MKW". And obtaining maintenance loss and damage loss of different power equipment in the power grid. The whole operation loss of the power grid is divided into maintenance loss R of power equipment in the power grid _M And damage loss R of electrical equipment in an electrical network _F Two types of maintenance of electrical equipment are that the electrical equipment which is not damaged is inspected and maintained, and the damaged electrical equipment needs to be replaced. The maintenance loss of the power equipment in the power grid comprises the economic loss R of the individual power equipment caused by the maintenance of the power equipment _M，1 And economic loss R of power grid power failure caused by power equipment overhaul _M，2 Damage loss of power equipment in a power grid comprises economic loss R of individual power equipment caused by damage of power equipment _F，1 And economic loss R of power grid power failure caused by damage of power equipment _F，2 . The overall operational loss of the whole grid (economic loss of grid blackout) can thus be expressed as:

wherein R is _M ＝R _M，1 +R _M，2 ，R _F ＝R _F，1 +R _F，2 T represents the running time, N _T The method is characterized in that the total operation duration of the power grid is represented, the economic loss of power failure of the power grid is continuously changed along with the operation power of the power grid, reinforcement learning aims at minimizing the overall operation loss of the power grid, namely, minimizing the overall operation loss of the power grid is used as reinforcement learning rewards, and weighting is carried out in reinforcement learning rewards, namely, the overall operation loss of the power grid is minimized.

And calculating all cutsets capable of enabling the current power grid to have power failure according to the connection and conduction conditions of equipment of the power grid. The second of the first and second cutsets of the first cutset is only to distinguish that the two cutsets may not be the same cutset. Cut-off refers to the risk of a power outage that occurs in the grid when all of the electrical equipment within the cut-off is damaged. The cut-off set is a set formed by partial power equipment, the cut-off set can enable the current power grid to be powered off, and no proper subset of the cut-off set exists to enable the current power grid to be powered off.

There are two problems with multiple power devices: the importance of each device to the overall operation of the grid is inconsistent, and the service decisions of the devices may affect the service decisions of other devices. In order to solve the problems, the invention provides a cut-set method for calculating the importance degree of each device in the power grid, and then the importance degree is applied to dynamic programming solution.

The cutset specifically refers to a spot cutset. A point cut set is a collection of partial points in the original, and if all points in the cut set are removed, the original is divided into two parts. The specific definition is as follows: the power grid G is a communication diagram formed by power equipment and wires, the point set V is a set formed by all the power equipment in the power grid G as elements, namely, each power equipment is a point, U is set as a subset of the point set V of the power grid G, if all the points of U are deleted in the communication diagram G, G-U is not communicated, and the proper subset of U does not exist to enable G-U to be not communicated, namely, the point set U is a point cutting set of the diagram G.

According to the connection state of the power equipment A, B, C, D and E in the power grid in fig. 2, the cut-set of the current power grid is obtained manually. As shown in FIG. 2, the cutsets of the original circuit grid graph are { A, C, D }, { B, A }, { E, C, D }, { B, E }.

In addition, maintenance of the electrical equipment can also cause variation in the cut set. When a certain power equipment is overhauled, a new cutting set is generated according to overhauling actions of other power equipment, and rewards of each action are recalculated according to the new cutting set, so that decision communication of different power equipment is realized.

The weight of the whole running loss of the power grid caused by each power device can be calculated through the cut-off set of the power grid, and the importance of different devices to the power grid can be focused by weighting the weight into the reinforcement learning rewards. By cutting the collection cut computing device i (device _i ) The weights of (2) are as follows:

wherein cut is _m Represents the mth cut set, T _m Represents the total number of the power devices in the current cutting set, N represents the number of the cutting sets, I (cut) _m ，device _i ) The calculation mode of (2) is as follows:

wherein i and m are both positive integers. As shown in fig. 2, the cut sets of the original circuit grid diagram are { a, C, D }, { B, a }, { E, C, D }, { B, E }, and then the weight calculation method of the power device a is 1/4 (1/3+1/2) ≡ 0.2083, and the calculation method of the power device B, C is similar.

And carrying out reinforcement learning dynamic programming solution based on the equipment weight.

According to the specific information of the power equipment, the operation state of each power equipment in the power grid, which represents the damage degree of the power equipment, is defined, four states are defined in the embodiment, S1 represents a good state, S2 represents a slightly degraded state, S3 represents a severely degraded state and S4 represents a damaged state, and a state transition matrix of the power equipment is calculated by combining the first-order Markov assumption of the power equipment and the history information of the power equipment. The historical information aims at calculating the transition probability of the power equipment in different running states, for example, the probability of the power equipment in S1-S4 is obtained, and then the transition probability is calculated through a Bayesian equation, namely, the probability that the historical information of the power equipment is that the power equipment is in each running state. As shown in FIG. 3, a state transition diagram, P, of a power device is shown _bc Representing state S _b To state S _c Transition probability lambda of (a) _b Represent S _b The probability of a state to a damaged state S4, b takes on the values 1,2,3, c takes on the values 2,3,4. When a certain power equipment is not overhauled, the probability transition matrix (also called as a state transition matrix) of the state transition is as follows:

state transition means the probability that the power device at time t transitions to a different state at time t+1, the time unit in which the power device state is determined being days.

One power equipment action set is [ fix, nofix ], fix stands for overhaul, and nofix stands for no overhaul. In order to better meet the practical situation, only the "nofix" action can be selected in the S1 state of the power equipment, only the "fix" action can be selected in the S4 state of the power equipment, and the action selection of other states of the power equipment is not limited.

An overall optimization objective of reinforcement learning is defined, i.e. to minimize the overall operational losses of the grid.

The markov decision process is solved using a weighted dynamic programming algorithm according to the weights of the power devices, adding static weights or dynamic weights to the policy evaluation, as in fig. 4.

In the second step, a reinforcement learning method is applied to solve a Markov decision process to obtain an optimal strategy and a value matrix of the optimal strategy, and the method comprises the following steps:

step 2.1, initializing a value matrix V of the power equipment to obtain an initialized value matrix, and initializing a strategy pi of the power equipment to obtain an initialized strategy;

step 2.2, weighting the overall operation loss of the power grid by using the static weight, taking the overall operation loss of the power grid weighted by the static weight as a target, and updating the value matrix by using a Belman equation according to the overall operation loss of the power grid weighted by the static weight, an initialized strategy and the initialized value matrix;

step 2.3, aiming at minimizing the overall operation loss of the power grid weighted by the static weight, and updating a strategy by using a greedy algorithm according to the overall operation loss of the power grid weighted by the static weight and the latest value matrix; the method comprises the steps of taking the overall operation loss of a power grid as a target, and updating a value matrix by utilizing a Belman equation according to the overall operation loss of the power grid weighted by static weights and the latest strategy;

and 2.4, judging whether the minimization of the overall operation loss of the power grid weighted by the static weight is realized, if so, completing the second step, otherwise, returning to the step 2.3.

The improvement of the optimal strategy in the third step comprises the following steps:

step 3.1, weighting the overall operation loss of the power grid by using dynamic weights, taking the overall operation loss of the minimized power grid as a target, and updating the value matrix of the optimal strategy by using a Belman equation according to the overall operation loss of the power grid weighted by the dynamic weights, the optimal strategy and the value matrix of the optimal strategy;

step 3.2, aiming at minimizing the overall operation loss of the power grid weighted by the dynamic weight, updating the optimal strategy by using a greedy algorithm according to the overall operation loss of the power grid weighted by the dynamic weight and the value matrix of the latest optimal strategy; the method comprises the steps of taking the overall operation loss of a power grid weighted by dynamic weights as a target, and updating a value matrix of an optimal strategy by utilizing a Belman equation according to the overall operation loss of the power grid weighted by the dynamic weights and the latest optimal strategy;

and 3.3, judging whether the minimization of the overall operation loss of the power grid weighted by the dynamic weight is realized, if the minimization of the overall operation loss of the power grid weighted by the dynamic weight is realized, selecting the action of the minimum overall operation loss of the power grid as a final maintenance strategy of the power grid to be maintained, otherwise, returning to the step 3.2.

Namely, in the second step, firstly initializing a value matrix V of the power equipment and initializing a strategy matrix pi of the power equipment; then, the Belman equation is used to combine the optimization objective for policy evaluation and policy improvement. And in the second step, strategy evaluation and strategy improvement are carried out for a plurality of times to obtain a final value matrix V and a strategy pi, the final strategy pi is called an optimal strategy, and the final value matrix V is called the value matrix V of the optimal strategy.

According to the optimal strategy and the value matrix of the optimal strategy, carrying out dynamic strategy generation of power equipment overhaul of the third power grid to obtain a power equipment overhaul strategy suitable for the current power grid, wherein the power equipment overhaul strategy comprises the following specific steps: according to the running state of the power grid equipment, calculating a cut set of the current power grid, and recalculating weights of different equipmentThe policy improvement step is carried out again by utilizing the obtained dynamic weight and the value matrix of the optimal policy, namely the optimal policy is improved, and the action with the largest rewards is selected as the current actionAnd (5) a final maintenance strategy of the state.

Policy evaluation refers to evaluating the value of different states executing a current policyI.e. the overall operational loss of the power network, it comprises two parts, i.e. the power plant i performs the direct rewards of the current action a in the current state s>And performing an indirect reward for the current action a causing the state transition at the current state s, the indirect reward being represented by the value matrix V and weighted by a weight (static weight or dynamic weight) and updated into the value matrix V.

Wherein the method comprises the steps ofThe calculation method is as follows:

wherein cut represents the first cut set or the second cut set,representing the weight of the device i calculated by the first cut set or the second cut set, R _M (s, a) represents maintenance loss of power equipment in the power grid in the state s action a, R _F (s, a) represents damage loss of electrical equipment in the grid during state s action a,/->Indicating the overall operational loss of the grid at state s action a based on cut, i.e. indicating the rewards of the power plant i under the weighting of the current cutset (first cutset or second cutset). That is, when the loss of the power grid caused by the power equipment is calculated, the weight obtained by calculation of the cut set can be used for weighting the whole operation loss of the power grid caused by the power equipment, and the weight is +.>Representing either static weights or dynamic weights. And then solving through dynamic programming, so as to obtain the perception of the power equipment on circuit connection and embody the perception in a strategy.

The weighted dynamic programming solution overall flow is shown in fig. 5, the first cut set and the second cut set are provided by the cut set management module, the states are provided by the state management module, the overhaul of the power equipment can cause the change of the power grid cut set, and in the next state, the weight of each equipment needs to be calculated according to the current cut set. State management refers to sequentially performing maintenance decisions on the devices in the S2 and S3 states in the current time period. And obtaining a value matrix corresponding to the current strategy by weighted dynamic programming, updating the strategy by using the current value matrix, and selecting more actions to be rewarded as the strategy of the current state. According to formula (3), add formula (3)Change to->Then the updating formula of the updating strategy by using the greedy algorithm in the step 2.3 and the updating formula of the updating optimal strategy by using the greedy algorithm in the step 3.2 are:

wherein,refers to the current state s,/of device i>It means that the power equipment i selects the action of the maximum reward as the corresponding strategy in the current state s, and performs the formula (5) as the latest strategy pi.

According to formula (2), add formula (2)Change to->Then the update formulas of the value matrix updated by the bellman equation in the above step 2.3 and the value matrix updated by the bellman equation in the above step 3.2 are:

let v () =v _new () For executing the formula (4) next and the formula (5) next.

In order to construct a power overhaul strategy of multiple devices, the invention provides a solution by a dynamic programming method of reinforcement learning, calculates the importance of the devices by using a cut set in a power grid, weights the importance of the devices into a dynamic programming solution process, and realizes the connection between device decisions by using the characteristic that overhaul actions can cause the change of the cut set. The communication among a plurality of devices can be indirectly realized through the change of the cutset and the integration into the decision process. The invention has the advantages of less data, high data utilization rate, repeated utilization of the connection and conduction conditions of all power equipment in the power grid, repeated utilization of the whole operation loss of the power grid and lower application threshold in the professional field. The effectiveness of the method provided by the invention is verified by simulating the simulation data and evaluating the strategy under various circuits.

Claims

1. The power equipment overhaul decision generation method based on reinforcement learning is characterized by comprising the following steps of:

modeling the power equipment overhaul decision generation problem as a Markov decision process, defining a plurality of possible running states of the power equipment, which represent the damage degree of the power equipment, according to the information of all the power equipment in the power grid, wherein the running states represent the damage degree of the power equipment form a state set of the Markov decision process; solving a Markov decision process by using a reinforcement learning method to obtain an optimal strategy and a value matrix of the optimal strategy, weighting static weights into the overall operation loss of the reinforcement-learned power grid, and taking the reinforcement learning as a goal to minimize the overall operation loss of the weighted static weights;

step three, calculating all second cut sets which can enable the current power grid to be powered off according to connection and conduction conditions of all power equipment in the power grid to be overhauled, and calculating dynamic weights of the power equipment causing power grid power failure loss according to the second cut sets; the static weights are weighted into the overall operation loss of the power grid, the overall operation loss of the power grid with the weighted dynamic weights is minimized, the overall operation loss of the power grid with the weighted dynamic weights and the value matrix obtained in the second step are utilized to improve the optimal strategy, and the action with the minimum overall operation loss of the power grid is selected as the final maintenance strategy of the power grid to be maintained;

step 2.3, aiming at minimizing the overall operation loss of the power grid weighted by the static weight, and updating a strategy by using a greedy algorithm according to the overall operation loss of the power grid weighted by the static weight and the latest value matrix; the method comprises the steps of taking the overall operation loss of a power grid weighted by static weights as a target, and updating a value matrix by utilizing a Belman equation according to the overall operation loss of the power grid weighted by the static weights and the latest strategy;

step 2.4, judging whether the minimization of the overall operation loss of the power grid weighted by the static weight is realized, if the minimization of the overall operation loss of the power grid is realized, finishing the step II, otherwise returning to the step 2.3;

step 3.1, weighting the overall operation loss of the power grid by using dynamic weights, taking the overall operation loss of the power grid weighted by the dynamic weights as a target, and updating the value matrix of the optimal strategy by using a Bellman equation according to the overall operation loss of the power grid weighted by the dynamic weights, the optimal strategy and the value matrix of the optimal strategy;

2. The method for generating overhaul decisions for electric power equipment based on reinforcement learning as claimed in claim 1, wherein said step one further comprises obtaining overhaul loss R of electric power equipment in the electric power grid _M And damage loss R _F In the power grid, the service loss of the power equipment comprises individual power equipment caused by service of the power equipmentEconomic loss R _M，1 And economic loss R of power grid power failure caused by power equipment overhaul _M，2 Damage loss of power equipment in a power grid comprises economic loss R of individual power equipment caused by damage of power equipment _F，1 And economic loss R of power grid power failure caused by damage of power equipment _F，2 The overall operation loss of the power grid in the second step is as follows:

wherein t represents the running time, N _T Indicating the total duration of grid operation.

3. The method for generating a maintenance decision of a power plant based on reinforcement learning according to claim 1, wherein the first cut-out set and the second cut-out set are point cut-out sets, the power plant is an element of the point cut-out sets, the point cut-out sets refer to power grid blackouts when all the power plants in the point cut-out sets are damaged, and the power grid blackouts when all the power plants in proper subsets of the point cut-out sets are not damaged.

4. The method for generating a power equipment overhaul decision based on reinforcement learning according to claim 1, wherein the overall operation loss of the power grid is:

wherein cut represents the first cut set or the second cut set,representing the weight of the device i calculated by the first cut set or the second cut set, R _M (s, a) represents maintenance loss of power equipment in the power grid in the state s action a, R _F (s, a) represents damage loss of electrical equipment in the grid during state s action a,/->Representing the overall operational loss of the grid at state s action a based on cut.

5. The method for generating a power equipment overhaul decision based on reinforcement learning as claimed in claim 4, wherein the method comprises the following steps of

Wherein, cut _m Represents the mth cutset, device _i Indicating that the power equipment i, i and m are positive integers, T _m Represents the total number of power devices in the current cut, N represents the number of cuts, I (cuts _m ，device _i ) The calculation mode of (2) is as follows:

6. the method for generating a maintenance decision for electric power equipment based on reinforcement learning as set forth in claim 4, wherein the updating formulas of the value matrix updated by the bellman equation in step 2.3 and the value matrix updated by the bellman equation in step 3.2 are:

let v () =v _new () For formula (4) and for the next execution of formula (5);

the updating formula of the updating strategy by using the greedy algorithm in the step 2.3 and the updating formula of the updating optimal strategy by using the greedy algorithm in the step 3.2 are as follows:

wherein A represents an action set of action a, gamma represents a discount factor, S represents a finite state set, S and S 'both belong to S, S represents a current state, S' represents a state after transfer,refers to the current state s,/of device i>Probability transition matrix representing current state s to state s' after transition and taking action a, v () represents value matrix at current time step, v _new () Representing the updated value matrix.