CN111539492A

CN111539492A - Abnormal electricity utilization judgment system and method based on reinforcement learning

Info

Publication number: CN111539492A
Application number: CN202010649574.7A
Authority: CN
Inventors: 陈应林; 陈勉舟
Original assignee: Wuhan Glory Road Intelligent Technology Co ltd
Current assignee: Wuhan Gelanruo Intelligent Technology Co.,Ltd.
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2020-08-14
Anticipated expiration: 2040-07-08
Also published as: CN111539492B

Abstract

The invention relates to an abnormal electricity utilization judging system and method based on reinforcement learning, wherein the judging system is a DRQN (Deep Current Q Network) model for judging abnormal electricity utilization, the Q Network model takes the current state and the currently selected action as input and output, and the state is taken as a judging index to determine the reward and punishment value of the current round; synchronizing the network parameters of the target Q network model into the network parameters of the Q network model when the Q network model is trained for the set times; the power utilization probability sequence to be tested is input into the trained DRQN model, the state is used as a dynamic threshold value of the power utilization probability sequence to be tested, whether power utilization is abnormal or not is judged according to the dynamic threshold value, the current state is used as a judgment index to determine a reward and punishment value, and the current state is used as the dynamic threshold value, so that the system can update the threshold value according to real-time power data of a user, and the generalization capability of a cross-user scene can be effectively improved.

Description

Abnormal electricity utilization judgment system and method based on reinforcement learning

Technical Field

The invention relates to the field of load prediction of power systems, in particular to an abnormal electricity utilization judgment system and method based on reinforcement learning.

Background

With the wide application of the intelligent electric meter, the abnormal electricity utilization detection becomes an important means for researching abnormal consumption behaviors of customers and timely discovering unexpected electricity utilization events. During the operation of the power grid, no matter the metering device is in failure or the user steals electricity, the real electricity utilization data of the user cannot be collected, and the false electricity utilization data is called abnormal electricity utilization data. If the situations cannot be found and processed in time, serious interference and influence can be generated on the power supply order of normal users and power supply companies, so that economic loss and hidden danger in the aspect of power supply and utilization safety are brought, and the automatic monitoring of abnormal power supply and utilization of the users is of great practical significance.

The current abnormal electricity detection methods can be classified into the following types: the method comprises a system state-based abnormal electricity utilization detection method, a game theory-based abnormal electricity utilization detection method and a data-driven abnormal electricity utilization detection method.

With the rapid development of massive data and artificial intelligence technology in the era of smart power grids, abnormal power utilization behavior detection based on data driving gradually becomes a hotspot of study of scholars, and the method can be divided into three methods based on data driving: classification, regression and clustering, the classification and regression being methods of supervised learning, and the clustering being methods of unsupervised learning.

The classification method divides the corresponding set of input into several classes according to the input characteristic quantity. In the abnormal electricity utilization detection, the classification aims to divide a user set into a normal class and an abnormal class according to the characteristic quantity of the user. The classification-based method needs a large amount of labeled training sets to provide samples, and the precision of classification is improved through training. The most important of the classification method is a classifier model, and common classifiers comprise extreme learning machines, support vector machines, k-nearest neighbor algorithms, neural networks and other models. The classification method has obvious advantages, and the method has higher accuracy compared with other methods due to the use of the labeled data set, so the method also becomes a research hotspot.

The abnormal electricity consumption behavior analysis at the user level needs to be analyzed from the long-term behavior of the user, but the analysis of the long-term electricity consumption behavior by directly using a classifier is not suitable, for example, the electricity consumption data of one month or even one year is directly input into the classifier, because of the following two reasons:

(1) in long-term electricity consumption data, the proportion of normal electricity consumption data is large, and for a small part of abnormal electricity consumption data, the abnormal electricity consumption data is equivalent to noise, so that the classification result output by the classifier is seriously influenced.

(2) The long-term electricity utilization data are directly input into the classifier, which means that the detection period is long, and whether the delay of abnormal electricity utilization behaviors of a user is long or not is judged, so that the situation is not in line with the actual application requirement scene.

Most classification methods analyze long-term electricity usage behavior based on short-term electricity usage behavior. The specific method comprises the following steps: dividing long-term data into a plurality of short-period data, and inputting the data into a classifier, wherein the classifier outputs the probability that the short-period samples come from abnormal power utilization users. Generally, when judging whether the short-term sample is from the abnormal electricity user, a judgment threshold value needs to be determined, and when the probability output by the classifier exceeds the threshold value, the short-term sample is judged to be from the abnormal electricity user.

The long-term data of a user is divided into n short-term data, and after the n short-term data are respectively input into the classifier, a probability sequence with the length of n can be obtained, so that a problem is faced next, and how many proportion of samples in the n short-term samples are judged to be from abnormal users, so that the user can be judged to belong to the abnormal power utilization users through the long-term power utilization behaviors. In the conventional classification method, the decision ratio needs to be determined by empirical reference, and has no fixed value.

In summary, most of the current abnormal electricity detection methods based on classification still have the following problems:

(1) when different user electricity consumption data are faced, no matter the judgment threshold value of the short period or the judgment proportion of the long period, the unified standard is difficult to determine.

(2) When the model training is completed and the decision threshold and the decision proportion are also fixed, when the power consumption data distribution of the user to be tested changes, an accurate decision result is difficult to generate, and at this time, manpower and computational resources are often consumed to adjust the model parameters or the decision threshold and the decision proportion.

Disclosure of Invention

The invention provides an abnormal electricity utilization judgment method based on reinforcement learning, aiming at the technical problems in the prior art, and solves the problem that the judgment threshold and the judgment proportion of the traditional classifier method are difficult to determine in the prior art.

The technical scheme for solving the technical problems is as follows: an abnormal electricity utilization judgment system based on reinforcement learning, wherein the judgment system is a constructed DRQN model and comprises the following steps: a memory base, a Q network model and a target Q network model;

the memory bank is used for storing the current state, the currently selected action, the state of the next step and the reward and punishment value of the current round;

the Q network model takes the current state and the currently selected action as input and output, and takes the state as a decision index to determine the reward and punishment value of the current round;

the target Q network model and the Q network model have the same structure, and when the Q network model is trained for a set number of times, the network parameters of the target Q network model are synchronized to be the network parameters of the Q network model; the targetQ network model takes the next state stored in the memory base as input; calculating loss according to the outputs of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss;

and inputting the power utilization probability sequence to be tested into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be tested, and judging whether the power utilization is abnormal or not according to the dynamic threshold.

An abnormal electricity utilization judgment method based on the judgment system comprises the following steps:

step 1, obtaining a user abnormal electricity utilization probability sequence output by a classifier and sample data of a corresponding original label, and dividing the sample data into a training set and a test set;

step 2, the DRQN module is iterated by utilizing the training set, and the training of the Q network model is completed in the iteration process of the DRQN module;

the iteration process of the DRQN module comprises the following steps: according to the input n power utilization probability sequences, determining a reward and punishment value of the current round by taking the dynamic state as a decision index to obtain a next state;

and 3, inputting the power utilization probability sequence to be detected into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be detected, and judging whether the power utilization is abnormal or not according to the dynamic threshold.

The invention has the beneficial effects that: the reward and punishment value is determined by taking the current state as a decision index, and the current state is taken as a dynamic threshold value, so that the system can update the threshold value according to real-time power data of a user, and the generalization capability of a cross-user scene can be effectively improved; compared with a common classifier algorithm, the method can easily obtain a discrimination threshold value and a discrimination proportion parameter; the user data of the online test can be directly used for online learning, and the network parameters can be adjusted without redundant manpower resources and computing resources under the condition of ensuring the accuracy for a long time; the input probability sequence can be of dynamic length, but the length is not less than 10, and the method has very high flexibility in practical application.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the state is a five-dimensional array composed of 5 values, including a four-dimensional decision threshold and a one-dimensional decision ratio;

the four-dimensional decision threshold values are respectively the maximum value, median and average value of the threshold value and a set threshold value A; the one-dimensional decision proportion is a proportion exceeding the threshold a.

Further, the actions include:

and selecting two of the five-dimensional arrays of the state by the DRQN model, and adding a set fixed value and subtracting a set fixed value to the two selected arrays respectively.

Further, after the Q network model outputs the currently selected action, the environment function inputs the current state and the currently selected action, and outputs the state of the next step.

Further, the calculation process of the reward and punishment value is as follows: taking the state as a judgment index, calculating that m samples in input n samples are judged correctly, and then awarding and punishing values

。

Further, the state as a decision index is a next step state obtained after the DRQN model completes a complete iteration of a round;

inputting n power utilization probability sequences in n steps of each round, wherein each step comprises the following steps: inputting the current state, outputting the currently selected action, and obtaining the state of the next step by the environment function; and inputting the reward and punishment function into the state of the next step obtained after the DRQN model completes the complete iteration of one round, and outputting the reward and punishment value.

Further, inputting the power utilization probability sequence to be tested into the trained DRQN model, taking the current state as a dynamic threshold of the power utilization probability sequence to be tested, and determining whether the power utilization is abnormal according to the dynamic threshold includes:

calculating the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A of the power utilization probability sequence to be detected, and comparing the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A with the a2, b2, c2, A and e2 of the five-dimensional array of the dynamic threshold to obtain:

；

and if the output is 1, judging that the power utilization probability sequence to be detected belongs to an abnormal power utilization user, otherwise, judging that the power utilization probability sequence to be detected is a normal user.

Further, the step 2 comprises:

step 201, initializing the Q network model, the target Q network model and a memory base;

step 202, initializing a state, and inputting a value of the initial state into the Q network model;

step 203, the Q network model outputs the currently selected action, inputs the current state and the currently selected action into an environment function, and outputs the state of the next step;

step 204, inputting the next state in step 203 as the current state into the Q network model;

step 205, repeating steps 203 to 204 for n times, completing iteration of one round, and inputting the state, the n power utilization probability sequences and the corresponding original tags of the last step into a reward and punishment function to obtain a reward and punishment value of one round;

step 206, repeating the steps 202 to 205 for a set number of times, accumulating the quadruple data of each round, and storing the quadruple data into the memory bank;

the quadruple is

,

Which is indicative of the current state of the device,

indicating the action that is currently selected and,

the status of the next step is shown,

representing a reward and penalty value of the current round;

step 207, when the quaternary group data stored in the memory bank reaches a set amount, training the Q network model between the step 203 and the step 204, including: randomly sampling n quadruples of a round from the memory bank

Sequentially combining n of said quadruples

Is input into the Q-network model and,

inputting the target Q network model;

and 208, synchronizing the network parameter of the target Q network model as the network parameter of the Q network model when the Q network model is trained for a set number of times, calculating a loss value according to the outputs of the Q network model and the target Q network model and the reward and punishment value, and updating the network parameter of the Q network model according to the loss value.

Further, the step 3 comprises:

；

The beneficial effect of adopting the further scheme is that: when different user electricity consumption data are faced, a unified standard is provided to determine the judgment threshold value of the short period or the judgment proportion of the long period; after the model training is finished, the judgment threshold value and the judgment proportion are adjusted in real time, and even if the power consumption data distribution of a user needing to be tested changes, an accurate judgment result can be generated.

Drawings

Fig. 1 is an interaction diagram of an abnormal electricity usage determination system based on reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the structure and data interaction of the Q network model according to the embodiment of the present invention;

FIG. 3 is a schematic diagram of a structure and data interaction of a target Q network model according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for training a DRQN model according to an embodiment of the present invention;

fig. 5 is a flowchart of an abnormal user detection method according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

The invention provides an abnormal electricity utilization judgment system based on reinforcement learning, which is a constructed DRQN model and comprises the following components: a memory base, a Q network model and a target Q network model.

The memory bank is used for storing the current state, the currently selected action, the state of the next step and the reward and punishment value of the current round.

And the Q network model takes the current state and the currently selected action as input and output, and takes the state as a decision index to determine the reward and punishment value of the current round.

the structure of the target Q network model is the same as that of the Q network model, and when the Q network model is trained for a set number of times, the network parameters of the target Q network model are synchronized into the network parameters of the Q network model; the target Q network model takes the next state stored in the memory base as input; and calculating loss according to the output of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss.

And inputting the power utilization probability sequence to be detected into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be detected, and judging whether the power utilization is abnormal or not according to the dynamic threshold.

According to the reinforcement learning-based abnormal power utilization judgment system, the current state is used as a judgment index to determine a reward and punishment value, and the current state is used as a dynamic threshold value, so that the system can update the threshold value according to real-time power data of a user, and the generalization capability of a cross-user scene can be effectively improved; compared with a common classifier algorithm, the method can easily obtain a discrimination threshold value and a discrimination proportion parameter; the user data of the online test can be directly used for online learning, and the network parameters can be adjusted without redundant manpower resources and computing resources under the condition of ensuring the accuracy for a long time; the input probability sequence can be of dynamic length, but the length is not less than 10, and the method has very high flexibility in practical application.

Example 1

Embodiment 1 of the present invention is an embodiment of an abnormal electricity usage determination system based on reinforcement learning, and fig. 1 is an interactive schematic diagram of an abnormal electricity usage determination system based on reinforcement learning according to an embodiment of the present invention, and as can be seen from fig. 1, in an embodiment of an abnormal electricity usage determination system based on reinforcement learning according to the present invention, the determination system is a constructed DRQN model, and includes: a memory base, a Q network model and a target Q network model.

In particular, the memory bank stores four tuples

,

Which is indicative of the current state of the device,

indicating the action that is currently selected and,

the status of the next step is shown,

representing the reward and penalty value of the current round.

Preferably, the current state

And the status of the next step

The state in (1) is a five-dimensional array composed of 5 values, including a four-dimensional decision threshold and a one-dimensional decision scale.

The four-dimensional decision threshold values are respectively the maximum value, median, average value and a set threshold value A of the threshold value, and the set threshold value A can be a decision threshold value in a general classifier; the one-dimensional decision ratio is the ratio that exceeds the threshold a.

Movement of

For two of the five-dimensional arrays of states selected by the DRQN model, a set fixed value is added and a set fixed value is subtracted to the selected two arrays, respectively.

Fig. 2 and fig. 3 are schematic diagrams of structures and data interaction of a Q network model and a target Q network model provided by an embodiment of the present invention, respectively.

As can be seen from fig. 2 and 3, the target Q network model and the Q network model have the same structure, and are composed of two full-connected layers, FC (full-connected) Layer1, FC Layer2, and a Long Short-Term Memory network (LSTM) Layer.

The Q network model takes the current state and the currently selected action as input and output, namely the input of the Q network model is

Output is

And determining the reward and punishment value of the current round by taking the state as a judgment index.

Preferably, after the Q network model outputs the currently selected action, the environment function inputs the current state and the currently selected action, and outputs the next state, that is, the input of the environment function is

And

the output is

Specifically, the current dynamic threshold and the selected add-subtract action are input, and the dynamic threshold of the next step is output.

Inputting a five-dimensional array of a Q network model, obtaining output1 after passing through two fully-connected layers and an LSTM layer, and obtaining a selected action after passing through an argmax function, namely performing operation of adding and subtracting a fixed value on a threshold value of a corresponding dimension; the specific implementation of the action is a 5 × 2 array, corresponding to 5-dimensional thresholds, each threshold has two actions, i.e. adding a fixed value and subtracting a fixed value, respectively, and the output1 selects one of the ten scalars through the argmax function:

the calculation process of the reward and punishment value is as follows: taking the state as a judgment index, calculating that m samples in input n samples are judged correctly, and then awarding and punishing values

。

And (4) calculating a reward and punishment value by taking the final dynamic threshold value as a decision threshold value and a decision proportion, wherein if the judgment is correct, the reward and punishment value is positive, otherwise, the reward and punishment value is negative.

Preferably, the state used as the decision index is the next state obtained after the DRQN model completes a complete iteration of one round.

Inputting n power utilization probability sequences in n steps of each round, wherein each step comprises the following steps: inputting the current state, outputting the currently selected action, obtaining the state of the next step by the environment function, taking the state of the next step obtained in the previous step as the state input of the current step in the next step, and circulating the process for n steps in one round, wherein the state of the next step output in the last step is the final state of the whole round, and the final dynamic threshold value is obtained after the DRQN model is subjected to one round. And inputting the reward and punishment function into a DRQN model, completing complete iteration of one round to obtain the state of the next step, and outputting a reward and punishment value.

Synchronizing the network parameters of the target Q network model into the network parameters of the Q network model when the Q network model is trained for the set times; the target Q network model takes the next state stored in the memory bank as input, namely the input of the target Q network model is

(ii) a And calculating loss according to the output of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss.

Specifically, during training, the inputs of the Q network model and the target Q network model are both from the samples of the memory base, the input of the Q network model is the current state, output1 is obtained, the input of the target Q network model is the next state, output2 is obtained, and the calculation formula for calculating the loss is as follows:

. Wherein

Is a hyper-parameter, preferably 0.9.

Preferably, the process of determining whether the power consumption is abnormal includes:

。

the output represents the judgment of the threshold value on the power utilization probability sequence sample to be detected, if the output is 1, the power utilization probability sequence to be detected is judged to belong to an abnormal power utilization user, and otherwise, the power utilization probability sequence to be detected is judged to be a normal user. And combining the real label to obtain the label for judging whether the label is correct or not.

Example 2

The embodiment 2 provided by the invention is an embodiment of an abnormal electricity utilization judgment method based on reinforcement learning, the abnormal electricity utilization judgment method is based on an abnormal electricity utilization judgment system provided by the embodiment of the invention, and the judgment method comprises the following steps:

step 1, obtaining a user abnormal electricity utilization probability sequence output by a classifier and sample data of a corresponding original label, and dividing the sample data into a training set and a testing set.

And 2, iterating the DRQN module by using the training set, and finishing the training of the Q network model in the iteration process of the DRQN module.

The iterative process of the DRQN module comprises the following steps: and according to the input n power utilization probability sequences, dynamically determining the reward and punishment value of the current round by taking the power utilization probability sequences as a decision index to obtain the next state, namely generating a dynamic decision threshold value and decision proportion.

Preferably, as shown in fig. 4, which is a flowchart of a method for training a DRQN model provided in an embodiment of the present invention, as can be seen from fig. 1 to fig. 4, step 2 includes:

step 201, initializing a Q network model, a target Q network model and a memory base.

Specifically, the initialization of the memory bank is to empty the memory bank, which may be a container with a size of 20000x50, since the fixed number of steps per round is preferably 50, and 20000 rounds of data can be stored in total, and each round has 50 quadruplets

。

Step 202, starting iteration of a round, initializing the state, starting iteration of the first step, and inputting the value of the initial state into the Q network model.

Specifically, the state initialization is preferably [0.6, 0.4, 0.4, 0.5, 0.1 ].

In fig. 1, a dashed square box is a flow of iteration of the DRQN model in one step, and a thick arrow represents that data is transmitted to the reward and punishment function after one round is completed.

And step 203, outputting the currently selected action by the Q network model, inputting the current state and the currently selected action into the environment function, outputting the state of the next step, and finishing the current step.

In the iteration one-step process, the input of the Q network is the current state from the environment function, the output is the currently selected action, and the action is transmitted into the environment function to obtain the state of the next step.

And step 204, starting the next iteration, and inputting the next state in the step 203 as the current state into the Q network model.

And step 205, repeating the steps 203 to 204 for n times, completing iteration of one round, and inputting the state of the last step, the n power utilization probability sequences and the corresponding original tags into a reward and punishment function to obtain a reward and punishment value of one round.

Step 206, repeating the steps 202-205 for a set number of times, accumulating the quadruple data of each round, and storing the quadruple data in a memory.

The quadruple is

,

Which is indicative of the current state of the device,

indicating the action that is currently selected and,

the status of the next step is shown,

representing the reward and penalty value of the current round.

Each round has n steps, and there are n quadruplets

After one round of iteration is finished, the state of all steps

The state of the next step

And act of

Harmony punishment

Are stored in the memory bank.

Step 207, when the quadruple data stored in the memory base reaches a set amount, training the Q network model between step 203 and step 204, including: randomly sampling n quadruples of a round from a memory bank

Starting a round of training, n quadruplets are sequentially formed

The input to the Q-network model,

input into the target Q network model.

After the memory base has certain data, formal training is started, specifically, the data volume of the memory base is more than or equal to 1000 rounds, namely the training requirement is met, when a Q network model is trained, the input of the Q network model is the current state sampled from the memory base, only the Q network model participating in the training in the DRQN is inserted in iteration, and specifically, after step 202 and step 203, the Q network model is started to be trained: random sampling from memory banksOne round of data, i.e. n quadruplets

Starting a round of training, n quadruplets are sequentially formed

The input to the Q-network model,

inputting the loss value into a target Q network model, only training the Q network model, and outputting the loss value by the output of the Q network model and the output of the target Q network model

The calculated network parameters of the Q network model are then updated according to the losses, after which steps 204 and 205 are continued.

And 208, synchronizing the network parameters of the target Q network model into the network parameters of the Q network model when the Q network model is trained for the set times, calculating a loss value according to the outputs of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss value.

Step 201-step 207 of one time are called as an epoch of training, a plurality of epochs are repeatedly trained, state, action and reward and punishment value data of each round are continuously stored in the memory bank, and if the memory bank reserves reach the upper limit, the most advanced data are deleted; in fig. 1, a dashed arrow represents that each time a plurality of epochs are trained, specifically, it is determined whether 100 training rounds have been separated, if yes, the target Q network copies the network parameters of the Q network once, and otherwise, the network parameters are fixed.

Specifically, it is determined whether 50000 rounds of training have been completed, if yes, the whole training process is ended, otherwise, step 202 is returned to, and the training of the next round is continued.

Preferably, the step 3 comprises:

；

the output represents the judgment of the threshold value on the power utilization probability sequence sample to be detected, if the output is 1, the power utilization probability sequence to be detected is judged to belong to an abnormal power utilization user, otherwise, the power utilization probability sequence to be detected is judged to be a normal user; and combining the real label to obtain the label for judging whether the label is correct or not.

Example 3

Embodiment 3 provided by the present invention is a specific application embodiment of the abnormal power consumption judgment system based on reinforcement learning provided by the present invention, as shown in fig. 5, it is a flowchart of the abnormal user detection method provided by the embodiment of the present invention, as can be seen from fig. 5, in the specific application embodiment, user data sampled from the companies of province G and province J are used for training a DRQN model, the data set includes power consumption data of more than 300 users, and a single user records power consumption records for at most 311 days. The sampling frequency of the data set was 0.5 h/time, and 48 power usage records were available to individual users a day. We first down-sample the electricity usage data of these users to 1 h/time, and then uniformly clip the user data to a 300-day electricity usage record.

Establishing a simple two-layer full-connection network and a softmax layer as an initial classifier, selecting 250 users' data as a training set, inputting the user daily electricity data records and corresponding labels for learning. After the classifier is trained, user data is input once again, and a probability sequence of whether each user has abnormal electricity utilization behaviors every day is output, wherein the length of the probability sequence is 300.

The probability sequence output by the classifier is used to train the DRQN model, with 50 samples input per round. After the DRQN model is trained, the remaining 81 pieces of data are used as a test set and input into the DRQN model for testing, a corresponding dynamic threshold value can be obtained, and whether the user belongs to an abnormal power utilization user or not is judged according to the threshold value. Through tests, the highest accuracy rate on the test set reaches 96.7%, and compared with the performance of a fixed threshold value on the test set, the method is obviously improved, and the details are shown in the following table 1.

TABLE 1 comparison of dynamic threshold versus fixed threshold Performance on test set of the present invention

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An abnormal power utilization judgment system based on reinforcement learning is characterized in that the judgment system is a constructed DRQN model and comprises the following steps: a memory base, a Q network model and a target Q network model;

the Q network model takes the current state and the currently selected action as input and output, and takes the current state as a decision index to determine the reward and punishment value of the current round;

inputting the power utilization probability sequence to be tested into the trained DRQN model, taking the current state as a dynamic threshold of the power utilization probability sequence to be tested, and judging whether the power utilization is abnormal or not according to the dynamic threshold.

2. The decision system according to claim 1, wherein the current state and the next state are five-dimensional arrays of 5 values, including four-dimensional decision thresholds and one-dimensional decision ratios;

3. The decision system of claim 2, wherein the action comprises:

4. The decision system according to claim 1, wherein the environment function inputs the current state and the currently selected action after the Q network model outputs the currently selected action, and outputs the state of the next step.

5. The decision system according to claim 1 or 4, wherein the reward penalty value is calculated by: taking the current state as a judgment index, calculating that m samples in the input n samples are judged correctly, and then awarding and punishing the value

。

6. The decision system according to claim 5, wherein the state as a decision indicator is a next-step state obtained after the DRQN model completes a complete iteration of one round;

7. The decision system according to claim 2, wherein the power utilization probability sequence to be tested is input into the trained DRQN model, the current state is used as a dynamic threshold of the power utilization probability sequence to be tested, and the determining whether the power utilization is abnormal according to the dynamic threshold comprises:

；

8. An abnormal electricity utilization judging method based on the judging system of any one of claims 1 to 7, characterized in that the judging method comprises:

9. The decision method according to claim 8, the step 2 comprising:

the quadruple is