CN111539492A - Abnormal electricity utilization judgment system and method based on reinforcement learning - Google Patents

Abnormal electricity utilization judgment system and method based on reinforcement learning Download PDF

Info

Publication number
CN111539492A
CN111539492A CN202010649574.7A CN202010649574A CN111539492A CN 111539492 A CN111539492 A CN 111539492A CN 202010649574 A CN202010649574 A CN 202010649574A CN 111539492 A CN111539492 A CN 111539492A
Authority
CN
China
Prior art keywords
network model
state
power utilization
value
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010649574.7A
Other languages
Chinese (zh)
Other versions
CN111539492B (en
Inventor
陈应林
陈勉舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Gelanruo Intelligent Technology Co.,Ltd.
Original Assignee
Wuhan Glory Road Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Glory Road Intelligent Technology Co ltd filed Critical Wuhan Glory Road Intelligent Technology Co ltd
Priority to CN202010649574.7A priority Critical patent/CN111539492B/en
Publication of CN111539492A publication Critical patent/CN111539492A/en
Application granted granted Critical
Publication of CN111539492B publication Critical patent/CN111539492B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Water Supply & Treatment (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Testing And Monitoring For Control Systems (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention relates to an abnormal electricity utilization judging system and method based on reinforcement learning, wherein the judging system is a DRQN (Deep Current Q Network) model for judging abnormal electricity utilization, the Q Network model takes the current state and the currently selected action as input and output, and the state is taken as a judging index to determine the reward and punishment value of the current round; synchronizing the network parameters of the target Q network model into the network parameters of the Q network model when the Q network model is trained for the set times; the power utilization probability sequence to be tested is input into the trained DRQN model, the state is used as a dynamic threshold value of the power utilization probability sequence to be tested, whether power utilization is abnormal or not is judged according to the dynamic threshold value, the current state is used as a judgment index to determine a reward and punishment value, and the current state is used as the dynamic threshold value, so that the system can update the threshold value according to real-time power data of a user, and the generalization capability of a cross-user scene can be effectively improved.

Description

Abnormal electricity utilization judgment system and method based on reinforcement learning
Technical Field
The invention relates to the field of load prediction of power systems, in particular to an abnormal electricity utilization judgment system and method based on reinforcement learning.
Background
With the wide application of the intelligent electric meter, the abnormal electricity utilization detection becomes an important means for researching abnormal consumption behaviors of customers and timely discovering unexpected electricity utilization events. During the operation of the power grid, no matter the metering device is in failure or the user steals electricity, the real electricity utilization data of the user cannot be collected, and the false electricity utilization data is called abnormal electricity utilization data. If the situations cannot be found and processed in time, serious interference and influence can be generated on the power supply order of normal users and power supply companies, so that economic loss and hidden danger in the aspect of power supply and utilization safety are brought, and the automatic monitoring of abnormal power supply and utilization of the users is of great practical significance.
The current abnormal electricity detection methods can be classified into the following types: the method comprises a system state-based abnormal electricity utilization detection method, a game theory-based abnormal electricity utilization detection method and a data-driven abnormal electricity utilization detection method.
With the rapid development of massive data and artificial intelligence technology in the era of smart power grids, abnormal power utilization behavior detection based on data driving gradually becomes a hotspot of study of scholars, and the method can be divided into three methods based on data driving: classification, regression and clustering, the classification and regression being methods of supervised learning, and the clustering being methods of unsupervised learning.
The classification method divides the corresponding set of input into several classes according to the input characteristic quantity. In the abnormal electricity utilization detection, the classification aims to divide a user set into a normal class and an abnormal class according to the characteristic quantity of the user. The classification-based method needs a large amount of labeled training sets to provide samples, and the precision of classification is improved through training. The most important of the classification method is a classifier model, and common classifiers comprise extreme learning machines, support vector machines, k-nearest neighbor algorithms, neural networks and other models. The classification method has obvious advantages, and the method has higher accuracy compared with other methods due to the use of the labeled data set, so the method also becomes a research hotspot.
The abnormal electricity consumption behavior analysis at the user level needs to be analyzed from the long-term behavior of the user, but the analysis of the long-term electricity consumption behavior by directly using a classifier is not suitable, for example, the electricity consumption data of one month or even one year is directly input into the classifier, because of the following two reasons:
(1) in long-term electricity consumption data, the proportion of normal electricity consumption data is large, and for a small part of abnormal electricity consumption data, the abnormal electricity consumption data is equivalent to noise, so that the classification result output by the classifier is seriously influenced.
(2) The long-term electricity utilization data are directly input into the classifier, which means that the detection period is long, and whether the delay of abnormal electricity utilization behaviors of a user is long or not is judged, so that the situation is not in line with the actual application requirement scene.
Most classification methods analyze long-term electricity usage behavior based on short-term electricity usage behavior. The specific method comprises the following steps: dividing long-term data into a plurality of short-period data, and inputting the data into a classifier, wherein the classifier outputs the probability that the short-period samples come from abnormal power utilization users. Generally, when judging whether the short-term sample is from the abnormal electricity user, a judgment threshold value needs to be determined, and when the probability output by the classifier exceeds the threshold value, the short-term sample is judged to be from the abnormal electricity user.
The long-term data of a user is divided into n short-term data, and after the n short-term data are respectively input into the classifier, a probability sequence with the length of n can be obtained, so that a problem is faced next, and how many proportion of samples in the n short-term samples are judged to be from abnormal users, so that the user can be judged to belong to the abnormal power utilization users through the long-term power utilization behaviors. In the conventional classification method, the decision ratio needs to be determined by empirical reference, and has no fixed value.
In summary, most of the current abnormal electricity detection methods based on classification still have the following problems:
(1) when different user electricity consumption data are faced, no matter the judgment threshold value of the short period or the judgment proportion of the long period, the unified standard is difficult to determine.
(2) When the model training is completed and the decision threshold and the decision proportion are also fixed, when the power consumption data distribution of the user to be tested changes, an accurate decision result is difficult to generate, and at this time, manpower and computational resources are often consumed to adjust the model parameters or the decision threshold and the decision proportion.
Disclosure of Invention
The invention provides an abnormal electricity utilization judgment method based on reinforcement learning, aiming at the technical problems in the prior art, and solves the problem that the judgment threshold and the judgment proportion of the traditional classifier method are difficult to determine in the prior art.
The technical scheme for solving the technical problems is as follows: an abnormal electricity utilization judgment system based on reinforcement learning, wherein the judgment system is a constructed DRQN model and comprises the following steps: a memory base, a Q network model and a target Q network model;
the memory bank is used for storing the current state, the currently selected action, the state of the next step and the reward and punishment value of the current round;
the Q network model takes the current state and the currently selected action as input and output, and takes the state as a decision index to determine the reward and punishment value of the current round;
the target Q network model and the Q network model have the same structure, and when the Q network model is trained for a set number of times, the network parameters of the target Q network model are synchronized to be the network parameters of the Q network model; the targetQ network model takes the next state stored in the memory base as input; calculating loss according to the outputs of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss;
and inputting the power utilization probability sequence to be tested into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be tested, and judging whether the power utilization is abnormal or not according to the dynamic threshold.
An abnormal electricity utilization judgment method based on the judgment system comprises the following steps:
step 1, obtaining a user abnormal electricity utilization probability sequence output by a classifier and sample data of a corresponding original label, and dividing the sample data into a training set and a test set;
step 2, the DRQN module is iterated by utilizing the training set, and the training of the Q network model is completed in the iteration process of the DRQN module;
the iteration process of the DRQN module comprises the following steps: according to the input n power utilization probability sequences, determining a reward and punishment value of the current round by taking the dynamic state as a decision index to obtain a next state;
and 3, inputting the power utilization probability sequence to be detected into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be detected, and judging whether the power utilization is abnormal or not according to the dynamic threshold.
The invention has the beneficial effects that: the reward and punishment value is determined by taking the current state as a decision index, and the current state is taken as a dynamic threshold value, so that the system can update the threshold value according to real-time power data of a user, and the generalization capability of a cross-user scene can be effectively improved; compared with a common classifier algorithm, the method can easily obtain a discrimination threshold value and a discrimination proportion parameter; the user data of the online test can be directly used for online learning, and the network parameters can be adjusted without redundant manpower resources and computing resources under the condition of ensuring the accuracy for a long time; the input probability sequence can be of dynamic length, but the length is not less than 10, and the method has very high flexibility in practical application.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the state is a five-dimensional array composed of 5 values, including a four-dimensional decision threshold and a one-dimensional decision ratio;
the four-dimensional decision threshold values are respectively the maximum value, median and average value of the threshold value and a set threshold value A; the one-dimensional decision proportion is a proportion exceeding the threshold a.
Further, the actions include:
and selecting two of the five-dimensional arrays of the state by the DRQN model, and adding a set fixed value and subtracting a set fixed value to the two selected arrays respectively.
Further, after the Q network model outputs the currently selected action, the environment function inputs the current state and the currently selected action, and outputs the state of the next step.
Further, the calculation process of the reward and punishment value is as follows: taking the state as a judgment index, calculating that m samples in input n samples are judged correctly, and then awarding and punishing values
Figure 154553DEST_PATH_IMAGE001
Further, the state as a decision index is a next step state obtained after the DRQN model completes a complete iteration of a round;
inputting n power utilization probability sequences in n steps of each round, wherein each step comprises the following steps: inputting the current state, outputting the currently selected action, and obtaining the state of the next step by the environment function; and inputting the reward and punishment function into the state of the next step obtained after the DRQN model completes the complete iteration of one round, and outputting the reward and punishment value.
Further, inputting the power utilization probability sequence to be tested into the trained DRQN model, taking the current state as a dynamic threshold of the power utilization probability sequence to be tested, and determining whether the power utilization is abnormal according to the dynamic threshold includes:
calculating the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A of the power utilization probability sequence to be detected, and comparing the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A with the a2, b2, c2, A and e2 of the five-dimensional array of the dynamic threshold to obtain:
Figure 931403DEST_PATH_IMAGE002
and if the output is 1, judging that the power utilization probability sequence to be detected belongs to an abnormal power utilization user, otherwise, judging that the power utilization probability sequence to be detected is a normal user.
Further, the step 2 comprises:
step 201, initializing the Q network model, the target Q network model and a memory base;
step 202, initializing a state, and inputting a value of the initial state into the Q network model;
step 203, the Q network model outputs the currently selected action, inputs the current state and the currently selected action into an environment function, and outputs the state of the next step;
step 204, inputting the next state in step 203 as the current state into the Q network model;
step 205, repeating steps 203 to 204 for n times, completing iteration of one round, and inputting the state, the n power utilization probability sequences and the corresponding original tags of the last step into a reward and punishment function to obtain a reward and punishment value of one round;
step 206, repeating the steps 202 to 205 for a set number of times, accumulating the quadruple data of each round, and storing the quadruple data into the memory bank;
the quadruple is
Figure 796591DEST_PATH_IMAGE003
,
Figure 997765DEST_PATH_IMAGE004
Which is indicative of the current state of the device,
Figure 581194DEST_PATH_IMAGE005
indicating the action that is currently selected and,
Figure 830909DEST_PATH_IMAGE006
the status of the next step is shown,
Figure 878500DEST_PATH_IMAGE007
representing a reward and penalty value of the current round;
step 207, when the quaternary group data stored in the memory bank reaches a set amount, training the Q network model between the step 203 and the step 204, including: randomly sampling n quadruples of a round from the memory bank
Figure 250575DEST_PATH_IMAGE008
Sequentially combining n of said quadruples
Figure 321299DEST_PATH_IMAGE004
Is input into the Q-network model and,
Figure 437023DEST_PATH_IMAGE009
inputting the target Q network model;
and 208, synchronizing the network parameter of the target Q network model as the network parameter of the Q network model when the Q network model is trained for a set number of times, calculating a loss value according to the outputs of the Q network model and the target Q network model and the reward and punishment value, and updating the network parameter of the Q network model according to the loss value.
Further, the step 3 comprises:
calculating the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A of the power utilization probability sequence to be detected, and comparing the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A with the a2, b2, c2, A and e2 of the five-dimensional array of the dynamic threshold to obtain:
Figure 276803DEST_PATH_IMAGE002
and if the output is 1, judging that the power utilization probability sequence to be detected belongs to an abnormal power utilization user, otherwise, judging that the power utilization probability sequence to be detected is a normal user.
The beneficial effect of adopting the further scheme is that: when different user electricity consumption data are faced, a unified standard is provided to determine the judgment threshold value of the short period or the judgment proportion of the long period; after the model training is finished, the judgment threshold value and the judgment proportion are adjusted in real time, and even if the power consumption data distribution of a user needing to be tested changes, an accurate judgment result can be generated.
Drawings
Fig. 1 is an interaction diagram of an abnormal electricity usage determination system based on reinforcement learning according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the structure and data interaction of the Q network model according to the embodiment of the present invention;
FIG. 3 is a schematic diagram of a structure and data interaction of a target Q network model according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for training a DRQN model according to an embodiment of the present invention;
fig. 5 is a flowchart of an abnormal user detection method according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
The invention provides an abnormal electricity utilization judgment system based on reinforcement learning, which is a constructed DRQN model and comprises the following components: a memory base, a Q network model and a target Q network model.
The memory bank is used for storing the current state, the currently selected action, the state of the next step and the reward and punishment value of the current round.
And the Q network model takes the current state and the currently selected action as input and output, and takes the state as a decision index to determine the reward and punishment value of the current round.
the structure of the target Q network model is the same as that of the Q network model, and when the Q network model is trained for a set number of times, the network parameters of the target Q network model are synchronized into the network parameters of the Q network model; the target Q network model takes the next state stored in the memory base as input; and calculating loss according to the output of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss.
And inputting the power utilization probability sequence to be detected into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be detected, and judging whether the power utilization is abnormal or not according to the dynamic threshold.
According to the reinforcement learning-based abnormal power utilization judgment system, the current state is used as a judgment index to determine a reward and punishment value, and the current state is used as a dynamic threshold value, so that the system can update the threshold value according to real-time power data of a user, and the generalization capability of a cross-user scene can be effectively improved; compared with a common classifier algorithm, the method can easily obtain a discrimination threshold value and a discrimination proportion parameter; the user data of the online test can be directly used for online learning, and the network parameters can be adjusted without redundant manpower resources and computing resources under the condition of ensuring the accuracy for a long time; the input probability sequence can be of dynamic length, but the length is not less than 10, and the method has very high flexibility in practical application.
Example 1
Embodiment 1 of the present invention is an embodiment of an abnormal electricity usage determination system based on reinforcement learning, and fig. 1 is an interactive schematic diagram of an abnormal electricity usage determination system based on reinforcement learning according to an embodiment of the present invention, and as can be seen from fig. 1, in an embodiment of an abnormal electricity usage determination system based on reinforcement learning according to the present invention, the determination system is a constructed DRQN model, and includes: a memory base, a Q network model and a target Q network model.
The memory bank is used for storing the current state, the currently selected action, the state of the next step and the reward and punishment value of the current round.
In particular, the memory bank stores four tuples
Figure 757463DEST_PATH_IMAGE010
,
Figure 112221DEST_PATH_IMAGE004
Which is indicative of the current state of the device,
Figure 969318DEST_PATH_IMAGE005
indicating the action that is currently selected and,
Figure 725922DEST_PATH_IMAGE011
the status of the next step is shown,
Figure 377483DEST_PATH_IMAGE007
representing the reward and penalty value of the current round.
Preferably, the current state
Figure 485116DEST_PATH_IMAGE004
And the status of the next step
Figure 880326DEST_PATH_IMAGE006
The state in (1) is a five-dimensional array composed of 5 values, including a four-dimensional decision threshold and a one-dimensional decision scale.
The four-dimensional decision threshold values are respectively the maximum value, median, average value and a set threshold value A of the threshold value, and the set threshold value A can be a decision threshold value in a general classifier; the one-dimensional decision ratio is the ratio that exceeds the threshold a.
Movement of
Figure 429119DEST_PATH_IMAGE005
For two of the five-dimensional arrays of states selected by the DRQN model, a set fixed value is added and a set fixed value is subtracted to the selected two arrays, respectively.
Fig. 2 and fig. 3 are schematic diagrams of structures and data interaction of a Q network model and a target Q network model provided by an embodiment of the present invention, respectively.
As can be seen from fig. 2 and 3, the target Q network model and the Q network model have the same structure, and are composed of two full-connected layers, FC (full-connected) Layer1, FC Layer2, and a Long Short-Term Memory network (LSTM) Layer.
The Q network model takes the current state and the currently selected action as input and output, namely the input of the Q network model is
Figure 576548DEST_PATH_IMAGE004
Output is
Figure 843581DEST_PATH_IMAGE005
And determining the reward and punishment value of the current round by taking the state as a judgment index.
Preferably, after the Q network model outputs the currently selected action, the environment function inputs the current state and the currently selected action, and outputs the next state, that is, the input of the environment function is
Figure 104798DEST_PATH_IMAGE004
And
Figure 508097DEST_PATH_IMAGE005
the output is
Figure 829357DEST_PATH_IMAGE006
Specifically, the current dynamic threshold and the selected add-subtract action are input, and the dynamic threshold of the next step is output.
Inputting a five-dimensional array of a Q network model, obtaining output1 after passing through two fully-connected layers and an LSTM layer, and obtaining a selected action after passing through an argmax function, namely performing operation of adding and subtracting a fixed value on a threshold value of a corresponding dimension; the specific implementation of the action is a 5 × 2 array, corresponding to 5-dimensional thresholds, each threshold has two actions, i.e. adding a fixed value and subtracting a fixed value, respectively, and the output1 selects one of the ten scalars through the argmax function:
Figure 583687DEST_PATH_IMAGE012
the calculation process of the reward and punishment value is as follows: taking the state as a judgment index, calculating that m samples in input n samples are judged correctly, and then awarding and punishing values
Figure 320698DEST_PATH_IMAGE001
And (4) calculating a reward and punishment value by taking the final dynamic threshold value as a decision threshold value and a decision proportion, wherein if the judgment is correct, the reward and punishment value is positive, otherwise, the reward and punishment value is negative.
Preferably, the state used as the decision index is the next state obtained after the DRQN model completes a complete iteration of one round.
Inputting n power utilization probability sequences in n steps of each round, wherein each step comprises the following steps: inputting the current state, outputting the currently selected action, obtaining the state of the next step by the environment function, taking the state of the next step obtained in the previous step as the state input of the current step in the next step, and circulating the process for n steps in one round, wherein the state of the next step output in the last step is the final state of the whole round, and the final dynamic threshold value is obtained after the DRQN model is subjected to one round. And inputting the reward and punishment function into a DRQN model, completing complete iteration of one round to obtain the state of the next step, and outputting a reward and punishment value.
Synchronizing the network parameters of the target Q network model into the network parameters of the Q network model when the Q network model is trained for the set times; the target Q network model takes the next state stored in the memory bank as input, namely the input of the target Q network model is
Figure 906401DEST_PATH_IMAGE011
(ii) a And calculating loss according to the output of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss.
Specifically, during training, the inputs of the Q network model and the target Q network model are both from the samples of the memory base, the input of the Q network model is the current state, output1 is obtained, the input of the target Q network model is the next state, output2 is obtained, and the calculation formula for calculating the loss is as follows:
Figure 336245DEST_PATH_IMAGE013
. Wherein
Figure 374608DEST_PATH_IMAGE014
Is a hyper-parameter, preferably 0.9.
And inputting the power utilization probability sequence to be detected into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be detected, and judging whether the power utilization is abnormal or not according to the dynamic threshold.
Preferably, the process of determining whether the power consumption is abnormal includes:
calculating the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A of the power utilization probability sequence to be detected, and comparing the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A with the a2, b2, c2, A and e2 of the five-dimensional array of the dynamic threshold to obtain:
Figure 915311DEST_PATH_IMAGE002
the output represents the judgment of the threshold value on the power utilization probability sequence sample to be detected, if the output is 1, the power utilization probability sequence to be detected is judged to belong to an abnormal power utilization user, and otherwise, the power utilization probability sequence to be detected is judged to be a normal user. And combining the real label to obtain the label for judging whether the label is correct or not.
Example 2
The embodiment 2 provided by the invention is an embodiment of an abnormal electricity utilization judgment method based on reinforcement learning, the abnormal electricity utilization judgment method is based on an abnormal electricity utilization judgment system provided by the embodiment of the invention, and the judgment method comprises the following steps:
step 1, obtaining a user abnormal electricity utilization probability sequence output by a classifier and sample data of a corresponding original label, and dividing the sample data into a training set and a testing set.
And 2, iterating the DRQN module by using the training set, and finishing the training of the Q network model in the iteration process of the DRQN module.
The iterative process of the DRQN module comprises the following steps: and according to the input n power utilization probability sequences, dynamically determining the reward and punishment value of the current round by taking the power utilization probability sequences as a decision index to obtain the next state, namely generating a dynamic decision threshold value and decision proportion.
Preferably, as shown in fig. 4, which is a flowchart of a method for training a DRQN model provided in an embodiment of the present invention, as can be seen from fig. 1 to fig. 4, step 2 includes:
step 201, initializing a Q network model, a target Q network model and a memory base.
Specifically, the initialization of the memory bank is to empty the memory bank, which may be a container with a size of 20000x50, since the fixed number of steps per round is preferably 50, and 20000 rounds of data can be stored in total, and each round has 50 quadruplets
Figure 621099DEST_PATH_IMAGE008
Step 202, starting iteration of a round, initializing the state, starting iteration of the first step, and inputting the value of the initial state into the Q network model.
Specifically, the state initialization is preferably [0.6, 0.4, 0.4, 0.5, 0.1 ].
In fig. 1, a dashed square box is a flow of iteration of the DRQN model in one step, and a thick arrow represents that data is transmitted to the reward and punishment function after one round is completed.
And step 203, outputting the currently selected action by the Q network model, inputting the current state and the currently selected action into the environment function, outputting the state of the next step, and finishing the current step.
In the iteration one-step process, the input of the Q network is the current state from the environment function, the output is the currently selected action, and the action is transmitted into the environment function to obtain the state of the next step.
And step 204, starting the next iteration, and inputting the next state in the step 203 as the current state into the Q network model.
And step 205, repeating the steps 203 to 204 for n times, completing iteration of one round, and inputting the state of the last step, the n power utilization probability sequences and the corresponding original tags into a reward and punishment function to obtain a reward and punishment value of one round.
Step 206, repeating the steps 202-205 for a set number of times, accumulating the quadruple data of each round, and storing the quadruple data in a memory.
The quadruple is
Figure 956265DEST_PATH_IMAGE008
,
Figure 747504DEST_PATH_IMAGE004
Which is indicative of the current state of the device,
Figure 826318DEST_PATH_IMAGE005
indicating the action that is currently selected and,
Figure 324295DEST_PATH_IMAGE011
the status of the next step is shown,
Figure 892680DEST_PATH_IMAGE007
representing the reward and penalty value of the current round.
Each round has n steps, and there are n quadruplets
Figure 108898DEST_PATH_IMAGE008
After one round of iteration is finished, the state of all steps
Figure 56650DEST_PATH_IMAGE004
The state of the next step
Figure 409134DEST_PATH_IMAGE009
And act of
Figure 413999DEST_PATH_IMAGE005
Harmony punishment
Figure 851933DEST_PATH_IMAGE007
Are stored in the memory bank.
Step 207, when the quadruple data stored in the memory base reaches a set amount, training the Q network model between step 203 and step 204, including: randomly sampling n quadruples of a round from a memory bank
Figure 334867DEST_PATH_IMAGE008
Starting a round of training, n quadruplets are sequentially formed
Figure 807437DEST_PATH_IMAGE004
The input to the Q-network model,
Figure 655307DEST_PATH_IMAGE009
input into the target Q network model.
After the memory base has certain data, formal training is started, specifically, the data volume of the memory base is more than or equal to 1000 rounds, namely the training requirement is met, when a Q network model is trained, the input of the Q network model is the current state sampled from the memory base, only the Q network model participating in the training in the DRQN is inserted in iteration, and specifically, after step 202 and step 203, the Q network model is started to be trained: random sampling from memory banksOne round of data, i.e. n quadruplets
Figure 642855DEST_PATH_IMAGE008
Starting a round of training, n quadruplets are sequentially formed
Figure 867163DEST_PATH_IMAGE004
The input to the Q-network model,
Figure 256556DEST_PATH_IMAGE009
inputting the loss value into a target Q network model, only training the Q network model, and outputting the loss value by the output of the Q network model and the output of the target Q network model
Figure 540907DEST_PATH_IMAGE007
The calculated network parameters of the Q network model are then updated according to the losses, after which steps 204 and 205 are continued.
And 208, synchronizing the network parameters of the target Q network model into the network parameters of the Q network model when the Q network model is trained for the set times, calculating a loss value according to the outputs of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss value.
Step 201-step 207 of one time are called as an epoch of training, a plurality of epochs are repeatedly trained, state, action and reward and punishment value data of each round are continuously stored in the memory bank, and if the memory bank reserves reach the upper limit, the most advanced data are deleted; in fig. 1, a dashed arrow represents that each time a plurality of epochs are trained, specifically, it is determined whether 100 training rounds have been separated, if yes, the target Q network copies the network parameters of the Q network once, and otherwise, the network parameters are fixed.
Specifically, it is determined whether 50000 rounds of training have been completed, if yes, the whole training process is ended, otherwise, step 202 is returned to, and the training of the next round is continued.
And 3, inputting the power utilization probability sequence to be detected into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be detected, and judging whether the power utilization is abnormal or not according to the dynamic threshold.
Preferably, the step 3 comprises:
calculating the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A of the power utilization probability sequence to be detected, and comparing the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A with the a2, b2, c2, A and e2 of the five-dimensional array of the dynamic threshold to obtain:
Figure 15750DEST_PATH_IMAGE002
the output represents the judgment of the threshold value on the power utilization probability sequence sample to be detected, if the output is 1, the power utilization probability sequence to be detected is judged to belong to an abnormal power utilization user, otherwise, the power utilization probability sequence to be detected is judged to be a normal user; and combining the real label to obtain the label for judging whether the label is correct or not.
Example 3
Embodiment 3 provided by the present invention is a specific application embodiment of the abnormal power consumption judgment system based on reinforcement learning provided by the present invention, as shown in fig. 5, it is a flowchart of the abnormal user detection method provided by the embodiment of the present invention, as can be seen from fig. 5, in the specific application embodiment, user data sampled from the companies of province G and province J are used for training a DRQN model, the data set includes power consumption data of more than 300 users, and a single user records power consumption records for at most 311 days. The sampling frequency of the data set was 0.5 h/time, and 48 power usage records were available to individual users a day. We first down-sample the electricity usage data of these users to 1 h/time, and then uniformly clip the user data to a 300-day electricity usage record.
Establishing a simple two-layer full-connection network and a softmax layer as an initial classifier, selecting 250 users' data as a training set, inputting the user daily electricity data records and corresponding labels for learning. After the classifier is trained, user data is input once again, and a probability sequence of whether each user has abnormal electricity utilization behaviors every day is output, wherein the length of the probability sequence is 300.
The probability sequence output by the classifier is used to train the DRQN model, with 50 samples input per round. After the DRQN model is trained, the remaining 81 pieces of data are used as a test set and input into the DRQN model for testing, a corresponding dynamic threshold value can be obtained, and whether the user belongs to an abnormal power utilization user or not is judged according to the threshold value. Through tests, the highest accuracy rate on the test set reaches 96.7%, and compared with the performance of a fixed threshold value on the test set, the method is obviously improved, and the details are shown in the following table 1.
TABLE 1 comparison of dynamic threshold versus fixed threshold Performance on test set of the present invention
Figure 778170DEST_PATH_IMAGE016
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An abnormal power utilization judgment system based on reinforcement learning is characterized in that the judgment system is a constructed DRQN model and comprises the following steps: a memory base, a Q network model and a target Q network model;
the memory bank is used for storing the current state, the currently selected action, the state of the next step and the reward and punishment value of the current round;
the Q network model takes the current state and the currently selected action as input and output, and takes the current state as a decision index to determine the reward and punishment value of the current round;
the target Q network model and the Q network model have the same structure, and when the Q network model is trained for a set number of times, the network parameters of the target Q network model are synchronized to be the network parameters of the Q network model; the targetQ network model takes the next state stored in the memory base as input; calculating loss according to the outputs of the Q network model and the target Q network model and the reward and punishment values, and updating the network parameters of the Q network model according to the loss;
inputting the power utilization probability sequence to be tested into the trained DRQN model, taking the current state as a dynamic threshold of the power utilization probability sequence to be tested, and judging whether the power utilization is abnormal or not according to the dynamic threshold.
2. The decision system according to claim 1, wherein the current state and the next state are five-dimensional arrays of 5 values, including four-dimensional decision thresholds and one-dimensional decision ratios;
the four-dimensional decision threshold values are respectively the maximum value, median and average value of the threshold value and a set threshold value A; the one-dimensional decision proportion is a proportion exceeding the threshold a.
3. The decision system of claim 2, wherein the action comprises:
and selecting two of the five-dimensional arrays of the state by the DRQN model, and adding a set fixed value and subtracting a set fixed value to the two selected arrays respectively.
4. The decision system according to claim 1, wherein the environment function inputs the current state and the currently selected action after the Q network model outputs the currently selected action, and outputs the state of the next step.
5. The decision system according to claim 1 or 4, wherein the reward penalty value is calculated by: taking the current state as a judgment index, calculating that m samples in the input n samples are judged correctly, and then awarding and punishing the value
Figure DEST_PATH_IMAGE001
6. The decision system according to claim 5, wherein the state as a decision indicator is a next-step state obtained after the DRQN model completes a complete iteration of one round;
inputting n power utilization probability sequences in n steps of each round, wherein each step comprises the following steps: inputting the current state, outputting the currently selected action, and obtaining the state of the next step by the environment function; and inputting the reward and punishment function into the state of the next step obtained after the DRQN model completes the complete iteration of one round, and outputting the reward and punishment value.
7. The decision system according to claim 2, wherein the power utilization probability sequence to be tested is input into the trained DRQN model, the current state is used as a dynamic threshold of the power utilization probability sequence to be tested, and the determining whether the power utilization is abnormal according to the dynamic threshold comprises:
calculating the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A of the power utilization probability sequence to be detected, and comparing the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A with the a2, b2, c2, A and e2 of the five-dimensional array of the dynamic threshold to obtain:
Figure 684580DEST_PATH_IMAGE002
and if the output is 1, judging that the power utilization probability sequence to be detected belongs to an abnormal power utilization user, otherwise, judging that the power utilization probability sequence to be detected is a normal user.
8. An abnormal electricity utilization judging method based on the judging system of any one of claims 1 to 7, characterized in that the judging method comprises:
step 1, obtaining a user abnormal electricity utilization probability sequence output by a classifier and sample data of a corresponding original label, and dividing the sample data into a training set and a test set;
step 2, the DRQN module is iterated by utilizing the training set, and the training of the Q network model is completed in the iteration process of the DRQN module;
the iteration process of the DRQN module comprises the following steps: according to the input n power utilization probability sequences, determining a reward and punishment value of the current round by taking the dynamic state as a decision index to obtain a next state;
and 3, inputting the power utilization probability sequence to be detected into the trained DRQN model, taking the state as a dynamic threshold of the power utilization probability sequence to be detected, and judging whether the power utilization is abnormal or not according to the dynamic threshold.
9. The decision method according to claim 8, the step 2 comprising:
step 201, initializing the Q network model, the target Q network model and a memory base;
step 202, initializing a state, and inputting a value of the initial state into the Q network model;
step 203, the Q network model outputs the currently selected action, inputs the current state and the currently selected action into an environment function, and outputs the state of the next step;
step 204, inputting the next state in step 203 as the current state into the Q network model;
step 205, repeating steps 203 to 204 for n times, completing iteration of one round, and inputting the state, the n power utilization probability sequences and the corresponding original tags of the last step into a reward and punishment function to obtain a reward and punishment value of one round;
step 206, repeating the steps 202 to 205 for a set number of times, accumulating the quadruple data of each round, and storing the quadruple data into the memory bank;
the quadruple is
Figure DEST_PATH_IMAGE003
,
Figure 313139DEST_PATH_IMAGE004
Which is indicative of the current state of the device,
Figure DEST_PATH_IMAGE005
indicating the action that is currently selected and,
Figure 335452DEST_PATH_IMAGE006
the status of the next step is shown,
Figure DEST_PATH_IMAGE007
representing a reward and penalty value of the current round;
step 207, when the quaternary group data stored in the memory bank reaches a set amount, training the Q network model between the step 203 and the step 204, including: randomly sampling n quadruples of a round from the memory bank
Figure 566714DEST_PATH_IMAGE008
Sequentially combining n of said quadruples
Figure 951559DEST_PATH_IMAGE004
Is input into the Q-network model and,
Figure DEST_PATH_IMAGE009
inputting the target Q network model;
and 208, synchronizing the network parameter of the target Q network model as the network parameter of the Q network model when the Q network model is trained for a set number of times, calculating a loss value according to the outputs of the Q network model and the target Q network model and the reward and punishment value, and updating the network parameter of the Q network model according to the loss value.
10. The decision method according to claim 8, the step 3 comprising:
calculating the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A of the power utilization probability sequence to be detected, and comparing the maximum value a1, the median b1, the average value c1 and the proportion e1 of days exceeding the threshold A with the a2, b2, c2, A and e2 of the five-dimensional array of the dynamic threshold to obtain:
Figure 78915DEST_PATH_IMAGE002
and if the output is 1, judging that the power utilization probability sequence to be detected belongs to an abnormal power utilization user, otherwise, judging that the power utilization probability sequence to be detected is a normal user.
CN202010649574.7A 2020-07-08 2020-07-08 Abnormal electricity utilization judgment system and method based on reinforcement learning Active CN111539492B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010649574.7A CN111539492B (en) 2020-07-08 2020-07-08 Abnormal electricity utilization judgment system and method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010649574.7A CN111539492B (en) 2020-07-08 2020-07-08 Abnormal electricity utilization judgment system and method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN111539492A true CN111539492A (en) 2020-08-14
CN111539492B CN111539492B (en) 2020-11-20

Family

ID=71979835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010649574.7A Active CN111539492B (en) 2020-07-08 2020-07-08 Abnormal electricity utilization judgment system and method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN111539492B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705688A (en) * 2021-08-30 2021-11-26 华侨大学 Method and system for detecting abnormal electricity utilization behavior of power consumer
CN114638555A (en) * 2022-05-18 2022-06-17 国网江西综合能源服务有限公司 Power consumption behavior detection method and system based on multilayer regularization extreme learning machine

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101596345B1 (en) * 2014-12-05 2016-02-25 김규호 Integrated electricity energy management system using smart grid
CN106779069A (en) * 2016-12-08 2017-05-31 国家电网公司 A kind of abnormal electricity consumption detection method based on neutral net
CN109142831A (en) * 2018-09-21 2019-01-04 国网安徽省电力公司电力科学研究院 A kind of resident's exception electricity consumption analysis method and device based on impedance analysis
CN109347149A (en) * 2018-09-20 2019-02-15 国网河南省电力公司电力科学研究院 Micro-capacitance sensor energy storage dispatching method and device based on depth Q value network intensified learning
CN110488872A (en) * 2019-09-04 2019-11-22 中国人民解放军国防科技大学 A kind of unmanned plane real-time route planing method based on deeply study
US20190385051A1 (en) * 2018-06-14 2019-12-19 Accenture Global Solutions Limited Virtual agent with a dialogue management system and method of training a dialogue management system
CN110837532A (en) * 2019-10-17 2020-02-25 福建网能科技开发有限责任公司 Method for detecting electricity stealing behavior of charging pile based on big data platform
US20200119556A1 (en) * 2018-10-11 2020-04-16 Di Shi Autonomous Voltage Control for Power System Using Deep Reinforcement Learning Considering N-1 Contingency
CN111047094A (en) * 2019-12-12 2020-04-21 国网浙江省电力有限公司 Meter reading data anomaly analysis method based on deep learning algorithm
CN111223007A (en) * 2019-12-31 2020-06-02 深圳供电局有限公司 User abnormal electricity utilization behavior analysis early warning method, device, equipment and medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101596345B1 (en) * 2014-12-05 2016-02-25 김규호 Integrated electricity energy management system using smart grid
CN106779069A (en) * 2016-12-08 2017-05-31 国家电网公司 A kind of abnormal electricity consumption detection method based on neutral net
US20190385051A1 (en) * 2018-06-14 2019-12-19 Accenture Global Solutions Limited Virtual agent with a dialogue management system and method of training a dialogue management system
CN109347149A (en) * 2018-09-20 2019-02-15 国网河南省电力公司电力科学研究院 Micro-capacitance sensor energy storage dispatching method and device based on depth Q value network intensified learning
CN109142831A (en) * 2018-09-21 2019-01-04 国网安徽省电力公司电力科学研究院 A kind of resident's exception electricity consumption analysis method and device based on impedance analysis
US20200119556A1 (en) * 2018-10-11 2020-04-16 Di Shi Autonomous Voltage Control for Power System Using Deep Reinforcement Learning Considering N-1 Contingency
CN110488872A (en) * 2019-09-04 2019-11-22 中国人民解放军国防科技大学 A kind of unmanned plane real-time route planing method based on deeply study
CN110837532A (en) * 2019-10-17 2020-02-25 福建网能科技开发有限责任公司 Method for detecting electricity stealing behavior of charging pile based on big data platform
CN111047094A (en) * 2019-12-12 2020-04-21 国网浙江省电力有限公司 Meter reading data anomaly analysis method based on deep learning algorithm
CN111223007A (en) * 2019-12-31 2020-06-02 深圳供电局有限公司 User abnormal electricity utilization behavior analysis early warning method, device, equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JINGDA WU等: "Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus", 《APPLIED ENERGY》 *
刘全等: "深度强化学习综述", 《计算机学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705688A (en) * 2021-08-30 2021-11-26 华侨大学 Method and system for detecting abnormal electricity utilization behavior of power consumer
CN113705688B (en) * 2021-08-30 2023-08-04 华侨大学 Abnormal electricity consumption behavior detection method and system for power users
CN114638555A (en) * 2022-05-18 2022-06-17 国网江西综合能源服务有限公司 Power consumption behavior detection method and system based on multilayer regularization extreme learning machine

Also Published As

Publication number Publication date
CN111539492B (en) 2020-11-20

Similar Documents

Publication Publication Date Title
Yoon et al. Semi-supervised learning with deep generative models for asset failure prediction
CN109214566A (en) Short-term wind power prediction method based on shot and long term memory network
Han et al. Short-term forecasting of individual residential load based on deep learning and K-means clustering
CN111178611A (en) Method for predicting daily electric quantity
CN107133695A (en) A kind of wind power forecasting method and system
CN111539492B (en) Abnormal electricity utilization judgment system and method based on reinforcement learning
CN108805193A (en) A kind of power loss data filling method based on mixed strategy
CN114297036A (en) Data processing method and device, electronic equipment and readable storage medium
CN111210072B (en) Prediction model training and user resource limit determining method and device
CN117556369B (en) Power theft detection method and system for dynamically generated residual error graph convolution neural network
Mohammad et al. Short term load forecasting using deep neural networks
CN114358362A (en) Electric vehicle load prediction method under condition of data shortage
Shehzad et al. Deep learning-based meta-learner strategy for electricity theft detection
CN113762591A (en) Short-term electric quantity prediction method and system based on GRU and multi-core SVM counterstudy
Zhou et al. Statistics-based method for large-scale group decision-making with incomplete linguistic distribution fuzzy information: Incorporating reliability and entropy
Abid et al. Multi-directional gated recurrent unit and convolutional neural network for load and energy forecasting: A novel hybridization
Mosavi Extracting most discriminative features on transient multivariate time series by bi-mode hybrid feature selection scheme for transient stability prediction
Dong et al. Research on academic early warning model based on improved SVM algorithm
CN115409317A (en) Transformer area line loss detection method and device based on feature selection and machine learning
CN114169416A (en) Short-term load prediction method under small sample set based on transfer learning
CN113589034A (en) Electricity stealing detection method, device, equipment and medium for power distribution system
Liu et al. A GCN-based adaptive generative adversarial network model for short-term wind speed scenario prediction
Huang et al. Electricity Theft Detection based on Iterative Interpolation and Fusion Convolutional Neural Network
Odoom A Methodology in Utilizing Machine Learning Algorithm for Electricity Theft Detection in Ghana
Lee et al. Teaching and learning the AI modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: Room 1803-1805, building 2-07, guanggu.core center, 303 Guanggu Avenue, Donghu New Technology Development Zone, Wuhan City, Hubei Province, 430000

Patentee after: Wuhan Gelanruo Intelligent Technology Co.,Ltd.

Address before: Room 1803-1805, building 2-07, guanggu.core center, 303 Guanggu Avenue, Donghu New Technology Development Zone, Wuhan City, Hubei Province, 430000

Patentee before: WUHAN GLORY ROAD INTELLIGENT TECHNOLOGY Co.,Ltd.