CN115936926A - SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium - Google Patents

SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115936926A
CN115936926A CN202211702959.0A CN202211702959A CN115936926A CN 115936926 A CN115936926 A CN 115936926A CN 202211702959 A CN202211702959 A CN 202211702959A CN 115936926 A CN115936926 A CN 115936926A
Authority
CN
China
Prior art keywords
data
gbdt
smote
model
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211702959.0A
Other languages
Chinese (zh)
Inventor
卜龙敏
赵丹
张璨辉
陈红
徐慧婷
杨帆
刘映兰
龙乙林
王翔宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Metering Center of State Grid Hunan Electric Power Co Ltd
Original Assignee
Metering Center of State Grid Hunan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Metering Center of State Grid Hunan Electric Power Co Ltd filed Critical Metering Center of State Grid Hunan Electric Power Co Ltd
Priority to CN202211702959.0A priority Critical patent/CN115936926A/en
Publication of CN115936926A publication Critical patent/CN115936926A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a SMOTE-GBDT-based unbalanced electricity stealing data classification method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: step one, collecting user load data; filling missing values and standardizing the data; taking different K oversampling based on the SMOTE algorithm; step four, training a GBDT model by adopting GBDT of default parameters for the oversampled data set, and step five, evaluating the classification performance of the data set generated by different K values and finding out the K neighbor value of the optimized classification; step six, training a GBDT model based on a data set with the best oversampling effect; and step seven, testing and finding out the optimal parameters of the GBDT model based on the parameters of the network search and cross validation combined model, and further obtaining the optimal electricity stealing analysis model. According to the invention, the analysis of the electricity stealing behavior of the user can be realized through the user load data in the metering system, and the application value of the metering data of the electricity consumer is improved.

Description

SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium
Technical Field
The invention belongs to the technical field of electric power and artificial intelligence, and relates to an unbalanced electricity stealing data classification method and device based on SMOTE-GBDT, computer equipment and a storage medium.
Background
Along with the continuous expansion of power network, its probability that receives to steal the electric destruction also improves greatly, steals the electric action and not only destroys electric power system's economic benefits, steals the damage that the electric installation often can cause distribution lines and device moreover, has caused serious potential safety hazard to resident's personal and property safety on every side. At present, a traditional power system is continuously developing to a smart grid system with digital control and communication capability, meanwhile, more and more scientific and technological means are also applied to the environment of electricity stealing, the traditional electricity stealing detection and data analysis method cannot adapt to the development of the power system, and a new and efficient method is needed by a power department to analyze electricity stealing behaviors. Therefore, how to improve the electricity stealing prevention level of the power grid and improve the efficiency of electricity stealing inspection work becomes a problem to be solved urgently
The non-invasive load analysis can realize the analysis of user behaviors by measuring and analyzing the information such as current, voltage, power and the like at the power load inlet of the metering device, and has the advantages of simplicity, economy, easiness in popularization and application and the like. According to the method, for user load data in a metering system, a mean interpolation method is used for filling missing values of time series data of power utilization, new samples with high accuracy are synthesized by using the characteristics of adjacent points close to feature spaces, and finally GBDT is used for classifying power-stealing users, so that analysis of power-stealing behaviors of the users is achieved, and the application value of metering data of power-consuming customers is improved.
Disclosure of Invention
The invention aims to provide an unbalanced electricity stealing data classification method, an unbalanced electricity stealing data classification device, computer equipment and a storage medium based on SMOTE-GBDT, missing values are used for filling, and the electricity stealing detection effect is improved by artificially generating samples according to unbalanced data.
The technical scheme adopted by the invention is as follows:
in a first aspect, the invention provides an unbalanced electricity stealing data classification method based on SMOTE-GBDT, which comprises the following steps:
step one, collecting user load data;
step two, missing value filling and standardization processing are carried out on the data;
step three, taking different K oversampling based on the SMOTE algorithm;
step four, the GBDT model is trained by adopting the GBDT of default parameters in the data set after oversampling,
evaluating the classification performance of the data sets generated by different K values, and finding out the K neighbor value of the optimized classification;
step six, training a GBDT model based on the data set with the best oversampling effect;
and seventhly, testing and finding out the optimal parameters of the GBDT model based on the parameters of the network search and cross validation combined model, and further obtaining the optimal electricity stealing analysis model.
Further, in the second step, a mean interpolation method is adopted to fill the missing values, and the formula is shown as (1):
Figure BDA0004025163760000021
where index i represents the ith user, index j represents day j, x ij Representing the power consumption of the ith user on the j th day, f 1 (x ij ) Denotes x ij The mean interpolated value.
Further, in the second step, the data set is normalized by adopting a maximum and minimum standard method, and the calculation is as in formula (2):
Figure BDA0004025163760000022
wherein x represents the electricity consumption data of the user, i represents the user i, j represents the j day, and x ij Represents the power consumption of the user i on the j th day, x imin Indicates the minimum power consumption, x, of the ith user imax Represents the maximum power consumption of the ith user, f 3 (x ij ) Denotes x ij Maximum and minimum normalized values.
Further, the GBDT model has the best overall performance when trained on a data set obtained by taking the SMOTE algorithm K value to be 5.
In a second aspect, the present invention provides an unbalanced electricity stealing data classification device based on SMOTE-GBDT, which includes:
the data acquisition module is used for collecting user load data in the metering system;
the data preprocessing module is used for filling missing values and carrying out standardization processing on the data;
the oversampling module is used for oversampling different K values of the preprocessed data by adopting an SMOTE algorithm;
the model training and generating module is used for training a GBDT model by adopting GBDT with default parameters for the oversampled data set, evaluating the classification performance of the data set generated by different K values, finding out the K neighbor value of the optimized classification, generating a new data set by adopting SMOTE for the user load data set according to the optimal K value, training the GBDT model on the training of the data set, and optimizing the optimal parameters of the GBDT model based on grid search and cross validation combined model parameters;
and the model application module is used for inputting the original user load data into the trained optimal GBDT model for computational analysis.
In a third aspect, the present invention provides a computer device, which includes a memory and a processor, the memory and the processor are communicatively connected, the memory stores computer instructions, and the processor executes the computer instructions, thereby executing the data classification method in the aspect of the first aspect.
In a fourth aspect, the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to perform the data classification method according to the first aspect.
The beneficial effects of the invention are: according to the method, for user load data in a metering system, a mean interpolation method is used for filling missing values of time series data of power utilization, new samples with high accuracy are synthesized by using the characteristics of adjacent points close to feature spaces, and finally GBDT is used for classifying power-stealing users, so that analysis of power-stealing behaviors of the users is achieved, and the application value of metering data of power-consuming customers is improved.
Drawings
FIG. 1 is a flow chart of the electricity stealing analysis process of the present invention;
FIG. 2 is a flow chart of the user load data preprocessing of the present invention;
FIG. 3 is a flow chart of user load data set expansion based on SMOTE algorithm;
FIG. 4 is a process flow diagram of a steal analysis process based on missing value padding and SMOTE-XGBoost.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
1. Oversampling technology based on SMOTE
SMOTE (Synthetic priority Oversampling Technique) is a modification based on random Oversampling. Random oversampling is to randomly sample a few classes in a data set according to a certain proportion, and then superpose the data sets obtained by random sampling each time. Because repeated extracted data may exist in the process of multiple sampling, a large number of repeated values exist in the data set after multiple random sampling and superposition, and finally model overfitting is caused. By analyzing the situation near the minority sample, randomly synthesizing new data between two minority samples, and continuously increasing the number of the minority samples until the relative balance between the number of the minority samples and the number of the majority samples is achieved. The specific flow of the algorithm is shown in fig. 3.
(1) Firstly, any sample x in a minority class of an original data set is selected each time, and the distance from the sample x to other samples in the minority class of the sample set is calculated by taking the Euclidean distance as a standard to obtain K neighbor of the sample x.
(2) Secondly, a sampling ratio, i.e. a sampling magnification N (N must be a positive integer), is set according to the degree of imbalance of the original data set.
(3) Furthermore, for each of the minority class samples x, one sample x is randomly selected from its K neighbors at a time i N are selected in total. Then a random number between 0 and 1 is generated, at x and x i To synthesize a new sample x new The calculation is as in equation (1).
X new =X+rand(0,1)*(x i -x) (1)
After the steps are repeatedly executed, N artificial synthesis samples are expanded for each minority class sample, and the whole minority class is expanded to N times of the original minority class sample.
The SMOTE algorithm is based on the characteristic that adjacent points on a feature space have similar features in oversampling, and new data which is highly similar to the original data is expanded on the feature space.
2. GBDT-based classification
The GBDT (Gradient Boost Decision Tree) obtains a strong learner from a plurality of weak learners according to a certain combination strategy. GBDT uses a forward distribution algorithm that takes the cumulative result of all the regression trees as the final result by concatenating them together [17] . The key of the GBDT algorithm for classification is that the value of the negative gradient of the loss function in the current model is used as the approximate value of the residual error in the regression problem lifting tree algorithm to fit the next regression tree. The GBDT secondary classification algorithm process is as follows:
(1) Initializing the firstWeak learning apparatus F 0 (x):
Figure BDA0004025163760000041
Where P (Y = 1|x) is the proportion of Y =1 in the training samples for the unitary GBDT classification problem, and P = P (Y = 1|x) is the probability of predicting a given input x as a positive sample for the binary GBDT classification model. The learner is initialized with a priori information.
(2) Calculate the negative gradient for m iterations:
Figure BDA0004025163760000042
/>
(3) Fitting data (x) using CART regression Tree im,i ) Obtaining the mth regression tree, wherein the corresponding leaf node area is R m,j Wherein J =1,2 m And J is m The number of the leaf nodes of the mth regression tree.
(4) For J m Each leaf node region J =1,2 m And calculating a best fit value:
Figure BDA0004025163760000043
(5) Update strong learning device F m (x):
Figure BDA0004025163760000044
(6) To obtain the final strong learner F M (x) Expression (c):
Figure BDA0004025163760000051
the invention discloses an unbalanced electricity stealing data classification method based on SMOTE-GBDT.
(1) Missing value padding
For the missing values contained in the electricity utilization data set, the characteristic fields containing the missing values are directly deleted or the classification effect of the electricity stealing detection model is greatly influenced without processing. This patent is to missing value completion processing vacancy data, adopts mean value interpolation method to fill missing value, and the formula is as follows:
Figure BDA0004025163760000052
wherein the index i represents the ith user, the index j represents the jth day, x ij Represents the power consumption of the ith user on the jth day, f 1 (x ij ) Represents x ij The mean interpolated value.
(2) Standardizing the power load data, and standardizing the data set by adopting a maximum-minimum standardization (min-max) method, wherein the following formula is calculated:
Figure BDA0004025163760000053
wherein x represents the electricity consumption data of the user, i represents the user i, j represents the j day, and x ij Represents the power consumption of the user i on the j th day, x imin Represents the minimum power consumption, x, of the ith user imax Indicating the maximum power usage of the ith user. f. of 3 (x ij ) Represents x ij Maximum and minimum normalized values.
Then, an optimal electricity stealing behavior analysis model is obtained through two-stage optimization:
1) K-value for oversampling in SMOTE algorithm
And for the preprocessed data, adopting an SMOTE algorithm to take different K values for oversampling, adopting a GBDT training model with default parameters, evaluating the classification performance of a data set generated by different K values, and finding out the K neighbor value of the optimized classification. Compared with the original data set and the data set obtained by oversampling other K values, the comprehensive performance of the XGboost model trained on the data set obtained by taking 5 as the K value of the SMOTE algorithm is the best.
2) GBDT model parameter optimization based on grid search and cross validation combination
According to the optimal K value, SMOTE is adopted for a user load data set to generate a new data set, a GBDT model is trained on the data set, and the GBDT model is optimized on the basis of grid search and cross validation combined model parameters.
Based on the foregoing method, this embodiment further provides an unbalanced electricity stealing data classification apparatus, which includes:
the data acquisition module is used for collecting user load data in the metering system;
the data preprocessing module is used for filling missing values and carrying out standardization processing on the data;
the oversampling module is used for oversampling the preprocessed data by adopting an SMOTE algorithm according to different K values;
the model training and generating module is used for training a GBDT model by adopting GBDT with default parameters for the oversampled data set, evaluating the classification performance of the data set generated by different K values, finding out the K neighbor value of the optimized classification, generating a new data set by adopting SMOTE for the user load data set according to the optimal K value, training the GBDT model on the data set training, and optimizing the optimal parameters of the GBDT model based on grid search and cross validation combined model parameters;
and the model application module is used for inputting the original user load data into the trained optimal GBDT model for calculation and analysis.
The embodiment also provides a computer device, which includes a memory and a processor, the memory and the processor are communicatively connected with each other, the memory stores computer instructions, and the processor executes the computer instructions, so as to execute the data classification method.
The present embodiment also provides a computer-readable storage medium storing computer instructions for causing a computer to execute the data classification method described above.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the scope of the present invention in any way, and all technical solutions obtained by using equivalent substitution methods fall within the scope of the present invention.
The parts not involved in the present invention are the same as or can be implemented using the prior art.

Claims (7)

1. An unbalanced electricity stealing data classification method based on SMOTE-GBDT is characterized by comprising the following steps:
step one, collecting user load data;
step two, missing value filling and standardization processing are carried out on the data;
step three, taking different K oversampling based on the SMOTE algorithm;
step four, training a GBDT model by adopting GBDT of default parameters in the oversampled data set,
evaluating the classification performance of the data sets generated by different K values, and finding out the K neighbor value of the optimized classification;
step six, training a GBDT model based on the data set with the best oversampling effect;
and step seven, testing and finding out the optimal parameters of the GBDT model based on the parameters of the network search and cross validation combined model, and further obtaining the optimal electricity stealing analysis model.
2. The SMOTE-GBDT-based unbalanced electricity stealing data classification method according to claim 1, wherein in the second step, the missing values are filled by using a mean interpolation method, and the formula is shown in (1):
Figure FDA0004025163750000011
where index i represents the ith user, index j represents day j, x ij Representing the power consumption of the ith user on the j th day, f 1 (x ij ) Denotes x ij The mean interpolated value.
3. The method for classifying the unbalanced electricity stealing data based on the SMOTE-GBDT as claimed in claim 1, wherein in the second step, the data set is normalized by a maximum and minimum standard method, and the formula (2) is calculated:
Figure FDA0004025163750000012
wherein x represents the electricity consumption data of the user, i represents the user i, j represents the j day, and x ij Representing the amount of electricity used, x, by user i on day j imin Indicates the minimum power consumption, x, of the ith user imax Represents the maximum power consumption of the ith user, f 3 (x ij ) Represents x ij Maximum and minimum normalized values.
4. The SMOTE-GBDT-based unbalanced electricity stealing data classification method according to claim 1, wherein the GBDT model has the best overall performance trained on a data set obtained by taking the SMOTE algorithm K value to 5.
5. An unbalanced electricity stealing data classification device based on SMOTE-GBDT is characterized by comprising:
the data acquisition module is used for collecting user load data in the metering system;
the data preprocessing module is used for filling missing values and carrying out standardization processing on the data;
the oversampling module is used for oversampling the preprocessed data by adopting an SMOTE algorithm according to different K values;
the model training and generating module is used for training a GBDT model by adopting GBDT with default parameters for the oversampled data set, evaluating the classification performance of the data set generated by different K values, finding out the K neighbor value of the optimized classification, generating a new data set by adopting SMOTE for the user load data set according to the optimal K value, training the GBDT model on the training of the data set, and optimizing the optimal parameters of the GBDT model based on grid search and cross validation combined model parameters;
and the model application module is used for inputting the original user load data into the trained optimal GBDT model for computational analysis.
6. A computer device comprising a memory and a processor, wherein the memory and the processor are communicatively connected, the memory stores computer instructions, and the processor executes the computer instructions to perform the data classification method according to any one of claims 1 to 4.
7. A computer-readable storage medium storing computer instructions for causing a computer to perform the data classification method of any one of claims 1-4.
CN202211702959.0A 2022-12-29 2022-12-29 SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium Pending CN115936926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211702959.0A CN115936926A (en) 2022-12-29 2022-12-29 SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211702959.0A CN115936926A (en) 2022-12-29 2022-12-29 SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115936926A true CN115936926A (en) 2023-04-07

Family

ID=86652505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211702959.0A Pending CN115936926A (en) 2022-12-29 2022-12-29 SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115936926A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881639A (en) * 2023-07-10 2023-10-13 国网四川省电力公司营销服务中心 Electricity larceny data synthesis method based on generation countermeasure network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881639A (en) * 2023-07-10 2023-10-13 国网四川省电力公司营销服务中心 Electricity larceny data synthesis method based on generation countermeasure network

Similar Documents

Publication Publication Date Title
Zhao et al. Full-scale distribution system topology identification using Markov random field
Yang et al. Hybrid prediction method for wind speed combining ensemble empirical mode decomposition and Bayesian ridge regression
CN111313403B (en) Markov random field-based network topology identification method for low-voltage power distribution system
CN113177357B (en) Transient stability assessment method for power system
CN110739692B (en) Power distribution network structure identification method based on probability map model
CN111654392A (en) Low-voltage distribution network topology identification method and system based on mutual information
CN113935237A (en) Power transmission line fault type distinguishing method and system based on capsule network
CN114283320A (en) Target detection method based on full convolution and without branch structure
Wu et al. Gridtopo-GAN for distribution system topology identification
CN114841199A (en) Power distribution network fault diagnosis method, device, equipment and readable storage medium
CN115936926A (en) SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium
Chen et al. Real‐time recognition of power quality disturbance‐based deep belief network using embedded parallel computing platform
CN117556369B (en) Power theft detection method and system for dynamically generated residual error graph convolution neural network
CN114021425A (en) Power system operation data modeling and feature selection method and device, electronic equipment and storage medium
CN112595918A (en) Low-voltage meter reading fault detection method and device
CN117194219A (en) Fuzzy test case generation and selection method, device, equipment and medium
CN111831955A (en) Lithium ion battery residual life prediction method and system
CN111199363A (en) Method for realizing topology recognition by maximum correlation screening algorithm
CN109697511B (en) Data reasoning method and device and computer equipment
CN113128130B (en) Real-time monitoring method and device for judging stability of direct-current power distribution system
CN116400168A (en) Power grid fault diagnosis method and system based on depth feature clustering
CN116010831A (en) Combined clustering scene reduction method and system based on potential decision result
CN113158134B (en) Method, device and storage medium for constructing non-invasive load identification model
CN115713032A (en) Power grid prevention control method, device, equipment and medium
CN116226748A (en) Multi-label co-occurrence network discrimination-based electricity stealing type detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination